arXiv Papers - 2025-02-24

The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning

Authors:Sheila Schoepp, Masoud Jafaripour, Yingyue Cao, Tianpei Yang, Fatemeh Abdollahi, Shadan Golestan, Zahin Sufiyan, Osmar R. Zaiane, Matthew E. Taylor
Date:2025-02-21 05:01:30

Reinforcement learning (RL) has shown impressive results in sequential decision-making tasks. Meanwhile, Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged, exhibiting impressive capabilities in multimodal understanding and reasoning. These advances have led to a surge of research integrating LLMs and VLMs into RL. In this survey, we review representative works in which LLMs and VLMs are used to overcome key challenges in RL, such as lack of prior knowledge, long-horizon planning, and reward design. We present a taxonomy that categorizes these LLM/VLM-assisted RL approaches into three roles: agent, planner, and reward. We conclude by exploring open problems, including grounding, bias mitigation, improved representations, and action advice. By consolidating existing research and identifying future directions, this survey establishes a framework for integrating LLMs and VLMs into RL, advancing approaches that unify natural language and visual understanding with sequential decision-making.

Reinforcement Learning for Ultrasound Image Analysis A Comprehensive Review of Advances and Applications

Authors:Maha Ezzelarab, Midhila Madhusoodanan, Shrimanti Ghosh, Geetika Vadali, Jacob Jaremko, Abhilash Hareendranathan
Date:2025-02-20 19:37:49

Over the last decade, the use of machine learning (ML) approaches in medicinal applications has increased manifold. Most of these approaches are based on deep learning, which aims to learn representations from grid data (like medical images). However, reinforcement learning (RL) applications in medicine are relatively less explored. Medical applications often involve a sequence of subtasks that form a diagnostic pipeline, and RL is uniquely suited to optimize over such sequential decision-making tasks. Ultrasound (US) image analysis is a quintessential example of such a sequential decision-making task, where the raw signal captured by the US transducer undergoes a series of signal processing and image post-processing steps, generally leading to a diagnostic suggestion. The application of RL in US remains limited. Deep Reinforcement Learning (DRL), that combines deep learning and RL, holds great promise in optimizing these pipelines by enabling intelligent and sequential decision-making. This review paper surveys the applications of RL in US over the last decade. We provide a succinct overview of the theoretic framework of RL and its application in US image processing and review existing work in each aspect of the image analysis pipeline. A comprehensive search of Scopus filtered on relevance yielded 14 papers most relevant to this topic. These papers were further categorized based on their target applications image classification, image segmentation, image enhancement, video summarization, and auto navigation and path planning. We also examined the type of RL approach used in each publication. Finally, we discuss key areas in healthcare where DRL approaches in US could be used for sequential decision-making. We analyze the opportunities, challenges, and limitations, providing insights into the future potential of DRL in US image analysis.

Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models

Authors:Vlad Sobal, Wancong Zhang, Kynghyun Cho, Randall Balestriero, Tim G. J. Rudner, Yann LeCun
Date:2025-02-20 18:39:41

A long-standing goal in AI is to build agents that can solve a variety of tasks across different environments, including previously unseen ones. Two dominant approaches tackle this challenge: (i) reinforcement learning (RL), which learns policies through trial and error, and (ii) optimal control, which plans actions using a learned or known dynamics model. However, their relative strengths and weaknesses remain underexplored in the setting where agents must learn from offline trajectories without reward annotations. In this work, we systematically analyze the performance of different RL and control-based methods under datasets of varying quality. On the RL side, we consider goal-conditioned and zero-shot approaches. On the control side, we train a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and use it for planning. We study how dataset properties-such as data diversity, trajectory quality, and environment variability-affect the performance of these approaches. Our results show that model-free RL excels when abundant, high-quality data is available, while model-based planning excels in generalization to novel environment layouts, trajectory stitching, and data-efficiency. Notably, planning with a latent dynamics model emerges as a promising approach for zero-shot generalization from suboptimal data.

SPRIG: Stackelberg Perception-Reinforcement Learning with Internal Game Dynamics

Authors:Fernando Martinez-Lopez, Juntao Chen, Yingdong Lu
Date:2025-02-20 05:02:29

Deep reinforcement learning agents often face challenges to effectively coordinate perception and decision-making components, particularly in environments with high-dimensional sensory inputs where feature relevance varies. This work introduces SPRIG (Stackelberg Perception-Reinforcement learning with Internal Game dynamics), a framework that models the internal perception-policy interaction within a single agent as a cooperative Stackelberg game. In SPRIG, the perception module acts as a leader, strategically processing raw sensory states, while the policy module follows, making decisions based on extracted features. SPRIG provides theoretical guarantees through a modified Bellman operator while preserving the benefits of modern policy optimization. Experimental results on the Atari BeamRider environment demonstrate SPRIG's effectiveness, achieving around 30% higher returns than standard PPO through its game-theoretical balance of feature extraction and decision-making.

Comprehensive Review on the Control of Heat Pumps for Energy Flexibility in Distribution Networks

Authors:Gustavo L. Aschidamini, Mina Pavlovic, Bradley A. Reinholz, Malcolm S. Metcalfe, Taco Niet, Mariana Resener
Date:2025-02-19 21:29:07

Decarbonization plans promote the transition to heat pumps (HPs), creating new opportunities for their energy flexibility in demand response programs, solar photovoltaic integration and optimization of distribution networks. This paper reviews scheduling-based and real-time optimization methods for controlling HPs with a focus on energy flexibility in distribution networks. Scheduling-based methods fall into two categories: rule-based controllers (RBCs), which rely on predefined control rules without explicitly seeking optimal solutions, and optimization models, which are designed to determine the optimal scheduling of operations. Real-time optimization is achieved through model predictive control (MPC), which relies on a predictive model to optimize decisions over a time horizon, and reinforcement learning (RL), which takes a model-free approach by learning optimal strategies through direct interaction with the environment. The paper also examines studies on the impact of HPs on distribution networks, particularly those leveraging energy flexibility strategies. Key takeaways suggest the need to validate control strategies for extreme cold-weather regions that require backup heaters, as well as develop approaches designed for demand charge schemes that integrate HPs with other controllable loads. From a grid impact assessment perspective, studies have focused primarily on RBCs for providing energy flexibility through HP operation, without addressing more advanced methods such as real-time optimization using MPC or RL-based algorithms. Incorporating these advanced control strategies could help identify key limitations, including the impact of varying user participation levels and the cost-benefit trade-offs associated with their implementation.

LLM should think and action as a human

Authors:Haun Leung, ZiNan Wang
Date:2025-02-19 06:58:34

It is popular lately to train large language models to be used as chat assistants, but in the conversation between the user and the chat assistant, there are prompts, require multi-turns between the chat assistant and the user. However, there are a number of issues with the multi-turns conversation: The response of the chat assistant is prone to errors and can't help users achieve their goals, and as the number of conversation turns increases, the probability of errors will also increase; It is difficult for chat assistant to generate responses with different processes based on actual needs for the same prompt; Chat assistant require the use of tools, but the current approach is not elegant and efficient, and the number of tool calls is limited. The main reason for these issues is that large language models don't have the thinking ability as a human, lack the reasoning ability and planning ability, and lack the ability to execute plans. To solve these issues, we propose a thinking method based on a built-in chain of thought: In the multi-turns conversation, for each user prompt, the large language model thinks based on elements such as chat history, thinking context, action calls, memory and knowledge, makes detailed reasoning and planning, and actions according to the plan. We also explored how the large language model enhances thinking ability through this thinking method: Collect training datasets according to the thinking method and fine tune the large language model through supervised learning; Train a consistency reward model and use it as a reward function to fine tune the large language model using reinforcement learning, and the reinforced large language model outputs according to this way of thinking. Our experimental results show that the reasoning ability and planning ability of the large language model are enhanced, and the issues in the multi-turns conversation are solved.

Physics-Aware Robotic Palletization with Online Masking Inference

Authors:Tianqi Zhang, Zheng Wu, Yuxin Chen, Yixiao Wang, Boyuan Liang, Scott Moura, Masayoshi Tomizuka, Mingyu Ding, Wei Zhan
Date:2025-02-19 05:39:41

The efficient planning of stacking boxes, especially in the online setting where the sequence of item arrivals is unpredictable, remains a critical challenge in modern warehouse and logistics management. Existing solutions often address box size variations, but overlook their intrinsic and physical properties, such as density and rigidity, which are crucial for real-world applications. We use reinforcement learning (RL) to solve this problem by employing action space masking to direct the RL policy toward valid actions. Unlike previous methods that rely on heuristic stability assessments which are difficult to assess in physical scenarios, our framework utilizes online learning to dynamically train the action space mask, eliminating the need for manual heuristic design. Extensive experiments demonstrate that our proposed method outperforms existing state-of-the-arts. Furthermore, we deploy our learned task planner in a real-world robotic palletizer, validating its practical applicability in operational settings.

Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

Authors:Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Shao, Ping Luo
Date:2025-02-18 17:59:48

Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text-to-world.github.io/.

Integrating Reinforcement Learning, Action Model Learning, and Numeric Planning for Tackling Complex Tasks

Authors:Yarin Benyamin, Argaman Mordoch, Shahaf S. Shperberg, Roni Stern
Date:2025-02-18 16:26:21

Automated Planning algorithms require a model of the domain that specifies the preconditions and effects of each action. Obtaining such a domain model is notoriously hard. Algorithms for learning domain models exist, yet it remains unclear whether learning a domain model and planning is an effective approach for numeric planning environments, i.e., where states include discrete and numeric state variables. In this work, we explore the benefits of learning a numeric domain model and compare it with alternative model-free solutions. As a case study, we use two tasks in Minecraft, a popular sandbox game that has been used as an AI challenge. First, we consider an offline learning setting, where a set of expert trajectories are available to learn from. This is the standard setting for learning domain models. We used the Numeric Safe Action Model Learning (NSAM) algorithm to learn a numeric domain model and solve new problems with the learned domain model and a numeric planner. We call this model-based solution NSAM_(+p), and compare it to several model-free Imitation Learning (IL) and Offline Reinforcement Learning (RL) algorithms. Empirical results show that some IL algorithms can learn faster to solve simple tasks, while NSAM_(+p) allows solving tasks that require long-term planning and enables generalizing to solve problems in larger environments. Then, we consider an online learning setting, where learning is done by moving an agent in the environment. For this setting, we introduce RAMP. In RAMP, observations collected during the agent's execution are used to simultaneously train an RL policy and learn a planning domain action model. This forms a positive feedback loop between the RL policy and the learned domain model. We demonstrate experimentally the benefits of using RAMP, showing that it finds more efficient plans and solves more problems than several RL baselines.

NTP-INT: Network Traffic Prediction-Driven In-band Network Telemetry for High-load Switches

Authors:Penghui Zhang, Hua Zhang, Yuqi Dai, Cheng Zeng, Jingyu Wang, Jianxin Liao
Date:2025-02-18 13:00:52

In-band network telemetry (INT) is essential to network management due to its real-time visibility. However, because of the rapid increase in network devices and services, it has become crucial to have targeted access to detailed network information in a dynamic network environment. This paper proposes an intelligent network telemetry system called NTP-INT to obtain more fine-grained network information on high-load switches. Specifically, NTP-INT consists of three modules: network traffic prediction module, network pruning module, and probe path planning module. Firstly, the network traffic prediction module adopts a Multi-Temporal Graph Neural Network (MTGNN) to predict future network traffic and identify high-load switches. Then, we design the network pruning algorithm to generate a subnetwork covering all high-load switches to reduce the complexity of probe path planning. Finally, the probe path planning module uses an attention-mechanism-based deep reinforcement learning (DEL) model to plan efficient probe paths in the network slice. The experimental results demonstrate that NTP-INT can acquire more precise network information on high-load switches while decreasing the control overhead by 50\%.

Navigating Demand Uncertainty in Container Shipping: Deep Reinforcement Learning for Enabling Adaptive and Feasible Master Stowage Planning

Authors:Jaike van Twiller, Yossiri Adulyasak, Erick Delage, Djordje Grbic, Rune Møller Jensen
Date:2025-02-18 11:18:17

Reinforcement learning (RL) has shown promise in solving various combinatorial optimization problems. However, conventional RL faces challenges when dealing with real-world constraints, especially when action space feasibility is explicit and dependent on the corresponding state or trajectory. In this work, we focus on using RL in container shipping, often considered the cornerstone of global trade, by dealing with the critical challenge of master stowage planning. The main objective is to maximize cargo revenue and minimize operational costs while navigating demand uncertainty and various complex operational constraints, namely vessel capacity and stability, which must be dynamically updated along the vessel's voyage. To address this problem, we implement a deep reinforcement learning framework with feasibility projection to solve the master stowage planning problem (MPP) under demand uncertainty. The experimental results show that our architecture efficiently finds adaptive, feasible solutions for this multi-stage stochastic optimization problem, outperforming traditional mixed-integer programming and RL with feasibility regularization. Our AI-driven decision-support policy enables adaptive and feasible planning under uncertainty, optimizing operational efficiency and capacity utilization while contributing to sustainable and resilient global supply chains.

Score-Based Diffusion Policy Compatible with Reinforcement Learning via Optimal Transport

Authors:Mingyang Sun, Pengxiang Ding, Weinan Zhang, Donglin Wang
Date:2025-02-18 08:22:20

Diffusion policies have shown promise in learning complex behaviors from demonstrations, particularly for tasks requiring precise control and long-term planning. However, they face challenges in robustness when encountering distribution shifts. This paper explores improving diffusion-based imitation learning models through online interactions with the environment. We propose OTPR (Optimal Transport-guided score-based diffusion Policy for Reinforcement learning fine-tuning), a novel method that integrates diffusion policies with RL using optimal transport theory. OTPR leverages the Q-function as a transport cost and views the policy as an optimal transport map, enabling efficient and stable fine-tuning. Moreover, we introduce masked optimal transport to guide state-action matching using expert keypoints and a compatibility-based resampling strategy to enhance training stability. Experiments on three simulation tasks demonstrate OTPR's superior performance and robustness compared to existing methods, especially in complex and sparse-reward environments. In sum, OTPR provides an effective framework for combining IL and RL, achieving versatile and reliable policy learning. The code will be released at https://github.com/Sunmmyy/OTPR.git.

Plant in Cupboard, Orange on Table, Book on Shelf. Benchmarking Practical Reasoning and Situation Modelling in a Text-Simulated Situated Environment

Authors:Jonathan Jordan, Sherzod Hakimov, David Schlangen
Date:2025-02-17 12:20:39

Large language models (LLMs) have risen to prominence as 'chatbots' for users to interact via natural language. However, their abilities to capture common-sense knowledge make them seem promising as language-based planners of situated or embodied action as well. We have implemented a simple text-based environment -- similar to others that have before been used for reinforcement-learning of agents -- that simulates, very abstractly, a household setting. We use this environment and the detailed error-tracking capabilities we implemented for targeted benchmarking of LLMs on the problem of practical reasoning: Going from goals and observations to actions. Our findings show that environmental complexity and game restrictions hamper performance, and concise action planning is demanding for current LLMs.

AI Generations: From AI 1.0 to AI 4.0

Authors:Jiahao Wu, Hengxu You, Jing Du
Date:2025-02-16 23:19:44

This paper proposes that Artificial Intelligence (AI) progresses through several overlapping generations: AI 1.0 (Information AI), AI 2.0 (Agentic AI), AI 3.0 (Physical AI), and now a speculative AI 4.0 (Conscious AI). Each of these AI generations is driven by shifting priorities among algorithms, computing power, and data. AI 1.0 ushered in breakthroughs in pattern recognition and information processing, fueling advances in computer vision, natural language processing, and recommendation systems. AI 2.0 built on these foundations through real-time decision-making in digital environments, leveraging reinforcement learning and adaptive planning for agentic AI applications. AI 3.0 extended intelligence into physical contexts, integrating robotics, autonomous vehicles, and sensor-fused control systems to act in uncertain real-world settings. Building on these developments, AI 4.0 puts forward the bold vision of self-directed AI capable of setting its own goals, orchestrating complex training regimens, and possibly exhibiting elements of machine consciousness. This paper traces the historical foundations of AI across roughly seventy years, mapping how changes in technological bottlenecks from algorithmic innovation to high-performance computing to specialized data, have spurred each generational leap. It further highlights the ongoing synergies among AI 1.0, 2.0, 3.0, and 4.0, and explores the profound ethical, regulatory, and philosophical challenges that arise when artificial systems approach (or aspire to) human-like autonomy. Ultimately, understanding these evolutions and their interdependencies is pivotal for guiding future research, crafting responsible governance, and ensuring that AI transformative potential benefits society as a whole.

Integrating Language Models for Enhanced Network State Monitoring in DRL-Based SFC Provisioning

Authors:Parisa Fard Moshiri, Murat Arda Onsu, Poonam Lohan, Burak Kantarci, Emil Janulewicz
Date:2025-02-16 22:52:14

Efficient Service Function Chain (SFC) provisioning and Virtual Network Function (VNF) placement are critical for enhancing network performance in modern architectures such as Software-Defined Networking (SDN) and Network Function Virtualization (NFV). While Deep Reinforcement Learning (DRL) aids decision-making in dynamic network environments, its reliance on structured inputs and predefined rules limits adaptability in unforeseen scenarios. Additionally, incorrect actions by a DRL agent may require numerous training iterations to correct, potentially reinforcing suboptimal policies and degrading performance. This paper integrates DRL with Language Models (LMs), specifically Bidirectional Encoder Representations from Transformers (BERT) and DistilBERT, to enhance network management. By feeding final VNF allocations from DRL into the LM, the system can process and respond to queries related to SFCs, DCs, and VNFs, enabling real-time insights into resource utilization, bottleneck detection, and future demand planning. The LMs are fine-tuned to our domain-specific dataset using Low-Rank Adaptation (LoRA). Results show that BERT outperforms DistilBERT with a lower test loss (0.28 compared to 0.36) and higher confidence (0.83 compared to 0.74), though BERT requires approximately 46% more processing time.

Solving Online Resource-Constrained Scheduling for Follow-Up Observation in Astronomy: a Reinforcement Learning Approach

Authors:Yajie Zhang, Ce Yu, Chao Sun, Jizeng Wei, Junhan Ju, Shanjiang Tang
Date:2025-02-16 14:01:12

In the astronomical observation field, determining the allocation of observation resources of the telescope array and planning follow-up observations for targets of opportunity (ToOs) are indispensable components of astronomical scientific discovery. This problem is computationally challenging, given the online observation setting and the abundance of time-varying factors that can affect whether an observation can be conducted. This paper presents ROARS, a reinforcement learning approach for online astronomical resource-constrained scheduling. To capture the structure of the astronomical observation scheduling, we depict every schedule using a directed acyclic graph (DAG), illustrating the dependency of timing between different observation tasks within the schedule. Deep reinforcement learning is used to learn a policy that can improve the feasible solution by iteratively local rewriting until convergence. It can solve the challenge of obtaining a complete solution directly from scratch in astronomical observation scenarios, due to the high computational complexity resulting from numerous spatial and temporal constraints. A simulation environment is developed based on real-world scenarios for experiments, to evaluate the effectiveness of our proposed scheduling approach. The experimental results show that ROARS surpasses 5 popular heuristics, adapts to various observation scenarios and learns effective strategies with hindsight.

Maximize Your Diffusion: A Study into Reward Maximization and Alignment for Diffusion-based Control

Authors:Dom Huh, Prasant Mohapatra
Date:2025-02-16 00:30:39

Diffusion-based planning, learning, and control methods present a promising branch of powerful and expressive decision-making solutions. Given the growing interest, such methods have undergone numerous refinements over the past years. However, despite these advancements, existing methods are limited in their investigations regarding general methods for reward maximization within the decision-making process. In this work, we study extensions of fine-tuning approaches for control applications. Specifically, we explore extensions and various design choices for four fine-tuning approaches: reward alignment through reinforcement learning, direct preference optimization, supervised fine-tuning, and cascading diffusion. We optimize their usage to merge these independent efforts into one unified paradigm. We show the utility of such propositions in offline RL settings and demonstrate empirical improvements over a rich array of control tasks.

Accelerated co-design of robots through morphological pretraining

Authors:Luke Strgar, Sam Kriegman
Date:2025-02-15 17:20:56

The co-design of robot morphology and neural control typically requires using reinforcement learning to approximate a unique control policy gradient for each body plan, demanding massive amounts of training data to measure the performance of each design. Here we show that a universal, morphology-agnostic controller can be rapidly and directly obtained by gradient-based optimization through differentiable simulation. This process of morphological pretraining allows the designer to explore non-differentiable changes to a robot's physical layout (e.g. adding, removing and recombining discrete body parts) and immediately determine which revisions are beneficial and which are deleterious using the pretrained model. We term this process "zero-shot evolution" and compare it with the simultaneous co-optimization of a universal controller alongside an evolving design population. We find the latter results in diversity collapse, a previously unknown pathology whereby the population -- and thus the controller's training data -- converges to similar designs that are easier to steer with a shared universal controller. We show that zero-shot evolution with a pretrained controller quickly yields a diversity of highly performant designs, and by fine-tuning the pretrained controller on the current population throughout evolution, diversity is not only preserved but significantly increased as superior performance is achieved.

Cooperative Multi-Agent Planning with Adaptive Skill Synthesis

Authors:Zhiyuan Li, Wenshuai Zhao, Joni Pajarinen
Date:2025-02-14 13:23:18

Despite much progress in training distributed artificial intelligence (AI), building cooperative multi-agent systems with multi-agent reinforcement learning (MARL) faces challenges in sample efficiency, interpretability, and transferability. Unlike traditional learning-based methods that require extensive interaction with the environment, large language models (LLMs) demonstrate remarkable capabilities in zero-shot planning and complex reasoning. However, existing LLM-based approaches heavily rely on text-based observations and struggle with the non-Markovian nature of multi-agent interactions under partial observability. We present COMPASS, a novel multi-agent architecture that integrates vision-language models (VLMs) with a dynamic skill library and structured communication for decentralized closed-loop decision-making. The skill library, bootstrapped from demonstrations, evolves via planner-guided tasks to enable adaptive strategies. COMPASS propagates entity information through multi-hop communication under partial observability. Evaluations on the improved StarCraft Multi-Agent Challenge (SMACv2) demonstrate COMPASS achieves up to 30\% higher win rates than state-of-the-art MARL algorithms in symmetric scenarios.

Knowledge Integration Strategies in Autonomous Vehicle Prediction and Planning: A Comprehensive Survey

Authors:Kumar Manas, Adrian Paschke
Date:2025-02-13 19:32:41

This comprehensive survey examines the integration of knowledge-based approaches into autonomous driving systems, with a focus on trajectory prediction and planning. We systematically review methodologies for incorporating domain knowledge, traffic rules, and commonsense reasoning into these systems, spanning purely symbolic representations to hybrid neuro-symbolic architectures. In particular, we analyze recent advancements in formal logic and differential logic programming, reinforcement learning frameworks, and emerging techniques that leverage large foundation models and diffusion models for knowledge representation. Organized under a unified literature survey section, our discussion synthesizes the state-of-the-art into a high-level overview, supported by a detailed comparative table that maps key works to their respective methodological categories. This survey not only highlights current trends -- including the growing emphasis on interpretable AI, formal verification in safety-critical systems, and the increased use of generative models in prediction and planning -- but also outlines the challenges and opportunities for developing robust, knowledge-enhanced autonomous driving systems.

Generalizable Reinforcement Learning with Biologically Inspired Hyperdimensional Occupancy Grid Maps for Exploration and Goal-Directed Path Planning

Authors:Shay Snyder, Ryan Shea, Andrew Capodieci, David Gorsich, Maryam Parsa
Date:2025-02-13 15:10:45

Real-time autonomous systems utilize multi-layer computational frameworks to perform critical tasks such as perception, goal finding, and path planning. Traditional methods implement perception using occupancy grid mapping (OGM), segmenting the environment into discretized cells with probabilistic information. This classical approach is well-established and provides a structured input for downstream processes like goal finding and path planning algorithms. Recent approaches leverage a biologically inspired mathematical framework known as vector symbolic architectures (VSA), commonly known as hyperdimensional computing, to perform probabilistic OGM in hyperdimensional space. This approach, VSA-OGM, provides native compatibility with spiking neural networks, positioning VSA-OGM as a potential neuromorphic alternative to conventional OGM. However, for large-scale integration, it is essential to assess the performance implications of VSA-OGM on downstream tasks compared to established OGM methods. This study examines the efficacy of VSA-OGM against a traditional OGM approach, Bayesian Hilbert Maps (BHM), within reinforcement learning based goal finding and path planning frameworks, across a controlled exploration environment and an autonomous driving scenario inspired by the F1-Tenth challenge. Our results demonstrate that VSA-OGM maintains comparable learning performance across single and multi-scenario training configurations while improving performance on unseen environments by approximately 47%. These findings highlight the increased generalizability of policy networks trained with VSA-OGM over BHM, reinforcing its potential for real-world deployment in diverse environments.

Autonomous Task Completion Based on Goal-directed Answer Set Programming

Authors:Alexis R. Tudor
Date:2025-02-13 11:46:56

Task planning for autonomous agents has typically been done using deep learning models and simulation-based reinforcement learning. This research proposes combining inductive learning techniques with goal-directed answer set programming to increase the explainability and reliability of systems for task breakdown and completion. Preliminary research has led to the creation of a Python harness that utilizes s(CASP) to solve task problems in a computationally efficient way. Although this research is in the early stages, we are exploring solutions to complex problems in simulated task completion.

A view on learning robust goal-conditioned value functions: Interplay between RL and MPC

Authors:Nathan P. Lawrence, Philip D. Loewen, Michael G. Forbes, R. Bhushan Gopaluni, Ali Mesbah
Date:2025-02-10 19:45:06

Reinforcement learning (RL) and model predictive control (MPC) offer a wealth of distinct approaches for automatic decision-making. Given the impact both fields have had independently across numerous domains, there is growing interest in combining the general-purpose learning capability of RL with the safety and robustness features of MPC. To this end, this paper presents a tutorial-style treatment of RL and MPC, treating them as alternative approaches to solving Markov decision processes. In our formulation, RL aims to learn a global value function through offline exploration in an uncertain environment, whereas MPC constructs a local value function through online optimization. This local-global perspective suggests new ways to design policies that combine robustness and goal-conditioned learning. Robustness is incorporated into the RL and MPC pipelines through a scenario-based approach. Goal-conditioned learning aims to alleviate the burden of engineering a reward function for RL. Combining the two leads to a single policy that unites a robust, high-level RL terminal value function with short-term, scenario-based MPC planning for reliable constraint satisfaction. This approach leverages the benefits of both RL and MPC, the effectiveness of which is demonstrated on classical control benchmarks.

ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

Authors:Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang
Date:2025-02-10 18:51:47

We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing sequential thought templates, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: https://github.com/Gen-Verse/ReasonFlux

AgilePilot: DRL-Based Drone Agent for Real-Time Motion Planning in Dynamic Environments by Leveraging Object Detection

Authors:Roohan Ahmed Khan, Valerii Serpiva, Demetros Aschalew, Aleksey Fedoseev, Dzmitry Tsetserukou
Date:2025-02-10 17:54:30

Autonomous drone navigation in dynamic environments remains a critical challenge, especially when dealing with unpredictable scenarios including fast-moving objects with rapidly changing goal positions. While traditional planners and classical optimisation methods have been extensively used to address this dynamic problem, they often face real-time, unpredictable changes that ultimately leads to sub-optimal performance in terms of adaptiveness and real-time decision making. In this work, we propose a novel motion planner, AgilePilot, based on Deep Reinforcement Learning (DRL) that is trained in dynamic conditions, coupled with real-time Computer Vision (CV) for object detections during flight. The training-to-deployment framework bridges the Sim2Real gap, leveraging sophisticated reward structures that promotes both safety and agility depending upon environment conditions. The system can rapidly adapt to changing environments, while achieving a maximum speed of 3.0 m/s in real-world scenarios. In comparison, our approach outperforms classical algorithms such as Artificial Potential Field (APF) based motion planner by 3 times, both in performance and tracking accuracy of dynamic targets by using velocity predictions while exhibiting 90% success rate in 75 conducted experiments. This work highlights the effectiveness of DRL in tackling real-time dynamic navigation challenges, offering intelligent safety and agility.

Habitizing Diffusion Planning for Efficient and Effective Decision Making

Authors:Haofei Lu, Yifei Shen, Dongsheng Li, Junliang Xing, Dongqi Han
Date:2025-02-10 12:40:32

Diffusion models have shown great promise in decision-making, also known as diffusion planning. However, the slow inference speeds limit their potential for broader real-world applications. Here, we introduce Habi, a general framework that transforms powerful but slow diffusion planning models into fast decision-making models, which mimics the cognitive process in the brain that costly goal-directed behavior gradually transitions to efficient habitual behavior with repetitive practice. Even using a laptop CPU, the habitized model can achieve an average 800+ Hz decision-making frequency (faster than previous diffusion planners by orders of magnitude) on standard offline reinforcement learning benchmarks D4RL, while maintaining comparable or even higher performance compared to its corresponding diffusion planner. Our work proposes a fresh perspective of leveraging powerful diffusion models for real-world decision-making tasks. We also provide robust evaluations and analysis, offering insights from both biological and engineering perspectives for efficient and effective decision-making.

Towards Bio-inspired Heuristically Accelerated Reinforcement Learning for Adaptive Underwater Multi-Agents Behaviour

Authors:Antoine Vivien, Thomas Chaffre, Matthew Stephenson, Eva Artusi, Paulo Santos, Benoit Clement, Karl Sammut
Date:2025-02-10 02:47:33

This paper describes the problem of coordination of an autonomous Multi-Agent System which aims to solve the coverage planning problem in a complex environment. The considered applications are the detection and identification of objects of interest while covering an area. These tasks, which are highly relevant for space applications, are also of interest among various domains including the underwater context, which is the focus of this study. In this context, coverage planning is traditionally modelled as a Markov Decision Process where a coordinated MAS, a swarm of heterogeneous autonomous underwater vehicles, is required to survey an area and search for objects. This MDP is associated with several challenges: environment uncertainties, communication constraints, and an ensemble of hazards, including time-varying and unpredictable changes in the underwater environment. MARL algorithms can solve highly non-linear problems using deep neural networks and display great scalability against an increased number of agents. Nevertheless, most of the current results in the underwater domain are limited to simulation due to the high learning time of MARL algorithms. For this reason, a novel strategy is introduced to accelerate this convergence rate by incorporating biologically inspired heuristics to guide the policy during training. The PSO method, which is inspired by the behaviour of a group of animals, is selected as a heuristic. It allows the policy to explore the highest quality regions of the action and state spaces, from the beginning of the training, optimizing the exploration/exploitation trade-off. The resulting agent requires fewer interactions to reach optimal performance. The method is applied to the MSAC algorithm and evaluated for a 2D covering area mission in a continuous control environment.

Data efficient Robotic Object Throwing with Model-Based Reinforcement Learning

Authors:Niccolò Turcato, Giulio Giacomuzzo, Matteo Terreran, Davide Allegro, Ruggero Carli, Alberto Dalla Libera
Date:2025-02-08 14:43:42

Pick-and-place (PnP) operations, featuring object grasping and trajectory planning, are fundamental in industrial robotics applications. Despite many advancements in the field, PnP is limited by workspace constraints, reducing flexibility. Pick-and-throw (PnT) is a promising alternative where the robot throws objects to target locations, leveraging extrinsic resources like gravity to improve efficiency and expand the workspace. However, PnT execution is complex, requiring precise coordination of high-speed movements and object dynamics. Solutions to the PnT problem are categorized into analytical and learning-based approaches. Analytical methods focus on system modeling and trajectory generation but are time-consuming and offer limited generalization. Learning-based solutions, in particular Model-Free Reinforcement Learning (MFRL), offer automation and adaptability but require extensive interaction time. This paper introduces a Model-Based Reinforcement Learning (MBRL) framework, MC-PILOT, which combines data-driven modeling with policy optimization for efficient and accurate PnT tasks. MC-PILOT accounts for model uncertainties and release errors, demonstrating superior performance in simulations and real-world tests with a Franka Emika Panda manipulator. The proposed approach generalizes rapidly to new targets, offering advantages over analytical and Model-Free methods.

Temporal Representation Alignment: Successor Features Enable Emergent Compositionality in Robot Instruction Following

Authors:Vivek Myers, Bill Chunyuan Zheng, Anca Dragan, Kuan Fang, Sergey Levine
Date:2025-02-08 05:26:29

Effective task representations should facilitate compositionality, such that after learning a variety of basic tasks, an agent can perform compound tasks consisting of multiple steps simply by composing the representations of the constituent steps together. While this is conceptually simple and appealing, it is not clear how to automatically learn representations that enable this sort of compositionality. We show that learning to associate the representations of current and future states with a temporal alignment loss can improve compositional generalization, even in the absence of any explicit subtask planning or reinforcement learning. We evaluate our approach across diverse robotic manipulation tasks as well as in simulation, showing substantial improvements for tasks specified with either language or goal images.

LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning

Authors:Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong
Date:2025-02-08 05:26:02

Developing intelligent agents for long-term cooperation in dynamic open-world scenarios is a major challenge in multi-agent systems. Traditional Multi-agent Reinforcement Learning (MARL) frameworks like centralized training decentralized execution (CTDE) struggle with scalability and flexibility. They require centralized long-term planning, which is difficult without custom reward functions, and face challenges in processing multi-modal data. CTDE approaches also assume fixed cooperation strategies, making them impractical in dynamic environments where agents need to adapt and plan independently. To address decentralized multi-agent cooperation, we propose Decentralized Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in a novel Multi-agent Crafter environment. Our generative agents, powered by Large Language Models (LLMs), are more scalable than traditional MARL agents by leveraging external knowledge and language for long-term planning and reasoning. Instead of fully sharing information from all past experiences, DAMCS introduces a multi-modal memory system organized as a hierarchical knowledge graph and a structured communication protocol to optimize agent cooperation. This allows agents to reason from past interactions and share relevant information efficiently. Experiments on novel multi-agent open-world tasks show that DAMCS outperforms both MARL and LLM baselines in task efficiency and collaboration. Compared to single-agent scenarios, the two-agent scenario achieves the same goal with 63% fewer steps, and the six-agent scenario with 74% fewer steps, highlighting the importance of adaptive memory and structured communication in achieving long-term goals. We publicly release our project at: https://happyeureka.github.io/damcs.

Seasonal Station-Keeping of Short Duration High Altitude Balloons using Deep Reinforcement Learning

Authors:Tristan K. Schuler, Chinthan Prasad, Georgiy Kiselev, Donald Sofge
Date:2025-02-07 15:42:26

Station-Keeping short-duration high-altitude balloons (HABs) in a region of interest is a challenging path-planning problem due to partially observable, complex, and dynamic wind flows. Deep reinforcement learning is a popular strategy for solving the station-keeping problem. A custom simulation environment was developed to train and evaluate Deep Q-Learning (DQN) for short-duration HAB agents in the simulation. To train the agents on realistic winds, synthetic wind forecasts were generated from aggregated historical radiosonde data to apply horizontal kinematics to simulated agents. The synthetic forecasts were closely correlated with ECWMF ERA5 Reanalysis forecasts, providing a realistic simulated wind field and seasonal and altitudinal variances between the wind models. DQN HAB agents were then trained and evaluated across different seasonal months. To highlight differences and trends in months with vastly different wind fields, a Forecast Score algorithm was introduced to independently classify forecasts based on wind diversity, and trends between station-keeping success and the Forecast Score were evaluated across all seasons.

Bilevel Multi-Armed Bandit-Based Hierarchical Reinforcement Learning for Interaction-Aware Self-Driving at Unsignalized Intersections

Authors:Zengqi Peng, Yubin Wang, Lei Zheng, Jun Ma
Date:2025-02-06 10:50:59

In this work, we present BiM-ACPPO, a bilevel multi-armed bandit-based hierarchical reinforcement learning framework for interaction-aware decision-making and planning at unsignalized intersections. Essentially, it proactively takes the uncertainties associated with surrounding vehicles (SVs) into consideration, which encompass those stemming from the driver's intention, interactive behaviors, and the varying number of SVs. Intermediate decision variables are introduced to enable the high-level RL policy to provide an interaction-aware reference, for guiding low-level model predictive control (MPC) and further enhancing the generalization ability of the proposed framework. By leveraging the structured nature of self-driving at unsignalized intersections, the training problem of the RL policy is modeled as a bilevel curriculum learning task, which is addressed by the proposed Exp3.S-based BiMAB algorithm. It is noteworthy that the training curricula are dynamically adjusted, thereby facilitating the sample efficiency of the RL training process. Comparative experiments are conducted in the high-fidelity CARLA simulator, and the results indicate that our approach achieves superior performance compared to all baseline methods. Furthermore, experimental results in two new urban driving scenarios clearly demonstrate the commendable generalization performance of the proposed method.

Transforming Multimodal Models into Action Models for Radiotherapy

Authors:Matteo Ferrante, Alessandra Carosi, Rolando Maria D Angelillo, Nicola Toschi
Date:2025-02-06 09:51:28

Radiotherapy is a crucial cancer treatment that demands precise planning to balance tumor eradication and preservation of healthy tissue. Traditional treatment planning (TP) is iterative, time-consuming, and reliant on human expertise, which can potentially introduce variability and inefficiency. We propose a novel framework to transform a large multimodal foundation model (MLM) into an action model for TP using a few-shot reinforcement learning (RL) approach. Our method leverages the MLM's extensive pre-existing knowledge of physics, radiation, and anatomy, enhancing it through a few-shot learning process. This allows the model to iteratively improve treatment plans using a Monte Carlo simulator. Our results demonstrate that this method outperforms conventional RL-based approaches in both quality and efficiency, achieving higher reward scores and more optimal dose distributions in simulations on prostate cancer data. This proof-of-concept suggests a promising direction for integrating advanced AI models into clinical workflows, potentially enhancing the speed, quality, and standardization of radiotherapy treatment planning.

Illuminating Spaces: Deep Reinforcement Learning and Laser-Wall Partitioning for Architectural Layout Generation

Authors:Reza Kakooee, Benjamin Dillenburger
Date:2025-02-06 09:35:24

Space layout design (SLD), occurring in the early stages of the design process, nonetheless influences both the functionality and aesthetics of the ultimate architectural outcome. The complexity of SLD necessitates innovative approaches to efficiently explore vast solution spaces. While image-based generative AI has emerged as a potential solution, they often rely on pixel-based space composition methods that lack intuitive representation of architectural processes. This paper leverages deep Reinforcement Learning (RL), as it offers a procedural approach that intuitively mimics the process of human designers. Effectively using RL for SLD requires an explorative space composing method to generate desirable design solutions. We introduce "laser-wall", a novel space partitioning method that conceptualizes walls as emitters of imaginary light beams to partition spaces. This approach bridges vector-based and pixel-based partitioning methods, offering both flexibility and exploratory power in generating diverse layouts. We present two planning strategies: one-shot planning, which generates entire layouts in a single pass, and dynamic planning, which allows for adaptive refinement by continuously transforming laser-walls. Additionally, we introduce on-light and off-light wall transformations for smooth and fast layout refinement, as well as identity-less and identity-full walls for versatile room assignment. We developed SpaceLayoutGym, an open-source OpenAI Gym compatible simulator for generating and evaluating space layouts. The RL agent processes the input design scenarios and generates solutions following a reward function that balances geometrical and topological requirements. Our results demonstrate that the RL-based laser-wall approach can generate diverse and functional space layouts that satisfy both geometric constraints and topological requirements and is architecturally intuitive.

Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning

Authors:Bokeng Zheng, Bo Rao, Tianxiang Zhu, Chee Wei Tan, Jingpu Duan, Zhi Zhou, Xu Chen, Xiaoxi Zhang
Date:2025-02-06 07:23:40

Advances in artificial intelligence (AI) including foundation models (FMs), are increasingly transforming human society, with smart city driving the evolution of urban living.Meanwhile, vehicle crowdsensing (VCS) has emerged as a key enabler, leveraging vehicles' mobility and sensor-equipped capabilities. In particular, ride-hailing vehicles can effectively facilitate flexible data collection and contribute towards urban intelligence, despite resource limitations. Therefore, this work explores a promising scenario, where edge-assisted vehicles perform joint tasks of order serving and the emerging foundation model fine-tuning using various urban data. However, integrating the VCS AI task with the conventional order serving task is challenging, due to their inconsistent spatio-temporal characteristics: (i) The distributions of ride orders and data point-of-interests (PoIs) may not coincide in geography, both following a priori unknown patterns; (ii) they have distinct forms of temporal effects, i.e., prolonged waiting makes orders become instantly invalid while data with increased staleness gradually reduces its utility for model fine-tuning.To overcome these obstacles, we propose an online framework based on multi-agent reinforcement learning (MARL) with careful augmentation. A new quality-of-service (QoS) metric is designed to characterize and balance the utility of the two joint tasks, under the effects of varying data volumes and staleness. We also integrate graph neural networks (GNNs) with MARL to enhance state representations, capturing graph-structured, time-varying dependencies among vehicles and across locations. Extensive experiments on our testbed simulator, utilizing various real-world foundation model fine-tuning tasks and the New York City Taxi ride order dataset, demonstrate the advantage of our proposed method.

TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint

Authors:Haotian Lin, Pengcheng Wang, Jeff Schneider, Guanya Shi
Date:2025-02-05 19:08:42

Model-based reinforcement learning algorithms that combine model-based planning and learned value/policy prior have gained significant recognition for their high data efficiency and superior performance in continuous control. However, we discover that existing methods that rely on standard SAC-style policy iteration for value learning, directly using data generated by the planner, often result in \emph{persistent value overestimation}. Through theoretical analysis and experiments, we argue that this issue is deeply rooted in the structural policy mismatch between the data generation policy that is always bootstrapped by the planner and the learned policy prior. To mitigate such a mismatch in a minimalist way, we propose a policy regularization term reducing out-of-distribution (OOD) queries, thereby improving value learning. Our method involves minimum changes on top of existing frameworks and requires no additional computation. Extensive experiments demonstrate that the proposed approach improves performance over baselines such as TD-MPC2 by large margins, particularly in 61-DoF humanoid tasks. View qualitative results at https://darthutopian.github.io/tdmpc_square/.

Deep Reinforcement Learning-Based Optimization of Second-Life Battery Utilization in Electric Vehicles Charging Stations

Authors:Rouzbeh Haghighi, Ali Hassan, Van-Hai Bui, Akhtar Hussain, Wencong Su
Date:2025-02-05 17:50:53

The rapid rise in electric vehicle (EV) adoption presents significant challenges in managing the vast number of retired EV batteries. Research indicates that second-life batteries (SLBs) from EVs typically retain considerable residual capacity, offering extended utility. These batteries can be effectively repurposed for use in EV charging stations (EVCS), providing a cost-effective alternative to new batteries and reducing overall planning costs. Integrating battery energy storage systems (BESS) with SLBs into EVCS is a promising strategy to alleviate system overload. However, efficient operation of EVCS with integrated BESS is hindered by uncertainties such as fluctuating EV arrival and departure times and variable power prices from the grid. This paper presents a deep reinforcement learning-based (DRL) planning framework for EV charging stations with BESS, leveraging SLBs. We employ the advanced soft actor-critic (SAC) approach, training the model on a year's worth of data to account for seasonal variations, including weekdays and holidays. A tailored reward function enables effective offline training, allowing real-time optimization of EVCS operations under uncertainty.

Conditional Prediction by Simulation for Automated Driving

Authors:Fabian Konstantinidis, Moritz Sackmann, Ulrich Hofmann, Christoph Stiller
Date:2025-02-05 15:44:06

Modular automated driving systems commonly handle prediction and planning as sequential, separate tasks, thereby prohibiting cooperative maneuvers. To enable cooperative planning, this work introduces a prediction model that models the conditional dependencies between trajectories. For this, predictions are generated by a microscopic traffic simulation, with the individual traffic participants being controlled by a realistic behavior model trained via Adversarial Inverse Reinforcement Learning. By assuming various candidate trajectories for the automated vehicle, we generate predictions conditioned on each of them. Furthermore, our approach allows the candidate trajectories to adapt dynamically during the prediction rollout. Several example scenarios are available at https://conditionalpredictionbysimulation.github.io/.

Policy Guided Tree Search for Enhanced LLM Reasoning

Authors:Yang Li
Date:2025-02-04 22:08:20

Despite their remarkable capabilities, large language models often struggle with tasks requiring complex reasoning and planning. While existing approaches like Chain-of-Thought prompting and tree search techniques show promise, they are limited by their reliance on predefined heuristics and computationally expensive exploration strategies. We propose Policy-Guided Tree Search (PGTS), a framework that combines reinforcement learning with structured tree exploration to efficiently navigate reasoning paths. Our key innovation is a learned policy that dynamically decides between expanding, branching, backtracking, or terminating exploration, eliminating the need for manual heuristics or exhaustive search. Experiments across mathematical reasoning, logical deduction, and planning benchmarks demonstrate that PGTS achieves superior reasoning performance while significantly reducing computational costs compared to existing methods. These results establish PGTS as a scalable and effective solution for tackling complex reasoning tasks with LLMs.

Deep Reinforcement Learning Enabled Persistent Surveillance with Energy-Aware UAV-UGV Systems for Disaster Management Applications

Authors:Md Safwan Mondal, Subramanian Ramasamy, Pranav Bhounsule
Date:2025-02-04 19:11:02

Integrating Unmanned Aerial Vehicles (UAVs) with Unmanned Ground Vehicles (UGVs) provides an effective solution for persistent surveillance in disaster management. UAVs excel at covering large areas rapidly, but their range is limited by battery capacity. UGVs, though slower, can carry larger batteries for extended missions. By using UGVs as mobile recharging stations, UAVs can extend mission duration through periodic refueling, leveraging the complementary strengths of both systems. To optimize this energy-aware UAV-UGV cooperative routing problem, we propose a planning framework that determines optimal routes and recharging points between a UAV and a UGV. Our solution employs a deep reinforcement learning (DRL) framework built on an encoder-decoder transformer architecture with multi-head attention mechanisms. This architecture enables the model to sequentially select actions for visiting mission points and coordinating recharging rendezvous between the UAV and UGV. The DRL model is trained to minimize the age periods (the time gap between consecutive visits) of mission points, ensuring effective surveillance. We evaluate the framework across various problem sizes and distributions, comparing its performance against heuristic methods and an existing learning-based model. Results show that our approach consistently outperforms these baselines in both solution quality and runtime. Additionally, we demonstrate the DRL policy's applicability in a real-world disaster scenario as a case study and explore its potential for online mission planning to handle dynamic changes. Adapting the DRL policy for priority-driven surveillance highlights the model's generalizability for real-time disaster response.

Sequential Multi-objective Multi-agent Reinforcement Learning Approach for Predictive Maintenance

Authors:Yan Chen, Cheng Liu
Date:2025-02-04 07:42:58

Existing predictive maintenance (PdM) methods typically focus solely on whether to replace system components without considering the costs incurred by inspection. However, a well-considered approach should be able to minimize Remaining Useful Life (RUL) at engine replacement while maximizing inspection interval. To achieve this, multi-agent reinforcement learning (MARL) can be introduced. However, due to the sequential and mutually constraining nature of these 2 objectives, conventional MARL is not applicable. Therefore, this paper introduces a novel framework and develops a Sequential Multi-objective Multi-agent Proximal Policy Optimization (SMOMA-PPO) algorithm. Furthermore, to provide comprehensive and effective degradation information to RL agents, we also employed Gated Recurrent Unit, quantile regression, and probability distribution fitting to develop a GRU-based RUL Prediction (GRP) model. Experiments demonstrate that the GRP method significantly improves the accuracy of RUL predictions in the later stages of system operation compared to existing methods. When incorporating its output into SMOMA-PPO, we achieve at least a 15% reduction in average RUL without unscheduled replacements (UR), nearly a 10% increase in inspection interval, and an overall decrease in maintenance costs. Importantly, our approach offers a new perspective for addressing multi-objective maintenance planning with sequential constraints, effectively enhancing system reliability and reducing maintenance expenses.

RAPID: Robust and Agile Planner Using Inverse Reinforcement Learning for Vision-Based Drone Navigation

Authors:Minwoo Kim, Geunsik Bae, Jinwoo Lee, Woojae Shin, Changseung Kim, Myong-Yol Choi, Heejung Shin, Hyondong Oh
Date:2025-02-04 06:42:08

This paper introduces a learning-based visual planner for agile drone flight in cluttered environments. The proposed planner generates collision-free waypoints in milliseconds, enabling drones to perform agile maneuvers in complex environments without building separate perception, mapping, and planning modules. Learning-based methods, such as behavior cloning (BC) and reinforcement learning (RL), demonstrate promising performance in visual navigation but still face inherent limitations. BC is susceptible to compounding errors due to limited expert imitation, while RL struggles with reward function design and sample inefficiency. To address these limitations, this paper proposes an inverse reinforcement learning (IRL)-based framework for high-speed visual navigation. By leveraging IRL, it is possible to reduce the number of interactions with simulation environments and improve capability to deal with high-dimensional spaces while preserving the robustness of RL policies. A motion primitive-based path planning algorithm collects an expert dataset with privileged map data from diverse environments, ensuring comprehensive scenario coverage. By leveraging both the acquired expert and learner dataset gathered from the agent's interactions with the simulation environments, a robust reward function and policy are learned across diverse states. While the proposed method is trained in a simulation environment only, it can be directly applied to real-world scenarios without additional training or tuning. The performance of the proposed method is validated in both simulation and real-world environments, including forests and various structures. The trained policy achieves an average speed of 7 m/s and a maximum speed of 8.8 m/s in real flight experiments. To the best of our knowledge, this is the first work to successfully apply an IRL framework for high-speed visual navigation of drones.

DHP: Discrete Hierarchical Planning for Hierarchical Reinforcement Learning Agents

Authors:Shashank Sharma, Janina Hoffmann, Vinay Namboodiri
Date:2025-02-04 03:05:55

In this paper, we address the challenge of long-horizon visual planning tasks using Hierarchical Reinforcement Learning (HRL). Our key contribution is a Discrete Hierarchical Planning (DHP) method, an alternative to traditional distance-based approaches. We provide theoretical foundations for the method and demonstrate its effectiveness through extensive empirical evaluations. Our agent recursively predicts subgoals in the context of a long-term goal and receives discrete rewards for constructing plans as compositions of abstract actions. The method introduces a novel advantage estimation strategy for tree trajectories, which inherently encourages shorter plans and enables generalization beyond the maximum tree depth. The learned policy function allows the agent to plan efficiently, requiring only $\log N$ computational steps, making re-planning highly efficient. The agent, based on a soft-actor critic (SAC) framework, is trained using on-policy imagination data. Additionally, we propose a novel exploration strategy that enables the agent to generate relevant training examples for the planning modules. We evaluate our method on long-horizon visual planning tasks in a 25-room environment, where it significantly outperforms previous benchmarks at success rate and average episode length. Furthermore, an ablation study highlights the individual contributions of key modules to the overall performance.

Embrace Collisions: Humanoid Shadowing for Deployable Contact-Agnostics Motions

Authors:Ziwen Zhuang, Hang Zhao
Date:2025-02-03 15:57:54

Previous humanoid robot research works treat the robot as a bipedal mobile manipulation platform, where only the feet and hands contact the environment. However, we humans use all body parts to interact with the world, e.g., we sit in chairs, get up from the ground, or roll on the floor. Contacting the environment using body parts other than feet and hands brings significant challenges in both model-predictive control and reinforcement learning-based methods. An unpredictable contact sequence makes it almost impossible for model-predictive control to plan ahead in real time. The success of the zero-shot sim-to-real reinforcement learning method for humanoids heavily depends on the acceleration of GPU-based rigid-body physical simulator and simplification of the collision detection. Lacking extreme torso movement of the humanoid research makes all other components non-trivial to design, such as termination conditions, motion commands and reward designs. To address these potential challenges, we propose a general humanoid motion framework that takes discrete motion commands and controls the robot's motor action in real time. Using a GPU-accelerated rigid-body simulator, we train a humanoid whole-body control policy that follows the high-level motion command in the real world in real time, even with stochastic contacts and extremely large robot base rotation and not-so-feasible motion command. More details at https://project-instinct.github.io

Resilient UAV Trajectory Planning via Few-Shot Meta-Offline Reinforcement Learning

Authors:Eslam Eldeeb, Hirley Alves
Date:2025-02-03 11:39:12

Reinforcement learning (RL) has been a promising essence in future 5G-beyond and 6G systems. Its main advantage lies in its robust model-free decision-making in complex and large-dimension wireless environments. However, most existing RL frameworks rely on online interaction with the environment, which might not be feasible due to safety and cost concerns. Another problem with online RL is the lack of scalability of the designed algorithm with dynamic or new environments. This work proposes a novel, resilient, few-shot meta-offline RL algorithm combining offline RL using conservative Q-learning (CQL) and meta-learning using model-agnostic meta-learning (MAML). The proposed algorithm can train RL models using static offline datasets without any online interaction with the environments. In addition, with the aid of MAML, the proposed model can be scaled up to new unseen environments. We showcase the proposed algorithm for optimizing an unmanned aerial vehicle (UAV) 's trajectory and scheduling policy to minimize the age-of-information (AoI) and transmission power of limited-power devices. Numerical results show that the proposed few-shot meta-offline RL algorithm converges faster than baseline schemes, such as deep Q-networks and CQL. In addition, it is the only algorithm that can achieve optimal joint AoI and transmission power using an offline dataset with few shots of data points and is resilient to network failures due to unprecedented environmental changes.

Actor Critic with Experience Replay-based automatic treatment planning for prostate cancer intensity modulated radiotherapy

Authors:Md Mainul Abrar, Parvat Sapkota, Damon Sprouts, Xun Jia, Yujie Chi
Date:2025-02-01 07:09:40

Background: Real-time treatment planning in IMRT is challenging due to complex beam interactions. AI has improved automation, but existing models require large, high-quality datasets and lack universal applicability. Deep reinforcement learning (DRL) offers a promising alternative by mimicking human trial-and-error planning. Purpose: Develop a stochastic policy-based DRL agent for automatic treatment planning with efficient training, broad applicability, and robustness against adversarial attacks using Fast Gradient Sign Method (FGSM). Methods: Using the Actor-Critic with Experience Replay (ACER) architecture, the agent tunes treatment planning parameters (TPPs) in inverse planning. Training is based on prostate cancer IMRT cases, using dose-volume histograms (DVHs) as input. The model is trained on a single patient case, validated on two independent cases, and tested on 300+ plans across three datasets. Plan quality is assessed using ProKnow scores, and robustness is tested against adversarial attacks. Results: Despite training on a single case, the model generalizes well. Before ACER-based planning, the mean plan score was 6.20$\pm$1.84; after, 93.09% of cases achieved a perfect score of 9, with a mean of 8.93$\pm$0.27. The agent effectively prioritizes optimal TPP tuning and remains robust against adversarial attacks. Conclusions: The ACER-based DRL agent enables efficient, high-quality treatment planning in prostate cancer IMRT, demonstrating strong generalizability and robustness.

Model-Free RL Agents Demonstrate System 1-Like Intentionality

Authors:Hal Ashton, Matija Franklin
Date:2025-01-30 12:21:50

This paper argues that model-free reinforcement learning (RL) agents, while lacking explicit planning mechanisms, exhibit behaviours that can be analogised to System 1 ("thinking fast") processes in human cognition. Unlike model-based RL agents, which operate akin to System 2 ("thinking slow") reasoning by leveraging internal representations for planning, model-free agents react to environmental stimuli without anticipatory modelling. We propose a novel framework linking the dichotomy of System 1 and System 2 to the distinction between model-free and model-based RL. This framing challenges the prevailing assumption that intentionality and purposeful behaviour require planning, suggesting instead that intentionality can manifest in the structured, reactive behaviours of model-free agents. By drawing on interdisciplinary insights from cognitive psychology, legal theory, and experimental jurisprudence, we explore the implications of this perspective for attributing responsibility and ensuring AI safety. These insights advocate for a broader, contextually informed interpretation of intentionality in RL systems, with implications for their ethical deployment and regulation.

Accelerated DC loadflow solver for topology optimization

Authors:Nico Westerbeck, Joost van Dijk, Jan Viebahn, Christian Merz, Dirk Witthaut
Date:2025-01-29 09:57:53

We present a massively parallel solver that accelerates DC loadflow computations for power grid topology optimization tasks. Our approach leverages low-rank updates of the Power Transfer Distribution Factors (PTDFs) to represent substation splits, line outages, and reconfigurations without ever refactorizing the system. Furthermore, we implement the core routines on Graphics Processing Units (GPUs), thereby exploiting their high-throughput architecture for linear algebra. A two-level decomposition separates changes in branch topology from changes in nodal injections, enabling additional speed-ups by an in-the-loop brute force search over injection variations at minimal additional cost. We demonstrate billion-loadflow-per-second performance on power grids of varying sizes in workload settings which are typical for gradient-free topology optimization such as Reinforcement Learning or Quality Diversity methods. While adopting the DC approximation sacrifices some accuracy and prohibits the computation of voltage magnitudes, we show that this sacrifice unlocks new scales of computational feasibility, offering a powerful tool for large-scale grid planning and operational topology optimization.

On Rollouts in Model-Based Reinforcement Learning

Authors:Bernd Frauenknecht, Devdutt Subhasish, Friedrich Solowjow, Sebastian Trimpe
Date:2025-01-28 13:02:52

Model-based reinforcement learning (MBRL) seeks to enhance data efficiency by learning a model of the environment and generating synthetic rollouts from it. However, accumulated model errors during these rollouts can distort the data distribution, negatively impacting policy learning and hindering long-term planning. Thus, the accumulation of model errors is a key bottleneck in current MBRL methods. We propose Infoprop, a model-based rollout mechanism that separates aleatoric from epistemic model uncertainty and reduces the influence of the latter on the data distribution. Further, Infoprop keeps track of accumulated model errors along a model rollout and provides termination criteria to limit data corruption. We demonstrate the capabilities of Infoprop in the Infoprop-Dyna algorithm, reporting state-of-the-art performance in Dyna-style MBRL on common MuJoCo benchmark tasks while substantially increasing rollout length and data quality.

Towards General-Purpose Model-Free Reinforcement Learning

Authors:Scott Fujimoto, Pierluca D'Oro, Amy Zhang, Yuandong Tian, Michael Rabbat
Date:2025-01-27 15:36:37

Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice however, RL algorithms are often tailored to specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices. Recently, powerful model-based RL methods have shown impressive general results across benchmarks but come at the cost of increased complexity and slow run times, limiting their broader applicability. In this paper, we attempt to find a unifying model-free deep RL algorithm that can address a diverse class of domains and problem settings. To achieve this, we leverage model-based representations that approximately linearize the value function, taking advantage of the denser task objectives used by model-based RL while avoiding the costs associated with planning or simulated trajectories. We evaluate our algorithm, MR.Q, on a variety of common RL benchmarks with a single set of hyperparameters and show a competitive performance against domain-specific and general baselines, providing a concrete step towards building general-purpose model-free deep RL algorithms.

Multi-Agent Meta-Offline Reinforcement Learning for Timely UAV Path Planning and Data Collection

Authors:Eslam Eldeeb, Hirley Alves
Date:2025-01-27 14:47:19

Multi-agent reinforcement learning (MARL) has been widely adopted in high-performance computing and complex data-driven decision-making in the wireless domain. However, conventional MARL schemes face many obstacles in real-world scenarios. First, most MARL algorithms are online, which might be unsafe and impractical. Second, MARL algorithms are environment-specific, meaning network configuration changes require model retraining. This letter proposes a novel meta-offline MARL algorithm that combines conservative Q-learning (CQL) and model agnostic meta-learning (MAML). CQL enables offline training by leveraging pre-collected datasets, while MAML ensures scalability and adaptability to dynamic network configurations and objectives. We propose two algorithm variants: independent training (M-I-MARL) and centralized training decentralized execution (M-CTDE-MARL). Simulation results show that the proposed algorithm outperforms conventional schemes, especially the CTDE approach that achieves 50 % faster convergence in dynamic scenarios than the benchmarks. The proposed framework enhances scalability, robustness, and adaptability in wireless communication systems by optimizing UAV trajectories and scheduling policies.

An Adaptable Budget Planner for Enhancing Budget-Constrained Auto-Bidding in Online Advertising

Authors:Zhijian Duan, Yusen Huo, Tianyu Wang, Zhilin Zhang, Yeshu Li, Chuan Yu, Jian Xu, Bo Zheng, Xiaotie Deng
Date:2025-01-26 08:00:23

In online advertising, advertisers commonly utilize auto-bidding services to bid for impression opportunities. A typical objective of the auto-bidder is to optimize the advertiser's cumulative value of winning impressions within specified budget constraints. However, such a problem is challenging due to the complex bidding environment faced by diverse advertisers. To address this challenge, we introduce ABPlanner, a few-shot adaptable budget planner designed to improve budget-constrained auto-bidding. ABPlanner is based on a hierarchical bidding framework that decomposes the bidding process into shorter, manageable stages. Within this framework, ABPlanner allocates the budget across all stages, allowing a low-level auto-bidder to bids based on the budget allocation plan. The adaptability of ABPlanner is achieved through a sequential decision-making approach, inspired by in-context reinforcement learning. For each advertiser, ABPlanner adjusts the budget allocation plan episode by episode, using data from previous episodes as prompt for current decisions. This enables ABPlanner to quickly adapt to different advertisers with few-shot data, providing a sample-efficient solution. Extensive simulation experiments and real-world A/B testing validate the effectiveness of ABPlanner, demonstrating its capability to enhance the cumulative value achieved by auto-bidders.

Towards Efficient Multi-Objective Optimisation for Real-World Power Grid Topology Control

Authors:Yassine El Manyari, Anton R. Fuxjager, Stefan Zahlner, Joost Van Dijk, Alberto Castagna, Davide Barbieri, Jan Viebahn, Marcel Wasserer
Date:2025-01-24 21:40:19

Power grid operators face increasing difficulties in the control room as the increase in energy demand and the shift to renewable energy introduce new complexities in managing congestion and maintaining a stable supply. Effective grid topology control requires advanced tools capable of handling multi-objective trade-offs. While Reinforcement Learning (RL) offers a promising framework for tackling such challenges, existing Multi-Objective Reinforcement Learning (MORL) approaches fail to scale to the large state and action spaces inherent in real-world grid operations. Here we present a two-phase, efficient and scalable Multi-Objective Optimisation (MOO) method designed for grid topology control, combining an efficient RL learning phase with a rapid planning phase to generate day-ahead plans for unseen scenarios. We validate our approach using historical data from TenneT, a European Transmission System Operator (TSO), demonstrating minimal deployment time, generating day-ahead plans within 4-7 minutes with strong performance. These results underline the potential of our scalable method to support real-world power grid management, offering a practical, computationally efficient, and time-effective tool for operational planning. Based on current congestion costs and inefficiencies in grid operations, adopting our approach by TSOs could potentially save millions of euros annually, providing a compelling economic incentive for its integration in the control room.

Breaking the Pre-Planning Barrier: Real-Time Adaptive Coordination of Mission and Charging UAVs Using Graph Reinforcement Learning

Authors:Yuhan Hu, Yirong Sun, Yanjun Chen, Xinghao Chen
Date:2025-01-24 13:42:00

Unmanned Aerial Vehicles (UAVs) are pivotal in applications such as search and rescue and environmental monitoring, excelling in intelligent perception tasks. However, their limited battery capacity hinders long-duration and long-distance missions. Charging UAVs (CUAVs) offers a potential solution by recharging mission UAVs (MUAVs), but existing methods rely on impractical pre-planned routes, failing to enable organic cooperation and limiting mission efficiency. We introduce a novel multi-agent deep reinforcement learning model named \textbf{H}eterogeneous \textbf{G}raph \textbf{A}ttention \textbf{M}ulti-agent Deep Deterministic Policy Gradient (HGAM), designed to dynamically coordinate MUAVs and CUAVs. This approach maximizes data collection, geographical fairness, and energy efficiency by allowing UAVs to adapt their routes in real-time to current task demands and environmental conditions without pre-planning. Our model uses heterogeneous graph attention networks (GATs) to present heterogeneous agents and facilitate efficient information exchange. It operates within an actor-critic framework. Simulation results show that our model significantly improves cooperation among heterogeneous UAVs, outperforming existing methods in several metrics, including data collection rate and charging efficiency.

MARL-OT: Multi-Agent Reinforcement Learning Guided Online Fuzzing to Detect Safety Violation in Autonomous Driving Systems

Authors:Linfeng Liang, Xi Zheng
Date:2025-01-24 12:34:04

Autonomous Driving Systems (ADSs) are safety-critical, as real-world safety violations can result in significant losses. Rigorous testing is essential before deployment, with simulation testing playing a key role. However, ADSs are typically complex, consisting of multiple modules such as perception and planning, or well-trained end-to-end autonomous driving systems. Offline methods, such as the Genetic Algorithm (GA), can only generate predefined trajectories for dynamics, which struggle to cause safety violations for ADSs rapidly and efficiently in different scenarios due to their evolutionary nature. Online methods, such as single-agent reinforcement learning (RL), can quickly adjust the dynamics' trajectory online to adapt to different scenarios, but they struggle to capture complex corner cases of ADS arising from the intricate interplay among multiple vehicles. Multi-agent reinforcement learning (MARL) has a strong ability in cooperative tasks. On the other hand, it faces its own challenges, particularly with convergence. This paper introduces MARL-OT, a scalable framework that leverages MARL to detect safety violations of ADS resulting from surrounding vehicles' cooperation. MARL-OT employs MARL for high-level guidance, triggering various dangerous scenarios for the rule-based online fuzzer to explore potential safety violations of ADS, thereby generating dynamic, realistic safety violation scenarios. Our approach improves the detected safety violation rate by up to 136.2% compared to the state-of-the-art (SOTA) testing technique.

Dream to Fly: Model-Based Reinforcement Learning for Vision-Based Drone Flight

Authors:Angel Romero, Ashwin Shenai, Ismail Geles, Elie Aljalbout, Davide Scaramuzza
Date:2025-01-24 10:24:39

Autonomous drone racing has risen as a challenging robotic benchmark for testing the limits of learning, perception, planning, and control. Expert human pilots are able to agilely fly a drone through a race track by mapping the real-time feed from a single onboard camera directly to control commands. Recent works in autonomous drone racing attempting direct pixel-to-commands control policies (without explicit state estimation) have relied on either intermediate representations that simplify the observation space or performed extensive bootstrapping using Imitation Learning (IL). This paper introduces an approach that learns policies from scratch, allowing a quadrotor to autonomously navigate a race track by directly mapping raw onboard camera pixels to control commands, just as human pilots do. By leveraging model-based reinforcement learning~(RL) - specifically DreamerV3 - we train visuomotor policies capable of agile flight through a race track using only raw pixel observations. While model-free RL methods such as PPO struggle to learn under these conditions, DreamerV3 efficiently acquires complex visuomotor behaviors. Moreover, because our policies learn directly from pixel inputs, the perception-aware reward term employed in previous RL approaches to guide the training process is no longer needed. Our experiments demonstrate in both simulation and real-world flight how the proposed approach can be deployed on agile quadrotors. This approach advances the frontier of vision-based autonomous flight and shows that model-based RL is a promising direction for real-world robotics.

SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

Authors:Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev
Date:2025-01-22 20:08:53

Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers to multi-agent settings by pooling and globally broadcasting individual working memories, enabling agents to exchange information implicitly and coordinate their actions. We evaluate SRMT on the Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck navigation task that requires agents to pass through a narrow corridor and on a POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently outperforms a variety of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is competitive with recent MARL, hybrid, and planning-based algorithms. These results suggest that incorporating shared recurrent memory into the transformer-based architectures can enhance coordination in decentralized multi-agent systems. The source code for training and evaluation is available on GitHub: https://github.com/Aloriosa/srmt.

Attention-Driven Hierarchical Reinforcement Learning with Particle Filtering for Source Localization in Dynamic Fields

Authors:Yiwei Shi, Mengyue Yang, Qi Zhang, Weinan Zhang, Cunjia Liu, Weiru Liu
Date:2025-01-22 18:45:29

In many real-world scenarios, such as gas leak detection or environmental pollutant tracking, solving the Inverse Source Localization and Characterization problem involves navigating complex, dynamic fields with sparse and noisy observations. Traditional methods face significant challenges, including partial observability, temporal and spatial dynamics, out-of-distribution generalization, and reward sparsity. To address these issues, we propose a hierarchical framework that integrates Bayesian inference and reinforcement learning. The framework leverages an attention-enhanced particle filtering mechanism for efficient and accurate belief updates, and incorporates two complementary execution strategies: Attention Particle Filtering Planning and Attention Particle Filtering Reinforcement Learning. These approaches optimize exploration and adaptation under uncertainty. Theoretical analysis proves the convergence of the attention-enhanced particle filter, while extensive experiments across diverse scenarios validate the framework's superior accuracy, adaptability, and computational efficiency. Our results highlight the framework's potential for broad applications in dynamic field estimation tasks.

AdaWM: Adaptive World Model based Planning for Autonomous Driving

Authors:Hang Wang, Xin Ye, Feng Tao, Chenbin Pan, Abhirup Mallik, Burhaneddin Yaman, Liu Ren, Junshan Zhang
Date:2025-01-22 18:34:51

World model based reinforcement learning (RL) has emerged as a promising approach for autonomous driving, which learns a latent dynamics model and uses it to train a planning policy. To speed up the learning process, the pretrain-finetune paradigm is often used, where online RL is initialized by a pretrained model and a policy learned offline. However, naively performing such initialization in RL may result in dramatic performance degradation during the online interactions in the new task. To tackle this challenge, we first analyze the performance degradation and identify two primary root causes therein: the mismatch of the planning policy and the mismatch of the dynamics model, due to distribution shift. We further analyze the effects of these factors on performance degradation during finetuning, and our findings reveal that the choice of finetuning strategies plays a pivotal role in mitigating these effects. We then introduce AdaWM, an Adaptive World Model based planning method, featuring two key steps: (a) mismatch identification, which quantifies the mismatches and informs the finetuning strategy, and (b) alignment-driven finetuning, which selectively updates either the policy or the model as needed using efficient low-rank updates. Extensive experiments on the challenging CARLA driving tasks demonstrate that AdaWM significantly improves the finetuning process, resulting in more robust and efficient performance in autonomous driving systems.

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Authors:Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah
Date:2025-01-22 16:53:08

Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.

Reinforcement Learning Constrained Beam Search for Parameter Optimization of Paper Drying Under Flexible Constraints

Authors:Siyuan Chen, Hanshen Yu, Jamal Yagoobi, Chenhui Shao
Date:2025-01-21 23:16:19

Existing approaches to enforcing design constraints in Reinforcement Learning (RL) applications often rely on training-time penalties in the reward function or training/inference-time invalid action masking, but these methods either cannot be modified after training, or are limited in the types of constraints that can be implemented. To address this limitation, we propose Reinforcement Learning Constrained Beam Search (RLCBS) for inference-time refinement in combinatorial optimization problems. This method respects flexible, inference-time constraints that support exclusion of invalid actions and forced inclusion of desired actions, and employs beam search to maximize sequence probability for more sensible constraint incorporation. RLCBS is extensible to RL-based planning and optimization problems that do not require real-time solution, and we apply the method to optimize process parameters for a novel modular testbed for paper drying. An RL agent is trained to minimize energy consumption across varying machine speed levels by generating optimal dryer module and air supply temperature configurations. Our results demonstrate that RLCBS outperforms NSGA-II under complex design constraints on drying module configurations at inference-time, while providing a 2.58-fold or higher speed improvement.

A Survey of World Models for Autonomous Driving

Authors:Tuo Feng, Wenguan Wang, Yi Yang
Date:2025-01-20 04:00:02

Recent breakthroughs in autonomous driving have been propelled by advances in robust world modeling, fundamentally transforming how vehicles interpret dynamic scenes and execute safe decision-making. In particular, world models have emerged as a linchpin technology, offering high-fidelity representations of the driving environment that integrate multi-sensor data, semantic cues, and temporal dynamics. This paper systematically reviews recent advances in world models for autonomous driving, proposing a three-tiered taxonomy: 1) Generation of Future Physical World, covering image-, BEV-, OG-, and PC-based generation methods that enhance scene evolution modeling through diffusion models and 4D occupancy forecasting; 2) Behavior Planning for Intelligent Agents, combining rule-driven and learning-based paradigms with cost map optimization and reinforcement learning for trajectory generation in complex traffic conditions; 3) Interaction Between Prediction and Planning, achieving multi-agent collaborative decision-making through latent space diffusion and memory-augmented architectures. The study further analyzes training paradigms including self-supervised learning, multimodal pretraining, and generative data augmentation, while evaluating world models' performance in scene understanding and motion prediction tasks. Future research must address key challenges in self-supervised representation learning, long-tail scenario generation, and multimodal fusion to advance the practical deployment of world models in complex urban environments. Overall, our comprehensive analysis provides a theoretical framework and technical roadmap for harnessing the transformative potential of world models in advancing safe and reliable autonomous driving solutions.

Enhancing UAV Path Planning Efficiency Through Accelerated Learning

Authors:Joseanne Viana, Boris Galkin, Lester Ho, Holger Claussen
Date:2025-01-17 12:05:24

Unmanned Aerial Vehicles (UAVs) are increasingly essential in various fields such as surveillance, reconnaissance, and telecommunications. This study aims to develop a learning algorithm for the path planning of UAV wireless communication relays, which can reduce storage requirements and accelerate Deep Reinforcement Learning (DRL) convergence. Assuming the system possesses terrain maps of the area and can estimate user locations using localization algorithms or direct GPS reporting, it can input these parameters into the learning algorithms to achieve optimized path planning performance. However, higher resolution terrain maps are necessary to extract topological information such as terrain height, object distances, and signal blockages. This requirement increases memory and storage demands on UAVs while also lengthening convergence times in DRL algorithms. Similarly, defining the telecommunication coverage map in UAV wireless communication relays using these terrain maps and user position estimations demands higher memory and storage utilization for the learning path planning algorithms. Our approach reduces path planning training time by applying a dimensionality reduction technique based on Principal Component Analysis (PCA), sample combination, Prioritized Experience Replay (PER), and the combination of Mean Squared Error (MSE) and Mean Absolute Error (MAE) loss calculations in the coverage map estimates, thereby enhancing a Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. The proposed solution reduces the convergence episodes needed for basic training by approximately four times compared to the traditional TD3.

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Authors:Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin
Date:2025-01-16 18:59:10

This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research.

Interoceptive Robots for Convergent Shared Control in Collaborative Construction Work

Authors:Xiaoshan Zhou, Carol C. Menassa, Vineet R. Kamat
Date:2025-01-16 04:50:15

Building autonomous mobile robots (AMRs) with optimized efficiency and adaptive capabilities-able to respond to changing task demands and dynamic environments-is a strongly desired goal for advancing construction robotics. Such robots can play a critical role in enabling automation, reducing operational carbon footprints, and supporting modular construction processes. Inspired by the adaptive autonomy of living organisms, we introduce interoception, which centers on the robot's internal state representation, as a foundation for developing self-reflection and conscious learning to enable continual learning and adaptability in robotic agents. In this paper, we factorize internal state variables and mathematical properties as "cognitive dissonance" in shared control paradigms, where human interventions occasionally occur. We offer a new perspective on how interoception can help build adaptive motion planning in AMRs by integrating the legacy of heuristic costs from grid/graph-based algorithms with recent advances in neuroscience and reinforcement learning. Declarative and procedural knowledge extracted from human semantic inputs is encoded into a hypergraph model that overlaps with the spatial configuration of onsite layout for path planning. In addition, we design a velocity-replay module using an encoder-decoder architecture with few-shot learning to enable robots to replicate velocity profiles in contextualized scenarios for multi-robot synchronization and handover collaboration. These "cached" knowledge representations are demonstrated in simulated environments for multi-robot motion planning and stacking tasks. The insights from this study pave the way toward artificial general intelligence in AMRs, fostering their progression from complexity to competence in construction automation.

Inferring Transition Dynamics from Value Functions

Authors:Jacob Adamczyk
Date:2025-01-15 19:00:47

In reinforcement learning, the value function is typically trained to solve the Bellman equation, which connects the current value to future values. This temporal dependency hints that the value function may contain implicit information about the environment's transition dynamics. By rearranging the Bellman equation, we show that a converged value function encodes a model of the underlying dynamics of the environment. We build on this insight to propose a simple method for inferring dynamics models directly from the value function, potentially mitigating the need for explicit model learning. Furthermore, we explore the challenges of next-state identifiability, discussing conditions under which the inferred dynamics model is well-defined. Our work provides a theoretical foundation for leveraging value functions in dynamics modeling and opens a new avenue for bridging model-free and model-based reinforcement learning.

Average-Reward Reinforcement Learning with Entropy Regularization

Authors:Jacob Adamczyk, Volodymyr Makarenko, Stas Tiomkin, Rahul V. Kulkarni
Date:2025-01-15 19:00:46

The average-reward formulation of reinforcement learning (RL) has drawn increased interest in recent years due to its ability to solve temporally-extended problems without discounting. Independently, RL algorithms have benefited from entropy-regularization: an approach used to make the optimal policy stochastic, thereby more robust to noise. Despite the distinct benefits of the two approaches, the combination of entropy regularization with an average-reward objective is not well-studied in the literature and there has been limited development of algorithms for this setting. To address this gap in the field, we develop algorithms for solving entropy-regularized average-reward RL problems with function approximation. We experimentally validate our method, comparing it with existing algorithms on standard benchmarks for RL.

EVAL: EigenVector-based Average-reward Learning

Authors:Jacob Adamczyk, Volodymyr Makarenko, Stas Tiomkin, Rahul V. Kulkarni
Date:2025-01-15 19:00:45

In reinforcement learning, two objective functions have been developed extensively in the literature: discounted and averaged rewards. The generalization to an entropy-regularized setting has led to improved robustness and exploration for both of these objectives. Recently, the entropy-regularized average-reward problem was addressed using tools from large deviation theory in the tabular setting. This method has the advantage of linearity, providing access to both the optimal policy and average reward-rate through properties of a single matrix. In this paper, we extend that framework to more general settings by developing approaches based on function approximation by neural networks. This formulation reveals new theoretical insights into the relationship between different objectives used in RL. Additionally, we combine our algorithm with a posterior policy iteration scheme, showing how our approach can also solve the average-reward RL problem without entropy-regularization. Using classic control benchmarks, we experimentally find that our method compares favorably with other algorithms in terms of stability and rate of convergence.

Optimization of Link Configuration for Satellite Communication Using Reinforcement Learning

Authors:Tobias Rohe, Michael Kölle, Jan Matheis, Rüdiger Höpfl, Leo Sünkel, Claudia Linnhoff-Popien
Date:2025-01-14 16:04:46

Satellite communication is a key technology in our modern connected world. With increasingly complex hardware, one challenge is to efficiently configure links (connections) on a satellite transponder. Planning an optimal link configuration is extremely complex and depends on many parameters and metrics. The optimal use of the limited resources, bandwidth and power of the transponder is crucial. Such an optimization problem can be approximated using metaheuristic methods such as simulated annealing, but recent research results also show that reinforcement learning can achieve comparable or even better performance in optimization methods. However, there have not yet been any studies on link configuration on satellite transponders. In order to close this research gap, a transponder environment was developed as part of this work. For this environment, the performance of the reinforcement learning algorithm PPO was compared with the metaheuristic simulated annealing in two experiments. The results show that Simulated Annealing delivers better results for this static problem than the PPO algorithm, however, the research in turn also underlines the potential of reinforcement learning for optimization problems.

Cooperative Patrol Routing: Optimizing Urban Crime Surveillance through Multi-Agent Reinforcement Learning

Authors:Juan Palma-Borda, Eduardo Guzmán, María-Victoria Belmonte
Date:2025-01-14 11:20:19

The effective design of patrol strategies is a difficult and complex problem, especially in medium and large areas. The objective is to plan, in a coordinated manner, the optimal routes for a set of patrols in a given area, in order to achieve maximum coverage of the area, while also trying to minimize the number of patrols. In this paper, we propose a multi-agent reinforcement learning (MARL) model, based on a decentralized partially observable Markov decision process, to plan unpredictable patrol routes within an urban environment represented as an undirected graph. The model attempts to maximize a target function that characterizes the environment within a given time frame. Our model has been tested to optimize police patrol routes in three medium-sized districts of the city of Malaga. The aim was to maximize surveillance coverage of the most crime-prone areas, based on actual crime data in the city. To address this problem, several MARL algorithms have been studied, and among these the Value Decomposition Proximal Policy Optimization (VDPPO) algorithm exhibited the best performance. We also introduce a novel metric, the coverage index, for the evaluation of the coverage performance of the routes generated by our model. This metric is inspired by the predictive accuracy index (PAI), which is commonly used in criminology to detect hotspots. Using this metric, we have evaluated the model under various scenarios in which the number of agents (or patrols), their starting positions, and the level of information they can observe in the environment have been modified. Results show that the coordinated routes generated by our model achieve a coverage of more than $90\%$ of the $3\%$ of graph nodes with the highest crime incidence, and $65\%$ for $20\%$ of these nodes; $3\%$ and $20\%$ represent the coverage standards for police resource allocation.

RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation

Authors:Zixuan Chen, Jing Huo, Yangtao Chen, Yang Gao
Date:2025-01-11 18:11:07

Efficient control in long-horizon robotic manipulation is challenging due to complex representation and policy learning requirements. Model-based visual reinforcement learning (RL) has shown great potential in addressing these challenges but still faces notable limitations, particularly in handling sparse rewards and complex visual features in long-horizon environments. To address these limitations, we propose the Recognize-Sense-Plan-Act (RSPA) pipeline for long-horizon tasks and further introduce RoboHorizon, an LLM-assisted multi-view world model tailored for long-horizon robotic manipulation. In RoboHorizon, pre-trained LLMs generate dense reward structures for multi-stage sub-tasks based on task language instructions, enabling robots to better recognize long-horizon tasks. Keyframe discovery is then integrated into the multi-view masked autoencoder (MAE) architecture to enhance the robot's ability to sense critical task sequences, strengthening its multi-stage perception of long-horizon processes. Leveraging these dense rewards and multi-view representations, a robotic world model is constructed to efficiently plan long-horizon tasks, enabling the robot to reliably act through RL algorithms. Experiments on two representative benchmarks, RLBench and FurnitureBench, show that RoboHorizon outperforms state-of-the-art visual model-based RL methods, achieving a 23.35% improvement in task success rates on RLBench's 4 short-horizon tasks and a 29.23% improvement on 6 long-horizon tasks from RLBench and 3 furniture assembly tasks from FurnitureBench.

DRL-Based Medium-Term Planning of Renewable-Integrated Self-Scheduling Cascaded Hydropower to Guide Wholesale Market Participation

Authors:Xianbang Chen, Yikui Liu, Neng Fan, Lei Wu
Date:2025-01-08 20:57:14

For self-scheduling cascaded hydropower (S-CHP) facilities, medium-term planning is a critical step that coordinates water availability over the medium-term horizon, providing water usage guidance for their short-term operations in wholesale market participation. Typically, medium-term planning strategies (e.g., reservoir storage targets at the end of each short-term period) are determined by either optimization methods or rules of thumb. However, with the integration of variable renewable energy sources (VRESs), optimization-based methods suffer from deviations between the anticipated and actual reservoir storage, while rules of thumb could be financially conservative, thereby compromising short-term operating profitability in wholesale market participation. This paper presents a deep reinforcement learning (DRL)-based framework to derive medium-term planning policies for VRES-integrated S-CHPs (VS-CHPs), which can leverage contextual information underneath individual short-term periods and train planning policies by their induced short-term operating profits in wholesale market participation. The proposed DRL-based framework offers two practical merits. First, its planning strategies consider both seasonal requirements of reservoir storage and needs for short-term operating profits. Second, it adopts a multi-parametric programming-based strategy to accelerate the expensive training process associated with multi-step short-term operations. Finally, the DRL-based framework is evaluated on a real-world VS-CHP, demonstrating its advantages over current practice.

Online Reinforcement Learning-Based Dynamic Adaptive Evaluation Function for Real-Time Strategy Tasks

Authors:Weilong Yang, Jie Zhang, Xunyun Liu, Yanqing Ye
Date:2025-01-07 14:36:33

Effective evaluation of real-time strategy tasks requires adaptive mechanisms to cope with dynamic and unpredictable environments. This study proposes a method to improve evaluation functions for real-time responsiveness to battle-field situation changes, utilizing an online reinforcement learning-based dynam-ic weight adjustment mechanism within the real-time strategy game. Building on traditional static evaluation functions, the method employs gradient descent in online reinforcement learning to update weights dynamically, incorporating weight decay techniques to ensure stability. Additionally, the AdamW optimizer is integrated to adjust the learning rate and decay rate of online reinforcement learning in real time, further reducing the dependency on manual parameter tun-ing. Round-robin competition experiments demonstrate that this method signifi-cantly enhances the application effectiveness of the Lanchester combat model evaluation function, Simple evaluation function, and Simple Sqrt evaluation function in planning algorithms including IDABCD, IDRTMinimax, and Port-folio AI. The method achieves a notable improvement in scores, with the en-hancement becoming more pronounced as the map size increases. Furthermore, the increase in evaluation function computation time induced by this method is kept below 6% for all evaluation functions and planning algorithms. The pro-posed dynamic adaptive evaluation function demonstrates a promising approach for real-time strategy task evaluation.

Sim-to-Real Transfer for Mobile Robots with Reinforcement Learning: from NVIDIA Isaac Sim to Gazebo and Real ROS 2 Robots

Authors:Sahar Salimpour, Jorge Peña-Queralta, Diego Paez-Granados, Jukka Heikkonen, Tomi Westerlund
Date:2025-01-06 10:26:16

Unprecedented agility and dexterous manipulation have been demonstrated with controllers based on deep reinforcement learning (RL), with a significant impact on legged and humanoid robots. Modern tooling and simulation platforms, such as NVIDIA Isaac Sim, have been enabling such advances. This article focuses on demonstrating the applications of Isaac in local planning and obstacle avoidance as one of the most fundamental ways in which a mobile robot interacts with its environments. Although there is extensive research on proprioception-based RL policies, the article highlights less standardized and reproducible approaches to exteroception. At the same time, the article aims to provide a base framework for end-to-end local navigation policies and how a custom robot can be trained in such simulation environment. We benchmark end-to-end policies with the state-of-the-art Nav2, navigation stack in Robot Operating System (ROS). We also cover the sim-to-real transfer process by demonstrating zero-shot transferability of policies trained in the Isaac simulator to real-world robots. This is further evidenced by the tests with different simulated robots, which show the generalization of the learned policy. Finally, the benchmarks demonstrate comparable performance to Nav2, opening the door to quick deployment of state-of-the-art end-to-end local planners for custom robot platforms, but importantly furthering the possibilities by expanding the state and action spaces or task definitions for more complex missions. Overall, with this article we introduce the most important steps, and aspects to consider, in deploying RL policies for local path planning and obstacle avoidance with Isaac Sim training, Gazebo testing, and ROS 2 for real-time inference in real robots. The code is available at https://github.com/sahars93/RL-Navigation.

First-place Solution for Streetscape Shop Sign Recognition Competition

Authors:Bin Wang, Li Jing
Date:2025-01-06 07:20:36

Text recognition technology applied to street-view storefront signs is increasingly utilized across various practical domains, including map navigation, smart city planning analysis, and business value assessments in commercial districts. This technology holds significant research and commercial potential. Nevertheless, it faces numerous challenges. Street view images often contain signboards with complex designs and diverse text styles, complicating the text recognition process. A notable advancement in this field was introduced by our team in a recent competition. We developed a novel multistage approach that integrates multimodal feature fusion, extensive self-supervised training, and a Transformer-based large model. Furthermore, innovative techniques such as BoxDQN, which relies on reinforcement learning, and text rectification methods were employed, leading to impressive outcomes. Comprehensive experiments have validated the effectiveness of these methods, showcasing our potential to enhance text recognition capabilities in complex urban environments.

Horizon Generalization in Reinforcement Learning

Authors:Vivek Myers, Catherine Ji, Benjamin Eysenbach
Date:2025-01-06 01:42:46

We study goal-conditioned RL through the lens of generalization, but not in the traditional sense of random augmentations and domain randomization. Rather, we aim to learn goal-directed policies that generalize with respect to the horizon: after training to reach nearby goals (which are easy to learn), these policies should succeed in reaching distant goals (which are quite challenging to learn). In the same way that invariance is closely linked with generalization is other areas of machine learning (e.g., normalization layers make a network invariant to scale, and therefore generalize to inputs of varying scales), we show that this notion of horizon generalization is closely linked with invariance to planning: a policy navigating towards a goal will select the same actions as if it were navigating to a waypoint en route to that goal. Thus, such a policy trained to reach nearby goals should succeed at reaching arbitrarily-distant goals. Our theoretical analysis proves that both horizon generalization and planning invariance are possible, under some assumptions. We present new experimental results and recall findings from prior work in support of our theoretical results. Taken together, our results open the door to studying how techniques for invariance and generalization developed in other areas of machine learning might be adapted to achieve this alluring property.

A View of the Certainty-Equivalence Method for PAC RL as an Application of the Trajectory Tree Method

Authors:Shivaram Kalyanakrishnan, Sheel Shah, Santhosh Kumar Guguloth
Date:2025-01-05 20:37:34

Reinforcement learning (RL) enables an agent interacting with an unknown MDP $M$ to optimise its behaviour by observing transitions sampled from $M$. A natural entity that emerges in the agent's reasoning is $\widehat{M}$, the maximum likelihood estimate of $M$ based on the observed transitions. The well-known \textit{certainty-equivalence} method (CEM) dictates that the agent update its behaviour to $\widehat{\pi}$, which is an optimal policy for $\widehat{M}$. Not only is CEM intuitive, it has been shown to enjoy minimax-optimal sample complexity in some regions of the parameter space for PAC RL with a generative model~\citep{Agarwal2020GenModel}. A seemingly unrelated algorithm is the ``trajectory tree method'' (TTM)~\citep{Kearns+MN:1999}, originally developed for efficient decision-time planning in large POMDPs. This paper presents a theoretical investigation that stems from the surprising finding that CEM may indeed be viewed as an application of TTM. The qualitative benefits of this view are (1) new and simple proofs of sample complexity upper bounds for CEM, in fact under a (2) weaker assumption on the rewards than is prevalent in the current literature. Our analysis applies to both non-stationary and stationary MDPs. Quantitatively, we obtain (3) improvements in the sample-complexity upper bounds for CEM both for non-stationary and stationary MDPs, in the regime that the ``mistake probability'' $\delta$ is small. Additionally, we show (4) a lower bound on the sample complexity for finite-horizon MDPs, which establishes the minimax-optimality of our upper bound for non-stationary MDPs in the small-$\delta$ regime.

Securing Integrated Sensing and Communication Against a Mobile Adversary: A Stackelberg Game with Deep Reinforcement Learning

Authors:Milad Tatar Mamaghani, Xiangyun Zhou, Nan Yang, A. Lee Swindlehurst
Date:2025-01-04 12:18:41

In this paper, we study a secure integrated sensing and communication (ISAC) system employing a full-duplex base station with sensing capabilities against a mobile proactive adversarial target$\unicode{x2014}$a malicious unmanned aerial vehicle (M-UAV). We develop a game-theoretic model to enhance communication security, radar sensing accuracy, and power efficiency. The interaction between the legitimate network and the mobile adversary is formulated as a non-cooperative Stackelberg game (NSG), where the M-UAV acts as the leader and strategically adjusts its trajectory to improve its eavesdropping ability while conserving power and avoiding obstacles. In response, the legitimate network, acting as the follower, dynamically allocates resources to minimize network power usage while ensuring required secrecy rates and sensing performance. To address this challenging problem, we propose a low-complexity successive convex approximation (SCA) method for network resource optimization combined with a deep reinforcement learning (DRL) algorithm for adaptive M-UAV trajectory planning through sequential interactions and learning. Simulation results demonstrate the efficacy of the proposed method in addressing security challenges of dynamic ISAC systems in 6G, i.e., achieving a Stackelberg equilibrium with robust performance while mitigating the adversary's ability to intercept network signals.

Humanoid Locomotion and Manipulation: Current Progress and Challenges in Control, Planning, and Learning

Authors:Zhaoyuan Gu, Junheng Li, Wenlan Shen, Wenhao Yu, Zhaoming Xie, Stephen McCrory, Xianyi Cheng, Abdulaziz Shamsah, Robert Griffin, C. Karen Liu, Abderrahmane Kheddar, Xue Bin Peng, Yuke Zhu, Guanya Shi, Quan Nguyen, Gordon Cheng, Huijun Gao, Ye Zhao
Date:2025-01-03 22:00:53

Humanoid robots have great potential to perform various human-level skills. These skills involve locomotion, manipulation, and cognitive capabilities. Driven by advances in machine learning and the strength of existing model-based approaches, these capabilities have progressed rapidly, but often separately. Therefore, a timely overview of current progress and future trends in this fast-evolving field is essential. This survey first summarizes the model-based planning and control that have been the backbone of humanoid robotics for the past three decades. We then explore emerging learning-based methods, with a focus on reinforcement learning and imitation learning that enhance the versatility of loco-manipulation skills. We examine the potential of integrating foundation models with humanoid embodiments, assessing the prospects for developing generalist humanoid agents. In addition, this survey covers emerging research for whole-body tactile sensing that unlocks new humanoid skills that involve physical interactions. The survey concludes with a discussion of the challenges and future trends.

Proposing Hierarchical Goal-Conditioned Policy Planning in Multi-Goal Reinforcement Learning

Authors:Gavin B. Rens
Date:2025-01-03 09:37:54

Humanoid robots must master numerous tasks with sparse rewards, posing a challenge for reinforcement learning (RL). We propose a method combining RL and automated planning to address this. Our approach uses short goal-conditioned policies (GCPs) organized hierarchically, with Monte Carlo Tree Search (MCTS) planning using high-level actions (HLAs). Instead of primitive actions, the planning process generates HLAs. A single plan-tree, maintained during the agent's lifetime, holds knowledge about goal achievement. This hierarchy enhances sample efficiency and speeds up reasoning by reusing HLAs and anticipating future actions. Our Hierarchical Goal-Conditioned Policy Planning (HGCPP) framework uniquely integrates GCPs, MCTS, and hierarchical RL, potentially improving exploration and planning in complex tasks.

Exploiting NOMA Transmissions in Multi-UAV-assisted Wireless Networks: From Aerial-RIS to Mode-switching UAVs

Authors:Songhan Zhao, Shimin Gong, Bo Gu, Lanhua Li, Bin Lyu, Dinh Thai Hoang, Changyan Yi
Date:2024-12-29 14:52:13

In this paper, we consider an aerial reconfigurable intelligent surface (ARIS)-assisted wireless network, where multiple unmanned aerial vehicles (UAVs) collect data from ground users (GUs) by using the non-orthogonal multiple access (NOMA) method. The ARIS provides enhanced channel controllability to improve the NOMA transmissions and reduce the co-channel interference among UAVs. We also propose a novel dual-mode switching scheme, where each UAV equipped with both an ARIS and a radio frequency (RF) transceiver can adaptively perform passive reflection or active transmission. We aim to maximize the overall network throughput by jointly optimizing the UAVs' trajectory planning and operating modes, the ARIS's passive beamforming, and the GUs' transmission control strategies. We propose an optimization-driven hierarchical deep reinforcement learning (O-HDRL) method to decompose it into a series of subproblems. Specifically, the multi-agent deep deterministic policy gradient (MADDPG) adjusts the UAVs' trajectory planning and mode switching strategies, while the passive beamforming and transmission control strategies are tackled by the optimization methods. Numerical results reveal that the O-HDRL efficiently improves the learning stability and reward performance compared to the benchmark methods. Meanwhile, the dual-mode switching scheme is verified to achieve a higher throughput performance compared to the fixed ARIS scheme.

SatFlow: Scalable Network Planning for LEO Mega-Constellations

Authors:Sheng Cen, Qiying Pan, Yifei Zhu, Bo Li
Date:2024-12-29 14:25:06

Low-earth-orbit (LEO) satellite communication networks have evolved into mega-constellations with hundreds to thousands of satellites inter-connecting with inter-satellite links (ISLs). Network planning, which plans for network resources and architecture to improve the network performance and save operational costs, is crucial for satellite network management. However, due to the large scale of mega-constellations, high dynamics of satellites, and complex distribution of real-world traffic, it is extremely challenging to conduct scalable network planning on mega-constellations with high performance. In this paper, we propose SatFlow, a distributed and hierarchical network planning framework to plan for the network topology, traffic allocation, and fine-grained ISL terminal power allocation for mega-constellations. To tackle the hardness of the original problem, we decompose the grand problem into two hierarchical sub-problems, tackled by two-tier modules. A multi-agent reinforcement learning approach is proposed for the upper-level module so that the overall laser energy consumption and ISL operational costs can be minimized; A distributed alternating step algorithm is proposed for the lower-level module so that the laser energy consumption could be minimized with low time complexity for a given topology. Extensive simulations on various mega-constellations validate SatFlow's scalability on the constellation size, reducing the flow violation ratio by up to 21.0% and reducing the total costs by up to 89.4%, compared with various state-of-the-art benchmarks.

Exploiting Hybrid Policy in Reinforcement Learning for Interpretable Temporal Logic Manipulation

Authors:Hao Zhang, Hao Wang, Xiucai Huang, Wenrui Chen, Zhen Kan
Date:2024-12-29 03:34:53

Reinforcement Learning (RL) based methods have been increasingly explored for robot learning. However, RL based methods often suffer from low sampling efficiency in the exploration phase, especially for long-horizon manipulation tasks, and generally neglect the semantic information from the task level, resulted in a delayed convergence or even tasks failure. To tackle these challenges, we propose a Temporal-Logic-guided Hybrid policy framework (HyTL) which leverages three-level decision layers to improve the agent's performance. Specifically, the task specifications are encoded via linear temporal logic (LTL) to improve performance and offer interpretability. And a waypoints planning module is designed with the feedback from the LTL-encoded task level as a high-level policy to improve the exploration efficiency. The middle-level policy selects which behavior primitives to execute, and the low-level policy specifies the corresponding parameters to interact with the environment. We evaluate HyTL on four challenging manipulation tasks, which demonstrate its effectiveness and interpretability. Our project is available at: https://sites.google.com/view/hytl-0257/.

Scalable Hierarchical Reinforcement Learning for Hyper Scale Multi-Robot Task Planning

Authors:Xuan Zhou, Xiang Shi, Lele Zhang, Chen Chen, Hongbo Li, Lin Ma, Fang Deng, Jie Chen
Date:2024-12-27 09:07:11

To improve the efficiency of warehousing system and meet huge customer orders, we aim to solve the challenges of dimension disaster and dynamic properties in hyper scale multi-robot task planning (MRTP) for robotic mobile fulfillment system (RMFS). Existing research indicates that hierarchical reinforcement learning (HRL) is an effective method to reduce these challenges. Based on that, we construct an efficient multi-stage HRL-based multi-robot task planner for hyper scale MRTP in RMFS, and the planning process is represented with a special temporal graph topology. To ensure optimality, the planner is designed with a centralized architecture, but it also brings the challenges of scaling up and generalization that require policies to maintain performance for various unlearned scales and maps. To tackle these difficulties, we first construct a hierarchical temporal attention network (HTAN) to ensure basic ability of handling inputs with unfixed lengths, and then design multi-stage curricula for hierarchical policy learning to further improve the scaling up and generalization ability while avoiding catastrophic forgetting. Additionally, we notice that policies with hierarchical structure suffer from unfair credit assignment that is similar to that in multi-agent reinforcement learning, inspired of which, we propose a hierarchical reinforcement learning algorithm with counterfactual rollout baseline to improve learning performance. Experimental results demonstrate that our planner outperform other state-of-the-art methods on various MRTP instances in both simulated and real-world RMFS. Also, our planner can successfully scale up to hyper scale MRTP instances in RMFS with up to 200 robots and 1000 retrieval racks on unlearned maps while keeping superior performance over other methods.

Autonomous Option Invention for Continual Hierarchical Reinforcement Learning and Planning

Authors:Rashmeet Kaur Nayyar, Siddharth Srivastava
Date:2024-12-20 23:04:52

Abstraction is key to scaling up reinforcement learning (RL). However, autonomously learning abstract state and action representations to enable transfer and generalization remains a challenging open problem. This paper presents a novel approach for inventing, representing, and utilizing options, which represent temporally extended behaviors, in continual RL settings. Our approach addresses streams of stochastic problems characterized by long horizons, sparse rewards, and unknown transition and reward functions. Our approach continually learns and maintains an interpretable state abstraction, and uses it to invent high-level options with abstract symbolic representations. These options meet three key desiderata: (1) composability for solving tasks effectively with lookahead planning, (2) reusability across problem instances for minimizing the need for relearning, and (3) mutual independence for reducing interference among options. Our main contributions are approaches for continually learning transferable, generalizable options with symbolic representations, and for integrating search techniques with RL to efficiently plan over these learned options to solve new problems. Empirical results demonstrate that the resulting approach effectively learns and transfers abstract knowledge across problem instances, achieving superior sample efficiency compared to state-of-the-art methods.

Simulation-Free Hierarchical Latent Policy Planning for Proactive Dialogues

Authors:Tao He, Lizi Liao, Yixin Cao, Yuanxing Liu, Yiheng Sun, Zerui Chen, Ming Liu, Bing Qin
Date:2024-12-19 07:06:01

Recent advancements in proactive dialogues have garnered significant attention, particularly for more complex objectives (e.g. emotion support and persuasion). Unlike traditional task-oriented dialogues, proactive dialogues demand advanced policy planning and adaptability, requiring rich scenarios and comprehensive policy repositories to develop such systems. However, existing approaches tend to rely on Large Language Models (LLMs) for user simulation and online learning, leading to biases that diverge from realistic scenarios and result in suboptimal efficiency. Moreover, these methods depend on manually defined, context-independent, coarse-grained policies, which not only incur high expert costs but also raise concerns regarding their completeness. In our work, we highlight the potential for automatically discovering policies directly from raw, real-world dialogue records. To this end, we introduce a novel dialogue policy planning framework, LDPP. It fully automates the process from mining policies in dialogue records to learning policy planning. Specifically, we employ a variant of the Variational Autoencoder to discover fine-grained policies represented as latent vectors. After automatically annotating the data with these latent policy labels, we propose an Offline Hierarchical Reinforcement Learning (RL) algorithm in the latent space to develop effective policy planning capabilities. Our experiments demonstrate that LDPP outperforms existing methods on two proactive scenarios, even surpassing ChatGPT with only a 1.8-billion-parameter LLM.

Neural-Network-Driven Reward Prediction as a Heuristic: Advancing Q-Learning for Mobile Robot Path Planning

Authors:Yiming Ji, Kaijie Yun, Yang Liu, Zongwu Xie, Hong Liu
Date:2024-12-17 08:19:40

Q-learning is a widely used reinforcement learning technique for solving path planning problems. It primarily involves the interaction between an agent and its environment, enabling the agent to learn an optimal strategy that maximizes cumulative rewards. Although many studies have reported the effectiveness of Q-learning, it still faces slow convergence issues in practical applications. To address this issue, we propose the NDR-QL method, which utilizes neural network outputs as heuristic information to accelerate the convergence process of Q-learning. Specifically, we improved the dual-output neural network model by introducing a start-end channel separation mechanism and enhancing the feature fusion process. After training, the proposed NDR model can output a narrowly focused optimal probability distribution, referred to as the guideline, and a broadly distributed suboptimal distribution, referred to as the region. Subsequently, based on the guideline prediction, we calculate the continuous reward function for the Q-learning method, and based on the region prediction, we initialize the Q-table with a bias. We conducted training, validation, and path planning simulation experiments on public datasets. The results indicate that the NDR model outperforms previous methods by up to 5\% in prediction accuracy. Furthermore, the proposed NDR-QL method improves the convergence speed of the baseline Q-learning method by 90\% and also surpasses the previously improved Q-learning methods in path quality metrics.

Equivariant Action Sampling for Reinforcement Learning and Planning

Authors:Linfeng Zhao, Owen Howell, Xupeng Zhu, Jung Yeon Park, Zhewen Zhang, Robin Walters, Lawson L. S. Wong
Date:2024-12-16 17:51:14

Reinforcement learning (RL) algorithms for continuous control tasks require accurate sampling-based action selection. Many tasks, such as robotic manipulation, contain inherent problem symmetries. However, correctly incorporating symmetry into sampling-based approaches remains a challenge. This work addresses the challenge of preserving symmetry in sampling-based planning and control, a key component for enhancing decision-making efficiency in RL. We introduce an action sampling approach that enforces the desired symmetry. We apply our proposed method to a coordinate regression problem and show that the symmetry aware sampling method drastically outperforms the naive sampling approach. We furthermore develop a general framework for sampling-based model-based planning with Model Predictive Path Integral (MPPI). We compare our MPPI approach with standard sampling methods on several continuous control tasks. Empirical demonstrations across multiple continuous control environments validate the effectiveness of our approach, showcasing the importance of symmetry preservation in sampling-based action selection.

Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning

Authors:Qi Sun, Pengfei Hong, Tej Deep Pala, Vernon Toh, U-Xuan Tan, Deepanway Ghosal, Soujanya Poria
Date:2024-12-16 16:58:28

Traditional reinforcement learning-based robotic control methods are often task-specific and fail to generalize across diverse environments or unseen objects and instructions. Visual Language Models (VLMs) demonstrate strong scene understanding and planning capabilities but lack the ability to generate actionable policies tailored to specific robotic embodiments. To address this, Visual-Language-Action (VLA) models have emerged, yet they face challenges in long-horizon spatial reasoning and grounded task planning. In this work, we propose the Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning, Emma-X. Emma-X leverages our constructed hierarchical embodiment dataset based on BridgeV2, containing 60,000 robot manipulation trajectories auto-annotated with grounded task reasoning and spatial guidance. Additionally, we introduce a trajectory segmentation strategy based on gripper states and motion trajectories, which can help mitigate hallucination in grounding subtask reasoning generation. Experimental results demonstrate that Emma-X achieves superior performance over competitive baselines, particularly in real-world robotic tasks requiring spatial reasoning.

Learning UAV-based path planning for efficient localization of objects using prior knowledge

Authors:Rick van Essen, Eldert van Henten, Gert Kootstra
Date:2024-12-16 12:39:02

UAV's are becoming popular for various object search applications in agriculture, however they usually use time-consuming row-by-row flight paths. This paper presents a deep-reinforcement-learning method for path planning to efficiently localize objects of interest using UAVs with a minimal flight-path length. The method uses some global prior knowledge with uncertain object locations and limited resolution in combination with a local object map created using the output of an object detection network. The search policy could be learned using deep Q-learning. We trained the agent in simulation, allowing thorough evaluation of the object distribution, typical errors in the perception system and prior knowledge, and different stopping criteria. When objects were non-uniformly distributed over the field, the agent found the objects quicker than a row-by-row flight path, showing that it learns to exploit the distribution of objects. Detection errors and quality of prior knowledge had only minor effect on the performance, indicating that the learned search policy was robust to errors in the perception system and did not need detailed prior knowledge. Without prior knowledge, the learned policy was still comparable in performance to a row-by-row flight path. Finally, we demonstrated that it is possible to learn the appropriate moment to end the search task. The applicability of the approach for object search on a real drone was comprehensively discussed and evaluated. Overall, we conclude that the learned search policy increased the efficiency of finding objects using a UAV, and can be applied in real-world conditions when the specified assumptions are met.

Are Expressive Models Truly Necessary for Offline RL?

Authors:Guan Wang, Haoyi Niu, Jianxiong Li, Li Jiang, Jianming Hu, Xianyuan Zhan
Date:2024-12-15 17:33:56

Among various branches of offline reinforcement learning (RL) methods, goal-conditioned supervised learning (GCSL) has gained increasing popularity as it formulates the offline RL problem as a sequential modeling task, therefore bypassing the notoriously difficult credit assignment challenge of value learning in conventional RL paradigm. Sequential modeling, however, requires capturing accurate dynamics across long horizons in trajectory data to ensure reasonable policy performance. To meet this requirement, leveraging large, expressive models has become a popular choice in recent literature, which, however, comes at the cost of significantly increased computation and inference latency. Contradictory yet promising, we reveal that lightweight models as simple as shallow 2-layer MLPs, can also enjoy accurate dynamics consistency and significantly reduced sequential modeling errors against large expressive models by adopting a simple recursive planning scheme: recursively planning coarse-grained future sub-goals based on current and target information, and then executes the action with a goal-conditioned policy learned from data rela-beled with these sub-goal ground truths. We term our method Recursive Skip-Step Planning (RSP). Simple yet effective, RSP enjoys great efficiency improvements thanks to its lightweight structure, and substantially outperforms existing methods, reaching new SOTA performances on the D4RL benchmark, especially in multi-stage long-horizon tasks.

Chasing Progress, Not Perfection: Revisiting Strategies for End-to-End LLM Plan Generation

Authors:Sukai Huang, Trevor Cohn, Nir Lipovetzky
Date:2024-12-14 04:23:14

The capability of Large Language Models (LLMs) to plan remains a topic of debate. Some critics argue that strategies to boost LLMs' reasoning skills are ineffective in planning tasks, while others report strong outcomes merely from training models on a planning corpus. This study reassesses recent strategies by developing an end-to-end LLM planner and employing diverse metrics for a thorough evaluation. We find that merely fine-tuning LLMs on a corpus of planning instances does not lead to robust planning skills, as indicated by poor performance on out-of-distribution test sets. At the same time, we find that various strategies, including Chain-of-Thought, do enhance the probability of a plan being executable. This indicates progress towards better plan quality, despite not directly enhancing the final validity rate. Among the strategies we evaluated, reinforcement learning with our novel `Longest Contiguous Common Subsequence' reward emerged as the most effective, contributing to both plan validity and executability. Overall, our research addresses key misconceptions in the LLM-planning literature; we validate incremental progress in plan executability, although plan validity remains a challenge. Hence, future strategies should focus on both these aspects, drawing insights from our findings.

Advances in Transformers for Robotic Applications: A Review

Authors:Nikunj Sanghai, Nik Bear Brown
Date:2024-12-13 23:02:15

The introduction of Transformers architecture has brought about significant breakthroughs in Deep Learning (DL), particularly within Natural Language Processing (NLP). Since their inception, Transformers have outperformed many traditional neural network architectures due to their "self-attention" mechanism and their scalability across various applications. In this paper, we cover the use of Transformers in Robotics. We go through recent advances and trends in Transformer architectures and examine their integration into robotic perception, planning, and control for autonomous systems. Furthermore, we review past work and recent research on use of Transformers in Robotics as pre-trained foundation models and integration of Transformers with Deep Reinforcement Learning (DRL) for autonomous systems. We discuss how different Transformer variants are being adapted in robotics for reliable planning and perception, increasing human-robot interaction, long-horizon decision-making, and generalization. Finally, we address limitations and challenges, offering insight and suggestions for future research directions.

Reconfigurable Intelligent Surface for Internet of Robotic Things

Authors:Wanli Ni, Ruyu Luo, Xinran Zhang, Peng Wang, Wen Wang, Hui Tian
Date:2024-12-12 09:51:55

With the rapid development of artificial intelligence, robotics, and Internet of Things, multi-robot systems are progressively acquiring human-like environmental perception and understanding capabilities, empowering them to complete complex tasks through autonomous decision-making and interaction. However, the Internet of Robotic Things (IoRT) faces significant challenges in terms of spectrum resources, sensing accuracy, communication latency, and energy supply. To address these issues, a reconfigurable intelligent surface (RIS)-aided IoRT network is proposed to enhance the overall performance of robotic communication, sensing, computation, and energy harvesting. In the case studies, by jointly optimizing parameters such as transceiver beamforming, robot trajectories, and RIS coefficients, solutions based on multi-agent deep reinforcement learning and multi-objective optimization are proposed to solve problems such as beamforming design, path planning, target sensing, and data aggregation. Numerical results are provided to demonstrate the effectiveness of proposed solutions in improve communication quality, sensing accuracy, computation error, and energy efficiency of RIS-aided IoRT networks.

Learning Sketch Decompositions in Planning via Deep Reinforcement Learning

Authors:Michael Aichmüller, Hector Geffner
Date:2024-12-11 17:45:31

In planning and reinforcement learning, the identification of common subgoal structures across problems is important when goals are to be achieved over long horizons. Recently, it has been shown that such structures can be expressed as feature-based rules, called sketches, over a number of classical planning domains. These sketches split problems into subproblems which then become solvable in low polynomial time by a greedy sequence of IW$(k)$ searches. Methods for learning sketches using feature pools and min-SAT solvers have been developed, yet they face two key limitations: scalability and expressivity. In this work, we address these limitations by formulating the problem of learning sketch decompositions as a deep reinforcement learning (DRL) task, where general policies are sought in a modified planning problem where the successor states of a state s are defined as those reachable from s through an IW$(k)$ search. The sketch decompositions obtained through this method are experimentally evaluated across various domains, and problems are regarded as solved by the decomposition when the goal is reached through a greedy sequence of IW$(k)$ searches. While our DRL approach for learning sketch decompositions does not yield interpretable sketches in the form of rules, we demonstrate that the resulting decompositions can often be understood in a crisp manner.

SimuDICE: Offline Policy Optimization Through World Model Updates and DICE Estimation

Authors:Catalin E. Brita, Stephan Bongers, Frans A. Oliehoek
Date:2024-12-09 13:35:46

In offline reinforcement learning, deriving an effective policy from a pre-collected set of experiences is challenging due to the distribution mismatch between the target policy and the behavioral policy used to collect the data, as well as the limited sample size. Model-based reinforcement learning improves sample efficiency by generating simulated experiences using a learned dynamic model of the environment. However, these synthetic experiences often suffer from the same distribution mismatch. To address these challenges, we introduce SimuDICE, a framework that iteratively refines the initial policy derived from offline data using synthetically generated experiences from the world model. SimuDICE enhances the quality of these simulated experiences by adjusting the sampling probabilities of state-action pairs based on stationary DIstribution Correction Estimation (DICE) and the estimated confidence in the model's predictions. This approach guides policy improvement by balancing experiences similar to those frequently encountered with ones that have a distribution mismatch. Our experiments show that SimuDICE achieves performance comparable to existing algorithms while requiring fewer pre-collected experiences and planning steps, and it remains robust across varying data collection policies.

Strategizing Equitable Transit Evacuations: A Data-Driven Reinforcement Learning Approach

Authors:Fang Tang, Han Wang, Maria Laura Delle Monache
Date:2024-12-08 02:17:38

As natural disasters become increasingly frequent, the need for efficient and equitable evacuation planning has become more critical. This paper proposes a data-driven, reinforcement learning-based framework to optimize bus-based evacuations with an emphasis on improving both efficiency and equity. We model the evacuation problem as a Markov Decision Process solved by reinforcement learning, using real-time transit data from General Transit Feed Specification and transportation networks extracted from OpenStreetMap. The reinforcement learning agent dynamically reroutes buses from their scheduled location to minimize total passengers' evacuation time while prioritizing equity-priority communities. Simulations on the San Francisco Bay Area transportation network indicate that the proposed framework achieves significant improvements in both evacuation efficiency and equitable service distribution compared to traditional rule-based and random strategies. These results highlight the potential of reinforcement learning to enhance system performance and urban resilience during emergency evacuations, offering a scalable solution for real-world applications in intelligent transportation systems.

Policy-shaped prediction: avoiding distractions in model-based reinforcement learning

Authors:Miles Hutson, Isaac Kauvar, Nick Haber
Date:2024-12-08 00:21:37

Model-based reinforcement learning (MBRL) is a promising route to sample-efficient policy optimization. However, a known vulnerability of reconstruction-based MBRL consists of scenarios in which detailed aspects of the world are highly predictable, but irrelevant to learning a good policy. Such scenarios can lead the model to exhaust its capacity on meaningless content, at the cost of neglecting important environment dynamics. While existing approaches attempt to solve this problem, we highlight its continuing impact on leading MBRL methods -- including DreamerV3 and DreamerPro -- with a novel environment where background distractions are intricate, predictable, and useless for planning future actions. To address this challenge we develop a method for focusing the capacity of the world model through synergy of a pretrained segmentation model, a task-aware reconstruction loss, and adversarial learning. Our method outperforms a variety of other approaches designed to reduce the impact of distractors, and is an advance towards robust model-based reinforcement learning.

Learning Soft Driving Constraints from Vectorized Scene Embeddings while Imitating Expert Trajectories

Authors:Niloufar Saeidi Mobarakeh, Behzad Khamidehi, Chunlin Li, Hamidreza Mirkhani, Fazel Arasteh, Mohammed Elmahgiubi, Weize Zhang, Kasra Rezaee, Pascal Poupart
Date:2024-12-07 18:29:28

The primary goal of motion planning is to generate safe and efficient trajectories for vehicles. Traditionally, motion planning models are trained using imitation learning to mimic the behavior of human experts. However, these models often lack interpretability and fail to provide clear justifications for their decisions. We propose a method that integrates constraint learning into imitation learning by extracting driving constraints from expert trajectories. Our approach utilizes vectorized scene embeddings that capture critical spatial and temporal features, enabling the model to identify and generalize constraints across various driving scenarios. We formulate the constraint learning problem using a maximum entropy model, which scores the motion planner's trajectories based on their similarity to the expert trajectory. By separating the scoring process into distinct reward and constraint streams, we improve both the interpretability of the planner's behavior and its attention to relevant scene components. Unlike existing constraint learning methods that rely on simulators and are typically embedded in reinforcement learning (RL) or inverse reinforcement learning (IRL) frameworks, our method operates without simulators, making it applicable to a wider range of datasets and real-world scenarios. Experimental results on the InD and TrafficJams datasets demonstrate that incorporating driving constraints enhances model interpretability and improves closed-loop performance.

AI Planning: A Primer and Survey (Preliminary Report)

Authors:Dillon Z. Chen, Pulkit Verma, Siddharth Srivastava, Michael Katz, Sylvie Thiébaux
Date:2024-12-07 04:00:25

Automated decision-making is a fundamental topic that spans multiple sub-disciplines in AI: reinforcement learning (RL), AI planning (AP), foundation models, and operations research, among others. Despite recent efforts to ``bridge the gaps'' between these communities, there remain many insights that have not yet transcended the boundaries. Our goal in this paper is to provide a brief and non-exhaustive primer on ideas well-known in AP, but less so in other sub-disciplines. We do so by introducing the classical AP problem and representation, and extensions that handle uncertainty and time through the Markov Decision Process formalism. Next, we survey state-of-the-art techniques and ideas for solving AP problems, focusing on their ability to exploit problem structure. Lastly, we cover subfields within AP for learning structure from unstructured inputs and learning to generalise to unseen scenarios and situations.

Intersection-Aware Assessment of EMS Accessibility in NYC: A Data-Driven Approach

Authors:Haoran Su, Joseph Y. J. Chow
Date:2024-12-05 17:38:03

Emergency response times are critical in densely populated urban environments like New York City (NYC), where traffic congestion significantly impedes emergency vehicle (EMV) mobility. This study introduces an intersection-aware emergency medical service (EMS) accessibility model to evaluate and improve EMV travel times across NYC. Integrating intersection density metrics, road network characteristics, and demographic data, the model identifies vulnerable regions with inadequate EMS coverage. The analysis reveals that densely interconnected areas, such as parts of Staten Island, Queens, and Manhattan, experience significant accessibility deficits due to intersection delays and sparse medical infrastructure. To address these challenges, this study explores the adoption of EMVLight, a multi-agent reinforcement learning framework, which demonstrates the potential to reduce intersection delays by 50\%, increasing EMS accessibility to 95\% of NYC residents within the critical benchmark of 4 minutes. Results indicate that advanced traffic signal control (TSC) systems can alleviate congestion-induced delays while improving equity in emergency response. The findings provide actionable insights for urban planning and policy interventions to enhance EMS accessibility and ensure timely care for underserved populations.

AI-Driven Day-to-Day Route Choice

Authors:Leizhen Wang, Peibo Duan, Zhengbing He, Cheng Lyu, Xin Chen, Nan Zheng, Li Yao, Zhenliang Ma
Date:2024-12-04 14:13:38

Understanding travelers' route choices can help policymakers devise optimal operational and planning strategies for both normal and abnormal circumstances. However, existing choice modeling methods often rely on predefined assumptions and struggle to capture the dynamic and adaptive nature of travel behavior. Recently, Large Language Models (LLMs) have emerged as a promising alternative, demonstrating remarkable ability to replicate human-like behaviors across various fields. Despite this potential, their capacity to accurately simulate human route choice behavior in transportation contexts remains doubtful. To satisfy this curiosity, this paper investigates the potential of LLMs for route choice modeling by introducing an LLM-empowered agent, "LLMTraveler." This agent integrates an LLM as its core, equipped with a memory system that learns from past experiences and makes decisions by balancing retrieved data and personality traits. The study systematically evaluates the LLMTraveler's ability to replicate human-like decision-making through two stages of day-to-day (DTD) congestion games: (1) analyzing its route-switching behavior in single origin-destination (OD) pair scenarios, where it demonstrates patterns that align with laboratory data but cannot be fully explained by traditional models, and (2) testing its capacity to model adaptive learning behaviors in multi-OD scenarios on the Ortuzar and Willumsen (OW) network, producing results comparable to Multinomial Logit (MNL) and Reinforcement Learning (RL) models. These experiments demonstrate that the framework can partially replicate human-like decision-making in route choice while providing natural language explanations for its decisions. This capability offers valuable insights for transportation policymaking, such as simulating traveler responses to new policies or changes in the network.

Experience-driven discovery of planning strategies

Authors:Ruiqi He, Falk Lieder
Date:2024-12-04 08:20:03

One explanation for how people can plan efficiently despite limited cognitive resources is that we possess a set of adaptive planning strategies and know when and how to use them. But how are these strategies acquired? While previous research has studied how individuals learn to choose among existing strategies, little is known about the process of forming new planning strategies. In this work, we propose that new planning strategies are discovered through metacognitive reinforcement learning. To test this, we designed a novel experiment to investigate the discovery of new planning strategies. We then present metacognitive reinforcement learning models and demonstrate their capability for strategy discovery as well as show that they provide a better explanation of human strategy discovery than alternative learning mechanisms. However, when fitted to human data, these models exhibit a slower discovery rate than humans, leaving room for improvement.

Optimizing Plastic Waste Collection in Water Bodies Using Heterogeneous Autonomous Surface Vehicles with Deep Reinforcement Learning

Authors:Alejandro Mendoza Barrionuevo, Samuel Yanes Luis, Daniel Gutiérrez Reina, Sergio L. Toral Marín
Date:2024-12-03 09:32:02

This paper presents a model-free deep reinforcement learning framework for informative path planning with heterogeneous fleets of autonomous surface vehicles to locate and collect plastic waste. The system employs two teams of vehicles: scouts and cleaners. Coordination between these teams is achieved through a deep reinforcement approach, allowing agents to learn strategies to maximize cleaning efficiency. The primary objective is for the scout team to provide an up-to-date contamination model, while the cleaner team collects as much waste as possible following this model. This strategy leads to heterogeneous teams that optimize fleet efficiency through inter-team cooperation supported by a tailored reward function. Different trainings of the proposed algorithm are compared with other state-of-the-art heuristics in two distinct scenarios, one with high convexity and another with narrow corridors and challenging access. According to the obtained results, it is demonstrated that deep reinforcement learning based algorithms outperform other benchmark heuristics, exhibiting superior adaptability. In addition, training with greedy actions further enhances performance, particularly in scenarios with intricate layouts.

The Problem of Social Cost in Multi-Agent General Reinforcement Learning: Survey and Synthesis

Authors:Kee Siong Ng, Samuel Yang-Zhao, Timothy Cadogan-Cowper
Date:2024-12-03 02:22:55

The AI safety literature is full of examples of powerful AI agents that, in blindly pursuing a specific and usually narrow objective, ends up with unacceptable and even catastrophic collateral damage to others. In this paper, we consider the problem of social harms that can result from actions taken by learning and utility-maximising agents in a multi-agent environment. The problem of measuring social harms or impacts in such multi-agent settings, especially when the agents are artificial generally intelligent (AGI) agents, was listed as an open problem in Everitt et al, 2018. We attempt a partial answer to that open problem in the form of market-based mechanisms to quantify and control the cost of such social harms. The proposed setup captures many well-studied special cases and is more general than existing formulations of multi-agent reinforcement learning with mechanism design in two ways: (i) the underlying environment is a history-based general reinforcement learning environment like in AIXI; (ii) the reinforcement-learning agents participating in the environment can have different learning strategies and planning horizons. To demonstrate the practicality of the proposed setup, we survey some key classes of learning algorithms and present a few applications, including a discussion of the Paperclips problem and pollution control with a cap-and-trade system.

Comparative Analysis of Multi-Agent Reinforcement Learning Policies for Crop Planning Decision Support

Authors:Anubha Mahajan, Shreya Hegde, Ethan Shay, Daniel Wu, Aviva Prins
Date:2024-12-03 00:30:19

In India, the majority of farmers are classified as small or marginal, making their livelihoods particularly vulnerable to economic losses due to market saturation and climate risks. Effective crop planning can significantly impact their expected income, yet existing decision support systems (DSS) often provide generic recommendations that fail to account for real-time market dynamics and the interactions among multiple farmers. In this paper, we evaluate the viability of three multi-agent reinforcement learning (MARL) approaches for optimizing total farmer income and promoting fairness in crop planning: Independent Q-Learning (IQL), where each farmer acts independently without coordination, Agent-by-Agent (ABA), which sequentially optimizes each farmer's policy in relation to the others, and the Multi-agent Rollout Policy, which jointly optimizes all farmers' actions for global reward maximization. Our results demonstrate that while IQL offers computational efficiency with linear runtime, it struggles with coordination among agents, leading to lower total rewards and an unequal distribution of income. Conversely, the Multi-agent Rollout policy achieves the highest total rewards and promotes equitable income distribution among farmers but requires significantly more computational resources, making it less practical for large numbers of agents. ABA strikes a balance between runtime efficiency and reward optimization, offering reasonable total rewards with acceptable fairness and scalability. These findings highlight the importance of selecting appropriate MARL approaches in DSS to provide personalized and equitable crop planning recommendations, advancing the development of more adaptive and farmer-centric agricultural decision-making systems.

Hierarchical Object-Oriented POMDP Planning for Object Rearrangement

Authors:Rajesh Mangannavar, Alan Fern, Prasad Tadepalli
Date:2024-12-02 10:19:36

We present an online planning framework for solving multi-object rearrangement problems in partially observable, multi-room environments. Current object rearrangement solutions, primarily based on Reinforcement Learning or hand-coded planning methods, often lack adaptability to diverse challenges. To address this limitation, we introduce a novel Hierarchical Object-Oriented Partially Observed Markov Decision Process (HOO-POMDP) planning approach. This approach comprises of (a) an object-oriented POMDP planner generating sub-goals, (b) a set of low-level policies for sub-goal achievement, and (c) an abstraction system converting the continuous low-level world into a representation suitable for abstract planning. We evaluate our system on varying numbers of objects, rooms, and problem types in AI2-THOR simulated environments with promising results.

Generating Freeform Endoskeletal Robots

Authors:Muhan Li, Lingji Kong, Sam Kriegman
Date:2024-12-02 01:40:04

The automatic design of embodied agents (e.g. robots) has existed for 31 years and is experiencing a renaissance of interest in the literature. To date however, the field has remained narrowly focused on two kinds of anatomically simple robots: (1) fully rigid, jointed bodies; and (2) fully soft, jointless bodies. Here we bridge these two extremes with the open ended creation of terrestrial endoskeletal robots: deformable soft bodies that leverage jointed internal skeletons to move efficiently across land. Simultaneous de novo generation of external and internal structures is achieved by (i) modeling 3D endoskeletal body plans as integrated collections of elastic and rigid cells that directly attach to form soft tissues anchored to compound rigid bodies; (ii) encoding these discrete mechanical subsystems into a continuous yet coherent latent embedding; (iii) optimizing the sensorimotor coordination of each decoded design using model-free reinforcement learning; and (iv) navigating this smooth yet highly non-convex latent manifold using evolutionary strategies. This yields an endless stream of novel species of "higher robots" that, like all higher animals, harness the mechanical advantages of both elastic tissues and skeletal levers for terrestrial travel. It also provides a plug-and-play experimental platform for benchmarking evolutionary design and representation learning algorithms in complex hierarchical embodied systems.

Learning Dynamic Weight Adjustment for Spatial-Temporal Trajectory Planning in Crowd Navigation

Authors:Muqing Cao, Xinhang Xu, Yizhuo Yang, Jianping Li, Tongxing Jin, Pengfei Wang, Tzu-Yi Hung, Guosheng Lin, Lihua Xie
Date:2024-11-30 18:53:34

Robot navigation in dense human crowds poses a significant challenge due to the complexity of human behavior in dynamic and obstacle-rich environments. In this work, we propose a dynamic weight adjustment scheme using a neural network to predict the optimal weights of objectives in an optimization-based motion planner. We adopt a spatial-temporal trajectory planner and incorporate diverse objectives to achieve a balance among safety, efficiency, and goal achievement in complex and dynamic environments. We design the network structure, observation encoding, and reward function to effectively train the policy network using reinforcement learning, allowing the robot to adapt its behavior in real time based on environmental and pedestrian information. Simulation results show improved safety compared to the fixed-weight planner and the state-of-the-art learning-based methods, and verify the ability of the learned policy to adaptively adjust the weights based on the observed situations. The approach's feasibility is demonstrated in a navigation task using an autonomous delivery robot across a crowded corridor over a 300 m distance.

PlanCritic: Formal Planning with Human Feedback

Authors:Owen Burns, Dana Hughes, Katia Sycara
Date:2024-11-30 00:58:48

Real world planning problems are often too complex to be effectively tackled by a single unaided human. To alleviate this, some recent work has focused on developing a collaborative planning system to assist humans in complex domains, with bridging the gap between the system's problem representation and the real world being a key consideration. Transferring the speed and correctness formal planners provide to real-world planning problems is greatly complicated by the dynamic and online nature of such tasks. Formal specifications of task and environment dynamics frequently lack constraints on some behaviors or goal conditions relevant to the way a human operator prefers a plan to be carried out. While adding constraints to the representation with the objective of increasing its realism risks slowing down the planner, we posit that the same benefits can be realized without sacrificing speed by modeling this problem as an online preference learning task. As part of a broader cooperative planning system, we present a feedback-driven plan critic. This method makes use of reinforcement learning with human feedback in conjunction with a genetic algorithm to directly optimize a plan with respect to natural-language user preferences despite the non-differentiability of traditional planners. Directly optimizing the plan bridges the gap between research into more efficient planners and research into planning with language models by utilizing the convenience of natural language to guide the output of formal planners. We demonstrate the effectiveness of our plan critic at adhering to user preferences on a disaster recovery task, and observe improved performance compared to an llm-only neurosymbolic approach.

PDDLFuse: A Tool for Generating Diverse Planning Domains

Authors:Vedant Khandelwal, Amit Sheth, Forest Agostinelli
Date:2024-11-29 17:52:39

Various real-world challenges require planning algorithms that can adapt to a broad range of domains. Traditionally, the creation of planning domains has relied heavily on human implementation, which limits the scale and diversity of available domains. While recent advancements have leveraged generative AI technologies such as large language models (LLMs) for domain creation, these efforts have predominantly focused on translating existing domains from natural language descriptions rather than generating novel ones. In contrast, the concept of domain randomization, which has been highly effective in reinforcement learning, enhances performance and generalizability by training on a diverse array of randomized new domains. Inspired by this success, our tool, PDDLFuse, aims to bridge this gap in Planning Domain Definition Language (PDDL). PDDLFuse is designed to generate new, diverse planning domains that can be used to validate new planners or test foundational planning models. We have developed methods to adjust the domain generators parameters to modulate the difficulty of the domains it generates. This adaptability is crucial as existing domain-independent planners often struggle with more complex problems. Initial tests indicate that PDDLFuse efficiently creates intricate and varied domains, representing a significant advancement over traditional domain generation methods and making a contribution towards planning research.

SANGO: Socially Aware Navigation through Grouped Obstacles

Authors:Rahath Malladi, Amol Harsh, Arshia Sangwan, Sunita Chauhan, Sandeep Manjanna
Date:2024-11-29 06:29:46

This paper introduces SANGO (Socially Aware Navigation through Grouped Obstacles), a novel method that ensures socially appropriate behavior by dynamically grouping obstacles and adhering to social norms. Using deep reinforcement learning, SANGO trains agents to navigate complex environments leveraging the DBSCAN algorithm for obstacle clustering and Proximal Policy Optimization (PPO) for path planning. The proposed approach improves safety and social compliance by maintaining appropriate distances and reducing collision rates. Extensive experiments conducted in custom simulation environments demonstrate SANGO's superior performance in significantly reducing discomfort (by up to 83.5%), reducing collision rates (by up to 29.4%) and achieving higher successful navigation in dynamic and crowded scenarios. These findings highlight the potential of SANGO for real-world applications, paving the way for advanced socially adept robotic navigation systems.

NeoHebbian Synapses to Accelerate Online Training of Neuromorphic Hardware

Authors:Shubham Pande, Sai Sukruth Bezugam, Tinish Bhattacharya, Ewelina Wlazlak, Anjan Chakaravorty, Bhaswar Chakrabarti, Dmitri Strukov
Date:2024-11-27 12:06:15

Neuromorphic systems that employ advanced synaptic learning rules, such as the three-factor learning rule, require synaptic devices of increased complexity. Herein, a novel neoHebbian artificial synapse utilizing ReRAM devices has been proposed and experimentally validated to meet this demand. This synapse features two distinct state variables: a neuron coupling weight and an "eligibility trace" that dictates synaptic weight updates. The coupling weight is encoded in the ReRAM conductance, while the "eligibility trace" is encoded in the local temperature of the ReRAM and is modulated by applying voltage pulses to a physically co-located resistive heating element. The utility of the proposed synapse has been investigated using two representative tasks: first, temporal signal classification using Recurrent Spiking Neural Networks (RSNNs) employing the e-prop algorithm, and second, Reinforcement Learning (RL) for path planning tasks in feedforward networks using a modified version of the same learning rule. System-level simulations, accounting for various device and system-level non-idealities, confirm that these synapses offer a robust solution for the fast, compact, and energy-efficient implementation of advanced learning rules in neuromorphic hardware.

Scalable Multi-Objective Reinforcement Learning with Fairness Guarantees using Lorenz Dominance

Authors:Dimitris Michailidis, Willem Röpke, Diederik M. Roijers, Sennay Ghebreab, Fernando P. Santos
Date:2024-11-27 10:16:25

Multi-Objective Reinforcement Learning (MORL) aims to learn a set of policies that optimize trade-offs between multiple, often conflicting objectives. MORL is computationally more complex than single-objective RL, particularly as the number of objectives increases. Additionally, when objectives involve the preferences of agents or groups, ensuring fairness is socially desirable. This paper introduces a principled algorithm that incorporates fairness into MORL while improving scalability to many-objective problems. We propose using Lorenz dominance to identify policies with equitable reward distributions and introduce {\lambda}-Lorenz dominance to enable flexible fairness preferences. We release a new, large-scale real-world transport planning environment and demonstrate that our method encourages the discovery of fair policies, showing improved scalability in two large cities (Xi'an and Amsterdam). Our methods outperform common multi-objective approaches, particularly in high-dimensional objective spaces.

Self-reconfiguration Strategies for Space-distributed Spacecraft

Authors:Tianle Liu, Zhixiang Wang, Yongwei Zhang, Ziwei Wang, Zihao Liu, Yizhai Zhang, Panfeng Huang
Date:2024-11-26 06:05:44

This paper proposes a distributed on-orbit spacecraft assembly algorithm, where future spacecraft can assemble modules with different functions on orbit to form a spacecraft structure with specific functions. This form of spacecraft organization has the advantages of reconfigurability, fast mission response and easy maintenance. Reasonable and efficient on-orbit self-reconfiguration algorithms play a crucial role in realizing the benefits of distributed spacecraft. This paper adopts the framework of imitation learning combined with reinforcement learning for strategy learning of module handling order. A robot arm motion algorithm is then designed to execute the handling sequence. We achieve the self-reconfiguration handling task by creating a map on the surface of the module, completing the path point planning of the robotic arm using A*. The joint planning of the robotic arm is then accomplished through forward and reverse kinematics. Finally, the results are presented in Unity3D.

Don't Command, Cultivate: An Exploratory Study of System-2 Alignment

Authors:Yuhang Wang, Yuxiang Zhang, Yanxu Zhu, Xinyan Wen, Jitao Sang
Date:2024-11-26 03:27:43

The o1 system card identifies the o1 models as the most robust within OpenAI, with their defining characteristic being the progression from rapid, intuitive thinking to slower, more deliberate reasoning. This observation motivated us to investigate the influence of System-2 thinking patterns on model safety. In our preliminary research, we conducted safety evaluations of the o1 model, including complex jailbreak attack scenarios using adversarial natural language prompts and mathematical encoding prompts. Our findings indicate that the o1 model demonstrates relatively improved safety performance; however, it still exhibits vulnerabilities, particularly against jailbreak attacks employing mathematical encoding. Through detailed case analysis, we identified specific patterns in the o1 model's responses. We also explored the alignment of System-2 safety in open-source models using prompt engineering and supervised fine-tuning techniques. Experimental results show that some simple methods to encourage the model to carefully scrutinize user requests are beneficial for model safety. Additionally, we proposed a implementation plan for process supervision to enhance safety alignment. The implementation details and experimental results will be provided in future versions.

CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning

Authors:Duo Wu, Jinghe Wang, Yuan Meng, Yanning Zhang, Le Sun, Zhi Wang
Date:2024-11-25 12:05:49

Utilizing large language models (LLMs) for tool planning has emerged as a promising avenue for developing general AI systems, where LLMs automatically schedule external tools (e.g. vision models) to tackle complex tasks based on task descriptions. To push this paradigm toward practical applications, it is crucial for LLMs to consider tool execution costs (e.g. execution time) for tool planning. Unfortunately, prior studies overlook the tool execution costs, leading to the generation of expensive plans of which the costs outweigh task performance. To fill this gap, we propose the Cost-Aware Tool Planning with LLMs (CATP-LLM) framework, which for the first time provides a coherent design to empower LLMs for cost-aware tool planning. Specifically, CATP-LLM incorporates a tool planning language to enhance the LLM to generate non-sequential plans of multiple branches for efficient concurrent tool execution and cost reduction. Moreover, it further designs a cost-aware offline reinforcement learning algorithm to fine-tune the LLM to optimize the performance-cost trade-off in tool planning. In lack of public cost-related datasets, we further present OpenCATP, the first platform for cost-aware planning evaluation. Experiments on OpenCATP show that CATP-LLM outperforms GPT-4 even when using Llama2-7B as its backbone, with the average improvement of 28.2%-30.2% higher plan performance and 24.7%-45.8% lower costs even on the challenging planning tasks. The codes of CATP-LLM and OpenCATP will be publicly available.

Can flocking aid the path planning of microswimmers in turbulent flows?

Authors:Akanksha Gupta, Jaya Kumar Alageshan, Kolluru Venkata Kiran, Rahul Pandit
Date:2024-11-24 16:25:33

We show that flocking of microswimmers in a turbulent flow can enhance the efficacy of reinforcement-learning-based path-planning of microswimmers in turbulent flows. In particular, we develop a machine-learning strategy that incorporates Vicsek-model-type flocking in microswimmer assemblies in a statistically homogeneous and isotropic turbulent flow in two dimensions (2D). We build on the adversarial-reinforcement-learning of Ref.~\cite{alageshan2020machine} for non-interacting microswimmers in turbulent flows. Such microswimmers aim to move optimally from an initial position to a target. We demonstrate that our flocking-aided version of the adversarial-reinforcement-learning strategy of Ref.~\cite{alageshan2020machine} can be superior to earlier microswimmer path-planning strategies.

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

Authors:Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel
Date:2024-11-20 18:54:32

Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities; however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies-areas in which we lack effective methodologies for comprehensively evaluating these capabilities. To address this gap, we introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games. Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master (e.g., the NetHack Learning Environment). We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks. Notably, we observe severe deficiencies in vision-based decision-making, as models perform worse when visual representations of the environments are provided. We release BALROG as an open and user-friendly benchmark to facilitate future research and development in the agentic community.

Effective Analog ICs Floorplanning with Relational Graph Neural Networks and Reinforcement Learning

Authors:Davide Basso, Luca Bortolussi, Mirjana Videnovic-Misic, Husni Habal
Date:2024-11-20 12:11:12

Analog integrated circuit (IC) floorplanning is typically a manual process with the placement of components (devices and modules) planned by a layout engineer. This process is further complicated by the interdependence of floorplanning and routing steps, numerous electric and layout-dependent constraints, as well as the high level of customization expected in analog design. This paper presents a novel automatic floorplanning algorithm based on reinforcement learning. It is augmented by a relational graph convolutional neural network model for encoding circuit features and positional constraints. The combination of these two machine learning methods enables knowledge transfer across different circuit designs with distinct topologies and constraints, increasing the \emph{generalization ability} of the solution. Applied to $6$ industrial circuits, our approach surpassed established floorplanning techniques in terms of speed, area and half-perimeter wire length. When integrated into a \emph{procedural generator} for layout completion, overall layout time was reduced by $67.3\%$ with a $8.3\%$ mean area reduction compared to manual layout.

Upside-Down Reinforcement Learning for More Interpretable Optimal Control

Authors:Juan Cardenas-Cartagena, Massimiliano Falzari, Marco Zullich, Matthia Sabatelli
Date:2024-11-18 10:44:20

Model-Free Reinforcement Learning (RL) algorithms either learn how to map states to expected rewards or search for policies that can maximize a certain performance function. Model-Based algorithms instead, aim to learn an approximation of the underlying model of the RL environment and then use it in combination with planning algorithms. Upside-Down Reinforcement Learning (UDRL) is a novel learning paradigm that aims to learn how to predict actions from states and desired commands. This task is formulated as a Supervised Learning problem and has successfully been tackled by Neural Networks (NNs). In this paper, we investigate whether function approximation algorithms other than NNs can also be used within a UDRL framework. Our experiments, performed over several popular optimal control benchmarks, show that tree-based methods like Random Forests and Extremely Randomized Trees can perform just as well as NNs with the significant benefit of resulting in policies that are inherently more interpretable than NNs, therefore paving the way for more transparent, safe, and robust RL.

BMP: Bridging the Gap between B-Spline and Movement Primitives

Authors:Weiran Liao, Ge Li, Hongyi Zhou, Rudolf Lioutikov, Gerhard Neumann
Date:2024-11-15 16:35:26

This work introduces B-spline Movement Primitives (BMPs), a new Movement Primitive (MP) variant that leverages B-splines for motion representation. B-splines are a well-known concept in motion planning due to their ability to generate complex, smooth trajectories with only a few control points while satisfying boundary conditions, i.e., passing through a specified desired position with desired velocity. However, current usages of B-splines tend to ignore the higher-order statistics in trajectory distributions, which limits their usage in imitation learning (IL) and reinforcement learning (RL), where modeling trajectory distribution is essential. In contrast, MPs are commonly used in IL and RL for their capacity to capture trajectory likelihoods and correlations. However, MPs are constrained by their abilities to satisfy boundary conditions and usually need extra terms in learning objectives to satisfy velocity constraints. By reformulating B-splines as MPs, represented through basis functions and weight parameters, BMPs combine the strengths of both approaches, allowing B-splines to capture higher-order statistics while retaining their ability to satisfy boundary conditions. Empirical results in IL and RL demonstrate that BMPs broaden the applicability of B-splines in robot learning and offer greater expressiveness compared to existing MP variants.

The Surprising Ineffectiveness of Pre-Trained Visual Representations for Model-Based Reinforcement Learning

Authors:Moritz Schneider, Robert Krug, Narunas Vaskevicius, Luigi Palmieri, Joschka Boedecker
Date:2024-11-15 13:21:26

Visual Reinforcement Learning (RL) methods often require extensive amounts of data. As opposed to model-free RL, model-based RL (MBRL) offers a potential solution with efficient data utilization through planning. Additionally, RL lacks generalization capabilities for real-world tasks. Prior work has shown that incorporating pre-trained visual representations (PVRs) enhances sample efficiency and generalization. While PVRs have been extensively studied in the context of model-free RL, their potential in MBRL remains largely unexplored. In this paper, we benchmark a set of PVRs on challenging control tasks in a model-based RL setting. We investigate the data efficiency, generalization capabilities, and the impact of different properties of PVRs on the performance of model-based agents. Our results, perhaps surprisingly, reveal that for MBRL current PVRs are not more sample efficient than learning representations from scratch, and that they do not generalize better to out-of-distribution (OOD) settings. To explain this, we analyze the quality of the trained dynamics model. Furthermore, we show that data diversity and network architecture are the most important contributors to OOD generalization performance.

DNN Task Assignment in UAV Networks: A Generative AI Enhanced Multi-Agent Reinforcement Learning Approach

Authors:Xin Tang, Qian Chen, Wenjie Weng, Binhan Liao, Jiacheng Wang, Xianbin Cao, Xiaohuan Li
Date:2024-11-13 02:41:02

Unmanned Aerial Vehicles (UAVs) possess high mobility and flexible deployment capabilities, prompting the development of UAVs for various application scenarios within the Internet of Things (IoT). The unique capabilities of UAVs give rise to increasingly critical and complex tasks in uncertain and potentially harsh environments. The substantial amount of data generated from these applications necessitates processing and analysis through deep neural networks (DNNs). However, UAVs encounter challenges due to their limited computing resources when managing DNN models. This paper presents a joint approach that combines multiple-agent reinforcement learning (MARL) and generative diffusion models (GDM) for assigning DNN tasks to a UAV swarm, aimed at reducing latency from task capture to result output. To address these challenges, we first consider the task size of the target area to be inspected and the shortest flying path as optimization constraints, employing a greedy algorithm to resolve the subproblem with a focus on minimizing the UAV's flying path and the overall system cost. In the second stage, we introduce a novel DNN task assignment algorithm, termed GDM-MADDPG, which utilizes the reverse denoising process of GDM to replace the actor network in multi-agent deep deterministic policy gradient (MADDPG). This approach generates specific DNN task assignment actions based on agents' observations in a dynamic environment. Simulation results indicate that our algorithm performs favorably compared to benchmarks in terms of path planning, Age of Information (AoI), energy consumption, and task load balancing.

Navigation with QPHIL: Quantizing Planner for Hierarchical Implicit Q-Learning

Authors:Alexi Canesse, Mathieu Petitbois, Ludovic Denoyer, Sylvain Lamprier, Rémy Portelas
Date:2024-11-12 12:49:41

Offline Reinforcement Learning (RL) has emerged as a powerful alternative to imitation learning for behavior modeling in various domains, particularly in complex navigation tasks. An existing challenge with Offline RL is the signal-to-noise ratio, i.e. how to mitigate incorrect policy updates due to errors in value estimates. Towards this, multiple works have demonstrated the advantage of hierarchical offline RL methods, which decouples high-level path planning from low-level path following. In this work, we present a novel hierarchical transformer-based approach leveraging a learned quantizer of the space. This quantization enables the training of a simpler zone-conditioned low-level policy and simplifies planning, which is reduced to discrete autoregressive prediction. Among other benefits, zone-level reasoning in planning enables explicit trajectory stitching rather than implicit stitching based on noisy value function estimates. By combining this transformer-based planner with recent advancements in offline RL, our proposed approach achieves state-of-the-art results in complex long-distance navigation environments.

Robust Offline Reinforcement Learning for Non-Markovian Decision Processes

Authors:Ruiquan Huang, Yingbin Liang, Jing Yang
Date:2024-11-12 03:22:56

Distributionally robust offline reinforcement learning (RL) aims to find a policy that performs the best under the worst environment within an uncertainty set using an offline dataset collected from a nominal model. While recent advances in robust RL focus on Markov decision processes (MDPs), robust non-Markovian RL is limited to planning problem where the transitions in the uncertainty set are known. In this paper, we study the learning problem of robust offline non-Markovian RL. Specifically, when the nominal model admits a low-rank structure, we propose a new algorithm, featuring a novel dataset distillation and a lower confidence bound (LCB) design for robust values under different types of the uncertainty set. We also derive new dual forms for these robust values in non-Markovian RL, making our algorithm more amenable to practical implementation. By further introducing a novel type-I concentrability coefficient tailored for offline low-rank non-Markovian decision processes, we prove that our algorithm can find an $\epsilon$-optimal robust policy using $O(1/\epsilon^2)$ offline samples. Moreover, we extend our algorithm to the case when the nominal model does not have specific structure. With a new type-II concentrability coefficient, the extended algorithm also enjoys polynomial sample efficiency under all different types of the uncertainty set.

Research on reinforcement learning based warehouse robot navigation algorithm in complex warehouse layout

Authors:Keqin Li, Lipeng Liu, Jiajing Chen, Dezhi Yu, Xiaofan Zhou, Ming Li, Congyu Wang, Zhao Li
Date:2024-11-09 09:44:03

In this paper, how to efficiently find the optimal path in complex warehouse layout and make real-time decision is a key problem. This paper proposes a new method of Proximal Policy Optimization (PPO) and Dijkstra's algorithm, Proximal policy-Dijkstra (PP-D). PP-D method realizes efficient strategy learning and real-time decision making through PPO, and uses Dijkstra algorithm to plan the global optimal path, thus ensuring high navigation accuracy and significantly improving the efficiency of path planning. Specifically, PPO enables robots to quickly adapt and optimize action strategies in dynamic environments through its stable policy updating mechanism. Dijkstra's algorithm ensures global optimal path planning in static environment. Finally, through the comparison experiment and analysis of the proposed framework with the traditional algorithm, the results show that the PP-D method has significant advantages in improving the accuracy of navigation prediction and enhancing the robustness of the system. Especially in complex warehouse layout, PP-D method can find the optimal path more accurately and reduce collision and stagnation. This proves the reliability and effectiveness of the robot in the study of complex warehouse layout navigation algorithm.

Evaluating Robustness of Reinforcement Learning Algorithms for Autonomous Shipping

Authors:Bavo Lesy, Ali Anwar, Siegfried Mercelis
Date:2024-11-07 17:55:07

Recently, there has been growing interest in autonomous shipping due to its potential to improve maritime efficiency and safety. The use of advanced technologies, such as artificial intelligence, can address the current navigational and operational challenges in autonomous shipping. In particular, inland waterway transport (IWT) presents a unique set of challenges, such as crowded waterways and variable environmental conditions. In such dynamic settings, the reliability and robustness of autonomous shipping solutions are critical factors for ensuring safe operations. This paper examines the robustness of benchmark deep reinforcement learning (RL) algorithms, implemented for IWT within an autonomous shipping simulator, and their ability to generate effective motion planning policies. We demonstrate that a model-free approach can achieve an adequate policy in the simulator, successfully navigating port environments never encountered during training. We focus particularly on Soft-Actor Critic (SAC), which we show to be inherently more robust to environmental disturbances compared to MuZero, a state-of-the-art model-based RL algorithm. In this paper, we take a significant step towards developing robust, applied RL frameworks that can be generalized to various vessel types and navigate complex port- and inland environments and scenarios.

UEVAVD: A Dataset for Developing UAV's Eye View Active Object Detection

Authors:Xinhua Jiang, Tianpeng Liu, Li Liu, Zhen Liu, Yongxiang Liu
Date:2024-11-07 01:10:05

Occlusion is a longstanding difficulty that challenges the UAV-based object detection. Many works address this problem by adapting the detection model. However, few of them exploit that the UAV could fundamentally improve detection performance by changing its viewpoint. Active Object Detection (AOD) offers an effective way to achieve this purpose. Through Deep Reinforcement Learning (DRL), AOD endows the UAV with the ability of autonomous path planning to search for the observation that is more conducive to target identification. Unfortunately, there exists no available dataset for developing the UAV AOD method. To fill this gap, we released a UAV's eye view active vision dataset named UEVAVD and hope it can facilitate research on the UAV AOD problem. Additionally, we improve the existing DRL-based AOD method by incorporating the inductive bias when learning the state representation. First, due to the partial observability, we use the gated recurrent unit to extract state representations from the observation sequence instead of the single-view observation. Second, we pre-decompose the scene with the Segment Anything Model (SAM) and filter out the irrelevant information with the derived masks. With these practices, the agent could learn an active viewing policy with better generalization capability. The effectiveness of our innovations is validated by the experiments on the UEVAVD dataset. Our dataset will soon be available at https://github.com/Leo000ooo/UEVAVD_dataset.

The Unreasonable Effectiveness of LLMs for Query Optimization

Authors:Peter Akioyamen, Zixuan Yi, Ryan Marcus
Date:2024-11-05 07:10:00

Recent work in database query optimization has used complex machine learning strategies, such as customized reinforcement learning schemes. Surprisingly, we show that LLM embeddings of query text contain useful semantic information for query optimization. Specifically, we show that a simple binary classifier deciding between alternative query plans, trained only on a small number of labeled embedded query vectors, can outperform existing heuristic systems. Although we only present some preliminary results, an LLM-powered query optimizer could provide significant benefits, both in terms of performance and simplicity.

Diversity Progress for Goal Selection in Discriminability-Motivated RL

Authors:Erik M. Lintunen, Nadia M. Ady, Christian Guckelsberger
Date:2024-11-03 10:47:39

Non-uniform goal selection has the potential to improve the reinforcement learning (RL) of skills over uniform-random selection. In this paper, we introduce a method for learning a goal-selection policy in intrinsically-motivated goal-conditioned RL: "Diversity Progress" (DP). The learner forms a curriculum based on observed improvement in discriminability over its set of goals. Our proposed method is applicable to the class of discriminability-motivated agents, where the intrinsic reward is computed as a function of the agent's certainty of following the true goal being pursued. This reward can motivate the agent to learn a set of diverse skills without extrinsic rewards. We demonstrate empirically that a DP-motivated agent can learn a set of distinguishable skills faster than previous approaches, and do so without suffering from a collapse of the goal distribution -- a known issue with some prior approaches. We end with plans to take this proof-of-concept forward.

Learning World Models for Unconstrained Goal Navigation

Authors:Yuanlin Duan, Wensen Mao, He Zhu
Date:2024-11-03 01:35:06

Learning world models offers a promising avenue for goal-conditioned reinforcement learning with sparse rewards. By allowing agents to plan actions or exploratory goals without direct interaction with the environment, world models enhance exploration efficiency. The quality of a world model hinges on the richness of data stored in the agent's replay buffer, with expectations of reasonable generalization across the state space surrounding recorded trajectories. However, challenges arise in generalizing learned world models to state transitions backward along recorded trajectories or between states across different trajectories, hindering their ability to accurately model real-world dynamics. To address these challenges, we introduce a novel goal-directed exploration algorithm, MUN (short for "World Models for Unconstrained Goal Navigation"). This algorithm is capable of modeling state transitions between arbitrary subgoal states in the replay buffer, thereby facilitating the learning of policies to navigate between any "key" states. Experimental results demonstrate that MUN strengthens the reliability of world models and significantly improves the policy's capacity to generalize across new goal settings.

Enhancing Model-Based Step Adaptation for Push Recovery through Reinforcement Learning of Step Timing and Region

Authors:Tobias Egle, Yashuai Yan, Dongheui Lee, Christian Ott
Date:2024-11-01 19:51:37

This paper introduces a new approach to enhance the robustness of humanoid walking under strong perturbations, such as substantial pushes. Effective recovery from external disturbances requires bipedal robots to dynamically adjust their stepping strategies, including footstep positions and timing. Unlike most advanced walking controllers that restrict footstep locations to a predefined convex region, substantially limiting recoverable disturbances, our method leverages reinforcement learning to dynamically adjust the permissible footstep region, expanding it to a larger, effectively non-convex area and allowing cross-over stepping, which is crucial for counteracting large lateral pushes. Additionally, our method adapts footstep timing in real time to further extend the range of recoverable disturbances. Based on these adjustments, feasible footstep positions and DCM trajectory are planned by solving a QP. Finally, we employ a DCM controller and an inverse dynamics whole-body control framework to ensure the robot effectively follows the trajectory.

Zonal RL-RRT: Integrated RL-RRT Path Planning with Collision Probability and Zone Connectivity

Authors:AmirMohammad Tahmasbi, MohammadSaleh Faghfoorian, Saeed Khodaygan, Aniket Bera
Date:2024-10-31 17:57:51

Path planning in high-dimensional spaces poses significant challenges, particularly in achieving both time efficiency and a fair success rate. To address these issues, we introduce a novel path-planning algorithm, Zonal RL-RRT, that leverages kd-tree partitioning to segment the map into zones while addressing zone connectivity, ensuring seamless transitions between zones. By breaking down the complex environment into multiple zones and using Q-learning as the high-level decision-maker, our algorithm achieves a 3x improvement in time efficiency compared to basic sampling methods such as RRT and RRT* in forest-like maps. Our approach outperforms heuristic-guided methods like BIT* and Informed RRT* by 1.5x in terms of runtime while maintaining robust and reliable success rates across 2D to 6D environments. Compared to learning-based methods like NeuralRRT* and MPNetSMP, as well as the heuristic RRT*J, our algorithm demonstrates, on average, 1.5x better performance in the same environments. We also evaluate the effectiveness of our approach through simulations of the UR10e arm manipulator in the MuJoCo environment. A key observation of our approach lies in its use of zone partitioning and Reinforcement Learning (RL) for adaptive high-level planning allowing the algorithm to accommodate flexible policies across diverse environments, making it a versatile tool for advanced path planning.

VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning

Authors:Yichao Liang, Nishanth Kumar, Hao Tang, Adrian Weller, Joshua B. Tenenbaum, Tom Silver, João F. Henriques, Kevin Ellis
Date:2024-10-30 16:11:05

Broadly intelligent agents should form task-specific abstractions that selectively expose the essential elements of a task, while abstracting away the complexity of the raw sensorimotor space. In this work, we present Neuro-Symbolic Predicates, a first-order abstraction language that combines the strengths of symbolic and neural knowledge representations. We outline an online algorithm for inventing such predicates and learning abstract world models. We compare our approach to hierarchical reinforcement learning, vision-language model planning, and symbolic predicate invention approaches, on both in- and out-of-distribution tasks across five simulated robotic domains. Results show that our approach offers better sample complexity, stronger out-of-distribution generalization, and improved interpretability.

SoftCTRL: Soft conservative KL-control of Transformer Reinforcement Learning for Autonomous Driving

Authors:Minh Tri Huynh, Duc Dung Nguyen
Date:2024-10-30 07:18:00

In recent years, motion planning for urban self-driving cars (SDV) has become a popular problem due to its complex interaction of road components. To tackle this, many methods have relied on large-scale, human-sampled data processed through Imitation learning (IL). Although effective, IL alone cannot adequately handle safety and reliability concerns. Combining IL with Reinforcement learning (RL) by adding KL divergence between RL and IL policy to the RL loss can alleviate IL's weakness but suffer from over-conservation caused by covariate shift of IL. To address this limitation, we introduce a method that combines IL with RL using an implicit entropy-KL control that offers a simple way to reduce the over-conservation characteristic. In particular, we validate different challenging simulated urban scenarios from the unseen dataset, indicating that although IL can perform well in imitation tasks, our proposed method significantly improves robustness (over 17\% reduction in failures) and generates human-like driving behavior.

Energy-Aware Multi-Agent Reinforcement Learning for Collaborative Execution in Mission-Oriented Drone Networks

Authors:Ying Li, Changling Li, Jiyao Chen, Christine Roinou
Date:2024-10-29 22:43:26

Mission-oriented drone networks have been widely used for structural inspection, disaster monitoring, border surveillance, etc. Due to the limited battery capacity of drones, mission execution strategy impacts network performance and mission completion. However, collaborative execution is a challenging problem for drones in such a dynamic environment as it also involves efficient trajectory design. We leverage multi-agent reinforcement learning (MARL) to manage the challenge in this study, letting each drone learn to collaboratively execute tasks and plan trajectories based on its current status and environment. Simulation results show that the proposed collaborative execution model can successfully complete the mission at least 80% of the time, regardless of task locations and lengths, and can even achieve a 100% success rate when the task density is not way too sparse. To the best of our knowledge, our work is one of the pioneer studies on leveraging MARL on collaborative execution for mission-oriented drone networks; the unique value of this work lies in drone battery level driving our model design.

Predicting Future Actions of Reinforcement Learning Agents

Authors:Stephen Chung, Scott Niekum, David Krueger
Date:2024-10-29 18:48:18

As reinforcement learning agents become increasingly deployed in real-world scenarios, predicting future agent actions and events during deployment is important for facilitating better human-agent interaction and preventing catastrophic outcomes. This paper experimentally evaluates and compares the effectiveness of future action and event prediction for three types of RL agents: explicitly planning, implicitly planning, and non-planning. We employ two approaches: the inner state approach, which involves predicting based on the inner computations of the agents (e.g., plans or neuron activations), and a simulation-based approach, which involves unrolling the agent in a learned world model. Our results show that the plans of explicitly planning agents are significantly more informative for prediction than the neuron activations of the other types. Furthermore, using internal plans proves more robust to model quality compared to simulation-based approaches when predicting actions, while the results for event prediction are more mixed. These findings highlight the benefits of leveraging inner states and simulations to predict future agent actions and events, thereby improving interaction and safety in real-world deployments.

AutoGLM: Autonomous Foundation Agents for GUIs

Authors:Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, Jie Tang
Date:2024-10-28 17:05:10

We present AutoGLM, a new series in the ChatGLM family, designed to serve as foundation agents for autonomous control of digital devices through Graphical User Interfaces (GUIs). While foundation models excel at acquiring human knowledge, they often struggle with decision-making in dynamic real-world environments, limiting their progress toward artificial general intelligence. This limitation underscores the importance of developing foundation agents capable of learning through autonomous environmental interactions by reinforcing existing models. Focusing on Web Browser and Phone as representative GUI scenarios, we have developed AutoGLM as a practical foundation agent system for real-world GUI interactions. Our approach integrates a comprehensive suite of techniques and infrastructures to create deployable agent systems suitable for user delivery. Through this development, we have derived two key insights: First, the design of an appropriate "intermediate interface" for GUI control is crucial, enabling the separation of planning and grounding behaviors, which require distinct optimization for flexibility and accuracy respectively. Second, we have developed a novel progressive training framework that enables self-evolving online curriculum reinforcement learning for AutoGLM. Our evaluations demonstrate AutoGLM's effectiveness across multiple domains. For web browsing, AutoGLM achieves a 55.2% success rate on VAB-WebArena-Lite (improving to 59.1% with a second attempt) and 96.2% on OpenTable evaluation tasks. In Android device control, AutoGLM attains a 36.2% success rate on AndroidLab (VAB-Mobile) and 89.7% on common tasks in popular Chinese APPs.

Combining Deep Reinforcement Learning with a Jerk-Bounded Trajectory Generator for Kinematically Constrained Motion Planning

Authors:Seyed Adel Alizadeh Kolagar, Mehdi Heydari Shahna, Jouni Mattila
Date:2024-10-28 10:36:32

Deep reinforcement learning (DRL) is emerging as a promising method for adaptive robotic motion and complex task automation, effectively addressing the limitations of traditional control methods. However, ensuring safety throughout both the learning process and policy deployment remains a key challenge due to the risky exploration inherent in DRL, as well as the discrete nature of actions taken at intervals. These discontinuities, despite being part of a continuous action space, can lead to abrupt changes between successive actions, causing instability and unsafe intermediate states. To address these challenges, this paper proposes an integrated framework that combines DRL with a jerk-bounded trajectory generator (JBTG) and a robust low-level control strategy, significantly enhancing the safety, stability, and reliability of robotic manipulators. The low-level controller ensures the precise execution of DRL-generated commands, while the JBTG refines these motions to produce smooth, continuous trajectories that prevent abrupt or unsafe actions. The framework also includes pre-calculated safe velocity zones for smooth braking, preventing joint limit violations and ensuring compliance with kinematic constraints. This approach not only guarantees the robustness and safety of the robotic system but also optimizes motion control, making it suitable for practical applications. The effectiveness of the proposed framework is demonstrated through its application to a highly complex heavy-duty manipulator.

Enhancing Battery Storage Energy Arbitrage with Deep Reinforcement Learning and Time-Series Forecasting

Authors:Manuel Sage, Joshua Campbell, Yaoyao Fiona Zhao
Date:2024-10-25 23:18:43

Energy arbitrage is one of the most profitable sources of income for battery operators, generating revenues by buying and selling electricity at different prices. Forecasting these revenues is challenging due to the inherent uncertainty of electricity prices. Deep reinforcement learning (DRL) emerged in recent years as a promising tool, able to cope with uncertainty by training on large quantities of historical data. However, without access to future electricity prices, DRL agents can only react to the currently observed price and not learn to plan battery dispatch. Therefore, in this study, we combine DRL with time-series forecasting methods from deep learning to enhance the performance on energy arbitrage. We conduct a case study using price data from Alberta, Canada that is characterized by irregular price spikes and highly non-stationary. This data is challenging to forecast even when state-of-the-art deep learning models consisting of convolutional layers, recurrent layers, and attention modules are deployed. Our results show that energy arbitrage with DRL-enabled battery control still significantly benefits from these imperfect predictions, but only if predictors for several horizons are combined. Grouping multiple predictions for the next 24-hour window, accumulated rewards increased by 60% for deep Q-networks (DQN) compared to the experiments without forecasts. We hypothesize that multiple predictors, despite their imperfections, convey useful information regarding the future development of electricity prices through a "majority vote" principle, enabling the DRL agent to learn more profitable control policies.

An Enhanced Hierarchical Planning Framework for Multi-Robot Autonomous Exploration

Authors:Gengyuan Cai, Luosong Guo, Xiangmao Chang
Date:2024-10-25 08:20:06

The autonomous exploration of environments by multi-robot systems is a critical task with broad applications in rescue missions, exploration endeavors, and beyond. Current approaches often rely on either greedy frontier selection or end-to-end deep reinforcement learning (DRL) methods, yet these methods are frequently hampered by limitations such as short-sightedness, overlooking long-term implications, and convergence difficulties stemming from the intricate high-dimensional learning space. To address these challenges, this paper introduces an innovative integration strategy that combines the low-dimensional action space efficiency of frontier-based methods with the far-sightedness and optimality of DRL-based approaches. We propose a three-tiered planning framework that first identifies frontiers in free space, creating a sparse map representation that lightens data transmission burdens and reduces the DRL action space's dimensionality. Subsequently, we develop a multi-graph neural network (mGNN) that incorporates states of potential targets and robots, leveraging policy-based reinforcement learning to compute affinities, thereby superseding traditional heuristic utility values. Lastly, we implement local routing planning through subsequence search, which avoids exhaustive sequence traversal. Extensive validation across diverse scenarios and comprehensive simulation results demonstrate the effectiveness of our proposed method. Compared to baseline approaches, our framework achieves environmental exploration with fewer time steps and a notable reduction of over 30% in data transmission, showcasing its superiority in terms of efficiency and performance.

Multi-UAV Behavior-based Formation with Static and Dynamic Obstacles Avoidance via Reinforcement Learning

Authors:Yuqing Xie, Chao Yu, Hongzhi Zang, Feng Gao, Wenhao Tang, Jingyi Huang, Jiayu Chen, Botian Xu, Yi Wu, Yu Wang
Date:2024-10-24 07:31:59

Formation control of multiple Unmanned Aerial Vehicles (UAVs) is vital for practical applications. This paper tackles the task of behavior-based UAV formation while avoiding static and dynamic obstacles during directed flight. We present a two-stage reinforcement learning (RL) training pipeline to tackle the challenge of multi-objective optimization, large exploration spaces, and the sim-to-real gap. The first stage searches in a simplified scenario for a linear utility function that balances all task objectives simultaneously, whereas the second stage applies the utility function in complex scenarios, utilizing curriculum learning to navigate large exploration spaces. Additionally, we apply an attention-based observation encoder to enhance formation maintenance and manage varying obstacle quantity. Experiments in simulation and real world demonstrate that our method outperforms planning-based and RL-based baselines regarding collision-free rate and formation maintenance in scenarios with static, dynamic, and mixed obstacles.

SPIRE: Synergistic Planning, Imitation, and Reinforcement Learning for Long-Horizon Manipulation

Authors:Zihan Zhou, Animesh Garg, Dieter Fox, Caelan Garrett, Ajay Mandlekar
Date:2024-10-23 17:42:07

Robot learning has proven to be a general and effective technique for programming manipulators. Imitation learning is able to teach robots solely from human demonstrations but is bottlenecked by the capabilities of the demonstrations. Reinforcement learning uses exploration to discover better behaviors; however, the space of possible improvements can be too large to start from scratch. And for both techniques, the learning difficulty increases proportional to the length of the manipulation task. Accounting for this, we propose SPIRE, a system that first uses Task and Motion Planning (TAMP) to decompose tasks into smaller learning subproblems and second combines imitation and reinforcement learning to maximize their strengths. We develop novel strategies to train learning agents when deployed in the context of a planning system. We evaluate SPIRE on a suite of long-horizon and contact-rich robot manipulation problems. We find that SPIRE outperforms prior approaches that integrate imitation learning, reinforcement learning, and planning by 35% to 50% in average task performance, is 6 times more data efficient in the number of human demonstrations needed to train proficient agents, and learns to complete tasks nearly twice as efficiently. View https://sites.google.com/view/spire-corl-2024 for more details.

Learning Versatile Skills with Curriculum Masking

Authors:Yao Tang, Zhihui Xie, Zichuan Lin, Deheng Ye, Shuai Li
Date:2024-10-23 10:17:13

Masked prediction has emerged as a promising pretraining paradigm in offline reinforcement learning (RL) due to its versatile masking schemes, enabling flexible inference across various downstream tasks with a unified model. Despite the versatility of masked prediction, it remains unclear how to balance the learning of skills at different levels of complexity. To address this, we propose CurrMask, a curriculum masking pretraining paradigm for sequential decision making. Motivated by how humans learn by organizing knowledge in a curriculum, CurrMask adjusts its masking scheme during pretraining for learning versatile skills. Through extensive experiments, we show that CurrMask exhibits superior zero-shot performance on skill prompting tasks, goal-conditioned planning tasks, and competitive finetuning performance on offline RL tasks. Additionally, our analysis of training dynamics reveals that CurrMask gradually acquires skills of varying complexity by dynamically adjusting its masking scheme.

DyPNIPP: Predicting Environment Dynamics for RL-based Robust Informative Path Planning

Authors:Srujan Deolasee, Siva Kailas, Wenhao Luo, Katia Sycara, Woojun Kim
Date:2024-10-22 17:07:26

Informative path planning (IPP) is an important planning paradigm for various real-world robotic applications such as environment monitoring. IPP involves planning a path that can learn an accurate belief of the quantity of interest, while adhering to planning constraints. Traditional IPP methods typically require high computation time during execution, giving rise to reinforcement learning (RL) based IPP methods. However, the existing RL-based methods do not consider spatio-temporal environments which involve their own challenges due to variations in environment characteristics. In this paper, we propose DyPNIPP, a robust RL-based IPP framework, designed to operate effectively across spatio-temporal environments with varying dynamics. To achieve this, DyPNIPP incorporates domain randomization to train the agent across diverse environments and introduces a dynamics prediction model to capture and adapt the agent actions to specific environment dynamics. Our extensive experiments in a wildfire environment demonstrate that DyPNIPP outperforms existing RL-based IPP algorithms by significantly improving robustness and performing across diverse environment conditions.

QuasiNav: Asymmetric Cost-Aware Navigation Planning with Constrained Quasimetric Reinforcement Learning

Authors:Jumman Hossain, Abu-Zaher Faridee, Derrik Asher, Jade Freeman, Theron Trout, Timothy Gregory, Nirmalya Roy
Date:2024-10-22 03:39:21

Autonomous navigation in unstructured outdoor environments is inherently challenging due to the presence of asymmetric traversal costs, such as varying energy expenditures for uphill versus downhill movement. Traditional reinforcement learning methods often assume symmetric costs, which can lead to suboptimal navigation paths and increased safety risks in real-world scenarios. In this paper, we introduce QuasiNav, a novel reinforcement learning framework that integrates quasimetric embeddings to explicitly model asymmetric costs and guide efficient, safe navigation. QuasiNav formulates the navigation problem as a constrained Markov decision process (CMDP) and employs quasimetric embeddings to capture directionally dependent costs, allowing for a more accurate representation of the terrain. This approach is combined with adaptive constraint tightening within a constrained policy optimization framework to dynamically enforce safety constraints during learning. We validate QuasiNav across three challenging navigation scenarios-undulating terrains, asymmetric hill traversal, and directionally dependent terrain traversal-demonstrating its effectiveness in both simulated and real-world environments. Experimental results show that QuasiNav significantly outperforms conventional methods, achieving higher success rates, improved energy efficiency, and better adherence to safety constraints.

Reinforced Imitative Trajectory Planning for Urban Automated Driving

Authors:Di Zeng, Ling Zheng, Xiantong Yang, Yinong Li
Date:2024-10-21 03:04:29

Reinforcement learning (RL) faces challenges in trajectory planning for urban automated driving due to the poor convergence of RL and the difficulty in designing reward functions. The convergence problem is alleviated by combining RL with supervised learning. However, most existing approaches only reason one step ahead and lack the capability to plan for multiple future steps. Besides, although inverse reinforcement learning holds promise for solving the reward function design issue, existing methods for automated driving impose a linear structure assumption on reward functions, making them difficult to apply to urban automated driving. In light of these challenges, this paper proposes a novel RL-based trajectory planning method that integrates RL with imitation learning to enable multi-step planning. Furthermore, a transformer-based Bayesian reward function is developed, providing effective reward signals for RL in urban scenarios. Moreover, a hybrid-driven trajectory planning framework is proposed to enhance safety and interpretability. The proposed methods were validated on the large-scale real-world urban automated driving nuPlan dataset. The results demonstrated the significant superiority of the proposed methods over the baselines in terms of the closed-loop metrics. The code is available at https://github.com/Zigned/nuplan_zigned.

Action abstractions for amortized sampling

Authors:Oussama Boussif, Léna Néhale Ezzine, Joseph D Viviano, Michał Koziarski, Moksh Jain, Nikolay Malkin, Emmanuel Bengio, Rim Assouel, Yoshua Bengio
Date:2024-10-19 19:22:50

As trajectories sampled by policies used by reinforcement learning (RL) and generative flow networks (GFlowNets) grow longer, credit assignment and exploration become more challenging, and the long planning horizon hinders mode discovery and generalization. The challenge is particularly pronounced in entropy-seeking RL methods, such as generative flow networks, where the agent must learn to sample from a structured distribution and discover multiple high-reward states, each of which take many steps to reach. To tackle this challenge, we propose an approach to incorporate the discovery of action abstractions, or high-level actions, into the policy optimization process. Our approach involves iteratively extracting action subsequences commonly used across many high-reward trajectories and `chunking' them into a single action that is added to the action space. In empirical evaluation on synthetic and real-world environments, our approach demonstrates improved sample efficiency performance in discovering diverse high-reward objects, especially on harder exploration problems. We also observe that the abstracted high-order actions are interpretable, capturing the latent structure of the reward landscape of the action space. This work provides a cognitively motivated approach to action abstraction in RL and is the first demonstration of hierarchical planning in amortized sequential sampling.

GUIDEd Agents: Enhancing Navigation Policies through Task-Specific Uncertainty Abstraction in Localization-Limited Environments

Authors:Gokul Puthumanaillam, Paulo Padrao, Jose Fuentes, Leonardo Bobadilla, Melkior Ornik
Date:2024-10-19 18:46:17

Autonomous vehicles performing navigation tasks in complex environments face significant challenges due to uncertainty in state estimation. In many scenarios, such as stealth operations or resource-constrained settings, accessing high-precision localization comes at a significant cost, forcing robots to rely primarily on less precise state estimates. Our key observation is that different tasks require varying levels of precision in different regions: a robot navigating a crowded space might need precise localization near obstacles but can operate effectively with less precision elsewhere. In this paper, we present a planning method for integrating task-specific uncertainty requirements directly into navigation policies. We introduce Task-Specific Uncertainty Maps (TSUMs), which abstract the acceptable levels of state estimation uncertainty across different regions. TSUMs align task requirements and environmental features using a shared representation space, generated via a domain-adapted encoder. Using TSUMs, we propose Generalized Uncertainty Integration for Decision-Making and Execution (GUIDE), a policy conditioning framework that incorporates these uncertainty requirements into robot decision-making. We find that TSUMs provide an effective way to abstract task-specific uncertainty requirements, and conditioning policies on TSUMs enables the robot to reason about the context-dependent value of certainty and adapt its behavior accordingly. We show how integrating GUIDE into reinforcement learning frameworks allows the agent to learn navigation policies that effectively balance task completion and uncertainty management without explicit reward engineering. We evaluate GUIDE on various real-world robotic navigation tasks and find that it demonstrates significant improvement in task completion rates compared to baseline methods that do not explicitly consider task-specific uncertainty.

MARLIN: Multi-Agent Reinforcement Learning Guided by Language-Based Inter-Robot Negotiation

Authors:Toby Godfrey, William Hunt, Mohammad D. Soorati
Date:2024-10-18 11:20:00

Multi-agent reinforcement learning is a key method for training multi-robot systems over a series of episodes in which robots are rewarded or punished according to their performance; only once the system is trained to a suitable standard is it deployed in the real world. If the system is not trained enough, the task will likely not be completed and could pose a risk to the surrounding environment. Therefore, reaching high performance in a shorter training period can lead to significant reductions in time and resource consumption. We introduce Multi-Agent Reinforcement Learning guided by Language-based Inter-Robot Negotiation (MARLIN), which makes the training process both faster and more transparent. We equip robots with large language models that negotiate and debate the task, producing a plan that is used to guide the policy during training. We dynamically switch between using reinforcement learning and the negotiation-based approach throughout training. This offers an increase in training speed when compared to standard multi-agent reinforcement learning and allows the system to be deployed to physical hardware earlier. As robots negotiate in natural language, we can better understand the behaviour of the robots individually and as a collective. We compare the performance of our approach to multi-agent reinforcement learning and a large language model to show that our hybrid method trains faster at little cost to performance.

Towards Effective Planning Strategies for Dynamic Opinion Networks

Authors:Bharath Muppasani, Protik Nag, Vignesh Narayanan, Biplav Srivastava, Michael N. Huhns
Date:2024-10-18 00:13:56

In this study, we investigate the under-explored intervention planning aimed at disseminating accurate information within dynamic opinion networks by leveraging learning strategies. Intervention planning involves identifying key nodes (search) and exerting control (e.g., disseminating accurate or official information through the nodes) to mitigate the influence of misinformation. However, as the network size increases, the problem becomes computationally intractable. To address this, we first introduce a ranking algorithm to identify key nodes for disseminating accurate information, which facilitates the training of neural network classifiers that provide generalized solutions for the search and planning problems. Second, we mitigate the complexity of label generation, which becomes challenging as the network grows, by developing a reinforcement learning-based centralized dynamic planning framework. We analyze these NN-based planners for opinion networks governed by two dynamic propagation models. Each model incorporates both binary and continuous opinion and trust representations. Our experimental results demonstrate that the ranking algorithm-based classifiers provide plans that enhance infection rate control, especially with increased action budgets for small networks. Further, we observe that the reward strategies focusing on key metrics, such as the number of susceptible nodes and infection rates, outperform those prioritizing faster blocking strategies. Additionally, our findings reveal that graph convolutional network-based planners facilitate scalable centralized plans that achieve lower infection rates (higher control) across various network configurations, including Watts-Strogatz topology, varying action budgets, varying initial infected nodes, and varying degrees of infected nodes.

Reward-free World Models for Online Imitation Learning

Authors:Shangzhe Li, Zhiao Huang, Hao Su
Date:2024-10-17 23:13:32

Imitation learning (IL) enables agents to acquire skills directly from expert demonstrations, providing a compelling alternative to reinforcement learning. However, prior online IL approaches struggle with complex tasks characterized by high-dimensional inputs and complex dynamics. In this work, we propose a novel approach to online imitation learning that leverages reward-free world models. Our method learns environmental dynamics entirely in latent spaces without reconstruction, enabling efficient and accurate modeling. We adopt the inverse soft-Q learning objective, reformulating the optimization process in the Q-policy space to mitigate the instability associated with traditional optimization in the reward-policy space. By employing a learned latent dynamics model and planning for control, our approach consistently achieves stable, expert-level performance in tasks with high-dimensional observation or action spaces and intricate dynamics. We evaluate our method on a diverse set of benchmarks, including DMControl, MyoSuite, and ManiSkill2, demonstrating superior empirical performance compared to existing approaches.

Integrating Large Language Models and Reinforcement Learning for Non-Linear Reasoning

Authors:Yoav Alon, Cristina David
Date:2024-10-17 12:47:31

Large Language Models (LLMs) were shown to struggle with long-term planning, which may be caused by the limited way in which they explore the space of possible solutions. We propose an architecture where a Reinforcement Learning (RL) Agent guides an LLM's space exploration: (1) the Agent has access to domain-specific information, and can therefore make decisions about the quality of candidate solutions based on specific and relevant metrics, which were not explicitly considered by the LLM's training objective; (2) the LLM can focus on generating immediate next steps, without the need for long-term planning. We allow non-linear reasoning by exploring alternative paths and backtracking. We evaluate this architecture on the program equivalence task, and compare it against Chain of Thought (CoT) and Tree of Thoughts (ToT). We assess both the downstream task, denoting the binary classification, and the intermediate reasoning steps. Our approach compares positively against CoT and ToT.

Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning

Authors:Jiayu Chen, Wentse Chen, Jeff Schneider
Date:2024-10-15 03:36:43

Offline reinforcement learning (RL) is a powerful approach for data-driven decision-making and control. Compared to model-free methods, offline model-based reinforcement learning (MBRL) explicitly learns world models from a static dataset and uses them as surrogate simulators, improving the data efficiency and enabling the learned policy to potentially generalize beyond the dataset support. However, there could be various MDPs that behave identically on the offline dataset and so dealing with the uncertainty about the true MDP can be challenging. In this paper, we propose modeling offline MBRL as a Bayes Adaptive Markov Decision Process (BAMDP), which is a principled framework for addressing model uncertainty. We further introduce a novel Bayes Adaptive Monte-Carlo planning algorithm capable of solving BAMDPs in continuous state and action spaces with stochastic transitions. This planning process is based on Monte Carlo Tree Search and can be integrated into offline MBRL as a policy improvement operator in policy iteration. Our ``RL + Search" framework follows in the footsteps of superhuman AIs like AlphaZero, improving on current offline MBRL methods by incorporating more computation input. The proposed algorithm significantly outperforms state-of-the-art model-based and model-free offline RL methods on twelve D4RL MuJoCo benchmark tasks and three target tracking tasks in a challenging, stochastic tokamak control simulator.

Traversability-Aware Legged Navigation by Learning from Real-World Visual Data

Authors:Hongbo Zhang, Zhongyu Li, Xuanqi Zeng, Laura Smith, Kyle Stachowicz, Dhruv Shah, Linzhu Yue, Zhitao Song, Weipeng Xia, Sergey Levine, Koushil Sreenath, Yun-hui Liu
Date:2024-10-14 15:25:55

The enhanced mobility brought by legged locomotion empowers quadrupedal robots to navigate through complex and unstructured environments. However, optimizing agile locomotion while accounting for the varying energy costs of traversing different terrains remains an open challenge. Most previous work focuses on planning trajectories with traversability cost estimation based on human-labeled environmental features. However, this human-centric approach is insufficient because it does not account for the varying capabilities of the robot locomotion controllers over challenging terrains. To address this, we develop a novel traversability estimator in a robot-centric manner, based on the value function of the robot's locomotion controller. This estimator is integrated into a new learning-based RGBD navigation framework. The framework employs multiple training stages to develop a planner that guides the robot in avoiding obstacles and hard-to-traverse terrains while reaching its goals. The training of the navigation planner is directly performed in the real world using a sample efficient reinforcement learning method that utilizes both online data and offline datasets. Through extensive benchmarking, we demonstrate that the proposed framework achieves the best performance in accurate traversability cost estimation and efficient learning from multi-modal data (including the robot's color and depth vision, as well as proprioceptive feedback) for real-world training. Using the proposed method, a quadrupedal robot learns to perform traversability-aware navigation through trial and error in various real-world environments with challenging terrains that are difficult to classify using depth vision alone. Moreover, the robot demonstrates the ability to generalize the learned navigation skills to unseen scenarios. Video can be found at https://youtu.be/RSqnIWZ1qks.

Generalization of Compositional Tasks with Logical Specification via Implicit Planning

Authors:Duo Xu, Faramarz Fekri
Date:2024-10-13 00:57:10

In this study, we address the challenge of learning generalizable policies for compositional tasks defined by logical specifications. These tasks consist of multiple temporally extended sub-tasks. Due to the sub-task inter-dependencies and sparse reward issue in long-horizon tasks, existing reinforcement learning (RL) approaches, such as task-conditioned and goal-conditioned policies, continue to struggle with slow convergence and sub-optimal performance in generalizing to compositional tasks. To overcome these limitations, we introduce a new hierarchical RL framework that enhances the efficiency and optimality of task generalization. At the high level, we present an implicit planner specifically designed for generalizing compositional tasks. This planner selects the next sub-task and estimates the multi-step return for completing the remaining task to complete from the current state. It learns a latent transition model and performs planning in the latent space by using a graph neural network (GNN). Subsequently, the high-level planner's selected sub-task guides the low-level agent to effectively handle long-horizon tasks, while the multi-step return encourages the low-level policy to account for future sub-task dependencies, enhancing its optimality. We conduct comprehensive experiments to demonstrate the framework's advantages over previous methods in terms of both efficiency and optimality.

SAPIENT: Mastering Multi-turn Conversational Recommendation with Strategic Planning and Monte Carlo Tree Search

Authors:Hanwen Du, Bo Peng, Xia Ning
Date:2024-10-12 16:21:33

Conversational Recommender Systems (CRS) proactively engage users in interactive dialogues to elicit user preferences and provide personalized recommendations. Existing methods train Reinforcement Learning (RL)-based agent with greedy action selection or sampling strategy, and may suffer from suboptimal conversational planning. To address this, we present a novel Monte Carlo Tree Search (MCTS)-based CRS framework SAPIENT. SAPIENT consists of a conversational agent (S-agent) and a conversational planner (S-planner). S-planner builds a conversational search tree with MCTS based on the initial actions proposed by S-agent to find conversation plans. The best conversation plans from S-planner are used to guide the training of S-agent, creating a self-training loop where S-agent can iteratively improve its capability for conversational planning. Furthermore, we propose an efficient variant SAPIENT-e for trade-off between training efficiency and performance. Extensive experiments on four benchmark datasets validate the effectiveness of our approach, showing that SAPIENT outperforms the state-of-the-art baselines.

HG2P: Hippocampus-inspired High-reward Graph and Model-Free Q-Gradient Penalty for Path Planning and Motion Control

Authors:Haoran Wang, Yaoru Sun, Zeshen Tang
Date:2024-10-12 11:46:31

Goal-conditioned hierarchical reinforcement learning (HRL) decomposes complex reaching tasks into a sequence of simple subgoal-conditioned tasks, showing significant promise for addressing long-horizon planning in large-scale environments. This paper bridges the goal-conditioned HRL based on graph-based planning to brain mechanisms, proposing a hippocampus-striatum-like dual-controller hypothesis. Inspired by the brain mechanisms of organisms (i.e., the high-reward preferences observed in hippocampal replay) and instance-based theory, we propose a high-return sampling strategy for constructing memory graphs, improving sample efficiency. Additionally, we derive a model-free lower-level Q-function gradient penalty to resolve the model dependency issues present in prior work, improving the generalization of Lipschitz constraints in applications. Finally, we integrate these two extensions, High-reward Graph and model-free Gradient Penalty (HG2P), into the state-of-the-art framework ACLG, proposing a novel goal-conditioned HRL framework, HG2P+ACLG. Experimentally, the results demonstrate that our method outperforms state-of-the-art goal-conditioned HRL algorithms on a variety of long-horizon navigation tasks and robotic manipulation tasks.

ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning

Authors:Yarden As, Bhavya Sukhija, Lenart Treven, Carmelo Sferrazza, Stelian Coros, Andreas Krause
Date:2024-10-12 10:46:02

Reinforcement learning (RL) is ubiquitous in the development of modern AI systems. However, state-of-the-art RL agents require extensive, and potentially unsafe, interactions with their environments to learn effectively. These limitations confine RL agents to simulated environments, hindering their ability to learn directly in real-world settings. In this work, we present ActSafe, a novel model-based RL algorithm for safe and efficient exploration. ActSafe learns a well-calibrated probabilistic model of the system and plans optimistically w.r.t. the epistemic uncertainty about the unknown dynamics, while enforcing pessimism w.r.t. the safety constraints. Under regularity assumptions on the constraints and dynamics, we show that ActSafe guarantees safety during learning while also obtaining a near-optimal policy in finite time. In addition, we propose a practical variant of ActSafe that builds on latest model-based RL advancements and enables safe exploration even in high-dimensional settings such as visual control. We empirically show that ActSafe obtains state-of-the-art performance in difficult exploration tasks on standard safe deep RL benchmarks while ensuring safety during learning.

Hierarchical Universal Value Function Approximators

Authors:Rushiv Arora
Date:2024-10-11 17:09:26

There have been key advancements to building universal approximators for multi-goal collections of reinforcement learning value functions -- key elements in estimating long-term returns of states in a parameterized manner. We extend this to hierarchical reinforcement learning, using the options framework, by introducing hierarchical universal value function approximators (H-UVFAs). This allows us to leverage the added benefits of scaling, planning, and generalization expected in temporal abstraction settings. We develop supervised and reinforcement learning methods for learning embeddings of the states, goals, options, and actions in the two hierarchical value functions: $Q(s, g, o; \theta)$ and $Q(s, g, o, a; \theta)$. Finally we demonstrate generalization of the HUVFAs and show they outperform corresponding UVFAs.

Overcoming Slow Decision Frequencies in Continuous Control: Model-Based Sequence Reinforcement Learning for Model-Free Control

Authors:Devdhar Patel, Hava Siegelmann
Date:2024-10-11 16:54:07

Reinforcement learning (RL) is rapidly reaching and surpassing human-level control capabilities. However, state-of-the-art RL algorithms often require timesteps and reaction times significantly faster than human capabilities, which is impractical in real-world settings and typically necessitates specialized hardware. Such speeds are difficult to achieve in the real world and often requires specialized hardware. We introduce Sequence Reinforcement Learning (SRL), an RL algorithm designed to produce a sequence of actions for a given input state, enabling effective control at lower decision frequencies. SRL addresses the challenges of learning action sequences by employing both a model and an actor-critic architecture operating at different temporal scales. We propose a "temporal recall" mechanism, where the critic uses the model to estimate intermediate states between primitive actions, providing a learning signal for each individual action within the sequence. Once training is complete, the actor can generate action sequences independently of the model, achieving model-free control at a slower frequency. We evaluate SRL on a suite of continuous control tasks, demonstrating that it achieves performance comparable to state-of-the-art algorithms while significantly reducing actor sample complexity. To better assess performance across varying decision frequencies, we introduce the Frequency-Averaged Score (FAS) metric. Our results show that SRL significantly outperforms traditional RL algorithms in terms of FAS, making it particularly suitable for applications requiring variable decision frequencies. Additionally, we compare SRL with model-based online planning, showing that SRL achieves superior FAS while leveraging the same model during training that online planners use for planning.

Optimizing Vital Sign Monitoring in Resource-Constrained Maternal Care: An RL-Based Restless Bandit Approach

Authors:Niclas Boehmer, Yunfan Zhao, Guojun Xiong, Paula Rodriguez-Diaz, Paola Del Cueto Cibrian, Joseph Ngonzi, Adeline Boatin, Milind Tambe
Date:2024-10-10 21:20:07

Maternal mortality remains a significant global public health challenge. One promising approach to reducing maternal deaths occurring during facility-based childbirth is through early warning systems, which require the consistent monitoring of mothers' vital signs after giving birth. Wireless vital sign monitoring devices offer a labor-efficient solution for continuous monitoring, but their scarcity raises the critical question of how to allocate them most effectively. We devise an allocation algorithm for this problem by modeling it as a variant of the popular Restless Multi-Armed Bandit (RMAB) paradigm. In doing so, we identify and address novel, previously unstudied constraints unique to this domain, which render previous approaches for RMABs unsuitable and significantly increase the complexity of the learning and planning problem. To overcome these challenges, we adopt the popular Proximal Policy Optimization (PPO) algorithm from reinforcement learning to learn an allocation policy by training a policy and value function network. We demonstrate in simulations that our approach outperforms the best heuristic baseline by up to a factor of $4$.

Offline Hierarchical Reinforcement Learning via Inverse Optimization

Authors:Carolin Schmidt, Daniele Gammelli, James Harrison, Marco Pavone, Filipe Rodrigues
Date:2024-10-10 14:00:21

Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards. However, learning hierarchical policies from static offline datasets presents a significant challenge. Crucially, actions taken by higher-level policies may not be directly observable within hierarchical controllers, and the offline dataset might have been generated using a different policy structure, hindering the use of standard offline learning algorithms. In this work, we propose OHIO: a framework for offline reinforcement learning (RL) of hierarchical policies. Our framework leverages knowledge of the policy structure to solve the inverse problem, recovering the unobservable high-level actions that likely generated the observed data under our hierarchical policy. This approach constructs a dataset suitable for off-the-shelf offline training. We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness. We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed.

Variations in Multi-Agent Actor-Critic Frameworks for Joint Optimizations in UAV Swarm Networks: Recent Evolution, Challenges, and Directions

Authors:Muhammad Morshed Alam, Muhammad Yeasir Aarafat, Tamim Hossain
Date:2024-10-09 07:22:40

Autonomous unmanned aerial vehicle (UAV) swarm networks (UAVSNs) can effectively execute surveillance, connectivity, and computing services to ground users (GUs). These missions require trajectory planning, UAV-GUs association, task offloading, next-hop selection, and resources such as transmit power, bandwidth, caching, and computing allocation to improve network performances. Owing to the highly dynamic topology, limited resources, and non-availability of global knowledge, optimizing network performance in UAVSNs is very intricate. Hence, it requires an adaptive joint optimization framework that can tackle both discrete and continuous decision variables to ensure optimal network performance under dynamic constraints. Multi-agent deep reinforcement learning-based adaptive actor-critic framework can efficiently address these problems. This paper investigates the recent evolutions of actor-critic frameworks to deal with joint optimization problems in UAVSNs. In addition, challenges and potential solutions are addressed as research directions.

Cooperative and Asynchronous Transformer-based Mission Planning for Heterogeneous Teams of Mobile Robots

Authors:Milad Farjadnasab, Shahin Sirouspour
Date:2024-10-08 21:14:09

Cooperative mission planning for heterogeneous teams of mobile robots presents a unique set of challenges, particularly when operating under communication constraints and limited computational resources. To address these challenges, we propose the Cooperative and Asynchronous Transformer-based Mission Planning (CATMiP) framework, which leverages multi-agent reinforcement learning (MARL) to coordinate distributed decision making among agents with diverse sensing, motion, and actuation capabilities, operating under sporadic ad hoc communication. A Class-based Macro-Action Decentralized Partially Observable Markov Decision Process (CMacDec-POMDP) is also formulated to effectively model asynchronous decision-making for heterogeneous teams of agents. The framework utilizes an asynchronous centralized training and distributed execution scheme that is developed based on the Multi-Agent Transformer (MAT) architecture. This design allows a single trained model to generalize to larger environments and accommodate varying team sizes and compositions. We evaluate CATMiP in a 2D grid-world simulation environment and compare its performance against planning-based exploration methods. Results demonstrate CATMiP's superior efficiency, scalability, and robustness to communication dropouts, highlighting its potential for real-world heterogeneous mobile robot systems. The code is available at https://github.com/mylad13/CATMiP.

Effort Allocation for Deadline-Aware Task and Motion Planning: A Metareasoning Approach

Authors:Yoonchang Sung, Shahaf S. Shperberg, Qi Wang, Peter Stone
Date:2024-10-08 08:56:07

In robot planning, tasks can often be achieved through multiple options, each consisting of several actions. This work specifically addresses deadline constraints in task and motion planning, aiming to find a plan that can be executed within the deadline despite uncertain planning and execution times. We propose an effort allocation problem, formulated as a Markov decision process (MDP), to find such a plan by leveraging metareasoning perspectives to allocate computational resources among the given options. We formally prove the NP-hardness of the problem by reducing it from the knapsack problem. Both a model-based approach, where transition models are learned from past experience, and a model-free approach, which overcomes the unavailability of prior data acquisition through reinforcement learning, are explored. For the model-based approach, we investigate Monte Carlo tree search (MCTS) to approximately solve the proposed MDP and further design heuristic schemes to tackle NP-hardness, leading to the approximate yet efficient algorithm called DP_Rerun. In experiments, DP_Rerun demonstrates promising performance comparable to MCTS while requiring negligible computation time.

AAAI Workshop on AI Planning for Cyber-Physical Systems -- CAIPI24

Authors:Oliver Niggemann, Gautam Biswas, Alexander Diedrich, Jonas Ehrhardt, René Heesch, Niklas Widulle
Date:2024-10-08 05:52:00

The workshop 'AI-based Planning for Cyber-Physical Systems', which took place on February 26, 2024, as part of the 38th Annual AAAI Conference on Artificial Intelligence in Vancouver, Canada, brought together researchers to discuss recent advances in AI planning methods for Cyber-Physical Systems (CPS). CPS pose a major challenge due to their complexity and data-intensive nature, which often exceeds the capabilities of traditional planning algorithms. The workshop highlighted new approaches such as neuro-symbolic architectures, large language models (LLMs), deep reinforcement learning and advances in symbolic planning. These techniques are promising when it comes to managing the complexity of CPS and have potential for real-world applications.

On the Modeling Capabilities of Large Language Models for Sequential Decision Making

Authors:Martin Klissarov, Devon Hjelm, Alexander Toshev, Bogdan Mazoure
Date:2024-10-08 03:12:57

Large pretrained models are showing increasingly better performance in reasoning and planning tasks across different modalities, opening the possibility to leverage them for complex sequential decision making problems. In this paper, we investigate the capabilities of Large Language Models (LLMs) for reinforcement learning (RL) across a diversity of interactive domains. We evaluate their ability to produce decision-making policies, either directly, by generating actions, or indirectly, by first generating reward models to train an agent with RL. Our results show that, even without task-specific fine-tuning, LLMs excel at reward modeling. In particular, crafting rewards through artificial intelligence (AI) feedback yields the most generally applicable approach and can enhance performance by improving credit assignment and exploration. Finally, in environments with unfamiliar dynamics, we explore how fine-tuning LLMs with synthetic data can significantly improve their reward modeling capabilities while mitigating catastrophic forgetting, further broadening their utility in sequential decision-making tasks.

Gen-Drive: Enhancing Diffusion Generative Driving Policies with Reward Modeling and Reinforcement Learning Fine-tuning

Authors:Zhiyu Huang, Xinshuo Weng, Maximilian Igl, Yuxiao Chen, Yulong Cao, Boris Ivanovic, Marco Pavone, Chen Lv
Date:2024-10-08 00:45:49

Autonomous driving necessitates the ability to reason about future interactions between traffic agents and to make informed evaluations for planning. This paper introduces the \textit{Gen-Drive} framework, which shifts from the traditional prediction and deterministic planning framework to a generation-then-evaluation planning paradigm. The framework employs a behavior diffusion model as a scene generator to produce diverse possible future scenarios, thereby enhancing the capability for joint interaction reasoning. To facilitate decision-making, we propose a scene evaluator (reward) model, trained with pairwise preference data collected through VLM assistance, thereby reducing human workload and enhancing scalability. Furthermore, we utilize an RL fine-tuning framework to improve the generation quality of the diffusion model, rendering it more effective for planning tasks. We conduct training and closed-loop planning tests on the nuPlan dataset, and the results demonstrate that employing such a generation-then-evaluation strategy outperforms other learning-based approaches. Additionally, the fine-tuned generative driving policy shows significant enhancements in planning performance. We further demonstrate that utilizing our learned reward model for evaluation or RL fine-tuning leads to better planning performance compared to relying on human-designed rewards. Project website: https://mczhi.github.io/GenDrive.

Diffusion Model Predictive Control

Authors:Guangyao Zhou, Sivaramakrishnan Swaminathan, Rajkumar Vasudeva Raju, J. Swaroop Guntupalli, Wolfgang Lehrach, Joseph Ortiz, Antoine Dedieu, Miguel Lázaro-Gredilla, Kevin Murphy
Date:2024-10-07 17:56:47

We propose Diffusion Model Predictive Control (D-MPC), a novel MPC approach that learns a multi-step action proposal and a multi-step dynamics model, both using diffusion models, and combines them for use in online MPC. On the popular D4RL benchmark, we show performance that is significantly better than existing model-based offline planning methods using MPC and competitive with state-of-the-art (SOTA) model-based and model-free reinforcement learning methods. We additionally illustrate D-MPC's ability to optimize novel reward functions at run time and adapt to novel dynamics, and highlight its advantages compared to existing diffusion-based planning baselines.

Reinforcement Learning Control for Autonomous Hydraulic Material Handling Machines with Underactuated Tools

Authors:Filippo A. Spinelli, Pascal Egli, Julian Nubert, Fang Nan, Thilo Bleumer, Patrick Goegler, Stephan Brockes, Ferdinand Hofmann, Marco Hutter
Date:2024-10-07 14:47:28

The precise and safe control of heavy material handling machines presents numerous challenges due to the hard-to-model hydraulically actuated joints and the need for collision-free trajectory planning with a free-swinging end-effector tool. In this work, we propose an RL-based controller that commands the cabin joint and the arm simultaneously. It is trained in a simulation combining data-driven modeling techniques with first-principles modeling. On the one hand, we employ a neural network model to capture the highly nonlinear dynamics of the upper carriage turn hydraulic motor, incorporating explicit pressure prediction to handle delays better. On the other hand, we model the arm as velocity-controllable and the free-swinging end-effector tool as a damped pendulum using first principles. This combined model enhances our simulation environment, enabling the training of RL controllers that can be directly transferred to the real machine. Designed to reach steady-state Cartesian targets, the RL controller learns to leverage the hydraulic dynamics to improve accuracy, maintain high speeds, and minimize end-effector tool oscillations. Our controller, tested on a mid-size prototype material handler, is more accurate than an inexperienced operator and causes fewer tool oscillations. It demonstrates competitive performance even compared to an experienced professional driver.

A Universal Formulation for Path-Parametric Planning and Control

Authors:Jon Arrizabalaga, Markus Ryll
Date:2024-10-07 00:26:29

This work presents a unified framework for path-parametric planning and control. This formulation is universal as it standardizes the entire spectrum of path-parametric techniques -- from traditional path following to more recent contouring or progress-maximizing Model Predictive Control and Reinforcement Learning -- under a single framework. The ingredients underlying this universality are twofold: First, we present a compact and efficient technique capable of computing singularity-free, smooth and differentiable moving frames. Second, we derive a spatial path parameterization of the Cartesian coordinates applicable to any arbitrary curve without prior assumptions on its parametric speed or moving frame, and that perfectly interplays with the aforementioned path parameterization method. The combination of these two ingredients leads to a planning and control framework that brings togehter existing path-parametric techniques in literature. Aiming to unify all these approaches, we open source PACOR, a software library that implements the presented content, thereby providing a self-contained toolkit for the formulation of path-parametric planning and control methods.

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Authors:Zhaolin Gao, Wenhao Zhan, Jonathan D. Chang, Gokul Swamy, Kianté Brantley, Jason D. Lee, Wen Sun
Date:2024-10-06 20:20:22

Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate $Q$-values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL outperforms Llama-3.1-70B-it on long multi-turn dialogues. Implementation of REFUEL can be found at https://github.com/ZhaolinGao/REFUEL/, and models trained by REFUEL can be found at https://huggingface.co/Cornell-AGI.

YOLO-MARL: You Only LLM Once for Multi-agent Reinforcement Learning

Authors:Yuan Zhuang, Yi Shen, Zhili Zhang, Yuxiao Chen, Fei Miao
Date:2024-10-05 01:44:11

Advancements in deep multi-agent reinforcement learning (MARL) have positioned it as a promising approach for decision-making in cooperative games. However, it still remains challenging for MARL agents to learn cooperative strategies for some game environments. Recently, large language models (LLMs) have demonstrated emergent reasoning capabilities, making them promising candidates for enhancing coordination among the agents. However, due to the model size of LLMs, it can be expensive to frequently infer LLMs for actions that agents can take. In this work, we propose You Only LLM Once for MARL (YOLO-MARL), a novel framework that leverages the high-level task planning capabilities of LLMs to improve the policy learning process of multi-agents in cooperative games. Notably, for each game environment, YOLO-MARL only requires one time interaction with LLMs in the proposed strategy generation, state interpretation and planning function generation modules, before the MARL policy training process. This avoids the ongoing costs and computational time associated with frequent LLMs API calls during training. Moreover, the trained decentralized normal-sized neural network-based policies operate independently of the LLM. We evaluate our method across three different environments and demonstrate that YOLO-MARL outperforms traditional MARL algorithms.

Deep Reinforcement Learning for Delay-Optimized Task Offloading in Vehicular Fog Computing

Authors:Mohammad Parsa Toopchinezhad, Mahmood Ahmadi
Date:2024-10-04 14:42:33

The imminent rise of autonomous vehicles (AVs) is revolutionizing the future of transport. The Vehicular Fog Computing (VFC) paradigm has emerged to alleviate the load of compute-intensive and delay-sensitive AV programs via task offloading to nearby vehicles. Effective VFC requires an intelligent and dynamic offloading algorithm. As a result, this paper adapts Deep Reinforcement Learning (DRL) for VFC offloading. First, a simulation environment utilizing realistic hardware and task specifications, in addition to a novel vehicular movement model based on grid-planned cities, is created. Afterward, a DRL-based algorithm is trained and tested on the environment with the goal of minimizing global task delay. The DRL model displays impressive results, outperforming other greedy and conventional methods. The findings further demonstrate the effectiveness of the DRL model in minimizing queue congestion, especially when compared to traditional cloud computing methods that struggle to handle the demands of a large fleet of vehicles. This is corroborated by queuing theory, highlighting the self-scalability of the VFC-based DRL approach.

CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control

Authors:Guy Tevet, Sigal Raab, Setareh Cohan, Daniele Reda, Zhengyi Luo, Xue Bin Peng, Amit H. Bermano, Michiel van de Panne
Date:2024-10-04 13:56:48

Motion diffusion models and Reinforcement Learning (RL) based control for physics-based simulations have complementary strengths for human motion generation. The former is capable of generating a wide variety of motions, adhering to intuitive control such as text, while the latter offers physically plausible motion and direct interaction with the environment. In this work, we present a method that combines their respective strengths. CLoSD is a text-driven RL physics-based controller, guided by diffusion generation for various tasks. Our key insight is that motion diffusion can serve as an on-the-fly universal planner for a robust RL controller. To this end, CLoSD maintains a closed-loop interaction between two modules -- a Diffusion Planner (DiP), and a tracking controller. DiP is a fast-responding autoregressive diffusion model, controlled by textual prompts and target locations, and the controller is a simple and robust motion imitator that continuously receives motion plans from DiP and provides feedback from the environment. CLoSD is capable of seamlessly performing a sequence of different tasks, including navigation to a goal location, striking an object with a hand or foot as specified in a text prompt, sitting down, and getting up. https://guytevet.github.io/CLoSD-page/

Hybrid Classical/RL Local Planner for Ground Robot Navigation

Authors:Vishnu D. Sharma, Jeongran Lee, Matthew Andrews, Ilija Hadžić
Date:2024-10-04 01:15:15

Local planning is an optimization process within a mobile robot navigation stack that searches for the best velocity vector, given the robot and environment state. Depending on how the optimization criteria and constraints are defined, some planners may be better than others in specific situations. We consider two conceptually different planners. The first planner explores the velocity space in real-time and has superior path-tracking and motion smoothness performance. The second planner was trained using reinforcement learning methods to produce the best velocity based on its training $"$experience$"$. It is better at avoiding dynamic obstacles but at the expense of motion smoothness. We propose a simple yet effective meta-reasoning approach that takes advantage of both approaches by switching between planners based on the surroundings. We demonstrate the superiority of our hybrid planner, both qualitatively and quantitatively, over the individual planners on a live robot in different scenarios, achieving an improvement of 26% in the navigation time.

Diffusion Meets Options: Hierarchical Generative Skill Composition for Temporally-Extended Tasks

Authors:Zeyu Feng, Hao Luan, Kevin Yuchen Ma, Harold Soh
Date:2024-10-03 11:10:37

Safe and successful deployment of robots requires not only the ability to generate complex plans but also the capacity to frequently replan and correct execution errors. This paper addresses the challenge of long-horizon trajectory planning under temporally extended objectives in a receding horizon manner. To this end, we propose DOPPLER, a data-driven hierarchical framework that generates and updates plans based on instruction specified by linear temporal logic (LTL). Our method decomposes temporal tasks into chain of options with hierarchical reinforcement learning from offline non-expert datasets. It leverages diffusion models to generate options with low-level actions. We devise a determinantal-guided posterior sampling technique during batch generation, which improves the speed and diversity of diffusion generated options, leading to more efficient querying. Experiments on robot navigation and manipulation tasks demonstrate that DOPPLER can generate sequences of trajectories that progressively satisfy the specified formulae for obstacle avoidance and sequential visitation. Demonstration videos are available online at: https://philiptheother.github.io/doppler/.

E2H: A Two-Stage Non-Invasive Neural Signal Driven Humanoid Robotic Whole-Body Control Framework

Authors:Yiqun Duan, Qiang Zhang, Jinzhao Zhou, Jingkai Sun, Xiaowei Jiang, Jiahang Cao, Jiaxu Wang, Yiqian Yang, Wen Zhao, Gang Han, Yijie Guo, Chin-Teng Lin
Date:2024-10-03 01:58:34

Recent advancements in humanoid robotics, including the integration of hierarchical reinforcement learning-based control and the utilization of LLM planning, have significantly enhanced the ability of robots to perform complex tasks. In contrast to the highly developed humanoid robots, the human factors involved remain relatively unexplored. Directly controlling humanoid robots with the brain has already appeared in many science fiction novels, such as Pacific Rim and Gundam. In this work, we present E2H (EEG-to-Humanoid), an innovative framework that pioneers the control of humanoid robots using high-frequency non-invasive neural signals. As the none-invasive signal quality remains low in decoding precise spatial trajectory, we decompose the E2H framework in an innovative two-stage formation: 1) decoding neural signals (EEG) into semantic motion keywords, 2) utilizing LLM facilitated motion generation with a precise motion imitation control policy to realize humanoid robotics control. The method of directly driving robots with brainwave commands offers a novel approach to human-machine collaboration, especially in situations where verbal commands are impractical, such as in cases of speech impairments, space exploration, or underwater exploration, unlocking significant potential. E2H offers an exciting glimpse into the future, holding immense potential for human-computer interaction.

Collaborative motion planning for multi-manipulator systems through Reinforcement Learning and Dynamic Movement Primitives

Authors:Siddharth Singh, Tian Xu, Qing Chang
Date:2024-10-01 14:52:05

Robotic tasks often require multiple manipulators to enhance task efficiency and speed, but this increases complexity in terms of collaboration, collision avoidance, and the expanded state-action space. To address these challenges, we propose a multi-level approach combining Reinforcement Learning (RL) and Dynamic Movement Primitives (DMP) to generate adaptive, real-time trajectories for new tasks in dynamic environments using a demonstration library. This method ensures collision-free trajectory generation and efficient collaborative motion planning. We validate the approach through experiments in the PyBullet simulation environment with UR5e robotic manipulators.

Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

Authors:Jie Cheng, Ruixi Qiao, Gang Xiong, Qinghai Miao, Yingwei Ma, Binhua Li, Yongbin Li, Yisheng Lv
Date:2024-10-01 10:25:03

A significant aspiration of offline reinforcement learning (RL) is to develop a generalist agent with high capabilities from large and heterogeneous datasets. However, prior approaches that scale offline RL either rely heavily on expert trajectories or struggle to generalize to diverse unseen tasks. Inspired by the excellent generalization of world model in conditional video generation, we explore the potential of image observation-based world model for scaling offline RL and enhancing generalization on novel tasks. In this paper, we introduce JOWA: Jointly-Optimized World-Action model, an offline model-based RL agent pretrained on multiple Atari games with 6 billion tokens data to learn general-purpose representation and decision-making ability. Our method jointly optimizes a world-action model through a shared transformer backbone, which stabilize temporal difference learning with large models during pretraining. Moreover, we propose a provably efficient and parallelizable planning algorithm to compensate for the Q-value estimation error and thus search out better policies. Experimental results indicate that our largest agent, with 150 million parameters, achieves 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange. Furthermore, JOWA scales favorably with model capacity and can sample-efficiently transfer to novel games using only 5k offline fine-tuning data (approximately 4 trajectories) per game, demonstrating superior generalization. We will release codes and model weights at https://github.com/CJReinforce/JOWA.

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Authors:Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, Yijie Guo
Date:2024-10-01 03:47:00

Robotic manipulation in open-world settings requires not only task execution but also the ability to detect and learn from failures. While recent advances in vision-language models (VLMs) and large language models (LLMs) have improved robots' spatial reasoning and problem-solving abilities, they still struggle with failure recognition, limiting their real-world applicability. We introduce AHA, an open-source VLM designed to detect and reason about failures in robotic manipulation using natural language. By framing failure detection as a free-form reasoning task, AHA identifies failures and provides detailed, adaptable explanations across different robots, tasks, and environments. We fine-tuned AHA using FailGen, a scalable framework that generates the first large-scale dataset of robotic failure trajectories, the AHA dataset. FailGen achieves this by procedurally perturbing successful demonstrations from simulation. Despite being trained solely on the AHA dataset, AHA generalizes effectively to real-world failure datasets, robotic systems, and unseen tasks. It surpasses the second-best model (GPT-4o in-context learning) by 10.3% and exceeds the average performance of six compared models including five state-of-the-art VLMs by 35.3% across multiple metrics and datasets. We integrate AHA into three manipulation frameworks that utilize LLMs/VLMs for reinforcement learning, task and motion planning, and zero-shot trajectory generation. AHA's failure feedback enhances these policies' performances by refining dense reward functions, optimizing task planning, and improving sub-task verification, boosting task success rates by an average of 21.4% across all three tasks compared to GPT-4 models.

Task-agnostic Pre-training and Task-guided Fine-tuning for Versatile Diffusion Planner

Authors:Chenyou Fan, Chenjia Bai, Zhao Shan, Haoran He, Yang Zhang, Zhen Wang
Date:2024-09-30 05:05:37

Diffusion models have demonstrated their capabilities in modeling trajectories of multi-tasks. However, existing multi-task planners or policies typically rely on task-specific demonstrations via multi-task imitation, or require task-specific reward labels to facilitate policy optimization via Reinforcement Learning (RL). They are costly due to the substantial human efforts required to collect expert data or design reward functions. To address these challenges, we aim to develop a versatile diffusion planner capable of leveraging large-scale inferior data that contains task-agnostic sub-optimal trajectories, with the ability to fast adapt to specific tasks. In this paper, we propose SODP, a two-stage framework that leverages Sub-Optimal data to learn a Diffusion Planner, which is generalizable for various downstream tasks. Specifically, in the pre-training stage, we train a foundation diffusion planner that extracts general planning capabilities by modeling the versatile distribution of multi-task trajectories, which can be sub-optimal and has wide data coverage. Then for downstream tasks, we adopt RL-based fine-tuning with task-specific rewards to quickly refine the diffusion planner, which aims to generate action sequences with higher task-specific returns. Experimental results from multi-task domains including Meta-World and Adroit demonstrate that SODP outperforms state-of-the-art methods with only a small amount of data for reward-guided fine-tuning.

Generalizability of Graph Neural Networks for Decentralized Unlabeled Motion Planning

Authors:Shreyas Muthusamy, Damian Owerko, Charilaos I. Kanatsoulis, Saurav Agarwal, Alejandro Ribeiro
Date:2024-09-29 23:57:25

Unlabeled motion planning involves assigning a set of robots to target locations while ensuring collision avoidance, aiming to minimize the total distance traveled. The problem forms an essential building block for multi-robot systems in applications such as exploration, surveillance, and transportation. We address this problem in a decentralized setting where each robot knows only the positions of its $k$-nearest robots and $k$-nearest targets. This scenario combines elements of combinatorial assignment and continuous-space motion planning, posing significant scalability challenges for traditional centralized approaches. To overcome these challenges, we propose a decentralized policy learned via a Graph Neural Network (GNN). The GNN enables robots to determine (1) what information to communicate to neighbors and (2) how to integrate received information with local observations for decision-making. We train the GNN using imitation learning with the centralized Hungarian algorithm as the expert policy, and further fine-tune it using reinforcement learning to avoid collisions and enhance performance. Extensive empirical evaluations demonstrate the scalability and effectiveness of our approach. The GNN policy trained on 100 robots generalizes to scenarios with up to 500 robots, outperforming state-of-the-art solutions by 8.6\% on average and significantly surpassing greedy decentralized methods. This work lays the foundation for solving multi-robot coordination problems in settings where scalability is important.

Learning to Bridge the Gap: Efficient Novelty Recovery with Planning and Reinforcement Learning

Authors:Alicia Li, Nishanth Kumar, Tomás Lozano-Pérez, Leslie Kaelbling
Date:2024-09-28 03:41:25

The real world is unpredictable. Therefore, to solve long-horizon decision-making problems with autonomous robots, we must construct agents that are capable of adapting to changes in the environment during deployment. Model-based planning approaches can enable robots to solve complex, long-horizon tasks in a variety of environments. However, such approaches tend to be brittle when deployed into an environment featuring a novel situation that their underlying model does not account for. In this work, we propose to learn a ``bridge policy'' via Reinforcement Learning (RL) to adapt to such novelties. We introduce a simple formulation for such learning, where the RL problem is constructed with a special ``CallPlanner'' action that terminates the bridge policy and hands control of the agent back to the planner. This allows the RL policy to learn the set of states in which querying the planner and following the returned plan will achieve the goal. We show that this formulation enables the agent to rapidly learn by leveraging the planner's knowledge to avoid challenging long-horizon exploration caused by sparse reward. In experiments across three different simulated domains of varying complexity, we demonstrate that our approach is able to learn policies that adapt to novelty more efficiently than several baselines, including a pure RL baseline. We also demonstrate that the learned bridge policy is generalizable in that it can be combined with the planner to enable the agent to solve more complex tasks with multiple instances of the encountered novelty.

CurricuLLM: Automatic Task Curricula Design for Learning Complex Robot Skills using Large Language Models

Authors:Kanghyun Ryu, Qiayuan Liao, Zhongyu Li, Koushil Sreenath, Negar Mehr
Date:2024-09-27 01:48:16

Curriculum learning is a training mechanism in reinforcement learning (RL) that facilitates the achievement of complex policies by progressively increasing the task difficulty during training. However, designing effective curricula for a specific task often requires extensive domain knowledge and human intervention, which limits its applicability across various domains. Our core idea is that large language models (LLMs), with their extensive training on diverse language data and ability to encapsulate world knowledge, present significant potential for efficiently breaking down tasks and decomposing skills across various robotics environments. Additionally, the demonstrated success of LLMs in translating natural language into executable code for RL agents strengthens their role in generating task curricula. In this work, we propose CurricuLLM, which leverages the high-level planning and programming capabilities of LLMs for curriculum design, thereby enhancing the efficient learning of complex target tasks. CurricuLLM consists of: (Step 1) Generating sequence of subtasks that aid target task learning in natural language form, (Step 2) Translating natural language description of subtasks in executable task code, including the reward code and goal distribution code, and (Step 3) Evaluating trained policies based on trajectory rollout and subtask description. We evaluate CurricuLLM in various robotics simulation environments, ranging from manipulation, navigation, and locomotion, to show that CurricuLLM can aid learning complex robot control tasks. In addition, we validate humanoid locomotion policy learned through CurricuLLM in real-world. The code is provided in https://github.com/labicon/CurricuLLM

iWalker: Imperative Visual Planning for Walking Humanoid Robot

Authors:Xiao Lin, Yuhao Huang, Taimeng Fu, Xiaobin Xiong, Chen Wang
Date:2024-09-27 00:35:21

Humanoid robots, with the potential to perform a broad range of tasks in environments designed for humans, have been deemed crucial for the basis of general AI agents. When talking about planning and controlling, although traditional models and task-specific methods have been extensively studied over the past few decades, they are inadequate for achieving the flexibility and versatility needed for general autonomy. Learning approaches, especially reinforcement learning, are powerful and popular nowadays, but they are inherently "blind" during training, relying heavily on trials in simulation without proper guidance from physical principles or underlying dynamics. In response, we propose a novel end-to-end pipeline that seamlessly integrates perception, planning, and model-based control for humanoid robot walking. We refer to our method as iWalker, which is driven by imperative learning (IL), a self-supervising neuro-symbolic learning framework. This enables the robot to learn from arbitrary unlabeled data, significantly improving its adaptability and generalization capabilities. In experiments, iWalker demonstrates effectiveness in both simulated and real-world environments, representing a significant advancement toward versatile and autonomous humanoid robots.

Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

Authors:Zhenghao Peng, Wenjie Luo, Yiren Lu, Tianyi Shen, Cole Gulino, Ari Seff, Justin Fu
Date:2024-09-26 23:40:33

A major challenge in autonomous vehicle research is modeling agent behaviors, which has critical applications including constructing realistic and reliable simulations for off-board evaluation and forecasting traffic agents motion for onboard planning. While supervised learning has shown success in modeling agents across various domains, these models can suffer from distribution shift when deployed at test-time. In this work, we improve the reliability of agent behaviors by closed-loop fine-tuning of behavior models with reinforcement learning. Our method demonstrates improved overall performance, as well as improved targeted metrics such as collision rate, on the Waymo Open Sim Agents challenge. Additionally, we present a novel policy evaluation benchmark to directly assess the ability of simulated agents to measure the quality of autonomous vehicle planners and demonstrate the effectiveness of our approach on this new benchmark.

Inverse Reinforcement Learning with Multiple Planning Horizons

Authors:Jiayu Yao, Weiwei Pan, Finale Doshi-Velez, Barbara E Engelhardt
Date:2024-09-26 16:55:31

In this work, we study an inverse reinforcement learning (IRL) problem where the experts are planning under a shared reward function but with different, unknown planning horizons. Without the knowledge of discount factors, the reward function has a larger feasible solution set, which makes it harder for existing IRL approaches to identify a reward function. To overcome this challenge, we develop algorithms that can learn a global multi-agent reward function with agent-specific discount factors that reconstruct the expert policies. We characterize the feasible solution space of the reward function and discount factors for both algorithms and demonstrate the generalizability of the learned reward function across multiple domains.

Navigation in a simplified Urban Flow through Deep Reinforcement Learning

Authors:Federica Tonti, Jean Rabault, Ricardo Vinuesa
Date:2024-09-26 15:05:15

The increasing number of unmanned aerial vehicles (UAVs) in urban environments requires a strategy to minimize their environmental impact, both in terms of energy efficiency and noise reduction. In order to reduce these concerns, novel strategies for developing prediction models and optimization of flight planning, for instance through deep reinforcement learning (DRL), are needed. Our goal is to develop DRL algorithms capable of enabling the autonomous navigation of UAVs in urban environments, taking into account the presence of buildings and other UAVs, optimizing the trajectories in order to reduce both energetic consumption and noise. This is achieved using fluid-flow simulations which represent the environment in which UAVs navigate and training the UAV as an agent interacting with an urban environment. In this work, we consider a domain domain represented by a two-dimensional flow field with obstacles, ideally representing buildings, extracted from a three-dimensional high-fidelity numerical simulation. The presented methodology, using PPO+LSTM cells, was validated by reproducing a simple but fundamental problem in navigation, namely the Zermelo's problem, which deals with a vessel navigating in a turbulent flow, travelling from a starting point to a target location, optimizing the trajectory. The current method shows a significant improvement with respect to both a simple PPO and a TD3 algorithm, with a success rate (SR) of the PPO+LSTM trained policy of 98.7%, and a crash rate (CR) of 0.1%, outperforming both PPO (SR = 75.6%, CR=18.6%) and TD3 (SR=77.4% and CR=14.5%). This is the first step towards DRL strategies which will guide UAVs in a three-dimensional flow field using real-time signals, making the navigation efficient in terms of flight time and avoiding damages to the vehicle.

Hierarchical End-to-End Autonomous Driving: Integrating BEV Perception with Deep Reinforcement Learning

Authors:Siyi Lu, Lei He, Shengbo Eben Li, Yugong Luo, Jianqiang Wang, Keqiang Li
Date:2024-09-26 09:14:16

End-to-end autonomous driving offers a streamlined alternative to the traditional modular pipeline, integrating perception, prediction, and planning within a single framework. While Deep Reinforcement Learning (DRL) has recently gained traction in this domain, existing approaches often overlook the critical connection between feature extraction of DRL and perception. In this paper, we bridge this gap by mapping the DRL feature extraction network directly to the perception phase, enabling clearer interpretation through semantic segmentation. By leveraging Bird's-Eye-View (BEV) representations, we propose a novel DRL-based end-to-end driving framework that utilizes multi-sensor inputs to construct a unified three-dimensional understanding of the environment. This BEV-based system extracts and translates critical environmental features into high-level abstract states for DRL, facilitating more informed control. Extensive experimental evaluations demonstrate that our approach not only enhances interpretability but also significantly outperforms state-of-the-art methods in autonomous driving control tasks, reducing the collision rate by 20%.

VertiSelector: Automatic Curriculum Learning for Wheeled Mobility on Vertically Challenging Terrain

Authors:Tong Xu, Chenhui Pan, Xuesu Xiao
Date:2024-09-26 02:02:58

Reinforcement Learning (RL) has the potential to enable extreme off-road mobility by circumventing complex kinodynamic modeling, planning, and control by simulated end-to-end trial-and-error learning experiences. However, most RL methods are sample-inefficient when training in a large amount of manually designed simulation environments and struggle at generalizing to the real world. To address these issues, we introduce VertiSelector (VS), an automatic curriculum learning framework designed to enhance learning efficiency and generalization by selectively sampling training terrain. VS prioritizes vertically challenging terrain with higher Temporal Difference (TD) errors when revisited, thereby allowing robots to learn at the edge of their evolving capabilities. By dynamically adjusting the sampling focus, VS significantly boosts sample efficiency and generalization within the VW-Chrono simulator built on the Chrono multi-physics engine. Furthermore, we provide simulation and physical results using VS on a Verti-4-Wheeler platform. These results demonstrate that VS can achieve 23.08% improvement in terms of success rate by efficiently sampling during training and robustly generalizing to the real world.

Cat-and-Mouse Satellite Dynamics: Divergent Adversarial Reinforcement Learning for Contested Multi-Agent Space Operations

Authors:Cameron Mehlman, Joseph Abramov, Gregory Falco
Date:2024-09-26 00:32:56

As space becomes increasingly crowded and contested, robust autonomous capabilities for multi-agent environments are gaining critical importance. Current autonomous systems in space primarily rely on optimization-based path planning or long-range orbital maneuvers, which have not yet proven effective in adversarial scenarios where one satellite is actively pursuing another. We introduce Divergent Adversarial Reinforcement Learning (DARL), a two-stage Multi-Agent Reinforcement Learning (MARL) approach designed to train autonomous evasion strategies for satellites engaged with multiple adversarial spacecraft. Our method enhances exploration during training by promoting diverse adversarial strategies, leading to more robust and adaptable evader models. We validate DARL through a cat-and-mouse satellite scenario, modeled as a partially observable multi-agent capture the flag game where two adversarial `cat' spacecraft pursue a single `mouse' evader. DARL's performance is compared against several benchmarks, including an optimization-based satellite path planner, demonstrating its ability to produce highly robust models for adversarial multi-agent space environments.

Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action

Authors:Xin Chen, Yifan Hu, Minda Zhao
Date:2024-09-25 17:56:02

Policy gradient methods are widely used in reinforcement learning. Yet, the nonconvexity of policy optimization imposes significant challenges in understanding the global convergence of policy gradient methods. For a class of finite-horizon Markov Decision Processes (MDPs) with general state and action spaces, we develop a framework that provides a set of easily verifiable assumptions to ensure the Kurdyka-Lojasiewicz (KL) condition of the policy optimization. Leveraging the KL condition, policy gradient methods converge to the globally optimal policy with a non-asymptomatic rate despite nonconvexity. Our results find applications in various control and operations models, including entropy-regularized tabular MDPs, Linear Quadratic Regulator (LQR) problems, stochastic inventory models, and stochastic cash balance problems, for which we show an $\epsilon$-optimal policy can be obtained using a sample size in $\tilde{\mathcal{O}}(\epsilon^{-1})$ and polynomial in terms of the planning horizon by stochastic policy gradient methods. Our result establishes the first sample complexity for multi-period inventory systems with Markov-modulated demands and stochastic cash balance problems in the literature.

AI-Driven Risk-Aware Scheduling for Active Debris Removal Missions

Authors:Antoine Poupon, Hugo de Rohan Willner, Pierre Nikitits, Adam Abdin
Date:2024-09-25 15:16:07

The proliferation of debris in Low Earth Orbit (LEO) represents a significant threat to space sustainability and spacecraft safety. Active Debris Removal (ADR) has emerged as a promising approach to address this issue, utilising Orbital Transfer Vehicles (OTVs) to facilitate debris deorbiting, thereby reducing future collision risks. However, ADR missions are substantially complex, necessitating accurate planning to make the missions economically viable and technically effective. Moreover, these servicing missions require a high level of autonomous capability to plan under evolving orbital conditions and changing mission requirements. In this paper, an autonomous decision-planning model based on Deep Reinforcement Learning (DRL) is developed to train an OTV to plan optimal debris removal sequencing. It is shown that using the proposed framework, the agent can find optimal mission plans and learn to update the planning autonomously to include risk handling of debris with high collision risk.

Multi-Robot Informative Path Planning for Efficient Target Mapping using Deep Reinforcement Learning

Authors:Apoorva Vashisth, Dipam Patel, Damon Conover, Aniket Bera
Date:2024-09-25 14:27:37

Autonomous robots are being employed in several mapping and data collection tasks due to their efficiency and low labor costs. In these tasks, the robots are required to map targets-of-interest in an unknown environment while constrained to a given resource budget such as path length or mission time. This is a challenging problem as each robot has to not only detect and avoid collisions from static obstacles in the environment but also has to model other robots' trajectories to avoid inter-robot collisions. We propose a novel deep reinforcement learning approach for multi-robot informative path planning to map targets-of-interest in an unknown 3D environment. A key aspect of our approach is an augmented graph that models other robots' trajectories to enable planning for communication and inter-robot collision avoidance. We train our decentralized reinforcement learning policy via the centralized training and decentralized execution paradigm. Once trained, our policy is also scalable to varying number of robots and does not require re-training. Our approach outperforms other state-of-the-art multi-robot target mapping approaches by 33.75% in terms of the number of discovered targets-of-interest. We open-source our code and model at: https://github.com/AccGen99/marl_ipp

Dynamic Obstacle Avoidance through Uncertainty-Based Adaptive Planning with Diffusion

Authors:Vineet Punyamoorty, Pascal Jutras-Dubé, Ruqi Zhang, Vaneet Aggarwal, Damon Conover, Aniket Bera
Date:2024-09-25 14:03:58

By framing reinforcement learning as a sequence modeling problem, recent work has enabled the use of generative models, such as diffusion models, for planning. While these models are effective in predicting long-horizon state trajectories in deterministic environments, they face challenges in dynamic settings with moving obstacles. Effective collision avoidance demands continuous monitoring and adaptive decision-making. While replanning at every timestep could ensure safety, it introduces substantial computational overhead due to the repetitive prediction of overlapping state sequences -- a process that is particularly costly with diffusion models, known for their intensive iterative sampling procedure. We propose an adaptive generative planning approach that dynamically adjusts replanning frequency based on the uncertainty of action predictions. Our method minimizes the need for frequent, computationally expensive, and redundant replanning while maintaining robust collision avoidance performance. In experiments, we obtain a 13.5% increase in the mean trajectory length and a 12.7% increase in mean reward over long-horizon planning, indicating a reduction in collision rates and an improved ability to navigate the environment safely.

Revisiting Space Mission Planning: A Reinforcement Learning-Guided Approach for Multi-Debris Rendezvous

Authors:Agni Bandyopadhyay, Guenther Waxenegger-Wilfing
Date:2024-09-25 12:50:01

This research introduces a novel application of a masked Proximal Policy Optimization (PPO) algorithm from the field of deep reinforcement learning (RL), for determining the most efficient sequence of space debris visitation, utilizing the Lambert solver as per Izzo's adaptation for individual rendezvous. The aim is to optimize the sequence in which all the given debris should be visited to get the least total time for rendezvous for the entire mission. A neural network (NN) policy is developed, trained on simulated space missions with varying debris fields. After training, the neural network calculates approximately optimal paths using Izzo's adaptation of Lambert maneuvers. Performance is evaluated against standard heuristics in mission planning. The reinforcement learning approach demonstrates a significant improvement in planning efficiency by optimizing the sequence for debris rendezvous, reducing the total mission time by an average of approximately {10.96\%} and {13.66\%} compared to the Genetic and Greedy algorithms, respectively. The model on average identifies the most time-efficient sequence for debris visitation across various simulated scenarios with the fastest computational speed. This approach signifies a step forward in enhancing mission planning strategies for space debris clearance.

OffRIPP: Offline RL-based Informative Path Planning

Authors:Srikar Babu Gadipudi, Srujan Deolasee, Siva Kailas, Wenhao Luo, Katia Sycara, Woojun Kim
Date:2024-09-25 11:30:59

Informative path planning (IPP) is a crucial task in robotics, where agents must design paths to gather valuable information about a target environment while adhering to resource constraints. Reinforcement learning (RL) has been shown to be effective for IPP, however, it requires environment interactions, which are risky and expensive in practice. To address this problem, we propose an offline RL-based IPP framework that optimizes information gain without requiring real-time interaction during training, offering safety and cost-efficiency by avoiding interaction, as well as superior performance and fast computation during execution -- key advantages of RL. Our framework leverages batch-constrained reinforcement learning to mitigate extrapolation errors, enabling the agent to learn from pre-collected datasets generated by arbitrary algorithms. We validate the framework through extensive simulations and real-world experiments. The numerical results show that our framework outperforms the baselines, demonstrating the effectiveness of the proposed approach.

Dashing for the Golden Snitch: Multi-Drone Time-Optimal Motion Planning with Multi-Agent Reinforcement Learning

Authors:Xian Wang, Jin Zhou, Yuanli Feng, Jiahao Mei, Jiming Chen, Shuo Li
Date:2024-09-25 08:09:52

Recent innovations in autonomous drones have facilitated time-optimal flight in single-drone configurations and enhanced maneuverability in multi-drone systems through the application of optimal control and learning-based methods. However, few studies have achieved time-optimal motion planning for multi-drone systems, particularly during highly agile maneuvers or in dynamic scenarios. This paper presents a decentralized policy network for time-optimal multi-drone flight using multi-agent reinforcement learning. To strike a balance between flight efficiency and collision avoidance, we introduce a soft collision penalty inspired by optimization-based methods. By customizing PPO in a centralized training, decentralized execution (CTDE) fashion, we unlock higher efficiency and stability in training, while ensuring lightweight implementation. Extensive simulations show that, despite slight performance trade-offs compared to single-drone systems, our multi-drone approach maintains near-time-optimal performance with low collision rates. Real-world experiments validate our method, with two quadrotors using the same network as simulation achieving a maximum speed of 13.65 m/s and a maximum body rate of 13.4 rad/s in a 5.5 m * 5.5 m * 2.0 m space across various tracks, relying entirely on onboard computation.

Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly

Authors:Jiankai Sun, Aidan Curtis, Yang You, Yan Xu, Michael Koehle, Leonidas Guibas, Sachin Chitta, Mac Schwager, Hui Li
Date:2024-09-24 20:42:42

Generalizable long-horizon robotic assembly requires reasoning at multiple levels of abstraction. End-to-end imitation learning (IL) has been proven a promising approach, but it requires a large amount of demonstration data for training and often fails to meet the high-precision requirement of assembly tasks. Reinforcement Learning (RL) approaches have succeeded in high-precision assembly tasks, but suffer from sample inefficiency and hence, are less competent at long-horizon tasks. To address these challenges, we propose a hierarchical modular approach, named ARCH (Adaptive Robotic Composition Hierarchy), which enables long-horizon high-precision assembly in contact-rich settings. ARCH employs a hierarchical planning framework, including a low-level primitive library of continuously parameterized skills and a high-level policy. The low-level primitive library includes essential skills for assembly tasks, such as grasping and inserting. These primitives consist of both RL and model-based controllers. The high-level policy, learned via imitation learning from a handful of demonstrations, selects the appropriate primitive skills and instantiates them with continuous input parameters. We extensively evaluate our approach on a real robot manipulation platform. We show that while trained on a single task, ARCH generalizes well to unseen tasks and outperforms baseline methods in terms of success rate and data efficiency. Videos can be found at https://long-horizon-assembly.github.io.

Multi-UAV Pursuit-Evasion with Online Planning in Unknown Environments by Deep Reinforcement Learning

Authors:Jiayu Chen, Chao Yu, Guosheng Li, Wenhao Tang, Xinyi Yang, Botian Xu, Huazhong Yang, Yu Wang
Date:2024-09-24 08:40:04

Multi-UAV pursuit-evasion, where pursuers aim to capture evaders, poses a key challenge for UAV swarm intelligence. Multi-agent reinforcement learning (MARL) has demonstrated potential in modeling cooperative behaviors, but most RL-based approaches remain constrained to simplified simulations with limited dynamics or fixed scenarios. Previous attempts to deploy RL policy to real-world pursuit-evasion are largely restricted to two-dimensional scenarios, such as ground vehicles or UAVs at fixed altitudes. In this paper, we address multi-UAV pursuit-evasion by considering UAV dynamics and physical constraints. We introduce an evader prediction-enhanced network to tackle partial observability in cooperative strategy learning. Additionally, we propose an adaptive environment generator within MARL training, enabling higher exploration efficiency and better policy generalization across diverse scenarios. Simulations show our method significantly outperforms all baselines in challenging scenarios, generalizing to unseen scenarios with a 100% capture rate. Finally, we derive a feasible policy via a two-stage reward refinement and deploy the policy on real quadrotors in a zero-shot manner. To our knowledge, this is the first work to derive and deploy an RL-based policy using collective thrust and body rates control commands for multi-UAV pursuit-evasion in unknown environments. The open-source code and videos are available at https://sites.google.com/view/pursuit-evasion-rl.

Autonomous Wheel Loader Navigation Using Goal-Conditioned Actor-Critic MPC

Authors:Aleksi Mäki-Penttilä, Naeim Ebrahimi Toulkani, Reza Ghabcheloo
Date:2024-09-24 04:06:01

This paper proposes a novel control method for an autonomous wheel loader, enabling time-efficient navigation to an arbitrary goal pose. Unlike prior works that combine high-level trajectory planners with Model Predictive Control (MPC), we directly enhance the planning capabilities of MPC by integrating a cost function derived from Actor-Critic Reinforcement Learning (RL). Specifically, we train an RL agent to solve the pose reaching task in simulation, then incorporate the trained neural network critic as both the stage and terminal cost of an MPC. We show through comprehensive simulations that the resulting MPC inherits the time-efficient behavior of the RL agent, generating trajectories that compare favorably against those found using trajectory optimization. We also deploy our method on a real wheel loader, where we successfully navigate to various goal poses.

NavRL: Learning Safe Flight in Dynamic Environments

Authors:Zhefan Xu, Xinming Han, Haoyu Shen, Hanyu Jin, Kenji Shimada
Date:2024-09-24 00:36:34

Safe flight in dynamic environments requires autonomous unmanned aerial vehicles (UAVs) to make effective decisions when navigating cluttered spaces with moving obstacles. Traditional approaches often decompose decision-making into hierarchical modules for prediction and planning. Although these handcrafted systems can perform well in specific settings, they might fail if environmental conditions change and often require careful parameter tuning. Additionally, their solutions could be suboptimal due to the use of inaccurate mathematical model assumptions and simplifications aimed at achieving computational efficiency. To overcome these limitations, this paper introduces the NavRL framework, a deep reinforcement learning-based navigation method built on the Proximal Policy Optimization (PPO) algorithm. NavRL utilizes our carefully designed state and action representations, allowing the learned policy to make safe decisions in the presence of both static and dynamic obstacles, with zero-shot transfer from simulation to real-world flight. Furthermore, the proposed method adopts a simple but effective safety shield for the trained policy, inspired by the concept of velocity obstacles, to mitigate potential failures associated with the black-box nature of neural networks. To accelerate the convergence, we implement the training pipeline using NVIDIA Isaac Sim, enabling parallel training with thousands of quadcopters. Simulation and physical experiments show that our method ensures safe navigation in dynamic environments and results in the fewest collisions compared to benchmarks in scenarios with dynamic obstacles.

ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback

Authors:Qinzhuo Wu, Wei Liu, Jian Luan, Bin Wang
Date:2024-09-23 08:58:48

Recently, tool-augmented LLMs have gained increasing attention. Given an instruction, tool-augmented LLMs can interact with various external tools in multiple rounds and provide a final answer. However, previous LLMs were trained on overly detailed instructions, which included API names or parameters, while real users would not explicitly mention these API details. This leads to a gap between trained LLMs and real-world scenarios. In addition, most works ignore whether the interaction process follows the instruction. To address these issues, we constructed a training dataset called MGToolBench, which contains statement and category-level instructions to better reflect real-world scenarios. In addition, we propose ToolPlanner, a two-stage reinforcement learning framework that utilizes path planning and two feedback mechanisms to enhance the LLM's task completion and instruction-following capabilities. Experimental results show that ToolPlanner significantly improves the Match Rate, Pass Rate and Win Rate by 26.8%, 20.2%, and 5.6% compared to the SOTA model. Human evaluation verifies that the multi-granularity instructions can better align with users' usage habits. Our data and code will be released upon acceptance.

DROP: Dexterous Reorientation via Online Planning

Authors:Albert H. Li, Preston Culbertson, Vince Kurtz, Aaron D. Ames
Date:2024-09-22 19:00:53

Achieving human-like dexterity is a longstanding challenge in robotics, in part due to the complexity of planning and control for contact-rich systems. In reinforcement learning (RL), one popular approach has been to use massively-parallelized, domain-randomized simulations to learn a policy offline over a vast array of contact conditions, allowing robust sim-to-real transfer. Inspired by recent advances in real-time parallel simulation, this work considers instead the viability of online planning methods for contact-rich manipulation by studying the well-known in-hand cube reorientation task. We propose a simple architecture that employs a sampling-based predictive controller and vision-based pose estimator to search for contact-rich control actions online. We conduct thorough experiments to assess the real-world performance of our method, architectural design choices, and key factors for robustness, demonstrating that our simple sampling-based approach achieves performance comparable to prior RL-based works. Supplemental material: https://caltech-amber.github.io/drop.

Work Smarter Not Harder: Simple Imitation Learning with CS-PIBT Outperforms Large Scale Imitation Learning for MAPF

Authors:Rishi Veerapaneni, Arthur Jakobsson, Kevin Ren, Samuel Kim, Jiaoyang Li, Maxim Likhachev
Date:2024-09-22 15:36:29

Multi-Agent Path Finding (MAPF) is the problem of effectively finding efficient collision-free paths for a group of agents in a shared workspace. The MAPF community has largely focused on developing high-performance heuristic search methods. Recently, several works have applied various machine learning (ML) techniques to solve MAPF, usually involving sophisticated architectures, reinforcement learning techniques, and set-ups, but none using large amounts of high-quality supervised data. Our initial objective in this work was to show how simple large scale imitation learning of high-quality heuristic search methods can lead to state-of-the-art ML MAPF performance. However, we find that, at least with our model architecture, simple large scale (700k examples with hundreds of agents per example) imitation learning does \textit{not} produce impressive results. Instead, we find that by using prior work that post-processes MAPF model predictions to resolve 1-step collisions (CS-PIBT), we can train a simple ML MAPF model in minutes that dramatically outperforms existing ML MAPF policies. This has serious implications for all future ML MAPF policies (with local communication) which currently struggle to scale. In particular, this finding implies that future learnt policies should (1) always use smart 1-step collision shields (e.g. CS-PIBT), (2) always include the collision shield with greedy actions as a baseline (e.g. PIBT) and (3) motivates future models to focus on longer horizon / more complex planning as 1-step collisions can be efficiently resolved.

Subassembly to Full Assembly: Effective Assembly Sequence Planning through Graph-based Reinforcement Learning

Authors:Chang Shu, Anton Kim, Shinkyu Park
Date:2024-09-20 16:32:32

This paper proposes an assembly sequence planning framework, named Subassembly to Assembly (S2A). The framework is designed to enable a robotic manipulator to assemble multiple parts in a prespecified structure by leveraging object manipulation actions. The primary technical challenge lies in the exponentially increasing complexity of identifying a feasible assembly sequence as the number of parts grows. To address this, we introduce a graph-based reinforcement learning approach, where a graph attention network is trained using a delayed reward assignment strategy. In this strategy, rewards are assigned only when an assembly action contributes to the successful completion of the assembly task. We validate the framework's performance through physics-based simulations, comparing it against various baselines to emphasize the significance of the proposed reward assignment approach. Additionally, we demonstrate the feasibility of deploying our framework in a real-world robotic assembly scenario.

Human-Robot Cooperative Distribution Coupling for Hamiltonian-Constrained Social Navigation

Authors:Weizheng Wang, Chao Yu, Yu Wang, Byung-Cheol Min
Date:2024-09-20 15:17:51

Navigating in human-filled public spaces is a critical challenge for deploying autonomous robots in real-world environments. This paper introduces NaviDIFF, a novel Hamiltonian-constrained socially-aware navigation framework designed to address the complexities of human-robot interaction and socially-aware path planning. NaviDIFF integrates a port-Hamiltonian framework to model dynamic physical interactions and a diffusion model to manage uncertainty in human-robot cooperation. The framework leverages a spatial-temporal transformer to capture social and temporal dependencies, enabling more accurate pedestrian strategy predictions and port-Hamiltonian dynamics construction. Additionally, reinforcement learning from human feedback is employed to fine-tune robot policies, ensuring adaptation to human preferences and social norms. Extensive experiments demonstrate that NaviDIFF outperforms state-of-the-art methods in social navigation tasks, offering improved stability, efficiency, and adaptability.

From Cognition to Precognition: A Future-Aware Framework for Social Navigation

Authors:Zeying Gong, Tianshuai Hu, Ronghe Qiu, Junwei Liang
Date:2024-09-20 06:08:24

To navigate safely and efficiently in crowded spaces, robots should not only perceive the current state of the environment but also anticipate future human movements. In this paper, we propose a reinforcement learning architecture, namely Falcon, to tackle socially-aware navigation by explicitly predicting human trajectories and penalizing actions that block future human paths. To facilitate realistic evaluation, we introduce a novel SocialNav benchmark containing two new datasets, Social-HM3D and Social-MP3D. This benchmark offers large-scale photo-realistic indoor scenes populated with a reasonable amount of human agents based on scene area size, incorporating natural human movements and trajectory patterns. We conduct a detailed experimental analysis with the state-of-the-art learning-based method and two classic rule-based path-planning algorithms on the new benchmark. The results demonstrate the importance of future prediction and our method achieves the best task success rate of 55% while maintaining about 90% personal space compliance. We will release our code and datasets. Videos of demonstrations can be viewed at https://zeying-gong.github.io/projects/falcon/ .

Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case

Authors:Peng Chen, Pi Bu, Jun Song, Yuan Gao, Bo Zheng
Date:2024-09-19 16:30:25

Recently, large language model (LLM)-based agents have made significant advances across various fields. One of the most popular research areas involves applying these agents to video games. Traditionally, these methods have relied on game APIs to access in-game environmental and action data. However, this approach is limited by the availability of APIs and does not reflect how humans play games. With the advent of vision language models (VLMs), agents now have enhanced visual understanding capabilities, enabling them to interact with games using only visual inputs. Despite these advances, current approaches still face challenges in action-oriented tasks, particularly in action role-playing games (ARPGs), where reinforcement learning methods are prevalent but suffer from poor generalization and require extensive training. To address these limitations, we select an ARPG, ``Black Myth: Wukong'', as a research platform to explore the capability boundaries of existing VLMs in scenarios requiring visual-only input and complex action output. We define 12 tasks within the game, with 75% focusing on combat, and incorporate several state-of-the-art VLMs into this benchmark. Additionally, we will release a human operation dataset containing recorded gameplay videos and operation logs, including mouse and keyboard actions. Moreover, we propose a novel VARP (Vision Action Role-Playing) agent framework, consisting of an action planning system and a visual trajectory system. Our framework demonstrates the ability to perform basic tasks and succeed in 90% of easy and medium-level combat scenarios. This research aims to provide new insights and directions for applying multimodal agents in complex action game environments. The code and datasets will be made available at https://varp-agent.github.io/.

Robots that Learn to Safely Influence via Prediction-Informed Reach-Avoid Dynamic Games

Authors:Ravi Pandya, Changliu Liu, Andrea Bajcsy
Date:2024-09-18 17:15:21

Robots can influence people to accomplish their tasks more efficiently: autonomous cars can inch forward at an intersection to pass through, and tabletop manipulators can go for an object on the table first. However, a robot's ability to influence can also compromise the safety of nearby people if naively executed. In this work, we pose and solve a novel robust reach-avoid dynamic game which enables robots to be maximally influential, but only when a safety backup control exists. On the human side, we model the human's behavior as goal-driven but conditioned on the robot's plan, enabling us to capture influence. On the robot side, we solve the dynamic game in the joint physical and belief space, enabling the robot to reason about how its uncertainty in human behavior will evolve over time. We instantiate our method, called SLIDE (Safely Leveraging Influence in Dynamic Environments), in a high-dimensional (39-D) simulated human-robot collaborative manipulation task solved via offline game-theoretic reinforcement learning. We compare our approach to a robust baseline that treats the human as a worst-case adversary, a safety controller that does not explicitly reason about influence, and an energy-function-based safety shield. We find that SLIDE consistently enables the robot to leverage the influence it has on the human when it is safe to do so, ultimately allowing the robot to be less conservative while still ensuring a high safety rate during task execution.

XP-MARL: Auxiliary Prioritization in Multi-Agent Reinforcement Learning to Address Non-Stationarity

Authors:Jianye Xu, Omar Sobhy, Bassam Alrifaee
Date:2024-09-18 10:10:55

Non-stationarity poses a fundamental challenge in Multi-Agent Reinforcement Learning (MARL), arising from agents simultaneously learning and altering their policies. This creates a non-stationary environment from the perspective of each individual agent, often leading to suboptimal or even unconverged learning outcomes. We propose an open-source framework named XP-MARL, which augments MARL with auxiliary prioritization to address this challenge in cooperative settings. XP-MARL is 1) founded upon our hypothesis that prioritizing agents and letting higher-priority agents establish their actions first would stabilize the learning process and thus mitigate non-stationarity and 2) enabled by our proposed mechanism called action propagation, where higher-priority agents act first and communicate their actions, providing a more stationary environment for others. Moreover, instead of using a predefined or heuristic priority assignment, XP-MARL learns priority-assignment policies with an auxiliary MARL problem, leading to a joint learning scheme. Experiments in a motion-planning scenario involving Connected and Automated Vehicles (CAVs) demonstrate that XP-MARL improves the safety of a baseline model by 84.4% and outperforms a state-of-the-art approach, which improves the baseline by only 12.8%. Code: github.com/cas-lab-munich/sigmarl

Optimizing Job Shop Scheduling in the Furniture Industry: A Reinforcement Learning Approach Considering Machine Setup, Batch Variability, and Intralogistics

Authors:Malte Schneevogt, Karsten Binninger, Noah Klarmann
Date:2024-09-18 09:12:40

This paper explores the potential application of Deep Reinforcement Learning in the furniture industry. To offer a broad product portfolio, most furniture manufacturers are organized as a job shop, which ultimately results in the Job Shop Scheduling Problem (JSSP). The JSSP is addressed with a focus on extending traditional models to better represent the complexities of real-world production environments. Existing approaches frequently fail to consider critical factors such as machine setup times or varying batch sizes. A concept for a model is proposed that provides a higher level of information detail to enhance scheduling accuracy and efficiency. The concept introduces the integration of DRL for production planning, particularly suited to batch production industries such as the furniture industry. The model extends traditional approaches to JSSPs by including job volumes, buffer management, transportation times, and machine setup times. This enables more precise forecasting and analysis of production flows and processes, accommodating the variability and complexity inherent in real-world manufacturing processes. The RL agent learns to optimize scheduling decisions. It operates within a discrete action space, making decisions based on detailed observations. A reward function guides the agent's decision-making process, thereby promoting efficient scheduling and meeting production deadlines. Two integration strategies for implementing the RL agent are discussed: episodic planning, which is suitable for low-automation environments, and continuous planning, which is ideal for highly automated plants. While episodic planning can be employed as a standalone solution, the continuous planning approach necessitates the integration of the agent with ERP and Manufacturing Execution Systems. This integration enables real-time adjustments to production schedules based on dynamic changes.

Synthesizing Evolving Symbolic Representations for Autonomous Systems

Authors:Gabriele Sartor, Angelo Oddi, Riccardo Rasconi, Vieri Giuliano Santucci, Rosa Meo
Date:2024-09-18 07:23:26

Recently, AI systems have made remarkable progress in various tasks. Deep Reinforcement Learning(DRL) is an effective tool for agents to learn policies in low-level state spaces to solve highly complex tasks. Researchers have introduced Intrinsic Motivation(IM) to the RL mechanism, which simulates the agent's curiosity, encouraging agents to explore interesting areas of the environment. This new feature has proved vital in enabling agents to learn policies without being given specific goals. However, even though DRL intelligence emerges through a sub-symbolic model, there is still a need for a sort of abstraction to understand the knowledge collected by the agent. To this end, the classical planning formalism has been used in recent research to explicitly represent the knowledge an autonomous agent acquires and effectively reach extrinsic goals. Despite classical planning usually presents limited expressive capabilities, PPDDL demonstrated usefulness in reviewing the knowledge gathered by an autonomous system, making explicit causal correlations, and can be exploited to find a plan to reach any state the agent faces during its experience. This work presents a new architecture implementing an open-ended learning system able to synthesize from scratch its experience into a PPDDL representation and update it over time. Without a predefined set of goals and tasks, the system integrates intrinsic motivations to explore the environment in a self-directed way, exploiting the high-level knowledge acquired during its experience. The system explores the environment and iteratively: (a) discover options, (b) explore the environment using options, (c) abstract the knowledge collected and (d) plan. This paper proposes an alternative approach to implementing open-ended learning architectures exploiting low-level and high-level representations to extend its knowledge in a virtuous loop.

Automating proton PBS treatment planning for head and neck cancers using policy gradient-based deep reinforcement learning

Authors:Qingqing Wang, Chang Chang
Date:2024-09-17 22:01:56

Proton pencil beam scanning (PBS) treatment planning for head and neck (H&N) cancers is a time-consuming and experience-demanding task where a large number of planning objectives are involved. Deep reinforcement learning (DRL) has recently been introduced to the planning processes of intensity-modulated radiation therapy and brachytherapy for prostate, lung, and cervical cancers. However, existing approaches are built upon the Q-learning framework and weighted linear combinations of clinical metrics, suffering from poor scalability and flexibility and only capable of adjusting a limited number of planning objectives in discrete action spaces. We propose an automatic treatment planning model using the proximal policy optimization (PPO) algorithm and a dose distribution-based reward function for proton PBS treatment planning of H&N cancers. Specifically, a set of empirical rules is used to create auxiliary planning structures from target volumes and organs-at-risk (OARs), along with their associated planning objectives. These planning objectives are fed into an in-house optimization engine to generate the spot monitor unit (MU) values. A decision-making policy network trained using PPO is developed to iteratively adjust the involved planning objective parameters in a continuous action space and refine the PBS treatment plans using a novel dose distribution-based reward function. Proton H&N treatment plans generated by the model show improved OAR sparing with equal or superior target coverage when compared with human-generated plans. Moreover, additional experiments on liver cancer demonstrate that the proposed method can be successfully generalized to other treatment sites. To the best of our knowledge, this is the first DRL-based automatic treatment planning model capable of achieving human-level performance for H&N cancers.

Leveraging Symmetry to Accelerate Learning of Trajectory Tracking Controllers for Free-Flying Robotic Systems

Authors:Jake Welde, Nishanth Rao, Pratik Kunapuli, Dinesh Jayaraman, Vijay Kumar
Date:2024-09-17 14:39:24

Tracking controllers enable robotic systems to accurately follow planned reference trajectories. In particular, reinforcement learning (RL) has shown promise in the synthesis of controllers for systems with complex dynamics and modest online compute budgets. However, the poor sample efficiency of RL and the challenges of reward design make training slow and sometimes unstable, especially for high-dimensional systems. In this work, we leverage the inherent Lie group symmetries of robotic systems with a floating base to mitigate these challenges when learning tracking controllers. We model a general tracking problem as a Markov decision process (MDP) that captures the evolution of both the physical and reference states. Next, we prove that symmetry in the underlying dynamics and running costs leads to an MDP homomorphism, a mapping that allows a policy trained on a lower-dimensional "quotient" MDP to be lifted to an optimal tracking controller for the original system. We compare this symmetry-informed approach to an unstructured baseline, using Proximal Policy Optimization (PPO) to learn tracking controllers for three systems: the Particle (a forced point mass), the Astrobee (a fully-actuated space robot), and the Quadrotor (an underactuated system). Results show that a symmetry-aware approach both accelerates training and reduces tracking error after the same number of training steps.

Agile Continuous Jumping in Discontinuous Terrains

Authors:Yuxiang Yang, Guanya Shi, Changyi Lin, Xiangyun Meng, Rosario Scalise, Mateo Guaman Castro, Wenhao Yu, Tingnan Zhang, Ding Zhao, Jie Tan, Byron Boots
Date:2024-09-17 06:42:50

We focus on agile, continuous, and terrain-adaptive jumping of quadrupedal robots in discontinuous terrains such as stairs and stepping stones. Unlike single-step jumping, continuous jumping requires accurately executing highly dynamic motions over long horizons, which is challenging for existing approaches. To accomplish this task, we design a hierarchical learning and control framework, which consists of a learned heightmap predictor for robust terrain perception, a reinforcement-learning-based centroidal-level motion policy for versatile and terrain-adaptive planning, and a low-level model-based leg controller for accurate motion tracking. In addition, we minimize the sim-to-real gap by accurately modeling the hardware characteristics. Our framework enables a Unitree Go1 robot to perform agile and continuous jumps on human-sized stairs and sparse stepping stones, for the first time to the best of our knowledge. In particular, the robot can cross two stair steps in each jump and completes a 3.5m long, 2.8m high, 14-step staircase in 4.5 seconds. Moreover, the same policy outperforms baselines in various other parkour tasks, such as jumping over single horizontal or vertical discontinuities. Experiment videos can be found at https://yxyang.github.io/jumping_cod/

DIGIMON: Diagnosis and Mitigation of Sampling Skew for Reinforcement Learning based Meta-Planner in Robot Navigation

Authors:Shiwei Feng, Xuan Chen, Zhiyuan Cheng, Zikang Xiong, Yifei Gao, Siyuan Cheng, Sayali Kate, Xiangyu Zhang
Date:2024-09-17 01:49:17

Robot navigation is increasingly crucial across applications like delivery services and warehouse management. The integration of Reinforcement Learning (RL) with classical planning has given rise to meta-planners that combine the adaptability of RL with the explainable decision-making of classical planners. However, the exploration capabilities of RL-based meta-planners during training are often constrained by the capabilities of the underlying classical planners. This constraint can result in limited exploration, thereby leading to sampling skew issues. To address these issues, our paper introduces a novel framework, DIGIMON, which begins with behavior-guided diagnosis for exploration bottlenecks within the meta-planner and follows up with a mitigation strategy that conducts up-sampling from diagnosed bottleneck data. Our evaluation shows 13.5%+ improvement in navigation performance, greater robustness in out-of-distribution environments, and a 4x boost in training efficiency. DIGIMON is designed as a versatile, plug-and-play solution, allowing seamless integration into various RL-based meta-planners.

Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens

Authors:Joseph Clinton, Robert Lieck
Date:2024-09-14 19:30:53

Supervised learning approaches to offline reinforcement learning, particularly those utilizing the Decision Transformer, have shown effectiveness in continuous environments and for sparse rewards. However, they often struggle with long-horizon tasks due to the high compounding error of auto-regressive models. To overcome this limitation, we go beyond next-token prediction and introduce Planning Tokens, which contain high-level, long time-scale information about the agent's future. Predicting dual time-scale tokens at regular intervals enables our model to use these long-horizon Planning Tokens as a form of implicit planning to guide its low-level policy and reduce compounding error. This architectural modification significantly enhances performance on long-horizon tasks, establishing a new state-of-the-art in complex D4RL environments. Additionally, we demonstrate that Planning Tokens improve the interpretability of the model's policy through the interpretable plan visualisations and attention map.

PIP-Loco: A Proprioceptive Infinite Horizon Planning Framework for Quadrupedal Robot Locomotion

Authors:Aditya Shirwatkar, Naman Saxena, Kishore Chandra, Shishir Kolathaya
Date:2024-09-14 13:51:37

A core strength of Model Predictive Control (MPC) for quadrupedal locomotion has been its ability to enforce constraints and provide interpretability of the sequence of commands over the horizon. However, despite being able to plan, MPC struggles to scale with task complexity, often failing to achieve robust behavior on rapidly changing surfaces. On the other hand, model-free Reinforcement Learning (RL) methods have outperformed MPC on multiple terrains, showing emergent motions but inherently lack any ability to handle constraints or perform planning. To address these limitations, we propose a framework that integrates proprioceptive planning with RL, allowing for agile and safe locomotion behaviors through the horizon. Inspired by MPC, we incorporate an internal model that includes a velocity estimator and a Dreamer module. During training, the framework learns an expert policy and an internal model that are co-dependent, facilitating exploration for improved locomotion behaviors. During deployment, the Dreamer module solves an infinite-horizon MPC problem, adapting actions and velocity commands to respect the constraints. We validate the robustness of our training framework through ablation studies on internal model components and demonstrate improved robustness to training noise. Finally, we evaluate our approach across multi-terrain scenarios in both simulation and hardware.

DexSim2Real$^{2}$: Building Explicit World Model for Precise Articulated Object Dexterous Manipulation

Authors:Taoran Jiang, Liqian Ma, Yixuan Guan, Jiaojiao Meng, Weihang Chen, Zecui Zeng, Lusong Li, Dan Wu, Jing Xu, Rui Chen
Date:2024-09-13 12:00:57

Articulated object manipulation is ubiquitous in daily life. In this paper, we present DexSim2Real$^{2}$, a novel robot learning framework for goal-conditioned articulated object manipulation using both two-finger grippers and multi-finger dexterous hands. The key of our framework is constructing an explicit world model of unseen articulated objects through active one-step interactions. This explicit world model enables sampling-based model predictive control to plan trajectories achieving different manipulation goals without needing human demonstrations or reinforcement learning. It first predicts an interaction motion using an affordance estimation network trained on self-supervised interaction data or videos of human manipulation from the internet. After executing this interaction on the real robot, the framework constructs a digital twin of the articulated object in simulation based on the two point clouds before and after the interaction. For dexterous multi-finger manipulation, we propose to utilize eigengrasp to reduce the high-dimensional action space, enabling more efficient trajectory searching. Extensive experiments validate the framework's effectiveness for precise articulated object manipulation in both simulation and the real world using a two-finger gripper and a 16-DoF dexterous hand. The robust generalizability of the explicit world model also enables advanced manipulation strategies, such as manipulating with different tools.

CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks

Authors:Tianlong Wang, Junzhe Chen, Xueting Han, Jing Bai
Date:2024-09-13 08:59:31

Post-training, particularly reinforcement learning (RL) using self-play-generated data, has become a new learning paradigm for large language models (LLMs). However, scaling RL to develop a general reasoner remains a research challenge, as existing methods focus on task-specific reasoning without adequately addressing generalization across a broader range of tasks. Moreover, unlike traditional RL with limited action space, LLMs operate in an infinite space, making it crucial to search for valuable and diverse strategies to solve problems effectively. To address this, we propose searching within the action space on high-level abstract plans to enhance model generalization and introduce Critical Plan Step Learning (CPL), comprising: 1) searching on plan, using Monte Carlo Tree Search (MCTS) to explore diverse plan steps in multi-step reasoning tasks, and 2) learning critical plan steps through Step-level Advantage Preference Optimization (Step-APO), which integrates advantage estimates for step preference obtained via MCTS into Direct Preference Optimization (DPO). This combination helps the model effectively learn critical plan steps, enhancing both reasoning capabilities and generalization. Experimental results demonstrate that our method, trained exclusively on GSM8K and MATH, not only significantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also enhances out-of-domain reasoning benchmarks, such as HumanEval (+12.2%), GPQA (+8.6%), ARC-C (+4.0%), MMLU-STEM (+2.2%), and BBH (+1.8%).

Online Decision MetaMorphFormer: A Casual Transformer-Based Reinforcement Learning Framework of Universal Embodied Intelligence

Authors:Luo Ji, Runji Lin
Date:2024-09-11 15:22:43

Interactive artificial intelligence in the motion control field is an interesting topic, especially when universal knowledge is adaptive to multiple tasks and universal environments. Despite there being increasing efforts in the field of Reinforcement Learning (RL) with the aid of transformers, most of them might be limited by the offline training pipeline, which prohibits exploration and generalization abilities. To address this limitation, we propose the framework of Online Decision MetaMorphFormer (ODM) which aims to achieve self-awareness, environment recognition, and action planning through a unified model architecture. Motivated by cognitive and behavioral psychology, an ODM agent is able to learn from others, recognize the world, and practice itself based on its own experience. ODM can also be applied to any arbitrary agent with a multi-joint body, located in different environments, and trained with different types of tasks using large-scale pre-trained datasets. Through the use of pre-trained datasets, ODM can quickly warm up and learn the necessary knowledge to perform the desired task, while the target environment continues to reinforce the universal policy. Extensive online experiments as well as few-shot and zero-shot environmental tests are used to verify ODM's performance and generalization ability. The results of our study contribute to the study of general artificial intelligence in embodied and cognitive fields. Code, results, and video examples can be found on the website \url{https://rlodm.github.io/odm/}.

Enhancing Cross-domain Pre-Trained Decision Transformers with Adaptive Attention

Authors:Wenhao Zhao, Qiushui Xu, Linjie Xu, Lei Song, Jinyu Wang, Chunlai Zhou, Jiang Bian
Date:2024-09-11 03:18:34

Recently, the pre-training of decision transformers (DT) using a different domain, such as natural language text, has generated significant attention in offline reinforcement learning (Offline RL). Although this cross-domain pre-training approach achieves superior performance compared to training from scratch in environments required short-term planning ability, the mechanisms by which pre-training benefits the fine-tuning phase remain unclear. Furthermore, we point out that the cross-domain pre-training approach hinders the extraction of distant information in environments like PointMaze that require long-term planning ability, leading to performance that is much worse than training DT from scratch. This work first analyzes these issues and found that Markov Matrix, a component that exists in pre-trained attention heads, is the key to explain the significant performance disparity of pre-trained models in different planning abilities. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pre-trained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodating diverse attention requirements during fine-tuning. Extensive experiments demonstrate that the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, and in long-term environments, it mitigates the negative impact caused by Markov Matrix, achieving results comparable to those of DT trained from scratch.

Developing Path Planning with Behavioral Cloning and Proximal Policy Optimization for Path-Tracking and Static Obstacle Nudging

Authors:Mingyan Zhou, Biao Wang, Tian Tan, Xiatao Sun
Date:2024-09-09 02:54:24

In autonomous driving, end-to-end methods utilizing Imitation Learning (IL) and Reinforcement Learning (RL) are becoming more and more common. However, they do not involve explicit reasoning like classic robotics workflow and planning with horizons, resulting in strategies implicit and myopic. In this paper, we introduce a path planning method that uses Behavioral Cloning (BC) for path-tracking and Proximal Policy Optimization (PPO) for static obstacle nudging. It outputs lateral offset values to adjust the given reference waypoints and performs modified path for different controllers. Experimental results show that the algorithm can do path following that mimics the expert performance of path-tracking controllers, and avoid collision to fixed obstacles. The method makes a good attempt at planning with learning-based methods in path planning problems of autonomous driving.

Multi-agent Path Finding for Mixed Autonomy Traffic Coordination

Authors:Han Zheng, Zhongxia Yan, Cathy Wu
Date:2024-09-05 19:37:01

In the evolving landscape of urban mobility, the prospective integration of Connected and Automated Vehicles (CAVs) with Human-Driven Vehicles (HDVs) presents a complex array of challenges and opportunities for autonomous driving systems. While recent advancements in robotics have yielded Multi-Agent Path Finding (MAPF) algorithms tailored for agent coordination task characterized by simplified kinematics and complete control over agent behaviors, these solutions are inapplicable in mixed-traffic environments where uncontrollable HDVs must coexist and interact with CAVs. Addressing this gap, we propose the Behavior Prediction Kinematic Priority Based Search (BK-PBS), which leverages an offline-trained conditional prediction model to forecast HDV responses to CAV maneuvers, integrating these insights into a Priority Based Search (PBS) where the A* search proceeds over motion primitives to accommodate kinematic constraints. We compare BK-PBS with CAV planning algorithms derived by rule-based car-following models, and reinforcement learning. Through comprehensive simulation on a highway merging scenario across diverse scenarios of CAV penetration rate and traffic density, BK-PBS outperforms these baselines in reducing collision rates and enhancing system-level travel delay. Our work is directly applicable to many scenarios of multi-human multi-robot coordination.

Reinforcement Learning Approach to Optimizing Profilometric Sensor Trajectories for Surface Inspection

Authors:Sara Roos-Hoefgeest, Mario Roos-Hoefgeest, Ignacio Alvarez, Rafael C. González
Date:2024-09-05 11:20:12

High-precision surface defect detection in manufacturing is essential for ensuring quality control. Laser triangulation profilometric sensors are key to this process, providing detailed and accurate surface measurements over a line. To achieve a complete and precise surface scan, accurate relative motion between the sensor and the workpiece is required. It is crucial to control the sensor pose to maintain optimal distance and relative orientation to the surface. It is also important to ensure uniform profile distribution throughout the scanning process. This paper presents a novel Reinforcement Learning (RL) based approach to optimize robot inspection trajectories for profilometric sensors. Building upon the Boustrophedon scanning method, our technique dynamically adjusts the sensor position and tilt to maintain optimal orientation and distance from the surface, while also ensuring a consistent profile distance for uniform and high-quality scanning. Utilizing a simulated environment based on the CAD model of the part, we replicate real-world scanning conditions, including sensor noise and surface irregularities. This simulation-based approach enables offline trajectory planning based on CAD models. Key contributions include the modeling of the state space, action space, and reward function, specifically designed for inspection applications using profilometric sensors. We use Proximal Policy Optimization (PPO) algorithm to efficiently train the RL agent, demonstrating its capability to optimize inspection trajectories with profilometric sensors. To validate our approach, we conducted several experiments where a model trained on a specific training piece was tested on various parts in simulation. Also, we conducted a real-world experiment by executing the optimized trajectory, generated offline from a CAD model, to inspect a part using a UR3e robotic arm model.

Tractable Offline Learning of Regular Decision Processes

Authors:Ahana Deb, Roberto Cipollone, Anders Jonsson, Alessandro Ronca, Mohammad Sadegh Talebi
Date:2024-09-04 14:26:58

This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs). In RDPs, the unknown dependency of future observations and rewards from the past interactions can be captured by some hidden finite-state automaton. For this reason, many RDP algorithms first reconstruct this unknown dependency using automata learning techniques. In this paper, we show that it is possible to overcome two strong limitations of previous offline RL algorithms for RDPs, notably RegORL. This can be accomplished via the introduction of two original techniques: the development of a new pseudometric based on formal languages, which removes a problematic dependency on $L_\infty^\mathsf{p}$-distinguishability parameters, and the adoption of Count-Min-Sketch (CMS), instead of naive counting. The former reduces the number of samples required in environments that are characterized by a low complexity in language-theoretic terms. The latter alleviates the memory requirements for long planning horizons. We derive the PAC sample complexity bounds associated to each of these techniques, and we validate the approach experimentally.

USV-AUV Collaboration Framework for Underwater Tasks under Extreme Sea Conditions

Authors:Jingzehua Xu, Guanwen Xie, Xinqi Wang, Yimian Ding, Shuai Zhang
Date:2024-09-04 04:44:21

Autonomous underwater vehicles (AUVs) are valuable for ocean exploration due to their flexibility and ability to carry communication and detection units. Nevertheless, AUVs alone often face challenges in harsh and extreme sea conditions. This study introduces a unmanned surface vehicle (USV)-AUV collaboration framework, which includes high-precision multi-AUV positioning using USV path planning via Fisher information matrix optimization and reinforcement learning for multi-AUV cooperative tasks. Applied to a multi-AUV underwater data collection task scenario, extensive simulations validate the framework's feasibility and superior performance, highlighting exceptional coordination and robustness under extreme sea conditions. To accelerate relevant research in this field, we have made the simulation code (demo version) available as open-source.

Reinforcement Learning for Wheeled Mobility on Vertically Challenging Terrain

Authors:Tong Xu, Chenhui Pan, Xuesu Xiao
Date:2024-09-04 02:19:21

Off-road navigation on vertically challenging terrain, involving steep slopes and rugged boulders, presents significant challenges for wheeled robots both at the planning level to achieve smooth collision-free trajectories and at the control level to avoid rolling over or getting stuck. Considering the complex model of wheel-terrain interactions, we develop an end-to-end Reinforcement Learning (RL) system for an autonomous vehicle to learn wheeled mobility through simulated trial-and-error experiences. Using a custom-designed simulator built on the Chrono multi-physics engine, our approach leverages Proximal Policy Optimization (PPO) and a terrain difficulty curriculum to refine a policy based on a reward function to encourage progress towards the goal and penalize excessive roll and pitch angles, which circumvents the need of complex and expensive kinodynamic modeling, planning, and control. Additionally, we present experimental results in the simulator and deploy our approach on a physical Verti-4-Wheeler (V4W) platform, demonstrating that RL can equip conventional wheeled robots with previously unrealized potential of navigating vertically challenging terrain.

A Deployed Online Reinforcement Learning Algorithm In An Oral Health Clinical Trial

Authors:Anna L. Trella, Kelly W. Zhang, Hinal Jajal, Inbal Nahum-Shani, Vivek Shetty, Finale Doshi-Velez, Susan A. Murphy
Date:2024-09-03 17:16:01

Dental disease is a prevalent chronic condition associated with substantial financial burden, personal suffering, and increased risk of systemic diseases. Despite widespread recommendations for twice-daily tooth brushing, adherence to recommended oral self-care behaviors remains sub-optimal due to factors such as forgetfulness and disengagement. To address this, we developed Oralytics, a mHealth intervention system designed to complement clinician-delivered preventative care for marginalized individuals at risk for dental disease. Oralytics incorporates an online reinforcement learning algorithm to determine optimal times to deliver intervention prompts that encourage oral self-care behaviors. We have deployed Oralytics in a registered clinical trial. The deployment required careful design to manage challenges specific to the clinical trials setting in the U.S. In this paper, we (1) highlight key design decisions of the RL algorithm that address these challenges and (2) conduct a re-sampling analysis to evaluate algorithm design decisions. A second phase (randomized control trial) of Oralytics is planned to start in spring 2025.

Grounding Language Models in Autonomous Loco-manipulation Tasks

Authors:Jin Wang, Nikos Tsagarakis
Date:2024-09-02 15:27:48

Humanoid robots with behavioral autonomy have consistently been regarded as ideal collaborators in our daily lives and promising representations of embodied intelligence. Compared to fixed-based robotic arms, humanoid robots offer a larger operational space while significantly increasing the difficulty of control and planning. Despite the rapid progress towards general-purpose humanoid robots, most studies remain focused on locomotion ability with few investigations into whole-body coordination and tasks planning, thus limiting the potential to demonstrate long-horizon tasks involving both mobility and manipulation under open-ended verbal instructions. In this work, we propose a novel framework that learns, selects, and plans behaviors based on tasks in different scenarios. We combine reinforcement learning (RL) with whole-body optimization to generate robot motions and store them into a motion library. We further leverage the planning and reasoning features of the large language model (LLM), constructing a hierarchical task graph that comprises a series of motion primitives to bridge lower-level execution with higher-level planning. Experiments in simulation and real-world using the CENTAURO robot show that the language model based planner can efficiently adapt to new loco-manipulation tasks, demonstrating high autonomy from free-text commands in unstructured scenes.

Multiagent Reinforcement Learning Enhanced Decision-making of Crew Agents During Floor Construction Process

Authors:Bin Yang, Boda Liu, Yilong Han, Xin Meng, Yifan Wang, Hansi Yang, Jianzhuang Xia
Date:2024-09-02 08:35:59

Fine-grained simulation of floor construction processes is essential for supporting lean management and the integration of information technology. However, existing research does not adequately address the on-site decision-making of constructors in selecting tasks and determining their sequence within the entire construction process. Moreover, decision-making frameworks from computer science and robotics are not directly applicable to construction scenarios. To facilitate intelligent simulation in construction, this study introduces the Construction Markov Decision Process (CMDP). The primary contribution of this CMDP framework lies in its construction knowledge in decision, observation modifications and policy design, enabling agents to perceive the construction state and follow policy guidance to evaluate and reach various range of targets for optimizing the planning of construction activities. The CMDP is developed on the Unity platform, utilizing a two-stage training approach with the multi-agent proximal policy optimization algorithm. A case study demonstrates the effectiveness of this framework: the low-level policy successfully simulates the construction process in continuous space, facilitating policy testing and training focused on reducing conflicts and blockages among crews; and the high-level policy improving the spatio-temporal planning of construction activities, generating construction patterns in distinct phases, leading to the discovery of new construction insights.

Solving Integrated Process Planning and Scheduling Problem via Graph Neural Network Based Deep Reinforcement Learning

Authors:Hongpei Li, Han Zhang, Ziyan He, Yunkai Jia, Bo Jiang, Xiang Huang, Dongdong Ge
Date:2024-09-02 06:18:30

The Integrated Process Planning and Scheduling (IPPS) problem combines process route planning and shop scheduling to achieve high efficiency in manufacturing and maximize resource utilization, which is crucial for modern manufacturing systems. Traditional methods using Mixed Integer Linear Programming (MILP) and heuristic algorithms can not well balance solution quality and speed when solving IPPS. In this paper, we propose a novel end-to-end Deep Reinforcement Learning (DRL) method. We model the IPPS problem as a Markov Decision Process (MDP) and employ a Heterogeneous Graph Neural Network (GNN) to capture the complex relationships among operations, machines, and jobs. To optimize the scheduling strategy, we use Proximal Policy Optimization (PPO). Experimental results show that, compared to traditional methods, our approach significantly improves solution efficiency and quality in large-scale IPPS instances, providing superior scheduling strategies for modern intelligent manufacturing systems.

Cooperative Path Planning with Asynchronous Multiagent Reinforcement Learning

Authors:Jiaming Yin, Weixiong Rao, Yu Xiao, Keshuang Tang
Date:2024-09-01 15:48:14

In this paper, we study the shortest path problem (SPP) with multiple source-destination pairs (MSD), namely MSD-SPP, to minimize average travel time of all shortest paths. The inherent traffic capacity limits within a road network contributes to the competition among vehicles. Multi-agent reinforcement learning (MARL) model cannot offer effective and efficient path planning cooperation due to the asynchronous decision making setting in MSD-SPP, where vehicles (a.k.a agents) cannot simultaneously complete routing actions in the previous time step. To tackle the efficiency issue, we propose to divide an entire road network into multiple sub-graphs and subsequently execute a two-stage process of inter-region and intra-region route planning. To address the asynchronous issue, in the proposed asyn-MARL framework, we first design a global state, which exploits a low-dimensional vector to implicitly represent the joint observations and actions of multi-agents. Then we develop a novel trajectory collection mechanism to decrease the redundancy in training trajectories. Additionally, we design a novel actor network to facilitate the cooperation among vehicles towards the same or close destinations and a reachability graph aimed at preventing infinite loops in routing paths. On both synthetic and real road networks, our evaluation result demonstrates that our approach outperforms state-of-the-art planning approaches.

AgGym: An agricultural biotic stress simulation environment for ultra-precision management planning

Authors:Mahsa Khosravi, Matthew Carroll, Kai Liang Tan, Liza Van der Laan, Joscif Raigne, Daren S. Mueller, Arti Singh, Aditya Balu, Baskar Ganapathysubramanian, Asheesh Kumar Singh, Soumik Sarkar
Date:2024-09-01 14:55:45

Agricultural production requires careful management of inputs such as fungicides, insecticides, and herbicides to ensure a successful crop that is high-yielding, profitable, and of superior seed quality. Current state-of-the-art field crop management relies on coarse-scale crop management strategies, where entire fields are sprayed with pest and disease-controlling chemicals, leading to increased cost and sub-optimal soil and crop management. To overcome these challenges and optimize crop production, we utilize machine learning tools within a virtual field environment to generate localized management plans for farmers to manage biotic threats while maximizing profits. Specifically, we present AgGym, a modular, crop and stress agnostic simulation framework to model the spread of biotic stresses in a field and estimate yield losses with and without chemical treatments. Our validation with real data shows that AgGym can be customized with limited data to simulate yield outcomes under various biotic stress conditions. We further demonstrate that deep reinforcement learning (RL) policies can be trained using AgGym for designing ultra-precise biotic stress mitigation strategies with potential to increase yield recovery with less chemicals and lower cost. Our proposed framework enables personalized decision support that can transform biotic stress management from being schedule based and reactive to opportunistic and prescriptive. We also release the AgGym software implementation as a community resource and invite experts to contribute to this open-sourced publicly available modular environment framework. The source code can be accessed at: https://github.com/SCSLabISU/AgGym.

MAPF-GPT: Imitation Learning for Multi-Agent Pathfinding at Scale

Authors:Anton Andreychuk, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik
Date:2024-08-29 12:55:10

Multi-agent pathfinding (MAPF) is a problem that generally requires finding collision-free paths for multiple agents in a shared environment. Solving MAPF optimally, even under restrictive assumptions, is NP-hard, yet efficient solutions for this problem are critical for numerous applications, such as automated warehouses and transportation systems. Recently, learning-based approaches to MAPF have gained attention, particularly those leveraging deep reinforcement learning. Typically, such learning-based MAPF solvers are augmented with additional components like single-agent planning or communication. Orthogonally, in this work we rely solely on imitation learning that leverages a large dataset of expert MAPF solutions and transformer-based neural network to create a foundation model for MAPF called MAPF-GPT. The latter is capable of generating actions without additional heuristics or communication. MAPF-GPT demonstrates zero-shot learning abilities when solving the MAPF problems that are not present in the training dataset. We show that MAPF-GPT notably outperforms the current best-performing learnable MAPF solvers on a diverse range of problem instances and is computationally efficient during inference.

DECAF: a Discrete-Event based Collaborative Human-Robot Framework for Furniture Assembly

Authors:Giulio Giacomuzzo, Matteo Terreran, Siddarth Jain, Diego Romeres
Date:2024-08-28 20:26:32

This paper proposes a task planning framework for collaborative Human-Robot scenarios, specifically focused on assembling complex systems such as furniture. The human is characterized as an uncontrollable agent, implying for example that the agent is not bound by a pre-established sequence of actions and instead acts according to its own preferences. Meanwhile, the task planner computes reactively the optimal actions for the collaborative robot to efficiently complete the entire assembly task in the least time possible. We formalize the problem as a Discrete Event Markov Decision Problem (DE-MDP), a comprehensive framework that incorporates a variety of asynchronous behaviors, human change of mind and failure recovery as stochastic events. Although the problem could theoretically be addressed by constructing a graph of all possible actions, such an approach would be constrained by computational limitations. The proposed formulation offers an alternative solution utilizing Reinforcement Learning to derive an optimal policy for the robot. Experiments where conducted both in simulation and on a real system with human subjects assembling a chair in collaboration with a 7-DoF manipulator.

Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games

Authors:Nicholas R. Waytowich, Devin White, MD Sunbeam, Vinicius G. Goecks
Date:2024-08-28 17:08:56

Recent advancements in large language models (LLMs) have expanded their capabilities beyond traditional text-based tasks to multimodal domains, integrating visual, auditory, and textual data. While multimodal LLMs have been extensively explored for high-level planning in domains like robotics and games, their potential as low-level controllers remains largely untapped. In this paper, we introduce a novel benchmark aimed at testing the emergent capabilities of multimodal LLMs as low-level policies in Atari games. Unlike traditional reinforcement learning (RL) methods that require training for each new environment and reward function specification, these LLMs utilize pre-existing multimodal knowledge to directly engage with game environments. Our study assesses the performances of multiple multimodal LLMs against traditional RL agents, human players, and random agents, focusing on their ability to understand and interact with complex visual scenes and formulate strategic responses. Our results show that these multimodal LLMs are not yet capable of being zero-shot low-level policies. Furthermore, we see that this is, in part, due to their visual and spatial reasoning. Additional results and videos are available on our project webpage: https://dev1nw.github.io/atari-gpt/.

Earth Observation Satellite Scheduling with Graph Neural Networks

Authors:Antoine Jacquet, Guillaume Infantes, Nicolas Meuleau, Emmanuel Benazera, Stéphanie Roussel, Vincent Baudoui, Jonathan Guerra
Date:2024-08-27 13:10:26

The Earth Observation Satellite Planning (EOSP) is a difficult optimization problem with considerable practical interest. A set of requested observations must be scheduled on an agile Earth observation satellite while respecting constraints on their visibility window, as well as maneuver constraints that impose varying delays between successive observations. In addition, the problem is largely oversubscribed: there are much more candidate observations than what can possibly be achieved. Therefore, one must select the set of observations that will be performed while maximizing their weighted cumulative benefit, and propose a feasible schedule for these observations. As previous work mostly focused on heuristic and iterative search algorithms, this paper presents a new technique for selecting and scheduling observations based on Graph Neural Networks (GNNs) and Deep Reinforcement Learning (DRL). GNNs are used to extract relevant information from the graphs representing instances of the EOSP, and DRL drives the search for optimal schedules. Our simulations show that it is able to learn on small problem instances and generalize to larger real-world instances, with very competitive performance compared to traditional approaches.

DynamicRouteGPT: A Real-Time Multi-Vehicle Dynamic Navigation Framework Based on Large Language Models

Authors:Ziai Zhou, Bin Zhou, Hao Liu
Date:2024-08-26 11:19:58

Real-time dynamic path planning in complex traffic environments presents challenges, such as varying traffic volumes and signal wait times. Traditional static routing algorithms like Dijkstra and A* compute shortest paths but often fail under dynamic conditions. Recent Reinforcement Learning (RL) approaches offer improvements but tend to focus on local optima, risking dead-ends or boundary issues. This paper proposes a novel approach based on causal inference for real-time dynamic path planning, balancing global and local optimality. We first use the static Dijkstra algorithm to compute a globally optimal baseline path. A distributed control strategy then guides vehicles along this path. At intersections, DynamicRouteGPT performs real-time decision-making for local path selection, considering real-time traffic, driving preferences, and unexpected events. DynamicRouteGPT integrates Markov chains, Bayesian inference, and large-scale pretrained language models like Llama3 8B to provide an efficient path planning solution. It dynamically adjusts to traffic scenarios and driver preferences and requires no pre-training, offering broad applicability across road networks. A key innovation is the construction of causal graphs for counterfactual reasoning, optimizing path decisions. Experimental results show that our method achieves state-of-the-art performance in real-time dynamic path planning for multiple vehicles while providing explainable path selections, offering a novel and efficient solution for complex traffic environments.

Bridging the gap between Learning-to-plan, Motion Primitives and Safe Reinforcement Learning

Authors:Piotr Kicki, Davide Tateo, Puze Liu, Jonas Guenster, Jan Peters, Krzysztof Walas
Date:2024-08-26 07:44:53

Trajectory planning under kinodynamic constraints is fundamental for advanced robotics applications that require dexterous, reactive, and rapid skills in complex environments. These constraints, which may represent task, safety, or actuator limitations, are essential for ensuring the proper functioning of robotic platforms and preventing unexpected behaviors. Recent advances in kinodynamic planning demonstrate that learning-to-plan techniques can generate complex and reactive motions under intricate constraints. However, these techniques necessitate the analytical modeling of both the robot and the entire task, a limiting assumption when systems are extremely complex or when constructing accurate task models is prohibitive. This paper addresses this limitation by combining learning-to-plan methods with reinforcement learning, resulting in a novel integration of black-box learning of motion primitives and optimization. We evaluate our approach against state-of-the-art safe reinforcement learning methods, showing that our technique, particularly when exploiting task structure, outperforms baseline methods in challenging scenarios such as planning to hit in robot air hockey. This work demonstrates the potential of our integrated approach to enhance the performance and safety of robots operating under complex kinodynamic constraints.

Multi-Agent Target Assignment and Path Finding for Intelligent Warehouse: A Cooperative Multi-Agent Deep Reinforcement Learning Perspective

Authors:Qi Liu, Jianqi Gao, Dongjie Zhu, Zhongjian Qiao, Pengbin Chen, Jingxiang Guo, Yanjie Li
Date:2024-08-25 07:32:58

Multi-agent target assignment and path planning (TAPF) are two key problems in intelligent warehouse. However, most literature only addresses one of these two problems separately. In this study, we propose a method to simultaneously solve target assignment and path planning from a perspective of cooperative multi-agent deep reinforcement learning (RL). To the best of our knowledge, this is the first work to model the TAPF problem for intelligent warehouse to cooperative multi-agent deep RL, and the first to simultaneously address TAPF based on multi-agent deep RL. Furthermore, previous literature rarely considers the physical dynamics of agents. In this study, the physical dynamics of the agents is considered. Experimental results show that our method performs well in various task settings, which means that the target assignment is solved reasonably well and the planned path is almost shortest. Moreover, our method is more time-efficient than baselines.

Optimally Solving Simultaneous-Move Dec-POMDPs: The Sequential Central Planning Approach

Authors:Johan Peralez, Aurèlien Delage, Jacopo Castellini, Rafael F. Cunha, Jilles S. Dibangoye
Date:2024-08-23 15:01:37

The centralized training for decentralized execution paradigm emerged as the state-of-the-art approach to $\epsilon$-optimally solving decentralized partially observable Markov decision processes. However, scalability remains a significant issue. This paper presents a novel and more scalable alternative, namely the sequential-move centralized training for decentralized execution. This paradigm further pushes the applicability of the Bellman's principle of optimality, raising three new properties. First, it allows a central planner to reason upon sufficient sequential-move statistics instead of prior simultaneous-move ones. Next, it proves that $\epsilon$-optimal value functions are piecewise linear and convex in such sufficient sequential-move statistics. Finally, it drops the complexity of the backup operators from double exponential to polynomial at the expense of longer planning horizons. Besides, it makes it easy to use single-agent methods, e.g., SARSA algorithm enhanced with these findings, while still preserving convergence guarantees. Experiments on two- as well as many-agent domains from the literature against $\epsilon$-optimal simultaneous-move solvers confirm the superiority of our novel approach. This paradigm opens the door for efficient planning and reinforcement learning methods for multi-agent systems.

Intelligent OPC Engineer Assistant for Semiconductor Manufacturing

Authors:Guojin Chen, Haoyu Yang, Bei Yu, Haoxing Ren
Date:2024-08-23 00:49:36

Advancements in chip design and manufacturing have enabled the processing of complex tasks such as deep learning and natural language processing, paving the way for the development of artificial general intelligence (AGI). AI, on the other hand, can be leveraged to innovate and streamline semiconductor technology from planning and implementation to manufacturing. In this paper, we present \textit{Intelligent OPC Engineer Assistant}, an AI/LLM-powered methodology designed to solve the core manufacturing-aware optimization problem known as optical proximity correction (OPC). The methodology involves a reinforcement learning-based OPC recipe search and a customized multi-modal agent system for recipe summarization. Experiments demonstrate that our methodology can efficiently build OPC recipes on various chip designs with specially handled design topologies, a task that typically requires the full-time effort of OPC engineers with years of experience.

Efficient Exploration and Discriminative World Model Learning with an Object-Centric Abstraction

Authors:Anthony GX-Chen, Kenneth Marino, Rob Fergus
Date:2024-08-21 17:59:31

In the face of difficult exploration problems in reinforcement learning, we study whether giving an agent an object-centric mapping (describing a set of items and their attributes) allow for more efficient learning. We found this problem is best solved hierarchically by modelling items at a higher level of state abstraction to pixels, and attribute change at a higher level of temporal abstraction to primitive actions. This abstraction simplifies the transition dynamic by making specific future states easier to predict. We make use of this to propose a fully model-based algorithm that learns a discriminative world model, plans to explore efficiently with only a count-based intrinsic reward, and can subsequently plan to reach any discovered (abstract) states. We demonstrate the model's ability to (i) efficiently solve single tasks, (ii) transfer zero-shot and few-shot across item types and environments, and (iii) plan across long horizons. Across a suite of 2D crafting and MiniHack environments, we empirically show our model significantly out-performs state-of-the-art low-level methods (without abstraction), as well as performant model-free and model-based methods using the same abstraction. Finally, we show how to learn low level object-perturbing policies via reinforcement learning, and the object mapping itself by supervised learning.

Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search

Authors:Jonathan Light, Min Cai, Weiqin Chen, Guanzhi Wang, Xiusi Chen, Wei Cheng, Yisong Yue, Ziniu Hu
Date:2024-08-20 08:22:04

In this paper, we propose a new method STRATEGIST that utilizes LLMs to acquire new skills for playing multi-agent games through a self-improvement process. Our method gathers quality feedback through self-play simulations with Monte Carlo tree search and LLM-based reflection, which can then be used to learn high-level strategic skills such as how to evaluate states that guide the low-level execution. We showcase how our method can be used in both action planning and dialogue generation in the context of games, achieving good performance on both tasks. Specifically, we demonstrate that our method can help train agents with better performance than both traditional reinforcement learning-based approaches and other LLM-based skill learning approaches in games including the Game of Pure Strategy (GOPS) and The Resistance: Avalon. STRATEGIST helps bridge the gap between foundation models and symbolic decision-making methods through its bi-level approach, leading to more robust decision-making.

Integrating Multi-Modal Input Token Mixer Into Mamba-Based Decision Models: Decision MetaMamba

Authors:Wall Kim
Date:2024-08-20 03:35:28

Sequence modeling with State Space models (SSMs) has demonstrated performance surpassing that of Transformers in various tasks, raising expectations for their potential to outperform the Decision Transformer and its enhanced variants in offline reinforcement learning (RL). However, decision models based on Mamba, a state-of-the-art SSM, failed to achieve superior performance compared to these enhanced Decision Transformers. We hypothesize that this limitation arises from information loss during the selective scanning phase. To address this, we propose the Decision MetaMamba (DMM), which augments Mamba with a token mixer in its input layer. This mixer explicitly accounts for the multimodal nature of offline RL inputs, comprising state, action, and return-to-go. The DMM demonstrates improved performance while significantly reducing parameter count compared to prior models. Notably, similar performance gains were achieved using a simple linear token mixer, emphasizing the importance of preserving information from proximate time steps rather than the specific design of the token mixer itself. This novel modification to Mamba's input layer represents a departure from conventional timestamp-based encoding approaches used in Transformers. By enhancing performance of Mamba in offline RL, characterized by memory efficiency and fast inference, this work opens new avenues for its broader application in future RL research.

Physics-Aware Combinatorial Assembly Sequence Planning using Data-free Action Masking

Authors:Ruixuan Liu, Alan Chen, Weiye Zhao, Changliu Liu
Date:2024-08-19 17:16:35

Combinatorial assembly uses standardized unit primitives to build objects that satisfy user specifications. This paper studies assembly sequence planning (ASP) for physical combinatorial assembly. Given the shape of the desired object, the goal is to find a sequence of actions for placing unit primitives to build the target object. In particular, we aim to ensure the planned assembly sequence is physically executable. However, ASP for combinatorial assembly is particularly challenging due to its combinatorial nature. To address the challenge, we employ deep reinforcement learning to learn a construction policy for placing unit primitives sequentially to build the desired object. Specifically, we design an online physics-aware action mask that filters out invalid actions, which effectively guides policy learning and ensures violation-free deployment. In the end, we apply the proposed method to Lego assembly with more than 250 3D structures. The experiment results demonstrate that the proposed method plans physically valid assembly sequences to build all structures, achieving a $100\%$ success rate, whereas the best comparable baseline fails more than $40$ structures. Our implementation is available at \url{https://github.com/intelligent-control-lab/PhysicsAwareCombinatorialASP}.

Model-based RL as a Minimalist Approach to Horizon-Free and Second-Order Bounds

Authors:Zhiyong Wang, Dongruo Zhou, John C. S. Lui, Wen Sun
Date:2024-08-16 19:52:53

Learning a transition model via Maximum Likelihood Estimation (MLE) followed by planning inside the learned model is perhaps the most standard and simplest Model-based Reinforcement Learning (RL) framework. In this work, we show that such a simple Model-based RL scheme, when equipped with optimistic and pessimistic planning procedures, achieves strong regret and sample complexity bounds in online and offline RL settings. Particularly, we demonstrate that under the conditions where the trajectory-wise reward is normalized between zero and one and the transition is time-homogenous, it achieves nearly horizon-free and second-order bounds. Nearly horizon-free means that our bounds have no polynomial dependence on the horizon of the Markov Decision Process. A second-order bound is a type of instance-dependent bound that scales with respect to the variances of the returns of the policies which can be small when the system is nearly deterministic and (or) the optimal policy has small values. We highlight that our algorithms are simple, fairly standard, and indeed have been extensively studied in the RL literature: they learn a model via MLE, build a version space around the MLE solution, and perform optimistic or pessimistic planning depending on whether operating in the online or offline mode. These algorithms do not rely on additional specialized algorithmic designs such as learning variances and performing variance-weighted learning and thus can easily leverage non-linear function approximations. The simplicity of the algorithms also implies that our horizon-free and second-order regret analysis is actually standard and mainly follows the general framework of optimism/pessimism in the face of uncertainty.

From Decision to Action in Surgical Autonomy: Multi-Modal Large Language Models for Robot-Assisted Blood Suction

Authors:Sadra Zargarzadeh, Maryam Mirzaei, Yafei Ou, Mahdi Tavakoli
Date:2024-08-14 20:30:34

The rise of Large Language Models (LLMs) has impacted research in robotics and automation. While progress has been made in integrating LLMs into general robotics tasks, a noticeable void persists in their adoption in more specific domains such as surgery, where critical factors such as reasoning, explainability, and safety are paramount. Achieving autonomy in robotic surgery, which entails the ability to reason and adapt to changes in the environment, remains a significant challenge. In this work, we propose a multi-modal LLM integration in robot-assisted surgery for autonomous blood suction. The reasoning and prioritization are delegated to the higher-level task-planning LLM, and the motion planning and execution are handled by the lower-level deep reinforcement learning model, creating a distributed agency between the two components. As surgical operations are highly dynamic and may encounter unforeseen circumstances, blood clots and active bleeding were introduced to influence decision-making. Results showed that using a multi-modal LLM as a higher-level reasoning unit can account for these surgical complexities to achieve a level of reasoning previously unattainable in robot-assisted surgeries. These findings demonstrate the potential of multi-modal LLMs to significantly enhance contextual understanding and decision-making in robotic-assisted surgeries, marking a step toward autonomous surgical systems.

SigmaRL: A Sample-Efficient and Generalizable Multi-Agent Reinforcement Learning Framework for Motion Planning

Authors:Jianye Xu, Pan Hu, Bassam Alrifaee
Date:2024-08-14 16:16:51

This paper introduces an open-source, decentralized framework named SigmaRL, designed to enhance both sample efficiency and generalization of multi-agent Reinforcement Learning (RL) for motion planning of connected and automated vehicles. Most RL agents exhibit a limited capacity to generalize, often focusing narrowly on specific scenarios, and are usually evaluated in similar or even the same scenarios seen during training. Various methods have been proposed to address these challenges, including experience replay and regularization. However, how observation design in RL affects sample efficiency and generalization remains an under-explored area. We address this gap by proposing five strategies to design information-dense observations, focusing on general features that are applicable to most traffic scenarios. We train our RL agents using these strategies on an intersection and evaluate their generalization through numerical experiments across completely unseen traffic scenarios, including a new intersection, an on-ramp, and a roundabout. Incorporating these information-dense observations reduces training times to under one hour on a single CPU, and the evaluation results reveal that our RL agents can effectively zero-shot generalize. Code: github.com/cas-lab-munich/SigmaRL

Retrieval-Augmented Hierarchical in-Context Reinforcement Learning and Hindsight Modular Reflections for Task Planning with LLMs

Authors:Chuanneng Sun, Songjun Huang, Dario Pompili
Date:2024-08-12 22:40:01

Large Language Models (LLMs) have demonstrated remarkable abilities in various language tasks, making them promising candidates for decision-making in robotics. Inspired by Hierarchical Reinforcement Learning (HRL), we propose Retrieval-Augmented in-context reinforcement Learning (RAHL), a novel framework that decomposes complex tasks into sub-tasks using an LLM-based high-level policy, in which a complex task is decomposed into sub-tasks by a high-level policy on-the-fly. The sub-tasks, defined by goals, are assigned to the low-level policy to complete. To improve the agent's performance in multi-episode execution, we propose Hindsight Modular Reflection (HMR), where, instead of reflecting on the full trajectory, we let the agent reflect on shorter sub-trajectories to improve reflection efficiency. We evaluated the decision-making ability of the proposed RAHL in three benchmark environments--ALFWorld, Webshop, and HotpotQA. The results show that RAHL can achieve an improvement in performance in 9%, 42%, and 10% in 5 episodes of execution in strong baselines. Furthermore, we also implemented RAHL on the Boston Dynamics SPOT robot. The experiment shows that the robot can scan the environment, find entrances, and navigate to new rooms controlled by the LLM policy.

Building Decision Making Models Through Language Model Regime

Authors:Yu Zhang, Haoxiang Liu, Feijun Jiang, Weihua Luo, Kaifu Zhang
Date:2024-08-12 12:04:14

We propose a novel approach for decision making problems leveraging the generalization capabilities of large language models (LLMs). Traditional methods such as expert systems, planning algorithms, and reinforcement learning often exhibit limited generalization, typically requiring the training of new models for each unique task. In contrast, LLMs demonstrate remarkable success in generalizing across varied language tasks, inspiring a new strategy for training decision making models. Our approach, referred to as "Learning then Using" (LTU), entails a two-stage process. Initially, the \textit{learning} phase develops a robust foundational decision making model by integrating diverse knowledge from various domains and decision making contexts. The subsequent \textit{using} phase refines this foundation model for specific decision making scenarios. Distinct from other studies that employ LLMs for decision making through supervised learning, our LTU method embraces a versatile training methodology that combines broad pre-training with targeted fine-tuning. Experiments in e-commerce domains such as advertising and search optimization have shown that LTU approach outperforms traditional supervised learning regimes in decision making capabilities and generalization. The LTU approach is the first practical training architecture for both single-step and multi-step decision making tasks combined with LLMs, which can be applied beyond game and robot domains. It provides a robust and adaptable framework for decision making, enhances the effectiveness and flexibility of various systems in tackling various challenges.

Mitigating Metropolitan Carbon Emissions with Dynamic Eco-driving at Scale

Authors:Vindula Jayawardana, Baptiste Freydt, Ao Qu, Cameron Hickert, Edgar Sanchez, Catherine Tang, Mark Taylor, Blaine Leonard, Cathy Wu
Date:2024-08-10 18:23:59

The sheer scale and diversity of transportation make it a formidable sector to decarbonize. Here, we consider an emerging opportunity to reduce carbon emissions: the growing adoption of semi-autonomous vehicles, which can be programmed to mitigate stop-and-go traffic through intelligent speed commands and, thus, reduce emissions. But would such dynamic eco-driving move the needle on climate change? A comprehensive impact analysis has been out of reach due to the vast array of traffic scenarios and the complexity of vehicle emissions. We address this challenge with large-scale scenario modeling efforts and by using multi-task deep reinforcement learning with a carefully designed network decomposition strategy. We perform an in-depth prospective impact assessment of dynamic eco-driving at 6,011 signalized intersections across three major US metropolitan cities, simulating a million traffic scenarios. Overall, we find that vehicle trajectories optimized for emissions can cut city-wide intersection carbon emissions by 11-22%, without harming throughput or safety, and with reasonable assumptions, equivalent to the national emissions of Israel and Nigeria, respectively. We find that 10% eco-driving adoption yields 25%-50% of the total reduction, and nearly 70% of the benefits come from 20% of intersections, suggesting near-term implementation pathways. However, the composition of this high-impact subset of intersections varies considerably across different adoption levels, with minimal overlap, calling for careful strategic planning for eco-driving deployments. Moreover, the impact of eco-driving, when considered jointly with projections of vehicle electrification and hybrid vehicle adoption remains significant. More broadly, this work paves the way for large-scale analysis of traffic externalities, such as time, safety, and air quality, and the potential impact of solution strategies.

Trajectory Planning for Teleoperated Space Manipulators Using Deep Reinforcement Learning

Authors:Bo Xia, Xianru Tian, Bo Yuan, Zhiheng Li, Bin Liang, Xueqian Wang
Date:2024-08-10 07:08:09

Trajectory planning for teleoperated space manipulators involves challenges such as accurately modeling system dynamics, particularly in free-floating modes with non-holonomic constraints, and managing time delays that increase model uncertainty and affect control precision. Traditional teleoperation methods rely on precise dynamic models requiring complex parameter identification and calibration, while data-driven methods do not require prior knowledge but struggle with time delays. A novel framework utilizing deep reinforcement learning (DRL) is introduced to address these challenges. The framework incorporates three methods: Mapping, Prediction, and State Augmentation, to handle delays when delayed state information is received at the master end. The Soft Actor Critic (SAC) algorithm processes the state information to compute the next action, which is then sent to the remote manipulator for environmental interaction. Four environments are constructed using the MuJoCo simulation platform to account for variations in base and target fixation: fixed base and target, fixed base with rotated target, free-floating base with fixed target, and free-floating base with rotated target. Extensive experiments with both constant and random delays are conducted to evaluate the proposed methods. Results demonstrate that all three methods effectively address trajectory planning challenges, with State Augmentation showing superior efficiency and robustness.

Comp-LTL: Temporal Logic Planning via Zero-Shot Policy Composition

Authors:Taylor Bergeron, Zachary Serlin, Kevin Leahy
Date:2024-08-08 04:49:24

This work develops a zero-shot mechanism, Comp-LTL, for an agent to satisfy a Linear Temporal Logic (LTL) specification given existing task primitives trained via reinforcement learning (RL). Autonomous robots often need to satisfy spatial and temporal goals that are unknown until run time. Prior work focuses on learning policies for executing a task specified using LTL, but they incorporate the specification into the learning process. Any change to the specification requires retraining the policy, either via fine-tuning or from scratch. We present a more flexible approach -- to learn a set of composable task primitive policies that can be used to satisfy arbitrary LTL specifications without retraining or fine-tuning. Task primitives can be learned offline using RL and combined using Boolean composition at deployment. This work focuses on creating and pruning a transition system (TS) representation of the environment in order to solve for deterministic, non-ambiguous, and feasible solutions to LTL specifications given an environment and a set of task primitive policies. We show that our pruned TS is deterministic, contains no unrealizable transitions, and is sound. We verify our approach via simulation and compare it to other state of the art approaches, showing that Comp-LTL is safer and more adaptable.

PLANRL: A Motion Planning and Imitation Learning Framework to Bootstrap Reinforcement Learning

Authors:Amisha Bhaskar, Zahiruddin Mahammad, Sachin R Jadhav, Pratap Tokekar
Date:2024-08-07 19:30:08

Reinforcement Learning (RL) has shown remarkable progress in simulation environments, yet its application to real-world robotic tasks remains limited due to challenges in exploration and generalization. To address these issues, we introduce PLANRL, a framework that chooses when the robot should use classical motion planning and when it should learn a policy. To further improve the efficiency in exploration, we use imitation data to bootstrap the exploration. PLANRL dynamically switches between two modes of operation: reaching a waypoint using classical techniques when away from the objects and reinforcement learning for fine-grained manipulation control when about to interact with objects. PLANRL architecture is composed of ModeNet for mode classification, NavNet for waypoint prediction, and InteractNet for precise manipulation. By combining the strengths of RL and Imitation Learning (IL), PLANRL improves sample efficiency and mitigates distribution shift, ensuring robust task execution. We evaluate our approach across multiple challenging simulation environments and real-world tasks, demonstrating superior performance in terms of adaptability, efficiency, and generalization compared to existing methods. In simulations, PLANRL surpasses baseline methods by 10-15\% in training success rates at 30k samples and by 30-40\% during evaluation phases. In real-world scenarios, it demonstrates a 30-40\% higher success rate on simpler tasks compared to baselines and uniquely succeeds in complex, two-stage manipulation tasks. Datasets and supplementary materials can be found on our {https://raaslab.org/projects/NAVINACT/}.

HDPlanner: Advancing Autonomous Deployments in Unknown Environments through Hierarchical Decision Networks

Authors:Jingsong Liang, Yuhong Cao, Yixiao Ma, Hanqi Zhao, Guillaume Sartoretti
Date:2024-08-07 13:38:53

In this paper, we introduce HDPlanner, a deep reinforcement learning (DRL) based framework designed to tackle two core and challenging tasks for mobile robots: autonomous exploration and navigation, where the robot must optimize its trajectory adaptively to achieve the task objective through continuous interactions in unknown environments. Specifically, HDPlanner relies on novel hierarchical attention networks to empower the robot to reason about its belief across multiple spatial scales and sequence collaborative decisions, where our networks decompose long-term objectives into short-term informative task assignments and informative path plannings. We further propose a contrastive learning-based joint optimization to enhance the robustness of HDPlanner. We empirically demonstrate that HDPlanner significantly outperforms state-of-the-art conventional and learning-based baselines on an extensive set of simulations, including hundreds of test maps and large-scale, complex Gazebo environments. Notably, HDPlanner achieves real-time planning with travel distances reduced by up to 35.7% compared to exploration benchmarks and by up to 16.5% than navigation benchmarks. Furthermore, we validate our approach on hardware, where it generates high-quality, adaptive trajectories in both indoor and outdoor environments, highlighting its real-world applicability without additional training.

Faster Model Predictive Control via Self-Supervised Initialization Learning

Authors:Zhaoxin Li, Letian Chen, Rohan Paleja, Subramanya Nageshrao, Matthew Gombolay
Date:2024-08-06 18:41:57

Optimization for robot control tasks, spanning various methodologies, includes Model Predictive Control (MPC). However, the complexity of the system, such as non-convex and non-differentiable cost functions and prolonged planning horizons often drastically increases the computation time, limiting MPC's real-world applicability. Prior works in speeding up the optimization have limitations on solving convex problem and generalizing to hold out domains. To overcome this challenge, we develop a novel framework aiming at expediting optimization processes. In our framework, we combine offline self-supervised learning and online fine-tuning through reinforcement learning to improve the control performance and reduce optimization time. We demonstrate the effectiveness of our method on a novel, challenging Formula-1-track driving task, achieving 3.9\% higher performance in optimization time and 3.6\% higher performance in tracking accuracy on challenging holdout tracks.

Integrating Model-Based Footstep Planning with Model-Free Reinforcement Learning for Dynamic Legged Locomotion

Authors:Ho Jae Lee, Seungwoo Hong, Sangbae Kim
Date:2024-08-05 17:55:23

In this work, we introduce a control framework that combines model-based footstep planning with Reinforcement Learning (RL), leveraging desired footstep patterns derived from the Linear Inverted Pendulum (LIP) dynamics. Utilizing the LIP model, our method forward predicts robot states and determines the desired foot placement given the velocity commands. We then train an RL policy to track the foot placements without following the full reference motions derived from the LIP model. This partial guidance from the physics model allows the RL policy to integrate the predictive capabilities of the physics-informed dynamics and the adaptability characteristics of the RL controller without overfitting the policy to the template model. Our approach is validated on the MIT Humanoid, demonstrating that our policy can achieve stable yet dynamic locomotion for walking and turning. We further validate the adaptability and generalizability of our policy by extending the locomotion task to unseen, uneven terrain. During the hardware deployment, we have achieved forward walking speeds of up to 1.5 m/s on a treadmill and have successfully performed dynamic locomotion maneuvers such as 90-degree and 180-degree turns.

Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information

Authors:Yauwai Yim, Chunkit Chan, Tianyu Shi, Zheye Deng, Wei Fan, Tianshi Zheng, Yangqiu Song
Date:2024-08-05 15:36:46

Large language models (LLMs) have shown success in handling simple games with imperfect information and enabling multi-agent coordination, but their ability to facilitate practical collaboration against other agents in complex, imperfect information environments, especially in a non-English environment, still needs to be explored. This study investigates the applicability of knowledge acquired by open-source and API-based LLMs to sophisticated text-based games requiring agent collaboration under imperfect information, comparing their performance to established baselines using other types of agents. We propose a Theory of Mind (ToM) planning technique that allows LLM agents to adapt their strategy against various adversaries using only game rules, current state, and historical context as input. An external tool was incorporated to mitigate the challenge of dynamic and extensive action spaces in this card game. Our results show that although a performance gap exists between current LLMs and state-of-the-art reinforcement learning (RL) models, LLMs demonstrate ToM capabilities in this game setting. It consistently improves their performance against opposing agents, suggesting their ability to understand the actions of allies and adversaries and establish collaboration with allies. To encourage further research and understanding, we have made our codebase openly accessible.

Scalable Signal Temporal Logic Guided Reinforcement Learning via Value Function Space Optimization

Authors:Yiting He, Peiran Liu, Yiding Ji
Date:2024-08-04 04:34:29

The integration of reinforcement learning (RL) and formal methods has emerged as a promising framework for solving long-horizon planning problems. Conventional approaches typically involve abstraction of the state and action spaces and manually created labeling functions or predicates. However, the efficiency of these approaches deteriorates as the tasks become increasingly complex, which results in exponential growth in the size of labeling functions or predicates. To address these issues, we propose a scalable model-based RL framework, called VFSTL, which schedules pre-trained skills to follow unseen STL specifications without using hand-crafted predicates. Given a set of value functions obtained by goal-conditioned RL, we formulate an optimization problem to maximize the robustness value of Signal Temporal Logic (STL) defined specifications, which is computed using value functions as predicates. To further reduce the computation burden, we abstract the environment state space into the value function space (VFS). Then the optimization problem is solved by Model-Based Reinforcement Learning. Simulation results show that STL with value functions as predicates approximates the ground truth robustness and the planning in VFS directly achieves unseen specifications using data from sensors.

Coordinating Planning and Tracking in Layered Control Policies via Actor-Critic Learning

Authors:Fengjun Yang, Nikolai Matni
Date:2024-08-03 02:53:24

We propose a reinforcement learning (RL)-based algorithm to jointly train (1) a trajectory planner and (2) a tracking controller in a layered control architecture. Our algorithm arises naturally from a rewrite of the underlying optimal control problem that lends itself to an actor-critic learning approach. By explicitly learning a \textit{dual} network to coordinate the interaction between the planning and tracking layers, we demonstrate the ability to achieve an effective consensus between the two components, leading to an interpretable policy. We theoretically prove that our algorithm converges to the optimal dual network in the Linear Quadratic Regulator (LQR) setting and empirically validate its applicability to nonlinear systems through simulation experiments on a unicycle model.

GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS

Authors:Saman Kazemkhani, Aarav Pandya, Daphne Cornelisse, Brennan Shacklett, Eugene Vinitsky
Date:2024-08-02 21:37:46

Multi-agent learning algorithms have been successful at generating superhuman planning in various games but have had limited impact on the design of deployed multi-agent planners. A key bottleneck in applying these techniques to multi-agent planning is that they require billions of steps of experience. To enable the study of multi-agent planning at scale, we present GPUDrive. GPUDrive is a GPU-accelerated, multi-agent simulator built on top of the Madrona Game Engine capable of generating over a million simulation steps per second. Observation, reward, and dynamics functions are written directly in C++, allowing users to define complex, heterogeneous agent behaviors that are lowered to high-performance CUDA. Despite these low-level optimizations, GPUDrive is fully accessible through Python, offering a seamless and efficient workflow for multi-agent, closed-loop simulation. Using GPUDrive, we train reinforcement learning agents on the Waymo Open Motion Dataset, achieving efficient goal-reaching in minutes and scaling to thousands of scenarios in hours. We open-source the code and pre-trained agents at https://github.com/Emerge-Lab/gpudrive.

Adaptive Planning with Generative Models under Uncertainty

Authors:Pascal Jutras-Dubé, Ruqi Zhang, Aniket Bera
Date:2024-08-02 18:07:53

Planning with generative models has emerged as an effective decision-making paradigm across a wide range of domains, including reinforcement learning and autonomous navigation. While continuous replanning at each timestep might seem intuitive because it allows decisions to be made based on the most recent environmental observations, it results in substantial computational challenges, primarily due to the complexity of the generative model's underlying deep learning architecture. Our work addresses this challenge by introducing a simple adaptive planning policy that leverages the generative model's ability to predict long-horizon state trajectories, enabling the execution of multiple actions consecutively without the need for immediate replanning. We propose to use the predictive uncertainty derived from a Deep Ensemble of inverse dynamics models to dynamically adjust the intervals between planning sessions. In our experiments conducted on locomotion tasks within the OpenAI Gym framework, we demonstrate that our adaptive planning policy allows for a reduction in replanning frequency to only about 10% of the steps without compromising the performance. Our results underscore the potential of generative modeling as an efficient and effective tool for decision-making.

RESC: A Reinforcement Learning Based Search-to-Control Framework for Quadrotor Local Planning in Dense Environments

Authors:Zhaohong Liu, Wenxuan Gao, Yinshuai Sun, Peng Dong
Date:2024-08-01 04:29:34

Agile flight in complex environments poses significant challenges to current motion planning methods, as they often fail to fully leverage the quadrotor dynamic potential, leading to performance failures and reduced efficiency during aggressive maneuvers.Existing approaches frequently decouple trajectory optimization from control generation and neglect the dynamics, further limiting their ability to generate aggressive and feasible motions.To address these challenges, we introduce an enhanced Search-to-Control planning framework that integrates visibility path searching with reinforcement learning (RL) control generation, directly accounting for dynamics and bridging the gap between planning and control.Our method first extracts control points from collision-free paths using a proposed heuristic search, which are then refined by an RL policy to generate low-level control commands for the quadrotor controller, utilizing reduced-dimensional obstacle observations for efficient inference with lightweight neural networks.We validate the framework through simulations and real-world experiments, demonstrating improved time efficiency and dynamic maneuverability compared to existing methods, while confirming its robustness and applicability.

ProSpec RL: Plan Ahead, then Execute

Authors:Liangliang Liu, Yi Guan, BoRan Wang, Rujia Shen, Yi Lin, Chaoran Kong, Lian Yan, Jingchi Jiang
Date:2024-07-31 06:04:55

Imagining potential outcomes of actions before execution helps agents make more informed decisions, a prospective thinking ability fundamental to human cognition. However, mainstream model-free Reinforcement Learning (RL) methods lack the ability to proactively envision future scenarios, plan, and guide strategies. These methods typically rely on trial and error to adjust policy functions, aiming to maximize cumulative rewards or long-term value, even if such high-reward decisions place the environment in extremely dangerous states. To address this, we propose the Prospective (ProSpec) RL method, which makes higher-value, lower-risk optimal decisions by imagining future n-stream trajectories. Specifically, ProSpec employs a dynamic model to predict future states (termed "imagined states") based on the current state and a series of sampled actions. Furthermore, we integrate the concept of Model Predictive Control and introduce a cycle consistency constraint that allows the agent to evaluate and select the optimal actions from these trajectories. Moreover, ProSpec employs cycle consistency to mitigate two fundamental issues in RL: augmenting state reversibility to avoid irreversible events (low risk) and augmenting actions to generate numerous virtual trajectories, thereby improving data efficiency. We validated the effectiveness of our method on the DMControl benchmarks, where our approach achieved significant performance improvements. Code will be open-sourced upon acceptance.

QT-TDM: Planning With Transformer Dynamics Model and Autoregressive Q-Learning

Authors:Mostafa Kotb, Cornelius Weber, Muhammad Burhan Hafez, Stefan Wermter
Date:2024-07-26 16:05:26

Inspired by the success of the Transformer architecture in natural language processing and computer vision, we investigate the use of Transformers in Reinforcement Learning (RL), specifically in modeling the environment's dynamics using Transformer Dynamics Models (TDMs). We evaluate the capabilities of TDMs for continuous control in real-time planning scenarios with Model Predictive Control (MPC). While Transformers excel in long-horizon prediction, their tokenization mechanism and autoregressive nature lead to costly planning over long horizons, especially as the environment's dimensionality increases. To alleviate this issue, we use a TDM for short-term planning, and learn an autoregressive discrete Q-function using a separate Q-Transformer (QT) model to estimate a long-term return beyond the short-horizon planning. Our proposed method, QT-TDM, integrates the robust predictive capabilities of Transformers as dynamics models with the efficacy of a model-free Q-Transformer to mitigate the computational burden associated with real-time planning. Experiments in diverse state-based continuous control tasks show that QT-TDM is superior in performance and sample efficiency compared to existing Transformer-based RL models while achieving fast and computationally efficient inference.

Online Planning in POMDPs with State-Requests

Authors:Raphael Avalos, Eugenio Bargiacchi, Ann Nowé, Diederik M. Roijers, Frans A. Oliehoek
Date:2024-07-26 15:20:50

In key real-world problems, full state information is sometimes available but only at a high cost, like activating precise yet energy-intensive sensors or consulting humans, thereby compelling the agent to operate under partial observability. For this scenario, we propose AEMS-SR (Anytime Error Minimization Search with State Requests), a principled online planning algorithm tailored for POMDPs with state requests. By representing the search space as a graph instead of a tree, AEMS-SR avoids the exponential growth of the search space originating from state requests. Theoretical analysis demonstrates AEMS-SR's $\varepsilon$-optimality, ensuring solution quality, while empirical evaluations illustrate its effectiveness compared with AEMS and POMCP, two SOTA online planning algorithms. AEMS-SR enables efficient planning in domains characterized by partial observability and costly state requests offering practical benefits across various applications.

PP-TIL: Personalized Planning for Autonomous Driving with Instance-based Transfer Imitation Learning

Authors:Fangze Lin, Ying He, Fei Yu
Date:2024-07-26 07:51:11

Personalized motion planning holds significant importance within urban automated driving, catering to the unique requirements of individual users. Nevertheless, prior endeavors have frequently encountered difficulties in simultaneously addressing two crucial aspects: personalized planning within intricate urban settings and enhancing planning performance through data utilization. The challenge arises from the expensive and limited nature of user data, coupled with the scene state space tending towards infinity. These factors contribute to overfitting and poor generalization problems during model training. Henceforth, we propose an instance-based transfer imitation learning approach. This method facilitates knowledge transfer from extensive expert domain data to the user domain, presenting a fundamental resolution to these issues. We initially train a pre-trained model using large-scale expert data. Subsequently, during the fine-tuning phase, we feed the batch data, which comprises expert and user data. Employing the inverse reinforcement learning technique, we extract the style feature distribution from user demonstrations, constructing the regularization term for the approximation of user style. In our experiments, we conducted extensive evaluations of the proposed method. Compared to the baseline methods, our approach mitigates the overfitting issue caused by sparse user data. Furthermore, we discovered that integrating the driving model with a differentiable nonlinear optimizer as a safety protection layer for end-to-end personalized fine-tuning results in superior planning performance.

Personalized and Context-aware Route Planning for Edge-assisted Vehicles

Authors:Dinesh Cyril Selvaraj, Falko Dressler, Carla Fabiana Chiasserini
Date:2024-07-25 12:14:12

Conventional route planning services typically offer the same routes to all drivers, focusing primarily on a few standardized factors such as travel distance or time, overlooking individual driver preferences. With the inception of autonomous vehicles expected in the coming years, where vehicles will rely on routes decided by such planners, there arises a need to incorporate the specific preferences of each driver, ensuring personalized navigation experiences. In this work, we propose a novel approach based on graph neural networks (GNNs) and deep reinforcement learning (DRL), aimed at customizing routes to suit individual preferences. By analyzing the historical trajectories of individual drivers, we classify their driving behavior and associate it with relevant road attributes as indicators of driver preferences. The GNN is capable of representing the road network as graph-structured data effectively, while DRL is capable of making decisions utilizing reward mechanisms to optimize route selection with factors such as travel costs, congestion level, and driver satisfaction. We evaluate our proposed GNN-based DRL framework using a real-world road network and demonstrate its ability to accommodate driver preferences, offering a range of route options tailored to individual drivers. The results indicate that our framework can select routes that accommodate driver's preferences with up to a 17% improvement compared to a generic route planner, and reduce the travel time by 33% (afternoon) and 46% (evening) relatively to the shortest distance-based approach.

RL-augmented MPC Framework for Agile and Robust Bipedal Footstep Locomotion Planning and Control

Authors:Seung Hyeon Bang, Carlos Arribalzaga Jové, Luis Sentis
Date:2024-07-25 00:51:19

This paper proposes an online bipedal footstep planning strategy that combines model predictive control (MPC) and reinforcement learning (RL) to achieve agile and robust bipedal maneuvers. While MPC-based foot placement controllers have demonstrated their effectiveness in achieving dynamic locomotion, their performance is often limited by the use of simplified models and assumptions. To address this challenge, we develop a novel foot placement controller that leverages a learned policy to bridge the gap between the use of a simplified model and the more complex full-order robot system. Specifically, our approach employs a unique combination of an ALIP-based MPC foot placement controller for sub-optimal footstep planning and the learned policy for refining footstep adjustments, enabling the resulting footstep policy to capture the robot's whole-body dynamics effectively. This integration synergizes the predictive capability of MPC with the flexibility and adaptability of RL. We validate the effectiveness of our framework through a series of experiments using the full-body humanoid robot DRACO 3. The results demonstrate significant improvements in dynamic locomotion performance, including better tracking of a wide range of walking speeds, enabling reliable turning and traversing challenging terrains while preserving the robustness and stability of the walking gaits compared to the baseline ALIP-based MPC approach.

Evaluating Uncertainties in Electricity Markets via Machine Learning and Quantum Computing

Authors:Shuyang Zhu, Ziqing Zhu, Linghua Zhu, Yujian Ye, Siqi Bu, Sasa Z. Djokic
Date:2024-07-23 11:46:13

The analysis of decision-making process in electricity markets is crucial for understanding and resolving issues related to market manipulation and reduced social welfare. Traditional Multi-Agent Reinforcement Learning (MARL) method can model decision-making of generation companies (GENCOs), but faces challenges due to uncertainties in policy functions, reward functions, and inter-agent interactions. Quantum computing offers a promising solution to resolve these uncertainties, and this paper introduces the Quantum Multi-Agent Deep Q-Network (Q-MADQN) method, which integrates variational quantum circuits into the traditional MARL framework. The main contributions of the paper are: identifying the correspondence between market uncertainties and quantum properties, proposing the Q-MADQN algorithm for simulating electricity market bidding, and demonstrating that Q-MADQN allows for a more thorough exploration and simulates more potential bidding strategies of profit-oriented GENCOs, compared to conventional methods, without compromising computational efficiency. The proposed method is illustrated on IEEE 30-bus test network, confirming that it offers a more accurate model for simulating complex market dynamics.

Negotiating Control: Neurosymbolic Variable Autonomy

Authors:Georgios Bakirtzis, Manolis Chiou, Andreas Theodorou
Date:2024-07-23 07:44:17

Variable autonomy equips a system, such as a robot, with mixed initiatives such that it can adjust its independence level based on the task's complexity and the surrounding environment. Variable autonomy solves two main problems in robotic planning: the first is the problem of humans being unable to keep focus in monitoring and intervening during robotic tasks without appropriate human factor indicators, and the second is achieving mission success in unforeseen and uncertain environments in the face of static reward structures. An open problem in variable autonomy is developing robust methods to dynamically balance autonomy and human intervention in real-time, ensuring optimal performance and safety in unpredictable and evolving environments. We posit that addressing unpredictable and evolving environments through an addition of rule-based symbolic logic has the potential to make autonomy adjustments more contextually reliable and adding feedback to reinforcement learning through data from mixed-initiative control further increases efficacy and safety of autonomous behaviour.

ODGR: Online Dynamic Goal Recognition

Authors:Matan Shamir, Osher Elhadad, Matthew E. Taylor, Reuth Mirsky
Date:2024-07-23 06:52:52

Traditionally, Reinforcement Learning (RL) problems are aimed at optimization of the behavior of an agent. This paper proposes a novel take on RL, which is used to learn the policy of another agent, to allow real-time recognition of that agent's goals. Goal Recognition (GR) has traditionally been framed as a planning problem where one must recognize an agent's objectives based on its observed actions. Recent approaches have shown how reinforcement learning can be used as part of the GR pipeline, but are limited to recognizing predefined goals and lack scalability in domains with a large goal space. This paper formulates a novel problem, "Online Dynamic Goal Recognition" (ODGR), as a first step to address these limitations. Contributions include introducing the concept of dynamic goals into the standard GR problem definition, revisiting common approaches by reformulating them using ODGR, and demonstrating the feasibility of solving ODGR in a navigation domain using transfer learning. These novel formulations open the door for future extensions of existing transfer learning-based GR methods, which will be robust to changing and expansive real-time environments.

Diffusion Models as Optimizers for Efficient Planning in Offline RL

Authors:Renming Huang, Yunqiang Pei, Guoqing Wang, Yangming Zhang, Yang Yang, Peng Wang, Hengtao Shen
Date:2024-07-23 03:00:01

Diffusion models have shown strong competitiveness in offline reinforcement learning tasks by formulating decision-making as sequential generation. However, the practicality of these methods is limited due to the lengthy inference processes they require. In this paper, we address this problem by decomposing the sampling process of diffusion models into two decoupled subprocesses: 1) generating a feasible trajectory, which is a time-consuming process, and 2) optimizing the trajectory. With this decomposition approach, we are able to partially separate efficiency and quality factors, enabling us to simultaneously gain efficiency advantages and ensure quality assurance. We propose the Trajectory Diffuser, which utilizes a faster autoregressive model to handle the generation of feasible trajectories while retaining the trajectory optimization process of diffusion models. This allows us to achieve more efficient planning without sacrificing capability. To evaluate the effectiveness and efficiency of the Trajectory Diffuser, we conduct experiments on the D4RL benchmarks. The results demonstrate that our method achieves $\it 3$-$\it 10 \times$ faster inference speed compared to previous sequence modeling methods, while also outperforming them in terms of overall performance. https://github.com/RenMing-Huang/TrajectoryDiffuser Keywords: Reinforcement Learning and Efficient Planning and Diffusion Model

On shallow planning under partial observability

Authors:Randy Lefebvre, Audrey Durand
Date:2024-07-22 17:34:07

Formulating a real-world problem under the Reinforcement Learning framework involves non-trivial design choices, such as selecting a discount factor for the learning objective (discounted cumulative rewards), which articulates the planning horizon of the agent. This work investigates the impact of the discount factor on the bias-variance trade-off given structural parameters of the underlying Markov Decision Process. Our results support the idea that a shorter planning horizon might be beneficial, especially under partial observability.

Temporal Abstraction in Reinforcement Learning with Offline Data

Authors:Ranga Shaarad Ayyagari, Anurita Ghosh, Ambedkar Dukkipati
Date:2024-07-21 18:10:31

Standard reinforcement learning algorithms with a single policy perform poorly on tasks in complex environments involving sparse rewards, diverse behaviors, or long-term planning. This led to the study of algorithms that incorporate temporal abstraction by training a hierarchy of policies that plan over different time scales. The options framework has been introduced to implement such temporal abstraction by learning low-level options that act as extended actions controlled by a high-level policy. The main challenge in applying these algorithms to real-world problems is that they suffer from high sample complexity to train multiple levels of the hierarchy, which is impossible in online settings. Motivated by this, in this paper, we propose an offline hierarchical RL method that can learn options from existing offline datasets collected by other unknown agents. This is a very challenging problem due to the distribution mismatch between the learned options and the policies responsible for the offline dataset and to our knowledge, this is the first work in this direction. In this work, we propose a framework by which an online hierarchical reinforcement learning algorithm can be trained on an offline dataset of transitions collected by an unknown behavior policy. We validate our method on Gym MuJoCo locomotion environments and robotic gripper block-stacking tasks in the standard as well as transfer and goal-conditioned settings.

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

Authors:Yanting Yang, Minghao Chen, Qibo Qiu, Jiahao Wu, Wenxiao Wang, Binbin Lin, Ziyu Guan, Xiaofei He
Date:2024-07-20 13:22:59

For a general-purpose robot to operate in reality, executing a broad range of instructions across various environments is imperative. Central to the reinforcement learning and planning for such robotic agents is a generalizable reward function. Recent advances in vision-language models, such as CLIP, have shown remarkable performance in the domain of deep learning, paving the way for open-domain visual recognition. However, collecting data on robots executing various language instructions across multiple environments remains a challenge. This paper aims to transfer video-language models with robust generalization into a generalizable language-conditioned reward function, only utilizing robot video data from a minimal amount of tasks in a singular environment. Unlike common robotic datasets used for training reward functions, human video-language datasets rarely contain trivial failure videos. To enhance the model's ability to distinguish between successful and failed robot executions, we cluster failure video features to enable the model to identify patterns within. For each cluster, we integrate a newly trained failure prompt into the text encoder to represent the corresponding failure mode. Our language-conditioned reward function shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning.

Understanding Reinforcement Learning-Based Fine-Tuning of Diffusion Models: A Tutorial and Review

Authors:Masatoshi Uehara, Yulai Zhao, Tommaso Biancalani, Sergey Levine
Date:2024-07-18 17:35:32

This tutorial provides a comprehensive survey of methods for fine-tuning diffusion models to optimize downstream reward functions. While diffusion models are widely known to provide excellent generative modeling capability, practical applications in domains such as biology require generating samples that maximize some desired metric (e.g., translation efficiency in RNA, docking score in molecules, stability in protein). In these cases, the diffusion model can be optimized not only to generate realistic samples but also to explicitly maximize the measure of interest. Such methods are based on concepts from reinforcement learning (RL). We explain the application of various RL algorithms, including PPO, differentiable optimization, reward-weighted MLE, value-weighted sampling, and path consistency learning, tailored specifically for fine-tuning diffusion models. We aim to explore fundamental aspects such as the strengths and limitations of different RL-based fine-tuning algorithms across various scenarios, the benefits of RL-based fine-tuning compared to non-RL-based approaches, and the formal objectives of RL-based fine-tuning (target distributions). Additionally, we aim to examine their connections with related topics such as classifier guidance, Gflownets, flow-based diffusion models, path integral control theory, and sampling from unnormalized distributions such as MCMC. The code of this tutorial is available at https://github.com/masa-ue/RLfinetuning_Diffusion_Bioseq

Misspecified $Q$-Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error

Authors:Ally Yalei Du, Lin F. Yang, Ruosong Wang
Date:2024-07-18 15:58:04

The recent work by Dong & Yang (2023) showed for misspecified sparse linear bandits, one can obtain an $O\left(\epsilon\right)$-optimal policy using a polynomial number of samples when the sparsity is a constant, where $\epsilon$ is the misspecification error. This result is in sharp contrast to misspecified linear bandits without sparsity, which require an exponential number of samples to get the same guarantee. In order to study whether the analog result is possible in the reinforcement learning setting, we consider the following problem: assuming the optimal $Q$-function is a $d$-dimensional linear function with sparsity $k$ and misspecification error $\epsilon$, whether we can obtain an $O\left(\epsilon\right)$-optimal policy using number of samples polynomially in the feature dimension $d$. We first demonstrate why the standard approach based on Bellman backup or the existing optimistic value function elimination approach such as OLIVE (Jiang et al., 2017) achieves suboptimal guarantees for this problem. We then design a novel elimination-based algorithm to show one can obtain an $O\left(H\epsilon\right)$-optimal policy with sample complexity polynomially in the feature dimension $d$ and planning horizon $H$. Lastly, we complement our upper bound with an $\widetilde{\Omega}\left(H\epsilon\right)$ suboptimality lower bound, giving a complete picture of this problem.

Hyp2Nav: Hyperbolic Planning and Curiosity for Crowd Navigation

Authors:Guido Maria D'Amely di Melendugno, Alessandro Flaborea, Pascal Mettes, Fabio Galasso
Date:2024-07-18 14:40:33

Autonomous robots are increasingly becoming a strong fixture in social environments. Effective crowd navigation requires not only safe yet fast planning, but should also enable interpretability and computational efficiency for working in real-time on embedded devices. In this work, we advocate for hyperbolic learning to enable crowd navigation and we introduce Hyp2Nav. Different from conventional reinforcement learning-based crowd navigation methods, Hyp2Nav leverages the intrinsic properties of hyperbolic geometry to better encode the hierarchical nature of decision-making processes in navigation tasks. We propose a hyperbolic policy model and a hyperbolic curiosity module that results in effective social navigation, best success rates, and returns across multiple simulation settings, using up to 6 times fewer parameters than competitor state-of-the-art models. With our approach, it becomes even possible to obtain policies that work in 2-dimensional embedding spaces, opening up new possibilities for low-resource crowd navigation and model interpretability. Insightfully, the internal hyperbolic representation of Hyp2Nav correlates with how much attention the robot pays to the surrounding crowds, e.g. due to multiple people occluding its pathway or to a few of them showing colliding plans, rather than to its own planned route. The code is available at https://github.com/GDam90/hyp2nav.

Autonomous Navigation of Unmanned Vehicle Through Deep Reinforcement Learning

Authors:Letian Xu, Jiabei Liu, Haopeng Zhao, Tianyao Zheng, Tongzhou Jiang, Lipeng Liu
Date:2024-07-18 05:18:59

This paper explores the method of achieving autonomous navigation of unmanned vehicles through Deep Reinforcement Learning (DRL). The focus is on using the Deep Deterministic Policy Gradient (DDPG) algorithm to address issues in high-dimensional continuous action spaces. The paper details the model of a Ackermann robot and the structure and application of the DDPG algorithm. Experiments were conducted in a simulation environment to verify the feasibility of the improved algorithm. The results demonstrate that the DDPG algorithm outperforms traditional Deep Q-Network (DQN) and Double Deep Q-Network (DDQN) algorithms in path planning tasks.

DITTO: A Visual Digital Twin for Interventions and Temporal Treatment Outcomes in Head and Neck Cancer

Authors:Andrew Wentzel, Serageldin Attia, Xinhua Zhang, Guadalupe Canahuate, Clifton David Fuller, G. Elisabeta Marai
Date:2024-07-18 02:36:18

Digital twin models are of high interest to Head and Neck Cancer (HNC) oncologists, who have to navigate a series of complex treatment decisions that weigh the efficacy of tumor control against toxicity and mortality risks. Evaluating individual risk profiles necessitates a deeper understanding of the interplay between different factors such as patient health, spatial tumor location and spread, and risk of subsequent toxicities that can not be adequately captured through simple heuristics. To support clinicians in better understanding tradeoffs when deciding on treatment courses, we developed DITTO, a digital-twin and visual computing system that allows clinicians to analyze detailed risk profiles for each patient, and decide on a treatment plan. DITTO relies on a sequential Deep Reinforcement Learning digital twin (DT) to deliver personalized risk of both long-term and short-term disease outcome and toxicity risk for HNC patients. Based on a participatory collaborative design alongside oncologists, we also implement several visual explainability methods to promote clinical trust and encourage healthy skepticism when using our system. We evaluate the efficacy of DITTO through quantitative evaluation of performance and case studies with qualitative feedback. Finally, we discuss design lessons for developing clinical visual XAI applications for clinical end users.

Towards Collaborative Intelligence: Propagating Intentions and Reasoning for Multi-Agent Coordination with Large Language Models

Authors:Xihe Qiu, Haoyu Wang, Xiaoyu Tan, Chao Qu, Yujie Xiong, Yuan Cheng, Yinghui Xu, Wei Chu, Yuan Qi
Date:2024-07-17 13:14:00

Effective collaboration in multi-agent systems requires communicating goals and intentions between agents. Current agent frameworks often suffer from dependencies on single-agent execution and lack robust inter-module communication, frequently leading to suboptimal multi-agent reinforcement learning (MARL) policies and inadequate task coordination. To address these challenges, we present a framework for training large language models (LLMs) as collaborative agents to enable coordinated behaviors in cooperative MARL. Each agent maintains a private intention consisting of its current goal and associated sub-tasks. Agents broadcast their intentions periodically, allowing other agents to infer coordination tasks. A propagation network transforms broadcast intentions into teammate-specific communication messages, sharing relevant goals with designated teammates. The architecture of our framework is structured into planning, grounding, and execution modules. During execution, multiple agents interact in a downstream environment and communicate intentions to enable coordinated behaviors. The grounding module dynamically adapts comprehension strategies based on emerging coordination patterns, while feedback from execution agents influnces the planning module, enabling the dynamic re-planning of sub-tasks. Results in collaborative environment simulation demonstrate intention propagation reduces miscoordination errors by aligning sub-task dependencies between agents. Agents learn when to communicate intentions and which teammates require task details, resulting in emergent coordinated behaviors. This demonstrates the efficacy of intention sharing for cooperative multi-agent RL based on LLMs.

Satisficing Exploration for Deep Reinforcement Learning

Authors:Dilip Arumugam, Saurabh Kumar, Ramki Gummadi, Benjamin Van Roy
Date:2024-07-16 21:28:03

A default assumption in the design of reinforcement-learning algorithms is that a decision-making agent always explores to learn optimal behavior. In sufficiently complex environments that approach the vastness and scale of the real world, however, attaining optimal performance may in fact be an entirely intractable endeavor and an agent may seldom find itself in a position to complete the requisite exploration for identifying an optimal policy. Recent work has leveraged tools from information theory to design agents that deliberately forgo optimal solutions in favor of sufficiently-satisfying or satisficing solutions, obtained through lossy compression. Notably, such agents may employ fundamentally different exploratory decisions to learn satisficing behaviors more efficiently than optimal ones that are more data intensive. While supported by a rigorous corroborating theory, the underlying algorithm relies on model-based planning, drastically limiting the compatibility of these ideas with function approximation and high-dimensional observations. In this work, we remedy this issue by extending an agent that directly represents uncertainty over the optimal value function allowing it to both bypass the need for model-based planning and to learn satisficing policies. We provide simple yet illustrative experiments that demonstrate how our algorithm enables deep reinforcement-learning agents to achieve satisficing behaviors. In keeping with previous work on this setting for multi-armed bandits, we additionally find that our algorithm is capable of synthesizing optimal behaviors, when feasible, more efficiently than its non-information-theoretic counterpart.

Walking the Values in Bayesian Inverse Reinforcement Learning

Authors:Ondrej Bajgar, Alessandro Abate, Konstantinos Gatsis, Michael A. Osborne
Date:2024-07-15 17:59:52

The goal of Bayesian inverse reinforcement learning (IRL) is recovering a posterior distribution over reward functions using a set of demonstrations from an expert optimizing for a reward unknown to the learner. The resulting posterior over rewards can then be used to synthesize an apprentice policy that performs well on the same or a similar task. A key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood, often defined in terms of Q values: vanilla Bayesian IRL needs to solve the costly forward planning problem - going from rewards to the Q values - at every step of the algorithm, which may need to be done thousands of times. We propose to solve this by a simple change: instead of focusing on primarily sampling in the space of rewards, we can focus on primarily working in the space of Q-values, since the computation required to go from Q-values to reward is radically cheaper. Furthermore, this reversion of the computation makes it easy to compute the gradient allowing efficient sampling using Hamiltonian Monte Carlo. We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight - and illustrate its advantages on several tasks.

Cooperative Reward Shaping for Multi-Agent Pathfinding

Authors:Zhenyu Song, Ronghao Zheng, Senlin Zhang, Meiqin Liu
Date:2024-07-15 02:44:41

The primary objective of Multi-Agent Pathfinding (MAPF) is to plan efficient and conflict-free paths for all agents. Traditional multi-agent path planning algorithms struggle to achieve efficient distributed path planning for multiple agents. In contrast, Multi-Agent Reinforcement Learning (MARL) has been demonstrated as an effective approach to achieve this objective. By modeling the MAPF problem as a MARL problem, agents can achieve efficient path planning and collision avoidance through distributed strategies under partial observation. However, MARL strategies often lack cooperation among agents due to the absence of global information, which subsequently leads to reduced MAPF efficiency. To address this challenge, this letter introduces a unique reward shaping technique based on Independent Q-Learning (IQL). The aim of this method is to evaluate the influence of one agent on its neighbors and integrate such an interaction into the reward function, leading to active cooperation among agents. This reward shaping method facilitates cooperation among agents while operating in a distributed manner. The proposed approach has been evaluated through experiments across various scenarios with different scales and agent counts. The results are compared with those from other state-of-the-art (SOTA) planners. The evidence suggests that the approach proposed in this letter parallels other planners in numerous aspects, and outperforms them in scenarios featuring a large number of agents.

Optimal Defender Strategies for CAGE-2 using Causal Modeling and Tree Search

Authors:Kim Hammar, Neil Dhir, Rolf Stadler
Date:2024-07-12 18:34:55

The CAGE-2 challenge is considered a standard benchmark to compare methods for autonomous cyber defense. Current state-of-the-art methods evaluated against this benchmark are based on model-free (offline) reinforcement learning, which does not provide provably optimal defender strategies. We address this limitation and present a formal (causal) model of CAGE-2 together with a method that produces a provably optimal defender strategy, which we call Causal Partially Observable Monte-Carlo Planning (C-POMCP). It has two key properties. First, it incorporates the causal structure of the target system, i.e., the causal relationships among the system variables. This structure allows for a significant reduction of the search space of defender strategies. Second, it is an online method that updates the defender strategy at each time step via tree search. Evaluations against the CAGE-2 benchmark show that C-POMCP achieves state-of-the-art performance with respect to effectiveness and is two orders of magnitude more efficient in computing time than the closest competitor method.

Instruction Following with Goal-Conditioned Reinforcement Learning in Virtual Environments

Authors:Zoya Volovikova, Alexey Skrynnik, Petr Kuderov, Aleksandr I. Panov
Date:2024-07-12 14:19:36

In this study, we address the issue of enabling an artificial intelligence agent to execute complex language instructions within virtual environments. In our framework, we assume that these instructions involve intricate linguistic structures and multiple interdependent tasks that must be navigated successfully to achieve the desired outcomes. To effectively manage these complexities, we propose a hierarchical framework that combines the deep language comprehension of large language models with the adaptive action-execution capabilities of reinforcement learning agents. The language module (based on LLM) translates the language instruction into a high-level action plan, which is then executed by a pre-trained reinforcement learning agent. We have demonstrated the effectiveness of our approach in two different environments: in IGLU, where agents are instructed to build structures, and in Crafter, where agents perform tasks and interact with objects in the surrounding environment according to language commands.

Graph Neural Networks with Model-based Reinforcement Learning for Multi-agent Systems

Authors:Hanxiao Chen
Date:2024-07-12 13:21:35

Multi-agent systems (MAS) constitute a significant role in exploring machine intelligence and advanced applications. In order to deeply investigate complicated interactions within MAS scenarios, we originally propose "GNN for MBRL" model, which utilizes a state-spaced Graph Neural Networks with Model-based Reinforcement Learning to address specific MAS missions (e.g., Billiard-Avoidance, Autonomous Driving Cars). In detail, we firstly used GNN model to predict future states and trajectories of multiple agents, then applied the Cross-Entropy Method (CEM) optimized Model Predictive Control to assist the ego-agent planning actions and successfully accomplish certain MAS tasks.

Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Decision-Making in Dynamic Environment

Authors:Jayabrata Chowdhury, Venkataramanan Shivaraman, Sumit Dangi, Suresh Sundaram, P. B. Sujit
Date:2024-07-12 02:34:44

Autonomous Vehicle (AV) decision making in urban environments is inherently challenging due to the dynamic interactions with surrounding vehicles. For safe planning, AV must understand the weightage of various spatiotemporal interactions in a scene. Contemporary works use colossal transformer architectures to encode interactions mainly for trajectory prediction, resulting in increased computational complexity. To address this issue without compromising spatiotemporal understanding and performance, we propose the simple Deep Attention Driven Reinforcement Learning (DADRL) framework, which dynamically assigns and incorporates the significance of surrounding vehicles into the ego's RL driven decision making process. We introduce an AV centric spatiotemporal attention encoding (STAE) mechanism for learning the dynamic interactions with different surrounding vehicles. To understand map and route context, we employ a context encoder to extract features from context maps. The spatiotemporal representations combined with contextual encoding provide a comprehensive state representation. The resulting model is trained using the Soft Actor Critic (SAC) algorithm. We evaluate the proposed framework on the SMARTS urban benchmarking scenarios without traffic signals to demonstrate that DADRL outperforms recent state of the art methods. Furthermore, an ablation study underscores the importance of the context-encoder and spatio temporal attention encoder in achieving superior performance.

MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility

Authors:Wayne Wu, Honglin He, Jack He, Yiran Wang, Chenda Duan, Zhizheng Liu, Quanyi Li, Bolei Zhou
Date:2024-07-11 17:56:49

Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks with pedestrians, while robot dogs and humanoids have recently emerged in the street. Micromobility enabled by AI for short-distance travel in public urban spaces plays a crucial component in the future transportation system. Ensuring the generalizability and safety of AI models maneuvering mobile machines is essential. In this work, we present MetaUrban, a compositional simulation platform for the AI-driven urban micromobility research. MetaUrban can construct an infinite number of interactive urban scenes from compositional elements, covering a vast array of ground plans, object placements, pedestrians, vulnerable road users, and other mobile agents' appearances and dynamics. We design point navigation and social navigation tasks as the pilot study using MetaUrban for urban micromobility research and establish various baselines of Reinforcement Learning and Imitation Learning. We conduct extensive evaluation across mobile machines, demonstrating that heterogeneous mechanical structures significantly influence the learning and execution of AI policies. We perform a thorough ablation study, showing that the compositional nature of the simulated environments can substantially improve the generalizability and safety of the trained mobile agents. MetaUrban will be made publicly available to provide research opportunities and foster safe and trustworthy embodied AI and micromobility in cities. The code and dataset will be publicly available.

Hierarchical Consensus-Based Multi-Agent Reinforcement Learning for Multi-Robot Cooperation Tasks

Authors:Pu Feng, Junkang Liang, Size Wang, Xin Yu, Xin Ji, Yiting Chen, Kui Zhang, Rongye Shi, Wenjun Wu
Date:2024-07-11 03:55:55

In multi-agent reinforcement learning (MARL), the Centralized Training with Decentralized Execution (CTDE) framework is pivotal but struggles due to a gap: global state guidance in training versus reliance on local observations in execution, lacking global signals. Inspired by human societal consensus mechanisms, we introduce the Hierarchical Consensus-based Multi-Agent Reinforcement Learning (HC-MARL) framework to address this limitation. HC-MARL employs contrastive learning to foster a global consensus among agents, enabling cooperative behavior without direct communication. This approach enables agents to form a global consensus from local observations, using it as an additional piece of information to guide collaborative actions during execution. To cater to the dynamic requirements of various tasks, consensus is divided into multiple layers, encompassing both short-term and long-term considerations. Short-term observations prompt the creation of an immediate, low-layer consensus, while long-term observations contribute to the formation of a strategic, high-layer consensus. This process is further refined through an adaptive attention mechanism that dynamically adjusts the influence of each consensus layer. This mechanism optimizes the balance between immediate reactions and strategic planning, tailoring it to the specific demands of the task at hand. Extensive experiments and real-world applications in multi-robot systems showcase our framework's superior performance, marking significant advancements over baselines.

Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models

Authors:Logan Cross, Violet Xiang, Agam Bhatia, Daniel LK Yamins, Nick Haber
Date:2024-07-09 17:57:15

Multi-agent reinforcement learning (MARL) methods struggle with the non-stationarity of multi-agent systems and fail to adaptively learn online when tested with novel agents. Here, we leverage large language models (LLMs) to create an autonomous agent that can handle these challenges. Our agent, Hypothetical Minds, consists of a cognitively-inspired architecture, featuring modular components for perception, memory, and hierarchical planning over two levels of abstraction. We introduce the Theory of Mind module that scaffolds the high-level planning process by generating hypotheses about other agents' strategies in natural language. It then evaluates and iteratively refines these hypotheses by reinforcing hypotheses that make correct predictions about the other agents' behavior. Hypothetical Minds significantly improves performance over previous LLM-agent and RL baselines on a range of competitive, mixed motive, and collaborative domains in the Melting Pot benchmark, including both dyadic and population-based environments. Additionally, comparisons against LLM-agent baselines and ablations reveal the importance of hypothesis evaluation and refinement for succeeding on complex scenarios.

A Unified Approach to Multi-task Legged Navigation: Temporal Logic Meets Reinforcement Learning

Authors:Jesse Jiang, Samuel Coogan, Ye Zhao
Date:2024-07-09 15:06:52

This study examines the problem of hopping robot navigation planning to achieve simultaneous goal-directed and environment exploration tasks. We consider a scenario in which the robot has mandatory goal-directed tasks defined using Linear Temporal Logic (LTL) specifications as well as optional exploration tasks represented using a reward function. Additionally, there exists uncertainty in the robot dynamics which results in motion perturbation. We first propose an abstraction of 3D hopping robot dynamics which enables high-level planning and a neural-network-based optimization for low-level control. We then introduce a Multi-task Product IMDP (MT-PIMDP) model of the system and tasks. We propose a unified control policy synthesis algorithm which enables both task-directed goal-reaching behaviors as well as task-agnostic exploration to learn perturbations and reward. We provide a formal proof of the trade-off induced by prioritizing either LTL or RL actions. We demonstrate our methods with simulation case studies in a 2D world navigation environment.

HiLMa-Res: A General Hierarchical Framework via Residual RL for Combining Quadrupedal Locomotion and Manipulation

Authors:Xiaoyu Huang, Qiayuan Liao, Yiming Ni, Zhongyu Li, Laura Smith, Sergey Levine, Xue Bin Peng, Koushil Sreenath
Date:2024-07-09 06:31:54

This work presents HiLMa-Res, a hierarchical framework leveraging reinforcement learning to tackle manipulation tasks while performing continuous locomotion using quadrupedal robots. Unlike most previous efforts that focus on solving a specific task, HiLMa-Res is designed to be general for various loco-manipulation tasks that require quadrupedal robots to maintain sustained mobility. The novel design of this framework tackles the challenges of integrating continuous locomotion control and manipulation using legs. It develops an operational space locomotion controller that can track arbitrary robot end-effector (toe) trajectories while walking at different velocities. This controller is designed to be general to different downstream tasks, and therefore, can be utilized in high-level manipulation planning policy to address specific tasks. To demonstrate the versatility of this framework, we utilize HiLMa-Res to tackle several challenging loco-manipulation tasks using a quadrupedal robot in the real world. These tasks span from leveraging state-based policy to vision-based policy, from training purely from the simulation data to learning from real-world data. In these tasks, HiLMa-Res shows better performance than other methods.

DiffPhyCon: A Generative Approach to Control Complex Physical Systems

Authors:Long Wei, Peiyan Hu, Ruiqi Feng, Haodong Feng, Yixuan Du, Tao Zhang, Rui Wang, Yue Wang, Zhi-Ming Ma, Tailin Wu
Date:2024-07-09 01:56:23

Controlling the evolution of complex physical systems is a fundamental task across science and engineering. Classical techniques suffer from limited applicability or huge computational costs. On the other hand, recent deep learning and reinforcement learning-based approaches often struggle to optimize long-term control sequences under the constraints of system dynamics. In this work, we introduce Diffusion Physical systems Control (DiffPhyCon), a new class of method to address the physical systems control problem. DiffPhyCon excels by simultaneously minimizing both the learned generative energy function and the predefined control objectives across the entire trajectory and control sequence. Thus, it can explore globally and plan near-optimal control sequences. Moreover, we enhance DiffPhyCon with prior reweighting, enabling the discovery of control sequences that significantly deviate from the training distribution. We test our method on three tasks: 1D Burgers' equation, 2D jellyfish movement control, and 2D high-dimensional smoke control, where our generated jellyfish dataset is released as a benchmark for complex physical system control research. Our method outperforms widely applied classical approaches and state-of-the-art deep learning and reinforcement learning methods. Notably, DiffPhyCon unveils an intriguing fast-close-slow-open pattern observed in the jellyfish, aligning with established findings in the field of fluid dynamics. The project website, jellyfish dataset, and code can be found at https://github.com/AI4Science-WestlakeU/diffphycon.

Collision Avoidance for Multiple UAVs in Unknown Scenarios with Causal Representation Disentanglement

Authors:Jiafan Zhuang, Zihao Xia, Gaofei Han, Boxi Wang, Wenji Li, Dongliang Wang, Zhifeng Hao, Ruichu Cai, Zhun Fan
Date:2024-07-04 17:09:36

Deep reinforcement learning (DRL) has achieved remarkable progress in online path planning tasks for multi-UAV systems. However, existing DRL-based methods often suffer from performance degradation when tackling unseen scenarios, since the non-causal factors in visual representations adversely affect policy learning. To address this issue, we propose a novel representation learning approach, \ie, causal representation disentanglement, which can identify the causal and non-causal factors in representations. After that, we only pass causal factors for subsequent policy learning and thus explicitly eliminate the influence of non-causal factors, which effectively improves the generalization ability of DRL models. Experimental results show that our proposed method can achieve robust navigation performance and effective collision avoidance especially in unseen scenarios, which significantly outperforms existing SOTA algorithms.

Solving Motion Planning Tasks with a Scalable Generative Model

Authors:Yihan Hu, Siqi Chai, Zhening Yang, Jingyu Qian, Kun Li, Wenxin Shao, Haichao Zhang, Wei Xu, Qiang Liu
Date:2024-07-03 03:57:05

As autonomous driving systems being deployed to millions of vehicles, there is a pressing need of improving the system's scalability, safety and reducing the engineering cost. A realistic, scalable, and practical simulator of the driving world is highly desired. In this paper, we present an efficient solution based on generative models which learns the dynamics of the driving scenes. With this model, we can not only simulate the diverse futures of a given driving scenario but also generate a variety of driving scenarios conditioned on various prompts. Our innovative design allows the model to operate in both full-Autoregressive and partial-Autoregressive modes, significantly improving inference and training speed without sacrificing generative capability. This efficiency makes it ideal for being used as an online reactive environment for reinforcement learning, an evaluator for planning policies, and a high-fidelity simulator for testing. We evaluated our model against two real-world datasets: the Waymo motion dataset and the nuPlan dataset. On the simulation realism and scene generation benchmark, our model achieves the state-of-the-art performance. And in the planning benchmarks, our planner outperforms the prior arts. We conclude that the proposed generative model may serve as a foundation for a variety of motion planning tasks, including data generation, simulation, planning, and online training. Source code is public at https://github.com/HorizonRobotics/GUMP/

The path towards contact-based physical human-robot interaction

Authors:Mohammad Farajtabar, Marie Charbonneau
Date:2024-07-02 20:52:42

With the advancements in human-robot interaction (HRI), robots are now capable of operating in close proximity and engaging in physical interactions with humans (pHRI). Likewise, contact-based pHRI is becoming increasingly common as robots are equipped with a range of sensors to perceive human motions. Despite the presence of surveys exploring various aspects of HRI and pHRI, there is presently a gap in comprehensive studies that collect, organize and relate developments across all aspects of contact-based pHRI. It has become challenging to gain a comprehensive understanding of the current state of the field, thoroughly analyze the aspects that have been covered, and identify areas needing further attention. Hence, the present survey. While it includes key developments in pHRI, a particular focus is placed on contact-based interaction, which has numerous applications in industrial, rehabilitation and medical robotics. Across the literature, a common denominator is the importance to establish a safe, compliant and human intention-oriented interaction. This endeavour encompasses aspects of perception, planning and control, and how they work together to enhance safety and reliability. Notably, the survey highlights the application of data-driven techniques: backed by a growing body of literature demonstrating their effectiveness, approaches like reinforcement learning and learning from demonstration have become key to improving robot perception and decision-making within complex and uncertain pHRI scenarios. As the field is yet in its early stage, these observations may help guide future developments and steer research towards the responsible integration of physically interactive robots into workplaces, public spaces, and elements of private life.

PWM: Policy Learning with Large World Models

Authors:Ignat Georgiev, Varun Giridhar, Nicklas Hansen, Animesh Garg
Date:2024-07-02 17:47:03

Reinforcement Learning (RL) has achieved impressive results on complex tasks but struggles in multi-task settings with different embodiments. World models offer scalability by learning a simulation of the environment, yet they often rely on inefficient gradient-free optimization methods. We introduce Policy learning with large World Models (PWM), a novel model-based RL algorithm that learns continuous control policies from large multi-task world models. By pre-training the world model on offline data and using it for first-order gradient policy learning, PWM effectively solves tasks with up to 152 action dimensions and outperforms methods using ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines without the need for expensive online planning. Visualizations and code available at https://www.imgeorgiev.com/pwm

Physics-Informed Model and Hybrid Planning for Efficient Dyna-Style Reinforcement Learning

Authors:Zakariae El Asri, Olivier Sigaud, Nicolas Thome
Date:2024-07-02 12:32:57

Applying reinforcement learning (RL) to real-world applications requires addressing a trade-off between asymptotic performance, sample efficiency, and inference time. In this work, we demonstrate how to address this triple challenge by leveraging partial physical knowledge about the system dynamics. Our approach involves learning a physics-informed model to boost sample efficiency and generating imaginary trajectories from this model to learn a model-free policy and Q-function. Furthermore, we propose a hybrid planning strategy, combining the learned policy and Q-function with the learned model to enhance time efficiency in planning. Through practical demonstrations, we illustrate that our method improves the compromise between sample efficiency, time efficiency, and performance over state-of-the-art methods.

Universal Plans: One Action Sequence to Solve Them All!

Authors:Kalle G. Timperi, Alexander J. LaValle, Steven M. LaValle
Date:2024-07-02 09:26:21

This paper introduces the notion of a universal plan, which when executed, is guaranteed to solve all planning problems in a category, regardless of the obstacles, initial state, and goal set. Such plans are specified as a deterministic sequence of actions that are blindly applied without any sensor feedback. Thus, they can be considered as pure exploration in a reinforcement learning context, and we show that with basic memory requirements, they even yield optimal plans. Building upon results in number theory and theory of automata, we provide universal plans both for discrete and continuous (motion) planning and prove their (semi)completeness. The concepts are applied and illustrated through simulation studies, and several directions for future research are sketched.

Generation of Geodesics with Actor-Critic Reinforcement Learning to Predict Midpoints

Authors:Kazumi Kasaura
Date:2024-07-02 07:06:49

To find the shortest paths for all pairs on manifolds with infinitesimally defined metrics, we propose to generate them by predicting midpoints recursively and an actor-critic method to learn midpoint prediction. We prove the soundness of our approach and show experimentally that the proposed method outperforms existing methods on both local and global path planning tasks.

Research on Autonomous Robots Navigation based on Reinforcement Learning

Authors:Zixiang Wang, Hao Yan, Yining Wang, Zhengjia Xu, Zhuoyue Wang, Zhizhong Wu
Date:2024-07-02 00:44:06

Reinforcement learning continuously optimizes decision-making based on real-time feedback reward signals through continuous interaction with the environment, demonstrating strong adaptive and self-learning capabilities. In recent years, it has become one of the key methods to achieve autonomous navigation of robots. In this work, an autonomous robot navigation method based on reinforcement learning is introduced. We use the Deep Q Network (DQN) and Proximal Policy Optimization (PPO) models to optimize the path planning and decision-making process through the continuous interaction between the robot and the environment, and the reward signals with real-time feedback. By combining the Q-value function with the deep neural network, deep Q network can handle high-dimensional state space, so as to realize path planning in complex environments. Proximal policy optimization is a strategy gradient-based method, which enables robots to explore and utilize environmental information more efficiently by optimizing policy functions. These methods not only improve the robot's navigation ability in the unknown environment, but also enhance its adaptive and self-learning capabilities. Through multiple training and simulation experiments, we have verified the effectiveness and robustness of these models in various complex scenarios.

Contractual Reinforcement Learning: Pulling Arms with Invisible Hands

Authors:Jibang Wu, Siyu Chen, Mengdi Wang, Huazheng Wang, Haifeng Xu
Date:2024-07-01 16:53:00

The agency problem emerges in today's large scale machine learning tasks, where the learners are unable to direct content creation or enforce data collection. In this work, we propose a theoretical framework for aligning economic interests of different stakeholders in the online learning problems through contract design. The problem, termed \emph{contractual reinforcement learning}, naturally arises from the classic model of Markov decision processes, where a learning principal seeks to optimally influence the agent's action policy for their common interests through a set of payment rules contingent on the realization of next state. For the planning problem, we design an efficient dynamic programming algorithm to determine the optimal contracts against the far-sighted agent. For the learning problem, we introduce a generic design of no-regret learning algorithms to untangle the challenges from robust design of contracts to the balance of exploration and exploitation, reducing the complexity analysis to the construction of efficient search algorithms. For several natural classes of problems, we design tailored search algorithms that provably achieve $\tilde{O}(\sqrt{T})$ regret. We also present an algorithm with $\tilde{O}(T^{2/3})$ for the general problem that improves the existing analysis in online contract design with mild technical assumptions.

Let Hybrid A* Path Planner Obey Traffic Rules: A Deep Reinforcement Learning-Based Planning Framework

Authors:Xibo Li, Shruti Patel, Christof Büskens
Date:2024-07-01 12:00:10

Deep reinforcement learning (DRL) allows a system to interact with its environment and take actions by training an efficient policy that maximizes self-defined rewards. In autonomous driving, it can be used as a strategy for high-level decision making, whereas low-level algorithms such as the hybrid A* path planning have proven their ability to solve the local trajectory planning problem. In this work, we combine these two methods where the DRL makes high-level decisions such as lane change commands. After obtaining the lane change command, the hybrid A* planner is able to generate a collision-free trajectory to be executed by a model predictive controller (MPC). In addition, the DRL algorithm is able to keep the lane change command consistent within a chosen time-period. Traffic rules are implemented using linear temporal logic (LTL), which is then utilized as a reward function in DRL. Furthermore, we validate the proposed method on a real system to demonstrate its feasibility from simulation to implementation on real hardware.

Residual-MPPI: Online Policy Customization for Continuous Control

Authors:Pengcheng Wang, Chenran Li, Catherine Weaver, Kenta Kawamoto, Masayoshi Tomizuka, Chen Tang, Wei Zhan
Date:2024-07-01 01:53:07

Policies learned through Reinforcement Learning (RL) and Imitation Learning (IL) have demonstrated significant potential in achieving advanced performance in continuous control tasks. However, in real-world environments, it is often necessary to further customize a trained policy when there are additional requirements that were unforeseen during the original training phase. It is possible to fine-tune the policy to meet the new requirements, but this often requires collecting new data with the added requirements and access to the original training metric and policy parameters. In contrast, an online planning algorithm, if capable of meeting the additional requirements, can eliminate the necessity for extensive training phases and customize the policy without knowledge of the original training scheme or task. In this work, we propose a generic online planning algorithm for customizing continuous-control policies at the execution time which we call Residual-MPPI. It is able to customize a given prior policy on new performance metrics in few-shot and even zero-shot online settings. Also, Residual-MPPI only requires access to the action distribution produced by the prior policy, without additional knowledge regarding the original task. Through our experiments, we demonstrate that the proposed Residual-MPPI algorithm can accomplish the few-shot/zero-shot online policy customization task effectively, including customizing the champion-level racing agent, Gran Turismo Sophy (GT Sophy) 1.0, in the challenging car racing scenario, Gran Turismo Sport (GTS) environment. Demo videos are available on our website: https://sites.google.com/view/residual-mppi

Diffusion Models for Offline Multi-agent Reinforcement Learning with Safety Constraints

Authors:Jianuo Huang
Date:2024-06-30 16:05:31

In recent advancements in Multi-agent Reinforcement Learning (MARL), its application has extended to various safety-critical scenarios. However, most methods focus on online learning, which presents substantial risks when deployed in real-world settings. Addressing this challenge, we introduce an innovative framework integrating diffusion models within the MARL paradigm. This approach notably enhances the safety of actions taken by multiple agents through risk mitigation while modeling coordinated action. Our framework is grounded in the Centralized Training with Decentralized Execution (CTDE) architecture, augmented by a Diffusion Model for prediction trajectory generation. Additionally, we incorporate a specialized algorithm to further ensure operational safety. We evaluate our model against baselines on the DSRL benchmark. Experiment results demonstrate that our model not only adheres to stringent safety constraints but also achieves superior performance compared to existing methodologies. This underscores the potential of our approach in advancing the safety and efficacy of MARL in real-world applications.

Meta-Gradient Search Control: A Method for Improving the Efficiency of Dyna-style Planning

Authors:Bradley Burega, John D. Martin, Luke Kapeluck, Michael Bowling
Date:2024-06-27 22:24:46

We study how a Reinforcement Learning (RL) system can remain sample-efficient when learning from an imperfect model of the environment. This is particularly challenging when the learning system is resource-constrained and in continual settings, where the environment dynamics change. To address these challenges, our paper introduces an online, meta-gradient algorithm that tunes a probability with which states are queried during Dyna-style planning. Our study compares the aggregate, empirical performance of this meta-gradient method to baselines that employ conventional sampling strategies. Results indicate that our method improves efficiency of the planning process, which, as a consequence, improves the sample-efficiency of the overall learning process. On the whole, we observe that our meta-learned solutions avoid several pathologies of conventional planning approaches, such as sampling inaccurate transitions and those that stall credit assignment. We believe these findings could prove useful, in future work, for designing model-based RL systems at scale.

Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs

Authors:Tian Tian, Lin F. Yang, Csaba Szepesvári
Date:2024-06-26 17:57:13

The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives while maximizing cumulative reward. However, the current understanding of how to learn efficiently in a CMDP environment with a potentially infinite number of states remains under investigation, particularly when function approximation is applied to the value functions. In this paper, we address the learning problem given linear function approximation with $q_{\pi}$-realizability, where the value functions of all policies are linearly representable with a known feature map, a setting known to be more general and challenging than other linear settings. Utilizing a local-access model, we propose a novel primal-dual algorithm that, after $\tilde{O}(\text{poly}(d) \epsilon^{-3})$ queries, outputs with high probability a policy that strictly satisfies the constraints while nearly optimizing the value with respect to a reward function. Here, $d$ is the feature dimension and $\epsilon > 0$ is a given error. The algorithm relies on a carefully crafted off-policy evaluation procedure to evaluate the policy using historical data, which informs policy updates through policy gradients and conserves samples. To our knowledge, this is the first result achieving polynomial sample complexity for CMDP in the $q_{\pi}$-realizable setting.

Human-Object Interaction from Human-Level Instructions

Authors:Zhen Wu, Jiaman Li, Pei Xu, C. Karen Liu
Date:2024-06-25 17:46:28

Intelligent agents must autonomously interact with the environments to perform daily tasks based on human-level instructions. They need a foundational understanding of the world to accurately interpret these instructions, along with precise low-level movement and interaction skills to execute the derived actions. In this work, we propose the first complete system for synthesizing physically plausible, long-horizon human-object interactions for object manipulation in contextual environments, driven by human-level instructions. We leverage large language models (LLMs) to interpret the input instructions into detailed execution plans. Unlike prior work, our system is capable of generating detailed finger-object interactions, in seamless coordination with full-body movements. We also train a policy to track generated motions in physics simulation via reinforcement learning (RL) to ensure physical plausibility of the motion. Our experiments demonstrate the effectiveness of our system in synthesizing realistic interactions with diverse objects in complex environments, highlighting its potential for real-world applications.

BricksRL: A Platform for Democratizing Robotics and Reinforcement Learning Research and Education with LEGO

Authors:Sebastian Dittert, Vincent Moens, Gianni De Fabritiis
Date:2024-06-25 12:17:44

We present BricksRL, a platform designed to democratize access to robotics for reinforcement learning research and education. BricksRL facilitates the creation, design, and training of custom LEGO robots in the real world by interfacing them with the TorchRL library for reinforcement learning agents. The integration of TorchRL with the LEGO hubs, via Bluetooth bidirectional communication, enables state-of-the-art reinforcement learning training on GPUs for a wide variety of LEGO builds. This offers a flexible and cost-efficient approach for scaling and also provides a robust infrastructure for robot-environment-algorithm communication. We present various experiments across tasks and robot configurations, providing built plans and training results. Furthermore, we demonstrate that inexpensive LEGO robots can be trained end-to-end in the real world to achieve simple tasks, with training times typically under 120 minutes on a normal laptop. Moreover, we show how users can extend the capabilities, exemplified by the successful integration of non-LEGO sensors. By enhancing accessibility to both robotics and reinforcement learning, BricksRL establishes a strong foundation for democratized robotic learning in research and educational settings.

Performance Comparison of Deep RL Algorithms for Mixed Traffic Cooperative Lane-Changing

Authors:Xue Yao, Shengren Hou, Serge P. Hoogendoorn, Simeon C. Calvert
Date:2024-06-25 07:49:25

Lane-changing (LC) is a challenging scenario for connected and automated vehicles (CAVs) because of the complex dynamics and high uncertainty of the traffic environment. This challenge can be handled by deep reinforcement learning (DRL) approaches, leveraging their data-driven and model-free nature. Our previous work proposed a cooperative lane-changing in mixed traffic (CLCMT) mechanism based on TD3 to facilitate an optimal lane-changing strategy. This study enhances the current CLCMT mechanism by considering both the uncertainty of the human-driven vehicles (HVs) and the microscopic interactions between HVs and CAVs. The state-of-the-art (SOTA) DRL algorithms including DDPG, TD3, SAC, and PPO are utilized to deal with the formulated MDP with continuous actions. Performance comparison among the four DRL algorithms demonstrates that DDPG, TD3, and PPO algorithms can deal with uncertainty in traffic environments and learn well-performed LC strategies in terms of safety, efficiency, comfort, and ecology. The PPO algorithm outperforms the other three algorithms, regarding a higher reward, fewer exploration mistakes and crashes, and a more comfortable and ecology LC strategy. The improvements promise CLCMT mechanism greater advantages in the LC motion planning of CAVs.

Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making

Authors:Vivek Myers, Chongyi Zheng, Anca Dragan, Sergey Levine, Benjamin Eysenbach
Date:2024-06-24 19:36:45

Temporal distances lie at the heart of many algorithms for planning, control, and reinforcement learning that involve reaching goals, allowing one to estimate the transit time between two states. However, prior attempts to define such temporal distances in stochastic settings have been stymied by an important limitation: these prior approaches do not satisfy the triangle inequality. This is not merely a definitional concern, but translates to an inability to generalize and find shortest paths. In this paper, we build on prior work in contrastive learning and quasimetrics to show how successor features learned by contrastive learning (after a change of variables) form a temporal distance that does satisfy the triangle inequality, even in stochastic settings. Importantly, this temporal distance is computationally efficient to estimate, even in high-dimensional and stochastic settings. Experiments in controlled settings and benchmark suites demonstrate that an RL algorithm based on these new temporal distances exhibits combinatorial generalization (i.e., "stitching") and can sometimes learn more quickly than prior methods, including those based on quasimetrics.

Probabilistic Subgoal Representations for Hierarchical Reinforcement learning

Authors:Vivienne Huiling Wang, Tinghuai Wang, Wenyan Yang, Joni-Kristian Kämäräinen, Joni Pajarinen
Date:2024-06-24 15:09:22

In goal-conditioned hierarchical reinforcement learning (HRL), a high-level policy specifies a subgoal for the low-level policy to reach. Effective HRL hinges on a suitable subgoal represen tation function, abstracting state space into latent subgoal space and inducing varied low-level behaviors. Existing methods adopt a subgoal representation that provides a deterministic mapping from state space to latent subgoal space. Instead, this paper utilizes Gaussian Processes (GPs) for the first probabilistic subgoal representation. Our method employs a GP prior on the latent subgoal space to learn a posterior distribution over the subgoal representation functions while exploiting the long-range correlation in the state space through learnable kernels. This enables an adaptive memory that integrates long-range subgoal information from prior planning steps allowing to cope with stochastic uncertainties. Furthermore, we propose a novel learning objective to facilitate the simultaneous learning of probabilistic subgoal representations and policies within a unified framework. In experiments, our approach outperforms state-of-the-art baselines in standard benchmarks but also in environments with stochastic elements and under diverse reward conditions. Additionally, our model shows promising capabilities in transferring low-level policies across different tasks.

Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning

Authors:Erin J. Talvitie, Zilei Shao, Huiying Li, Jinghan Hu, Jacob Boerma, Rory Zhao, Xintong Wang
Date:2024-06-23 04:23:15

In model-based reinforcement learning, simulated experiences from the learned model are often treated as equivalent to experience from the real environment. However, when the model is inaccurate, it can catastrophically interfere with policy learning. Alternatively, the agent might learn about the model's accuracy and selectively use it only when it can provide reliable predictions. We empirically explore model uncertainty measures for selective planning and show that best results require distribution insensitive inference to estimate the uncertainty over model-based updates. To that end, we propose and evaluate bounding-box inference, which operates on bounding-boxes around sets of possible states and other quantities. We find that bounding-box inference can reliably support effective selective planning.

Learning Abstract World Model for Value-preserving Planning with Options

Authors:Rafael Rodriguez-Sanchez, George Konidaris
Date:2024-06-22 13:41:02

General-purpose agents require fine-grained controls and rich sensory inputs to perform a wide range of tasks. However, this complexity often leads to intractable decision-making. Traditionally, agents are provided with task-specific action and observation spaces to mitigate this challenge, but this reduces autonomy. Instead, agents must be capable of building state-action spaces at the correct abstraction level from their sensorimotor experiences. We leverage the structure of a given set of temporally-extended actions to learn abstract Markov decision processes (MDPs) that operate at a higher level of temporal and state granularity. We characterize state abstractions necessary to ensure that planning with these skills, by simulating trajectories in the abstract MDP, results in policies with bounded value loss in the original MDP. We evaluate our approach in goal-based navigation environments that require continuous abstract states to plan successfully and show that abstract model learning improves the sample efficiency of planning and learning.

Deep UAV Path Planning with Assured Connectivity in Dense Urban Setting

Authors:Jiyong Oh, Syed M. Raza, Lusungu J. Mwasinga, Moonseong Kim, Hyunseung Choo
Date:2024-06-21 15:10:25

Unmanned Ariel Vehicle (UAV) services with 5G connectivity is an emerging field with numerous applications. Operator-controlled UAV flights and manual static flight configurations are major limitations for the wide adoption of scalability of UAV services. Several services depend on excellent UAV connectivity with a cellular network and maintaining it is challenging in predetermined flight paths. This paper addresses these limitations by proposing a Deep Reinforcement Learning (DRL) framework for UAV path planning with assured connectivity (DUPAC). During UAV flight, DUPAC determines the best route from a defined source to the destination in terms of distance and signal quality. The viability and performance of DUPAC are evaluated under simulated real-world urban scenarios using the Unity framework. The results confirm that DUPAC achieves an autonomous UAV flight path similar to base method with only 2% increment while maintaining an average 9% better connection quality throughout the flight.

HYPERmotion: Learning Hybrid Behavior Planning for Autonomous Loco-manipulation

Authors:Jin Wang, Rui Dai, Weijie Wang, Luca Rossini, Francesco Ruscelli, Nikos Tsagarakis
Date:2024-06-20 18:21:24

Enabling robots to autonomously perform hybrid motions in diverse environments can be beneficial for long-horizon tasks such as material handling, household chores, and work assistance. This requires extensive exploitation of intrinsic motion capabilities, extraction of affordances from rich environmental information, and planning of physical interaction behaviors. Despite recent progress has demonstrated impressive humanoid whole-body control abilities, they struggle to achieve versatility and adaptability for new tasks. In this work, we propose HYPERmotion, a framework that learns, selects and plans behaviors based on tasks in different scenarios. We combine reinforcement learning with whole-body optimization to generate motion for 38 actuated joints and create a motion library to store the learned skills. We apply the planning and reasoning features of the large language models (LLMs) to complex loco-manipulation tasks, constructing a hierarchical task graph that comprises a series of primitive behaviors to bridge lower-level execution with higher-level planning. By leveraging the interaction of distilled spatial geometry and 2D observation with a visual language model (VLM) to ground knowledge into a robotic morphology selector to choose appropriate actions in single- or dual-arm, legged or wheeled locomotion. Experiments in simulation and real-world show that learned motions can efficiently adapt to new tasks, demonstrating high autonomy from free-text commands in unstructured scenes. Videos and website: hy-motion.github.io/

ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation

Authors:Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, Yi Wu
Date:2024-06-20 08:04:07

Reinforcement Learning from Human Feedback (RLHF) stands as a pivotal technique in empowering large language model (LLM) applications. Since RLHF involves diverse computational workloads and intricate dependencies among multiple LLMs, directly adopting parallelization techniques from supervised training can result in sub-optimal performance. To overcome this limitation, we propose a novel approach named parameter ReaLlocation, which dynamically redistributes LLM parameters in the cluster and adapts parallelization strategies during training. Building upon this idea, we introduce ReaLHF, a pioneering system capable of automatically discovering and running efficient execution plans for RLHF training given the desired algorithmic and hardware configurations. ReaLHF formulates the execution plan for RLHF as an augmented dataflow graph. Based on this formulation, ReaLHF employs a tailored search algorithm with a lightweight cost estimator to discover an efficient execution plan. Subsequently, the runtime engine deploys the selected plan by effectively parallelizing computations and redistributing parameters. We evaluate ReaLHF on the LLaMA-2 models with up to $4\times70$ billion parameters and 128 GPUs. The experiment results showcase ReaLHF's substantial speedups of $2.0-10.6\times$ compared to baselines. Furthermore, the execution plans generated by ReaLHF exhibit an average of $26\%$ performance improvement over heuristic approaches based on Megatron-LM. The source code of ReaLHF is publicly available at https://github.com/openpsi-project/ReaLHF .

Urban-Focused Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing

Authors:Xinbo Zhao, Yingxue Zhang, Xin Zhang, Yu Yang, Yiqun Xie, Yanhua Li, Jun Luo
Date:2024-06-20 07:24:24

Enhancing diverse human decision-making processes in an urban environment is a critical issue across various applications, including ride-sharing vehicle dispatching, public transportation management, and autonomous driving. Offline reinforcement learning (RL) is a promising approach to learn and optimize human urban strategies (or policies) from pre-collected human-generated spatial-temporal urban data. However, standard offline RL faces two significant challenges: (1) data scarcity and data heterogeneity, and (2) distributional shift. In this paper, we introduce MODA -- a Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing approach. MODA addresses the challenges of data scarcity and heterogeneity in a multi-task urban setting through Contrastive Data Sharing among tasks. This technique involves extracting latent representations of human behaviors by contrasting positive and negative data pairs. It then shares data presenting similar representations with the target task, facilitating data augmentation for each task. Moreover, MODA develops a novel model-based multi-task offline RL algorithm. This algorithm constructs a robust Markov Decision Process (MDP) by integrating a dynamics model with a Generative Adversarial Network (GAN). Once the robust MDP is established, any online RL or planning algorithm can be applied. Extensive experiments conducted in a real-world multi-task urban setting validate the effectiveness of MODA. The results demonstrate that MODA exhibits significant improvements compared to state-of-the-art baselines, showcasing its capability in advancing urban decision-making processes. We also made our code available to the research community.

Learned Graph Rewriting with Equality Saturation: A New Paradigm in Relational Query Rewrite and Beyond

Authors:George-Octavian Bărbulescu, Taiyi Wang, Zak Singh, Eiko Yoneki
Date:2024-06-19 21:11:19

Query rewrite systems perform graph substitutions using rewrite rules to generate optimal SQL query plans. Rewriting logical and physical relational query plans is proven to be an NP-hard sequential decision-making problem with a search space exponential in the number of rewrite rules. In this paper, we address the query rewrite problem by interleaving Equality Saturation and Graph Reinforcement Learning (RL). The proposed system, Aurora, rewrites relational queries by guiding Equality Saturation, a method from compiler literature to perform non-destructive graph rewriting, with a novel RL agent that embeds both the spatial structure of the query graph as well as the temporal dimension associated with the sequential construction of query plans. Our results show Graph Reinforcement Learning for non-destructive graph rewriting yields SQL plans orders of magnitude faster than existing equality saturation solvers, while also achieving competitive results against mainstream query optimisers.

Improving GFlowNets with Monte Carlo Tree Search

Authors:Nikita Morozov, Daniil Tiapkin, Sergey Samsonov, Alexey Naumov, Dmitry Vetrov
Date:2024-06-19 15:58:35

Generative Flow Networks (GFlowNets) treat sampling from distributions over compositional discrete spaces as a sequential decision-making problem, training a stochastic policy to construct objects step by step. Recent studies have revealed strong connections between GFlowNets and entropy-regularized reinforcement learning. Building on these insights, we propose to enhance planning capabilities of GFlowNets by applying Monte Carlo Tree Search (MCTS). Specifically, we show how the MENTS algorithm (Xiao et al., 2019) can be adapted for GFlowNets and used during both training and inference. Our experiments demonstrate that this approach improves the sample efficiency of GFlowNet training and the generation fidelity of pre-trained GFlowNet models.

Tactile Aware Dynamic Obstacle Avoidance in Crowded Environment with Deep Reinforcement Learning

Authors:Yung Chuen Ng, Qi Wen, Lim, Chun Ye Tan, Zhen Hao Gan, Meng Yee, Chuah
Date:2024-06-19 10:50:04

Mobile robots operating in crowded environments require the ability to navigate among humans and surrounding obstacles efficiently while adhering to safety standards and socially compliant mannerisms. This scale of the robot navigation problem may be classified as both a local path planning and trajectory optimization problem. This work presents an array of force sensors that act as a tactile layer to complement the use of a LiDAR for the purpose of inducing awareness of contact with any surrounding objects within immediate vicinity of a mobile robot undetected by LiDARs. By incorporating the tactile layer, the robot can take more risks in its movements and possibly go right up to an obstacle or wall, and gently squeeze past it. In addition, we built up a simulation platform via Pybullet which integrates Robot Operating System (ROS) and reinforcement learning (RL) together. A touch-aware neural network model was trained on it to create an RL-based local path planner for dynamic obstacle avoidance. Our proposed method was demonstrated successfully on an omni-directional mobile robot who was able to navigate in a crowded environment with high agility and versatility in movement, while not being overly sensitive to nearby obstacles-not-in-contact.

Act Better by Timing: A timing-Aware Reinforcement Learning for Autonomous Driving

Authors:Guanzhou Li, Jianping Wu, Yujing He
Date:2024-06-19 05:25:15

Autonomous vehicles inevitably encounter a vast array of scenarios in real-world environments. Addressing long-tail scenarios, particularly those involving intensive interactions with numerous traffic participants, remains one of the most significant challenges in achieving high-level autonomous driving. Reinforcement learning (RL) offers a promising solution for such scenarios and allows autonomous vehicles to continuously self-evolve during interactions. However, traditional RL often requires trial and error from scratch in new scenarios, resulting in inefficient exploration of unknown states. Integrating RL with planning-based methods can significantly accelerate the learning process. Additionally, conventional RL methods lack robust safety mechanisms, making agents prone to collisions in dynamic environments in pursuit of short-term rewards. Many existing safe RL methods depend on environment modeling to identify reliable safety boundaries for constraining agent behavior. However, explicit environmental models can fail to capture the complexity of dynamic environments comprehensively. Inspired by the observation that human drivers rarely take risks in uncertain situations, this study introduces the concept of action timing and proposes a timing-aware RL method, In this approach, a "timing imagination" process previews the execution results of the agent's strategies at different time scales. The optimal execution timing is then projected to each decision moment, generating a dynamic safety factor to constrain actions. A planning-based method serves as a conservative baseline strategy in uncertain states. In two representative interaction scenarios, an unsignalized intersection and a roundabout, the proposed model outperforms the benchmark models in driving safety.

Online Pareto-Optimal Decision-Making for Complex Tasks using Active Inference

Authors:Peter Amorese, Shohei Wakayama, Nisar Ahmed, Morteza Lahijanian
Date:2024-06-17 18:03:45

When a robot autonomously performs a complex task, it frequently must balance competing objectives while maintaining safety. This becomes more difficult in uncertain environments with stochastic outcomes. Enhancing transparency in the robot's behavior and aligning with user preferences are also crucial. This paper introduces a novel framework for multi-objective reinforcement learning that ensures safe task execution, optimizes trade-offs between objectives, and adheres to user preferences. The framework has two main layers: a multi-objective task planner and a high-level selector. The planning layer generates a set of optimal trade-off plans that guarantee satisfaction of a temporal logic task. The selector uses active inference to decide which generated plan best complies with user preferences and aids learning. Operating iteratively, the framework updates a parameterized learning model based on collected data. Case studies and benchmarks on both manipulation and mobile robots show that our framework outperforms other methods and (i) learns multiple optimal trade-offs, (ii) adheres to a user preference, and (iii) allows the user to adjust the balance between (i) and (ii).

Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner

Authors:Kenneth Li, Yiming Wang, Fernanda Viégas, Martin Wattenberg
Date:2024-06-17 18:01:32

We present an approach called Dialogue Action Tokens (DAT) that adapts language model agents to plan goal-directed dialogues. The core idea is to treat each utterance as an action, thereby converting dialogues into games where existing approaches such as reinforcement learning can be applied. Specifically, we freeze a pretrained language model and train a small planner model that predicts a continuous action vector, used for controlled generation in each round. This design avoids the problem of language degradation under reward optimization. When evaluated on the Sotopia platform for social simulations, the DAT-steered LLaMA model surpasses GPT-4's performance. We also apply DAT to steer an attacker language model in a novel multi-turn red-teaming setting, revealing a potential new attack surface.

Intersymbolic AI: Interlinking Symbolic AI and Subsymbolic AI

Authors:André Platzer
Date:2024-06-17 14:01:59

This perspective piece calls for the study of the new field of Intersymbolic AI, by which we mean the combination of symbolic AI, whose building blocks have inherent significance/meaning, with subsymbolic AI, whose entirety creates significance/effect despite the fact that individual building blocks escape meaning. Canonical kinds of symbolic AI are logic, games and planning. Canonical kinds of subsymbolic AI are (un)supervised machine and reinforcement learning. Intersymbolic AI interlinks the worlds of symbolic AI with its compositional symbolic significance and meaning and of subsymbolic AI with its summative significance or effect to enable culminations of insights from both worlds by going between and across symbolic AI insights with subsymbolic AI techniques that are being helped by symbolic AI principles. For example, Intersymbolic AI may start with symbolic AI to understand a dynamic system, continue with subsymbolic AI to learn its control, and end with symbolic AI to safely use the outcome of the learned subsymbolic AI controller in the dynamic system. The way Intersymbolic AI combines both symbolic and subsymbolic AI to increase the effectiveness of AI compared to either kind of AI alone is likened to the way that the combination of both conscious and subconscious thought increases the effectiveness of human thought compared to either kind of thought alone. Some successful contributions to the Intersymbolic AI paradigm are surveyed here but many more are considered possible by advancing Intersymbolic AI.

Adaptive Reinforcement Learning Planning: Harnessing Large Language Models for Complex Information Extraction

Authors:Zepeng Ding, Ruiyang Ke, Wenhao Huang, Guochao Jiang, Yanda Li, Deqing Yang, Jiaqing Liang
Date:2024-06-17 12:11:01

Existing research on large language models (LLMs) shows that they can solve information extraction tasks through multi-step planning. However, their extraction behavior on complex sentences and tasks is unstable, emerging issues such as false positives and missing elements. We observe that decomposing complex extraction tasks and extracting them step by step can effectively improve LLMs' performance, and the extraction orders of entities significantly affect the final results of LLMs. This paper proposes a two-stage multi-step method for LLM-based information extraction and adopts the RL framework to execute the multi-step planning. We regard sequential extraction as a Markov decision process, build an LLM-based extraction environment, design a decision module to adaptively provide the optimal order for sequential entity extraction on different sentences, and utilize the DDQN algorithm to train the decision model. We also design the rewards and evaluation metrics suitable for the extraction results of LLMs. We conduct extensive experiments on multiple public datasets to demonstrate the effectiveness of our method in improving the information extraction capabilities of LLMs.

Multi-UAV Multi-RIS QoS-Aware Aerial Communication Systems using DRL and PSO

Authors:Marwan Dhuheir, Aiman Erbad, Ala Al-Fuqaha, Mohsen Guizani
Date:2024-06-16 17:53:56

Recently, Unmanned Aerial Vehicles (UAVs) have attracted the attention of researchers in academia and industry for providing wireless services to ground users in diverse scenarios like festivals, large sporting events, natural and man-made disasters due to their advantages in terms of versatility and maneuverability. However, the limited resources of UAVs (e.g., energy budget and different service requirements) can pose challenges for adopting UAVs for such applications. Our system model considers a UAV swarm that navigates an area, providing wireless communication to ground users with RIS support to improve the coverage of the UAVs. In this work, we introduce an optimization model with the aim of maximizing the throughput and UAVs coverage through optimal path planning of UAVs and multi-RIS phase configurations. The formulated optimization is challenging to solve using standard linear programming techniques, limiting its applicability in real-time decision-making. Therefore, we introduce a two-step solution using deep reinforcement learning and particle swarm optimization. We conduct extensive simulations and compare our approach to two competitive solutions presented in the recent literature. Our simulation results demonstrate that our adopted approach is 20 \% better than the brute-force approach and 30\% better than the baseline solution in terms of QoS.

UniZero: Generalized and Efficient Planning with Scalable Latent World Models

Authors:Yuan Pu, Yazhe Niu, Zhenjie Yang, Jiyuan Ren, Hongsheng Li, Yu Liu
Date:2024-06-15 15:24:15

Learning predictive world models is crucial for enhancing the planning capabilities of reinforcement learning (RL) agents. Recently, MuZero-style algorithms, leveraging the value equivalence principle and Monte Carlo Tree Search (MCTS), have achieved superhuman performance in various domains. However, these methods struggle to scale in heterogeneous scenarios with diverse dependencies and task variability. To overcome these limitations, we introduce UniZero, a novel approach that employs a modular transformer-based world model to effectively learn a shared latent space. By concurrently predicting latent dynamics and decision-oriented quantities conditioned on the learned latent history, UniZero enables joint optimization of the long-horizon world model and policy, facilitating broader and more efficient planning in the latent space. We show that UniZero significantly outperforms existing baselines in benchmarks that require long-term memory. Additionally, UniZero demonstrates superior scalability in multitask learning experiments conducted on Atari benchmarks. In standard single-task RL settings, such as Atari and DMControl, UniZero matches or even surpasses the performance of current state-of-the-art methods. Finally, extensive ablation studies and visual analyses validate the effectiveness and scalability of UniZero's design choices. Our code is available at \textcolor{magenta}{https://github.com/opendilab/LightZero}.

MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation

Authors:Ruibo Fu, Shuchen Shi, Hongming Guo, Tao Wang, Chunyu Qiang, Zhengqi Wen, Jianhua Tao, Xin Qi, Yi Lu, Xiaopeng Wang, Zhiyong Wang, Yukun Liu, Xuefei Liu, Shuai Zhang, Guanjun Li
Date:2024-06-15 10:47:36

Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on detailed and acoustically relevant textual descriptions, falls short in practical video dubbing applications. Existing datasets like AudioSet, AudioCaps, Clotho, Sound-of-Story, and WavCaps do not fully meet the requirements for real-world foley audio dubbing task. To address this, we introduce the Multi-modal Image and Narrative Text Dubbing Dataset (MINT), designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing. Besides, to address the limitations of existing TTA technology in understanding and planning complex prompts, a Foley Audio Content Planning, Generation, and Alignment (CPGA) framework is proposed, which includes a content planning module leveraging large language models for complex multi-modal prompts comprehension. Additionally, the training process is optimized using Proximal Policy Optimization based reinforcement learning, significantly improving the alignment and auditory realism of generated foley audio. Experimental results demonstrate that our approach significantly advances the field of foley audio dubbing, providing robust solutions for the challenges of multi-modal dubbing. Even when utilizing the relatively lightweight GPT-2 model, our framework outperforms open-source multimodal large models such as LLaVA, DeepSeek-VL, and Moondream2. The dataset is available at https://github.com/borisfrb/MINT .

Mix Q-learning for Lane Changing: A Collaborative Decision-Making Method in Multi-Agent Deep Reinforcement Learning

Authors:Xiaojun Bi, Mingjie He, Yiwen Sun
Date:2024-06-14 06:44:19

Lane-changing decisions, which are crucial for autonomous vehicle path planning, face practical challenges due to rule-based constraints and limited data. Deep reinforcement learning has become a major research focus due to its advantages in data acquisition and interpretability. However, current models often overlook collaboration, which affects not only impacts overall traffic efficiency but also hinders the vehicle's own normal driving in the long run. To address the aforementioned issue, this paper proposes a method named Mix Q-learning for Lane Changing(MQLC) that integrates a hybrid value Q network, taking into account both collective and individual benefits for the greater good. At the collective level, our method coordinates the individual Q and global Q networks by utilizing global information. This enables agents to effectively balance their individual interests with the collective benefit. At the individual level, we integrated a deep learning-based intent recognition module into our observation and enhanced the decision network. These changes provide agents with richer decision information and more accurate feature extraction for improved lane-changing decisions. This strategy enables the multi-agent system to learn and formulate optimal decision-making strategies effectively. Our MQLC model, through extensive experimental results, impressively outperforms other state-of-the-art multi-agent decision-making methods, achieving significantly safer and faster lane-changing decisions.

BaSeNet: A Learning-based Mobile Manipulator Base Pose Sequence Planning for Pickup Tasks

Authors:Lakshadeep Naik, Sinan Kalkan, Sune L. Sørensen, Mikkel B. Kjærgaard, Norbert Krüger
Date:2024-06-12 21:31:32

In many applications, a mobile manipulator robot is required to grasp a set of objects distributed in space. This may not be feasible from a single base pose and the robot must plan the sequence of base poses for grasping all objects, minimizing the total navigation and grasping time. This is a Combinatorial Optimization problem that can be solved using exact methods, which provide optimal solutions but are computationally expensive, or approximate methods, which offer computationally efficient but sub-optimal solutions. Recent studies have shown that learning-based methods can solve Combinatorial Optimization problems, providing near-optimal and computationally efficient solutions. In this work, we present BASENET - a learning-based approach to plan the sequence of base poses for the robot to grasp all the objects in the scene. We propose a Reinforcement Learning based solution that learns the base poses for grasping individual objects and the sequence in which the objects should be grasped to minimize the total navigation and grasping costs using Layered Learning. As the problem has a varying number of states and actions, we represent states and actions as a graph and use Graph Neural Networks for learning. We show that the proposed method can produce comparable solutions to exact and approximate methods with significantly less computation time.

Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning

Authors:Yuhui Wang, Qingyuan Wu, Weida Li, Dylan R. Ashley, Francesco Faccio, Chao Huang, Jürgen Schmidhuber
Date:2024-06-12 16:52:54

The Value Iteration Network (VIN) is an end-to-end differentiable architecture that performs value iteration on a latent MDP for planning in reinforcement learning (RL). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a $100\times 100$ maze -- a task which typically requires thousands of planning steps to solve. We observe that this deficiency is due to two issues: the representation capacity of the latent MDP and the planning module's depth. We address these by augmenting the latent MDP with a dynamic transition kernel, dramatically improving its representational capacity, and, to mitigate the vanishing gradient problem, introducing an "adaptive highway loss" that constructs skip connections to improve gradient flow. We evaluate our method on both 2D maze navigation environments and the ViZDoom 3D navigation benchmark. We find that our new method, named Dynamic Transition VIN (DT-VIN), easily scales to 5000 layers and casually solves challenging versions of the above tasks. Altogether, we believe that DT-VIN represents a concrete step forward in performing long-term large-scale planning in RL environments.

Efficient Adaptation in Mixed-Motive Environments via Hierarchical Opponent Modeling and Planning

Authors:Yizhe Huang, Anji Liu, Fanqi Kong, Yaodong Yang, Song-Chun Zhu, Xue Feng
Date:2024-06-12 08:48:06

Despite the recent successes of multi-agent reinforcement learning (MARL) algorithms, efficiently adapting to co-players in mixed-motive environments remains a significant challenge. One feasible approach is to hierarchically model co-players' behavior based on inferring their characteristics. However, these methods often encounter difficulties in efficient reasoning and utilization of inferred information. To address these issues, we propose Hierarchical Opponent modeling and Planning (HOP), a novel multi-agent decision-making algorithm that enables few-shot adaptation to unseen policies in mixed-motive environments. HOP is hierarchically composed of two modules: an opponent modeling module that infers others' goals and learns corresponding goal-conditioned policies, and a planning module that employs Monte Carlo Tree Search (MCTS) to identify the best response. Our approach improves efficiency by updating beliefs about others' goals both across and within episodes and by using information from the opponent modeling module to guide planning. Experimental results demonstrate that in mixed-motive environments, HOP exhibits superior few-shot adaptation capabilities when interacting with various unseen agents, and excels in self-play scenarios. Furthermore, the emergence of social intelligence during our experiments underscores the potential of our approach in complex multi-agent environments.

Hierarchical Reinforcement Learning for Swarm Confrontation with High Uncertainty

Authors:Qizhen Wu, Kexin Liu, Lei Chen, Jinhu Lü
Date:2024-06-12 05:12:10

In swarm robotics, confrontation including the pursuit-evasion game is a key scenario. High uncertainty caused by unknown opponents' strategies, dynamic obstacles, and insufficient training complicates the action space into a hybrid decision process. Although the deep reinforcement learning method is significant for swarm confrontation since it can handle various sizes, as an end-to-end implementation, it cannot deal with the hybrid process. Here, we propose a novel hierarchical reinforcement learning approach consisting of a target allocation layer, a path planning layer, and the underlying dynamic interaction mechanism between the two layers, which indicates the quantified uncertainty. It decouples the hybrid process into discrete allocation and continuous planning layers, with a probabilistic ensemble model to quantify the uncertainty and regulate the interaction frequency adaptively. Furthermore, to overcome the unstable training process introduced by the two layers, we design an integration training method including pre-training and cross-training, which enhances the training efficiency and stability. Experiment results in both comparison, ablation, and real-robot studies validate the effectiveness and generalization performance of our proposed approach. In our defined experiments with twenty to forty agents, the win rate of the proposed method reaches around ninety percent, outperforming other traditional methods.

Carbon Market Simulation with Adaptive Mechanism Design

Authors:Han Wang, Wenhao Li, Hongyuan Zha, Baoxiang Wang
Date:2024-06-12 05:08:51

A carbon market is a market-based tool that incentivizes economic agents to align individual profits with the global utility, i.e., reducing carbon emissions to tackle climate change. Cap and trade stands as a critical principle based on allocating and trading carbon allowances (carbon emission credit), enabling economic agents to follow planned emissions and penalizing excess emissions. A central authority is responsible for introducing and allocating those allowances in cap and trade. However, the complexity of carbon market dynamics makes accurate simulation intractable, which in turn hinders the design of effective allocation strategies. To address this, we propose an adaptive mechanism design framework, simulating the market using hierarchical, model-free multi-agent reinforcement learning (MARL). Government agents allocate carbon credits, while enterprises engage in economic activities and carbon trading. This framework illustrates agents' behavior comprehensively. Numerical results show MARL enables government agents to balance productivity, equality, and carbon emissions. Our project is available at https://github.com/xwanghan/Carbon-Simulator.

Deep Multi-Objective Reinforcement Learning for Utility-Based Infrastructural Maintenance Optimization

Authors:Jesse van Remmerden, Maurice Kenter, Diederik M. Roijers, Charalampos Andriotis, Yingqian Zhang, Zaharah Bukhsh
Date:2024-06-10 11:28:25

In this paper, we introduce Multi-Objective Deep Centralized Multi-Agent Actor-Critic (MO- DCMAC), a multi-objective reinforcement learning (MORL) method for infrastructural maintenance optimization, an area traditionally dominated by single-objective reinforcement learning (RL) approaches. Previous single-objective RL methods combine multiple objectives, such as probability of collapse and cost, into a singular reward signal through reward-shaping. In contrast, MO-DCMAC can optimize a policy for multiple objectives directly, even when the utility function is non-linear. We evaluated MO-DCMAC using two utility functions, which use probability of collapse and cost as input. The first utility function is the Threshold utility, in which MO-DCMAC should minimize cost so that the probability of collapse is never above the threshold. The second is based on the Failure Mode, Effects, and Criticality Analysis (FMECA) methodology used by asset managers to asses maintenance plans. We evaluated MO-DCMAC, with both utility functions, in multiple maintenance environments, including ones based on a case study of the historical quay walls of Amsterdam. The performance of MO-DCMAC was compared against multiple rule-based policies based on heuristics currently used for constructing maintenance plans. Our results demonstrate that MO-DCMAC outperforms traditional rule-based policies across various environments and utility functions.

WoCoCo: Learning Whole-Body Humanoid Control with Sequential Contacts

Authors:Chong Zhang, Wenli Xiao, Tairan He, Guanya Shi
Date:2024-06-10 04:00:55

Humanoid activities involving sequential contacts are crucial for complex robotic interactions and operations in the real world and are traditionally solved by model-based motion planning, which is time-consuming and often relies on simplified dynamics models. Although model-free reinforcement learning (RL) has become a powerful tool for versatile and robust whole-body humanoid control, it still requires tedious task-specific tuning and state machine design and suffers from long-horizon exploration issues in tasks involving contact sequences. In this work, we propose WoCoCo (Whole-Body Control with Sequential Contacts), a unified framework to learn whole-body humanoid control with sequential contacts by naturally decomposing the tasks into separate contact stages. Such decomposition facilitates simple and general policy learning pipelines through task-agnostic reward and sim-to-real designs, requiring only one or two task-related terms to be specified for each task. We demonstrated that end-to-end RL-based controllers trained with WoCoCo enable four challenging whole-body humanoid tasks involving diverse contact sequences in the real world without any motion priors: 1) versatile parkour jumping, 2) box loco-manipulation, 3) dynamic clap-and-tap dancing, and 4) cliffside climbing. We further show that WoCoCo is a general framework beyond humanoid by applying it in 22-DoF dinosaur robot loco-manipulation tasks.

Diffusion-based Reinforcement Learning for Dynamic UAV-assisted Vehicle Twins Migration in Vehicular Metaverses

Authors:Yongju Tong, Jiawen Kang, Junlong Chen, Minrui Xu, Gaolei Li, Weiting Zhang, Xincheng Yan
Date:2024-06-08 09:53:56

Air-ground integrated networks can relieve communication pressure on ground transportation networks and provide 6G-enabled vehicular Metaverses services offloading in remote areas with sparse RoadSide Units (RSUs) coverage and downtown areas where users have a high demand for vehicular services. Vehicle Twins (VTs) are the digital twins of physical vehicles to enable more immersive and realistic vehicular services, which can be offloaded and updated on RSU, to manage and provide vehicular Metaverses services to passengers and drivers. The high mobility of vehicles and the limited coverage of RSU signals necessitate VT migration to ensure service continuity when vehicles leave the signal coverage of RSUs. However, uneven VT task migration might overload some RSUs, which might result in increased service latency, and thus impactive immersive experiences for users. In this paper, we propose a dynamic Unmanned Aerial Vehicle (UAV)-assisted VT migration framework in air-ground integrated networks, where UAVs act as aerial edge servers to assist ground RSUs during VT task offloading. In this framework, we propose a diffusion-based Reinforcement Learning (RL) algorithm, which can efficiently make immersive VT migration decisions in UAV-assisted vehicular networks. To balance the workload of RSUs and improve VT migration quality, we design a novel dynamic path planning algorithm based on a heuristic search strategy for UAVs. Simulation results show that the diffusion-based RL algorithm with UAV-assisted performs better than other baseline schemes.

Planning Like Human: A Dual-process Framework for Dialogue Planning

Authors:Tao He, Lizi Liao, Yixin Cao, Yuanxing Liu, Ming Liu, Zerui Chen, Bing Qin
Date:2024-06-08 06:52:47

In proactive dialogue, the challenge lies not just in generating responses but in steering conversations toward predetermined goals, a task where Large Language Models (LLMs) typically struggle due to their reactive nature. Traditional approaches to enhance dialogue planning in LLMs, ranging from elaborate prompt engineering to the integration of policy networks, either face efficiency issues or deliver suboptimal performance. Inspired by the dualprocess theory in psychology, which identifies two distinct modes of thinking - intuitive (fast) and analytical (slow), we propose the Dual-Process Dialogue Planning (DPDP) framework. DPDP embodies this theory through two complementary planning systems: an instinctive policy model for familiar contexts and a deliberative Monte Carlo Tree Search (MCTS) mechanism for complex, novel scenarios. This dual strategy is further coupled with a novel two-stage training regimen: offline Reinforcement Learning for robust initial policy model formation followed by MCTS-enhanced on-the-fly learning, which ensures a dynamic balance between efficiency and strategic depth. Our empirical evaluations across diverse dialogue tasks affirm DPDP's superiority in achieving both high-quality dialogues and operational efficiency, outpacing existing methods.

SLOPE: Search with Learned Optimal Pruning-based Expansion

Authors:Davor Bokan, Zlatan Ajanovic, Bakir Lacevic
Date:2024-06-07 13:42:15

Heuristic search is often used for motion planning and pathfinding problems, for finding the shortest path in a graph while also promising completeness and optimal efficiency. The drawback is it's space complexity, specifically storing all expanded child nodes in memory and sorting large lists of active nodes, which can be a problem in real-time scenarios with limited on-board computation. To combat this, we present the Search with Learned Optimal Pruning-based Expansion (SLOPE), which, learns the distance of a node from a possible optimal path, unlike other approaches that learn a cost-to-go value. The unfavored nodes are then pruned according to the said distance, which in turn reduces the size of the open list. This ensures that the search explores only the region close to optimal paths while lowering memory and computational costs. Unlike traditional learning methods, our approach is orthogonal to estimating cost-to-go heuristics, offering a complementary strategy for improving search efficiency. We demonstrate the effectiveness of our approach evaluating it as a standalone search method and in conjunction with learned heuristic functions, achieving comparable-or-better node expansion metrics, while lowering the number of child nodes in the open list. Our code is available at https://github.com/dbokan1/SLOPE.

Sim-to-Real Transfer of Deep Reinforcement Learning Agents for Online Coverage Path Planning

Authors:Arvi Jonnarth, Ola Johansson, Michael Felsberg
Date:2024-06-07 13:24:19

Sim-to-real transfer presents a difficult challenge, where models trained in simulation are to be deployed in the real world. The distribution shift between the two settings leads to biased representations of the dynamics, and thus to suboptimal predictions in the real-world environment. In this work, we tackle the challenge of sim-to-real transfer of reinforcement learning (RL) agents for coverage path planning (CPP). In CPP, the task is for a robot to find a path that covers every point of a confined area. Specifically, we consider the case where the environment is unknown, and the agent needs to plan the path online while mapping the environment. We bridge the sim-to-real gap through a semi-virtual environment, including a real robot and real-time aspects, while utilizing a simulated sensor and obstacles to enable environment randomization and automated episode resetting. We investigate what level of fine-tuning is needed for adapting to a realistic setting, comparing to an agent trained solely in simulation. We find that a high inference frequency allows first-order Markovian policies to transfer directly from simulation, while higher-order policies can be fine-tuned to further reduce the sim-to-real gap. Moreover, they can operate at a lower frequency, thus reducing computational requirements. In both cases, our approaches transfer state-of-the-art results from simulation to the real domain, where direct learning would take in the order of weeks with manual interaction, that is, it would be completely infeasible.

MARLander: A Local Path Planning for Drone Swarms using Multiagent Deep Reinforcement Learning

Authors:Demetros Aschu, Robinroy Peter, Sausar Karaf, Aleksey Fedoseev, Dzmitry Tsetserukou
Date:2024-06-06 15:19:15

Achieving safe and precise landings for a swarm of drones poses a significant challenge, primarily attributed to conventional control and planning methods. This paper presents the implementation of multi-agent deep reinforcement learning (MADRL) techniques for the precise landing of a drone swarm at relocated target locations. The system is trained in a realistic simulated environment with a maximum velocity of 3 m/s in training spaces of 4 x 4 x 4 m and deployed utilizing Crazyflie drones with a Vicon indoor localization system. The experimental results revealed that the proposed approach achieved a landing accuracy of 2.26 cm on stationary and 3.93 cm on moving platforms surpassing a baseline method used with a Proportional-integral-derivative (PID) controller with an Artificial Potential Field (APF). This research highlights drone landing technologies that eliminate the need for analytical centralized systems, potentially offering scalability and revolutionizing applications in logistics, safety, and rescue missions.

Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents

Authors:Yoann Poupart
Date:2024-06-06 12:57:31

AI led chess systems to a superhuman level, yet these systems heavily rely on black-box algorithms. This is unsustainable in ensuring transparency to the end-user, particularly when these systems are responsible for sensitive decision-making. Recent interpretability work has shown that the inner representations of Deep Neural Networks (DNNs) were fathomable and contained human-understandable concepts. Yet, these methods are seldom contextualised and are often based on a single hidden state, which makes them unable to interpret multi-step reasoning, e.g. planning. In this respect, we propose contrastive sparse autoencoders (CSAE), a novel framework for studying pairs of game trajectories. Using CSAE, we are able to extract and interpret concepts that are meaningful to the chess-agent plans. We primarily focused on a qualitative analysis of the CSAE features before proposing an automated feature taxonomy. Furthermore, to evaluate the quality of our trained CSAE, we devise sanity checks to wave spurious correlations in our results.

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search

Authors:Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, Jie Tang
Date:2024-06-06 07:40:00

Recent methodologies in LLM self-training mostly rely on LLM generating responses and filtering those with correct output answers as training data. This approach often yields a low-quality fine-tuning training set (e.g., incorrect plans or intermediate reasoning). In this paper, we develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. ReST-MCTS* circumvents the per-step manual annotation typically used to train process rewards by tree-search-based reinforcement learning: Given oracle final correct answers, ReST-MCTS* is able to infer the correct process rewards by estimating the probability this step can help lead to the correct answer. These inferred rewards serve dual purposes: they act as value targets for further refining the process reward model and also facilitate the selection of high-quality traces for policy model self-training. We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget. We then show that by using traces searched by this tree-search policy as training data, we can continuously enhance the three language models for multiple iterations, and outperform other self-training algorithms such as ReST$^\text{EM}$ and Self-Rewarding LM. We release all code at https://github.com/THUDM/ReST-MCTS.

Adaptive Distance Functions via Kelvin Transformation

Authors:Rafael I. Cabral Muchacho, Florian T. Pokorny
Date:2024-06-05 12:33:11

The term safety in robotics is often understood as a synonym for avoidance. Although this perspective has led to progress in path planning and reactive control, a generalization of this perspective is necessary to include task semantics relevant to contact-rich manipulation tasks, especially during teleoperation and to ensure the safety of learned policies. We introduce the semantics-aware distance function and a corresponding computational method based on the Kelvin Transformation. The semantics-aware distance generalizes signed distance functions by allowing the zero level set to lie inside of the object in regions where contact is allowed, effectively incorporating task semantics -- such as object affordances and user intent -- in an adaptive implicit representation of safe sets. In validation experiments we show the capability of our method to adapt to time-varying semantic information, and to perform queries in sub-microsecond, enabling applications in reinforcement learning, trajectory optimization, and motion planning.

Open Grounded Planning: Challenges and Benchmark Construction

Authors:Shiguang Guo, Ziliang Deng, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun
Date:2024-06-05 03:46:52

The emergence of large language models (LLMs) has increasingly drawn attention to the use of LLMs for human-like planning. Existing work on LLM-based planning either focuses on leveraging the inherent language generation capabilities of LLMs to produce free-style plans, or employs reinforcement learning approaches to learn decision-making for a limited set of actions within restricted environments. However, both approaches exhibit significant discrepancies from the open and executable requirements in real-world planning. In this paper, we propose a new planning task--open grounded planning. The primary objective of open grounded planning is to ask the model to generate an executable plan based on a variable action set, thereby ensuring the executability of the produced plan. To this end, we establishes a benchmark for open grounded planning spanning a wide range of domains. Then we test current state-of-the-art LLMs along with five planning approaches, revealing that existing LLMs and methods still struggle to address the challenges posed by grounded planning in open domains. The outcomes of this paper define and establish a foundational dataset for open grounded planning, and shed light on the potential challenges and future directions of LLM-based planning.

Reinforcement Learning with Lookahead Information

Authors:Nadav Merlis
Date:2024-06-04 12:29:51

We study reinforcement learning (RL) problems in which agents observe the reward or transition realizations at their current state before deciding which action to take. Such observations are available in many applications, including transactions, navigation and more. When the environment is known, previous work shows that this lookahead information can drastically increase the collected reward. However, outside of specific applications, existing approaches for interacting with unknown environments are not well-adapted to these observations. In this work, we close this gap and design provably-efficient learning algorithms able to incorporate lookahead information. To achieve this, we perform planning using the empirical distribution of the reward and transition observations, in contrast to vanilla approaches that only rely on estimated expectations. We prove that our algorithms achieve tight regret versus a baseline that also has access to lookahead information - linearly increasing the amount of collected reward compared to agents that cannot handle lookahead information.

Multi-Agent Reinforcement Learning Meets Leaf Sequencing in Radiotherapy

Authors:Riqiang Gao, Florin C. Ghesu, Simon Arberet, Shahab Basiri, Esa Kuusela, Martin Kraus, Dorin Comaniciu, Ali Kamen
Date:2024-06-03 23:55:20

In contemporary radiotherapy planning (RTP), a key module leaf sequencing is predominantly addressed by optimization-based approaches. In this paper, we propose a novel deep reinforcement learning (DRL) model termed as Reinforced Leaf Sequencer (RLS) in a multi-agent framework for leaf sequencing. The RLS model offers improvements to time-consuming iterative optimization steps via large-scale training and can control movement patterns through the design of reward mechanisms. We have conducted experiments on four datasets with four metrics and compared our model with a leading optimization sequencer. Our findings reveal that the proposed RLS model can achieve reduced fluence reconstruction errors, and potential faster convergence when integrated in an optimization planner. Additionally, RLS has shown promising results in a full artificial intelligence RTP pipeline. We hope this pioneer multi-agent RL leaf sequencer can foster future research on machine learning for RTP.

A New View on Planning in Online Reinforcement Learning

Authors:Kevin Roice, Parham Mohammad Panahi, Scott M. Jordan, Adam White, Martha White
Date:2024-06-03 17:45:19

This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can propagate value from an abstract space in a manner that helps a variety of base learners learn significantly faster in different domains.

Multi-Agent Transfer Learning via Temporal Contrastive Learning

Authors:Weihao Zeng, Joseph Campbell, Simon Stepputtis, Katia Sycara
Date:2024-06-03 14:42:14

This paper introduces a novel transfer learning framework for deep multi-agent reinforcement learning. The approach automatically combines goal-conditioned policies with temporal contrastive learning to discover meaningful sub-goals. The approach involves pre-training a goal-conditioned agent, finetuning it on the target domain, and using contrastive learning to construct a planning graph that guides the agent via sub-goals. Experiments on multi-agent coordination Overcooked tasks demonstrate improved sample efficiency, the ability to solve sparse-reward and long-horizon problems, and enhanced interpretability compared to baselines. The results highlight the effectiveness of integrating goal-conditioned policies with unsupervised temporal abstraction learning for complex multi-agent transfer learning. Compared to state-of-the-art baselines, our method achieves the same or better performances while requiring only 21.7% of the training samples.

Learning to Play Atari in a World of Tokens

Authors:Pranav Agarwal, Sheldon Andrews, Samira Ebrahimi Kahou
Date:2024-06-03 14:25:29

Model-based reinforcement learning agents utilizing transformers have shown improved sample efficiency due to their ability to model extended context, resulting in more accurate world models. However, for complex reasoning and planning tasks, these methods primarily rely on continuous representations. This complicates modeling of discrete properties of the real world such as disjoint object classes between which interpolation is not plausible. In this work, we introduce discrete abstract representations for transformer-based learning (DART), a sample-efficient method utilizing discrete representations for modeling both the world and learning behavior. We incorporate a transformer-decoder for auto-regressive world modeling and a transformer-encoder for learning behavior by attending to task-relevant cues in the discrete representation of the world model. For handling partial observability, we aggregate information from past time steps as memory tokens. DART outperforms previous state-of-the-art methods that do not use look-ahead search on the Atari 100k sample efficiency benchmark with a median human-normalized score of 0.790 and beats humans in 9 out of 26 games. We release our code at https://pranaval.github.io/DART/.

NeoRL: Efficient Exploration for Nonepisodic RL

Authors:Bhavya Sukhija, Lenart Treven, Florian Dörfler, Stelian Coros, Andreas Krause
Date:2024-06-03 10:14:32

We study the problem of nonepisodic reinforcement learning (RL) for nonlinear dynamical systems, where the system dynamics are unknown and the RL agent has to learn from a single trajectory, i.e., without resets. We propose Nonepisodic Optimistic RL (NeoRL), an approach based on the principle of optimism in the face of uncertainty. NeoRL uses well-calibrated probabilistic models and plans optimistically w.r.t. the epistemic uncertainty about the unknown dynamics. Under continuity and bounded energy assumptions on the system, we provide a first-of-its-kind regret bound of $O(\Gamma_T \sqrt{T})$ for general nonlinear systems with Gaussian process dynamics. We compare NeoRL to other baselines on several deep RL environments and empirically demonstrate that NeoRL achieves the optimal average cost while incurring the least regret.

Satellites swarm cooperation for pursuit-attachment tasks with transformer-based reinforcement learning

Authors:yonghao Li
Date:2024-06-03 07:17:16

The on-orbit intelligent planning of satellites swarm has attracted increasing attention from scholars. Especially in tasks such as the pursuit and attachment of non-cooperative satellites, satellites swarm must achieve coordinated cooperation with limited resources. The study proposes a reinforcement learning framework that integrates the transformer and expert networks. Firstly, under the constraints of incomplete information about non-cooperative satellites, an implicit multi-satellites cooperation strategy was designed using a communication sharing mechanism. Subsequently, for the characteristics of the pursuit-attachment tasks, the multi-agent reinforcement learning framework is improved by introducing transformers and expert networks inspired by transfer learning ideas. To address the issue of satellites swarm scalability, sequence modelling based on transformers is utilized to craft memory-augmented policy networks, meanwhile increasing the scalability of the swarm. By comparing the convergence curves with other algorithms, it is shown that the proposed method is qualified for pursuit-attachment tasks of satellites swarm. Additionally, simulations under different maneuvering strategies of non-cooperative satellites respectively demonstrate the robustness of the algorithm and the task efficiency of the swarm system. The success rate of pursuit-attachment tasks is analyzed through Monte Carlo simulations.

An Advanced Reinforcement Learning Framework for Online Scheduling of Deferrable Workloads in Cloud Computing

Authors:Hang Dong, Liwen Zhu, Zhao Shan, Bo Qiao, Fangkai Yang, Si Qin, Chuan Luo, Qingwei Lin, Yuwen Yang, Gurpreet Virdi, Saravan Rajmohan, Dongmei Zhang, Thomas Moscibroda
Date:2024-06-03 06:55:26

Efficient resource utilization and perfect user experience usually conflict with each other in cloud computing platforms. Great efforts have been invested in increasing resource utilization but trying not to affect users' experience for cloud computing platforms. In order to better utilize the remaining pieces of computing resources spread over the whole platform, deferrable jobs are provided with a discounted price to users. For this type of deferrable jobs, users are allowed to submit jobs that will run for a specific uninterrupted duration in a flexible range of time in the future with a great discount. With these deferrable jobs to be scheduled under the remaining capacity after deploying those on-demand jobs, it remains a challenge to achieve high resource utilization and meanwhile shorten the waiting time for users as much as possible in an online manner. In this paper, we propose an online deferrable job scheduling method called \textit{Online Scheduling for DEferrable jobs in Cloud} (\OSDEC{}), where a deep reinforcement learning model is adopted to learn the scheduling policy, and several auxiliary tasks are utilized to provide better state representations and improve the performance of the model. With the integrated reinforcement learning framework, the proposed method can well plan the deployment schedule and achieve a short waiting time for users while maintaining a high resource utilization for the platform. The proposed method is validated on a public dataset and shows superior performance.

Research on the Application of Computer Vision Based on Deep Learning in Autonomous Driving Technology

Authors:Jingyu Zhang, Jin Cao, Jinghao Chang, Xinjin Li, Houze Liu, Zhenglin Li
Date:2024-06-01 16:41:24

This research aims to explore the application of deep learning in autonomous driving computer vision technology and its impact on improving system performance. By using advanced technologies such as convolutional neural networks (CNN), multi-task joint learning methods, and deep reinforcement learning, this article analyzes in detail the application of deep learning in image recognition, real-time target tracking and classification, environment perception and decision support, and path planning and navigation. Application process in key areas. Research results show that the proposed system has an accuracy of over 98% in image recognition, target tracking and classification, and also demonstrates efficient performance and practicality in environmental perception and decision support, path planning and navigation. The conclusion points out that deep learning technology can significantly improve the accuracy and real-time response capabilities of autonomous driving systems. Although there are still challenges in environmental perception and decision support, with the advancement of technology, it is expected to achieve wider applications and greater capabilities in the future. potential.

Task Planning for Object Rearrangement in Multi-room Environments

Authors:Karan Mirakhor, Sourav Ghosh, Dipanjan Das, Brojeshwar Bhowmick
Date:2024-06-01 14:23:58

Object rearrangement in a multi-room setup should produce a reasonable plan that reduces the agent's overall travel and the number of steps. Recent state-of-the-art methods fail to produce such plans because they rely on explicit exploration for discovering unseen objects due to partial observability and a heuristic planner to sequence the actions for rearrangement. This paper proposes a novel hierarchical task planner to efficiently plan a sequence of actions to discover unseen objects and rearrange misplaced objects within an untidy house to achieve a desired tidy state. The proposed method introduces several novel techniques, including (i) a method for discovering unseen objects using commonsense knowledge from large language models, (ii) a collision resolution and buffer prediction method based on Cross-Entropy Method to handle blocked goal and swap cases, (iii) a directed spatial graph-based state space for scalability, and (iv) deep reinforcement learning (RL) for producing an efficient planner. The planner interleaves the discovery of unseen objects and rearrangement to minimize the number of steps taken and overall traversal of the agent. The paper also presents new metrics and a benchmark dataset called MoPOR to evaluate the effectiveness of the rearrangement planning in a multi-room setting. The experimental results demonstrate that the proposed method effectively addresses the multi-room rearrangement problem.

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments

Authors:Chuang Wang, Lie Yang, Ze Lin, Yizhi Liao, Gang Chen, Longhan Xie
Date:2024-06-01 08:55:34

Assembling a slave object into a fixture-free master object represents a critical challenge in flexible manufacturing. Existing deep reinforcement learning-based methods, while benefiting from visual or operational priors, often struggle with small-batch precise assembly tasks due to their reliance on insufficient priors and high-costed model development. To address these limitations, this paper introduces a cognitive manipulation and learning approach that utilizes skill graphs to integrate learning-based object detection with fine manipulation models into a cohesive modular policy. This approach enables the detection of the master object from both global and local perspectives to accommodate positional uncertainties and variable backgrounds, and parametric residual policy to handle pose error and intricate contact dynamics effectively. Leveraging the skill graph, our method supports knowledge-informed learning of semi-supervised learning for object detection and classroom-to-real reinforcement learning for fine manipulation. Simulation experiments on a gear-assembly task have demonstrated that the skill-graph-enabled coarse-operation planning and visual attention are essential for efficient learning and robust manipulation, showing substantial improvements of 13$\%$ in success rate and 15.4$\%$ in number of completion steps over competing methods. Real-world experiments further validate that our system is highly effective for robotic assembly in semi-structured environments.

HOPE: A Reinforcement Learning-based Hybrid Policy Path Planner for Diverse Parking Scenarios

Authors:Mingyang Jiang, Yueyuan Li, Songan Zhang, Siyuan Chen, Chunxiang Wang, Ming Yang
Date:2024-05-31 02:17:51

Automated parking stands as a highly anticipated application of autonomous driving technology. However, existing path planning methodologies fall short of addressing this need due to their incapability to handle the diverse and complex parking scenarios in reality. While non-learning methods provide reliable planning results, they are vulnerable to intricate occasions, whereas learning-based ones are good at exploration but unstable in converging to feasible solutions. To leverage the strengths of both approaches, we introduce Hybrid pOlicy Path plannEr (HOPE). This novel solution integrates a reinforcement learning agent with Reeds-Shepp curves, enabling effective planning across diverse scenarios. HOPE guides the exploration of the reinforcement learning agent by applying an action mask mechanism and employs a transformer to integrate the perceived environmental information with the mask. To facilitate the training and evaluation of the proposed planner, we propose a criterion for categorizing the difficulty level of parking scenarios based on space and obstacle distribution. Experimental results demonstrate that our approach outperforms typical rule-based algorithms and traditional reinforcement learning methods, showing higher planning success rates and generalization across various scenarios. We also conduct real-world experiments to verify the practicability of HOPE. The code for our solution will be openly available on \href{GitHub}{https://github.com/jiamiya/HOPE}.

From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems

Authors:Jianliang He, Siyu Chen, Fengzhuo Zhang, Zhuoran Yang
Date:2024-05-30 09:42:54

In this work, from a theoretical lens, we aim to understand why large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. To this end, consider a hierarchical reinforcement learning (RL) model where the LLM Planner and the Actor perform high-level task planning and low-level execution, respectively. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. Under proper assumptions on the pretraining data, we prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning. Additionally, we highlight the necessity for exploration beyond the subgoals derived from BAIL by proving that naively executing the subgoals returned by LLM leads to a linear regret. As a remedy, we introduce an $\epsilon$-greedy exploration strategy to BAIL, which is proven to incur sublinear regret when the pretraining error is small. Finally, we extend our theoretical framework to include scenarios where the LLM Planner serves as a world model for inferring the transition model of the environment and to multi-agent settings, enabling coordination among multiple Actors.

To RL or not to RL? An Algorithmic Cheat-Sheet for AI-Based Radio Resource Management

Authors:Lorenzo Maggi, Matthew Andrews, Ryo Koblitz
Date:2024-05-29 12:41:43

Several Radio Resource Management (RRM) use cases can be framed as sequential decision planning problems, where an agent (the base station, typically) makes decisions that influence the network utility and state. While Reinforcement Learning (RL) in its general form can address this scenario, it is known to be sample inefficient. Following the principle of Occam's razor, we argue that the choice of the solution technique for RRM should be guided by questions such as, "Is it a short or long-term planning problem?", "Is the underlying model known or does it need to be learned?", "Can we solve the problem analytically?" or "Is an expert-designed policy available?". A wide range of techniques exists to address these questions, including static and stochastic optimization, bandits, model predictive control (MPC) and, indeed, RL. We review some of these techniques that have already been successfully applied to RRM, and we believe that others, such as MPC, may present exciting research opportunities for the future.

Bridging the Gap between Partially Observable Stochastic Games and Sparse POMDP Methods

Authors:Tyler Becker, Zachary Sunberg
Date:2024-05-29 02:27:47

Many real-world decision problems involve the interaction of multiple self-interested agents with limited sensing ability. The partially observable stochastic game (POSG) provides a mathematical framework for modeling these problems, however solving a POSG requires difficult reasoning over two critical factors: (1) information revealed by partial observations and (2) decisions other agents make. In the single agent case, partially observable Markov decision process (POMDP) planning can efficiently address partial observability with particle filtering. In the multi-agent case, extensive form game solution methods account for other agent's decisions, but preclude belief approximation. We propose a unifying framework that combines POMDP-inspired state distribution approximation and game-theoretic equilibrium search on information sets. This paper lays a theoretical foundation for the approach by bounding errors due to belief approximation, and empirically demonstrates effectiveness with a numerical example. The new approach enables planning in POSGs with very large state spaces, paving the way for reliable autonomous interaction in real-world physical environments and complementing multi-agent reinforcement learning.

Model-Based Diffusion for Trajectory Optimization

Authors:Chaoyi Pan, Zeji Yi, Guanya Shi, Guannan Qu
Date:2024-05-28 22:14:25

Recent advances in diffusion models have demonstrated their strong capabilities in generating high-fidelity samples from complex distributions through an iterative refinement process. Despite the empirical success of diffusion models in motion planning and control, the model-free nature of these methods does not leverage readily available model information and limits their generalization to new scenarios beyond the training data (e.g., new robots with different dynamics). In this work, we introduce Model-Based Diffusion (MBD), an optimization approach using the diffusion process to solve trajectory optimization (TO) problems without data. The key idea is to explicitly compute the score function by leveraging the model information in TO problems, which is why we refer to our approach as model-based diffusion. Moreover, although MBD does not require external data, it can be naturally integrated with data of diverse qualities to steer the diffusion process. We also reveal that MBD has interesting connections to sampling-based optimization. Empirical evaluations show that MBD outperforms state-of-the-art reinforcement learning and sampling-based TO methods in challenging contact-rich tasks. Additionally, MBD's ability to integrate with data enhances its versatility and practical applicability, even with imperfect and infeasible data (e.g., partial-state demonstrations for high-dimensional humanoids), beyond the scope of standard diffusion models.

Extreme Value Monte Carlo Tree Search

Authors:Masataro Asai, Stephen Wissow
Date:2024-05-28 14:58:43

Despite being successful in board games and reinforcement learning (RL), UCT, a Monte-Carlo Tree Search (MCTS) combined with UCB1 Multi-Armed Bandit (MAB), has had limited success in domain-independent planning until recently. Previous work showed that UCB1, designed for $[0,1]$-bounded rewards, is not appropriate for estimating the distance-to-go which are potentially unbounded in $\mathbb{R}$, such as heuristic functions used in classical planning, then proposed combining MCTS with MABs designed for Gaussian reward distributions and successfully improved the performance. In this paper, we further sharpen our understanding of ideal bandits for planning tasks. Existing work has two issues: First, while Gaussian MABs no longer over-specify the distances as $h\in [0,1]$, they under-specify them as $h\in [-\infty,\infty]$ while they are non-negative and can be further bounded in some cases. Second, there is no theoretical justifications for Full-Bellman backup (Schulte & Keller, 2014) that backpropagates minimum/maximum of samples. We identified \emph{extreme value} statistics as a theoretical framework that resolves both issues at once and propose two bandits, UCB1-Uniform/Power, and apply them to MCTS for classical planning. We formally prove their regret bounds and empirically demonstrate their performance in classical planning.

LNS2+RL: Combining Multi-Agent Reinforcement Learning with Large Neighborhood Search in Multi-Agent Path Finding

Authors:Yutong Wang, Tanishq Duhan, Jiaoyang Li, Guillaume Sartoretti
Date:2024-05-28 03:45:32

Multi-Agent Path Finding (MAPF) is a critical component of logistics and warehouse management, which focuses on planning collision-free paths for a team of robots in a known environment. Recent work introduced a novel MAPF approach, LNS2, which proposed to repair a quickly obtained set of infeasible paths via iterative replanning, by relying on a fast, yet lower-quality, prioritized planning (PP) algorithm. At the same time, there has been a recent push for Multi-Agent Reinforcement Learning (MARL) based MAPF algorithms, which exhibit improved cooperation over such PP algorithms, although inevitably remaining slower. In this paper, we introduce a new MAPF algorithm, LNS2+RL, which combines the distinct yet complementary characteristics of LNS2 and MARL to effectively balance their individual limitations and get the best from both worlds. During early iterations, LNS2+RL relies on MARL for low-level replanning, which we show eliminates collisions much more than a PP algorithm. There, our MARL-based planner allows agents to reason about past and future information to gradually learn cooperative decision-making through a finely designed curriculum learning. At later stages of planning, LNS2+RL adaptively switches to PP algorithm to quickly resolve the remaining collisions, naturally trading off solution quality (number of collisions in the solution) and computational efficiency. Our comprehensive experiments on high-agent-density tasks across various team sizes, world sizes, and map structures consistently demonstrate the superior performance of LNS2+RL compared to many MAPF algorithms, including LNS2, LaCAM, EECBS, and SCRIMP. In maps with complex structures, the advantages of LNS2+RL are particularly pronounced, with LNS2+RL achieving a success rate of over 50% in nearly half of the tested tasks, while that of LaCAM, EECBS and SCRIMP falls to 0%.

DPN: Decoupling Partition and Navigation for Neural Solvers of Min-max Vehicle Routing Problems

Authors:Zhi Zheng, Shunyu Yao, Zhenkun Wang, Xialiang Tong, Mingxuan Yuan, Ke Tang
Date:2024-05-27 15:33:16

The min-max vehicle routing problem (min-max VRP) traverses all given customers by assigning several routes and aims to minimize the length of the longest route. Recently, reinforcement learning (RL)-based sequential planning methods have exhibited advantages in solving efficiency and optimality. However, these methods fail to exploit the problem-specific properties in learning representations, resulting in less effective features for decoding optimal routes. This paper considers the sequential planning process of min-max VRPs as two coupled optimization tasks: customer partition for different routes and customer navigation in each route (i.e., partition and navigation). To effectively process min-max VRP instances, we present a novel attention-based Partition-and-Navigation encoder (P&N Encoder) that learns distinct embeddings for partition and navigation. Furthermore, we utilize an inherent symmetry in decoding routes and develop an effective agent-permutation-symmetric (APS) loss function. Experimental results demonstrate that the proposed Decoupling-Partition-Navigation (DPN) method significantly surpasses existing learning-based methods in both single-depot and multi-depot min-max VRPs. Our code is available at

Any-step Dynamics Model Improves Future Predictions for Online and Offline Reinforcement Learning

Authors:Haoxin Lin, Yu-Yan Xu, Yihao Sun, Zhilong Zhang, Yi-Chen Li, Chengxing Jia, Junyin Ye, Jiaji Zhang, Yang Yu
Date:2024-05-27 10:33:53

Model-based methods in reinforcement learning offer a promising approach to enhance data efficiency by facilitating policy exploration within a dynamics model. However, accurately predicting sequential steps in the dynamics model remains a challenge due to the bootstrapping prediction, which attributes the next state to the prediction of the current state. This leads to accumulated errors during model roll-out. In this paper, we propose the Any-step Dynamics Model (ADM) to mitigate the compounding error by reducing bootstrapping prediction to direct prediction. ADM allows for the use of variable-length plans as inputs for predicting future states without frequent bootstrapping. We design two algorithms, ADMPO-ON and ADMPO-OFF, which apply ADM in online and offline model-based frameworks, respectively. In the online setting, ADMPO-ON demonstrates improved sample efficiency compared to previous state-of-the-art methods. In the offline setting, ADMPO-OFF not only demonstrates superior performance compared to recent state-of-the-art offline approaches but also offers better quantification of model uncertainty using only a single ADM.

RoboArm-NMP: a Learning Environment for Neural Motion Planning

Authors:Tom Jurgenson, Matan Sudry, Gal Avineri, Aviv Tamar
Date:2024-05-25 19:28:11

We present RoboArm-NMP, a learning and evaluation environment that allows simple and thorough evaluations of Neural Motion Planning (NMP) algorithms, focused on robotic manipulators. Our Python-based environment provides baseline implementations for learning control policies (either supervised or reinforcement learning based), a simulator based on PyBullet, data of solved instances using a classical motion planning solver, various representation learning methods for encoding the obstacles, and a clean interface between the learning and planning frameworks. Using RoboArm-NMP, we compare several prominent NMP design points, and demonstrate that the best methods mostly succeed in generalizing to unseen goals in a scene with fixed obstacles, but have difficulty in generalizing to unseen obstacle configurations, suggesting focus points for future research.

TD3 Based Collision Free Motion Planning for Robot Navigation

Authors:Hao Liu, Yi Shen, Chang Zhou, Yuelin Zou, Zijun Gao, Qi Wang
Date:2024-05-24 11:34:45

This paper addresses the challenge of collision-free motion planning in automated navigation within complex environments. Utilizing advancements in Deep Reinforcement Learning (DRL) and sensor technologies like LiDAR, we propose the TD3-DWA algorithm, an innovative fusion of the traditional Dynamic Window Approach (DWA) with the Twin Delayed Deep Deterministic Policy Gradient (TD3). This hybrid algorithm enhances the efficiency of robotic path planning by optimizing the sampling interval parameters of DWA to effectively navigate around both static and dynamic obstacles. The performance of the TD3-DWA algorithm is validated through various simulation experiments, demonstrating its potential to significantly improve the reliability and safety of autonomous navigation systems.

Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search

Authors:Nicola Dainese, Matteo Merler, Minttu Alakuijala, Pekka Marttinen
Date:2024-05-24 09:31:26

In this work we consider Code World Models, world models generated by a Large Language Model (LLM) in the form of Python code for model-based Reinforcement Learning (RL). Calling code instead of LLMs for planning has potential to be more precise, reliable, interpretable, and extremely efficient. However, writing appropriate Code World Models requires the ability to understand complex instructions, to generate exact code with non-trivial logic and to self-debug a long program with feedback from unit tests and environment trajectories. To address these challenges, we propose Generate, Improve and Fix with Monte Carlo Tree Search (GIF-MCTS), a new code generation strategy for LLMs. To test our approach in an offline RL setting, we introduce the Code World Models Benchmark (CWMB), a suite of program synthesis and planning tasks comprised of 18 diverse RL environments paired with corresponding textual descriptions and curated trajectories. GIF-MCTS surpasses all baselines on the CWMB and two other benchmarks, and we show that the Code World Models synthesized with it can be successfully used for planning, resulting in model-based RL agents with greatly improved sample efficiency and inference speed.

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Authors:Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, Mingsheng Long
Date:2024-05-24 05:29:12

World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.

Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning

Authors:Siddhant Bhambri, Amrita Bhattacharjee, Durgesh Kalwar, Lin Guan, Huan Liu, Subbarao Kambhampati
Date:2024-05-24 03:53:57

Reinforcement Learning (RL) suffers from sample inefficiency in sparse reward domains, and the problem is further pronounced in case of stochastic transitions. To improve the sample efficiency, reward shaping is a well-studied approach to introduce intrinsic rewards that can help the RL agent converge to an optimal policy faster. However, designing a useful reward shaping function for all desirable states in the Markov Decision Process (MDP) is challenging, even for domain experts. Given that Large Language Models (LLMs) have demonstrated impressive performance across a magnitude of natural language tasks, we aim to answer the following question: `Can we obtain heuristics using LLMs for constructing a reward shaping function that can boost an RL agent's sample efficiency?' To this end, we aim to leverage off-the-shelf LLMs to generate a plan for an abstraction of the underlying MDP. We further use this LLM-generated plan as a heuristic to construct the reward shaping signal for the downstream RL agent. By characterizing the type of abstraction based on the MDP horizon length, we analyze the quality of heuristics when generated using an LLM, with and without a verifier in the loop. Our experiments across multiple domains with varying horizon length and number of sub-goals from the BabyAI environment suite, Household, Mario, and, Minecraft domain, show 1) the advantages and limitations of querying LLMs with and without a verifier to generate a reward shaping heuristic, and, 2) a significant improvement in the sample efficiency of PPO, A2C, and Q-learning when guided by the LLM-generated heuristics.

Multi-turn Reinforcement Learning from Preference Human Feedback

Authors:Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Rémi Munos
Date:2024-05-23 14:53:54

Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal. In this paper, we address this issue by developing novel methods for Reinforcement Learning (RL) from preference feedback between two full multi-turn conversations. In the tabular setting, we present a novel mirror-descent-based policy optimization algorithm for the general multi-turn preference-based RL problem, and prove its convergence to Nash equilibrium. To evaluate performance, we create a new environment, Education Dialogue, where a teacher agent guides a student in learning a random topic, and show that a deep RL variant of our algorithm outperforms RLHF baselines. Finally, we show that in an environment with explicit rewards, our algorithm recovers the same performance as a reward-based RL baseline, despite relying solely on a weaker preference signal.

Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration

Authors:Yang Zhang, Shixin Yang, Chenjia Bai, Fei Wu, Xiu Li, Zhen Wang, Xuelong Li
Date:2024-05-23 08:33:19

Grounding the reasoning ability of large language models (LLMs) for embodied tasks is challenging due to the complexity of the physical world. Especially, LLM planning for multi-agent collaboration requires communication of agents or credit assignment as the feedback to re-adjust the proposed plans and achieve effective coordination. However, existing methods that overly rely on physical verification or self-reflection suffer from excessive and inefficient querying of LLMs. In this paper, we propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. Specifically, we perform critic regression to learn a sequential advantage function from LLM-planned data, and then treat the LLM planner as an optimizer to generate actions that maximize the advantage function. It endows the LLM with the foresight to discern whether the action contributes to accomplishing the final task. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Overcooked-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents and query rounds of LLMs, demonstrating its high efficiency for grounding LLMs. More results are given at https://read-llm.github.io/.

Efficient Navigation of a Robotic Fish Swimming Across the Vortical Flow Field

Authors:Haodong Feng, Dehan Yuan, Jiale Miao, Jie You, Yue Wang, Yi Zhu, Dixia Fan
Date:2024-05-23 07:30:58

Navigating efficiently across vortical flow fields presents a significant challenge in various robotic applications. The dynamic and unsteady nature of vortical flows often disturbs the control of underwater robots, complicating their operation in hydrodynamic environments. Conventional control methods, which depend on accurate modeling, fail in these settings due to the complexity of fluid-structure interactions (FSI) caused by unsteady hydrodynamics. This study proposes a deep reinforcement learning (DRL) algorithm, trained in a data-driven manner, to enable efficient navigation of a robotic fish swimming across vortical flows. Our proposed algorithm incorporates the LSTM architecture and uses several recent consecutive observations as the state to address the issue of partial observation, often due to sensor limitations. We present a numerical study of navigation within a Karman vortex street, created by placing a stationary cylinder in a uniform flow, utilizing the immersed boundary-lattice Boltzmann method (IB-LBM). The aim is to train the robotic fish to discover efficient navigation policies, enabling it to reach a designated target point across the Karman vortex street from various initial positions. After training, the fish demonstrates the ability to rapidly reach the target from different initial positions, showcasing the effectiveness and robustness of our proposed algorithm. Analysis of the results reveals that the robotic fish can leverage velocity gains and pressure differences induced by the vortices to reach the target, underscoring the potential of our proposed algorithm in enhancing navigation in complex hydrodynamic environments.

Transformers for Image-Goal Navigation

Authors:Nikhilanj Pelluri
Date:2024-05-23 03:01:32

Visual perception and navigation have emerged as major focus areas in the field of embodied artificial intelligence. We consider the task of image-goal navigation, where an agent is tasked to navigate to a goal specified by an image, relying only on images from an onboard camera. This task is particularly challenging since it demands robust scene understanding, goal-oriented planning and long-horizon navigation. Most existing approaches typically learn navigation policies reliant on recurrent neural networks trained via online reinforcement learning. However, training such policies requires substantial computational resources and time, and performance of these models is not reliable on long-horizon navigation. In this work, we present a generative Transformer based model that jointly models image goals, camera observations and the robot's past actions to predict future actions. We use state-of-the-art perception models and navigation policies to learn robust goal conditioned policies without the need for real-time interaction with the environment. Our model demonstrates capability in capturing and associating visual information across long time horizons, helping in effective navigation. NOTE: This work was submitted as part of a Master's Capstone Project and must be treated as such. This is still an early work in progress and not the final version.

Uncertainty-Aware DRL for Autonomous Vehicle Crowd Navigation in Shared Space

Authors:Mahsa Golchoubian, Moojan Ghafurian, Kerstin Dautenhahn, Nasser Lashgarian Azad
Date:2024-05-22 20:09:21

Safe, socially compliant, and efficient navigation of low-speed autonomous vehicles (AVs) in pedestrian-rich environments necessitates considering pedestrians' future positions and interactions with the vehicle and others. Despite the inevitable uncertainties associated with pedestrians' predicted trajectories due to their unobserved states (e.g., intent), existing deep reinforcement learning (DRL) algorithms for crowd navigation often neglect these uncertainties when using predicted trajectories to guide policy learning. This omission limits the usability of predictions when diverging from ground truth. This work introduces an integrated prediction and planning approach that incorporates the uncertainties of predicted pedestrian states in the training of a model-free DRL algorithm. A novel reward function encourages the AV to respect pedestrians' personal space, decrease speed during close approaches, and minimize the collision probability with their predicted paths. Unlike previous DRL methods, our model, designed for AV operation in crowded spaces, is trained in a novel simulation environment that reflects realistic pedestrian behaviour in a shared space with vehicles. Results show a 40% decrease in collision rate and a 15% increase in minimum distance to pedestrians compared to the state of the art model that does not account for prediction uncertainty. Additionally, the approach outperforms model predictive control methods that incorporate the same prediction uncertainties in terms of both performance and computational time, while producing trajectories closer to human drivers in similar scenarios.

Dynamic Model Predictive Shielding for Provably Safe Reinforcement Learning

Authors:Arko Banerjee, Kia Rahmani, Joydeep Biswas, Isil Dillig
Date:2024-05-22 17:44:07

Among approaches for provably safe reinforcement learning, Model Predictive Shielding (MPS) has proven effective at complex tasks in continuous, high-dimensional state spaces, by leveraging a backup policy to ensure safety when the learned policy attempts to take risky actions. However, while MPS can ensure safety both during and after training, it often hinders task progress due to the conservative and task-oblivious nature of backup policies. This paper introduces Dynamic Model Predictive Shielding (DMPS), which optimizes reinforcement learning objectives while maintaining provable safety. DMPS employs a local planner to dynamically select safe recovery actions that maximize both short-term progress as well as long-term rewards. Crucially, the planner and the neural policy play a synergistic role in DMPS. When planning recovery actions for ensuring safety, the planner utilizes the neural policy to estimate long-term rewards, allowing it to observe beyond its short-term planning horizon. Conversely, the neural policy under training learns from the recovery plans proposed by the planner, converging to policies that are both high-performing and safe in practice. This approach guarantees safety during and after training, with bounded recovery regret that decreases exponentially with planning horizon depth. Experimental results demonstrate that DMPS converges to policies that rarely require shield interventions after training and achieve higher rewards compared to several state-of-the-art baselines.

Deep Reinforcement Learning for Time-Critical Wilderness Search And Rescue Using Drones

Authors:Jan-Hendrik Ewers, David Anderson, Douglas Thomson
Date:2024-05-21 13:51:47

Traditional search and rescue methods in wilderness areas can be time-consuming and have limited coverage. Drones offer a faster and more flexible solution, but optimizing their search paths is crucial. This paper explores the use of deep reinforcement learning to create efficient search missions for drones in wilderness environments. Our approach leverages a priori data about the search area and the missing person in the form of a probability distribution map. This allows the deep reinforcement learning agent to learn optimal flight paths that maximize the probability of finding the missing person quickly. Experimental results show that our method achieves a significant improvement in search times compared to traditional coverage planning and search planning algorithms. In one comparison, deep reinforcement learning is found to outperform other algorithms by over $160\%$, a difference that can mean life or death in real-world search operations. Additionally, unlike previous work, our approach incorporates a continuous action space enabled by cubature, allowing for more nuanced flight patterns.

Physics-based Scene Layout Generation from Human Motion

Authors:Jianan Li, Tao Huang, Qingxu Zhu, Tien-Tsin Wong
Date:2024-05-21 02:36:37

Creating scenes for captured motions that achieve realistic human-scene interaction is crucial for 3D animation in movies or video games. As character motion is often captured in a blue-screened studio without real furniture or objects in place, there may be a discrepancy between the planned motion and the captured one. This gives rise to the need for automatic scene layout generation to relieve the burdens of selecting and positioning furniture and objects. Previous approaches cannot avoid artifacts like penetration and floating due to the lack of physical constraints. Furthermore, some heavily rely on specific data to learn the contact affordances, restricting the generalization ability to different motions. In this work, we present a physics-based approach that simultaneously optimizes a scene layout generator and simulates a moving human in a physics simulator. To attain plausible and realistic interaction motions, our method explicitly introduces physical constraints. To automatically recover and generate the scene layout, we minimize the motion tracking errors to identify the objects that can afford interaction. We use reinforcement learning to perform a dual-optimization of both the character motion imitation controller and the scene layout generator. To facilitate the optimization, we reshape the tracking rewards and devise pose prior guidance obtained from our estimated pseudo-contact labels. We evaluate our method using motions from SAMP and PROX, and demonstrate physically plausible scene layout reconstruction compared with the previous kinematics-based method.

Efficient Multi-agent Reinforcement Learning by Planning

Authors:Qihan Liu, Jianing Ye, Xiaoteng Ma, Jun Yang, Bin Liang, Chongjie Zhang
Date:2024-05-20 04:36:02

Multi-agent reinforcement learning (MARL) algorithms have accomplished remarkable breakthroughs in solving large-scale decision-making tasks. Nonetheless, most existing MARL algorithms are model-free, limiting sample efficiency and hindering their applicability in more challenging scenarios. In contrast, model-based reinforcement learning (MBRL), particularly algorithms integrating planning, such as MuZero, has demonstrated superhuman performance with limited data in many tasks. Hence, we aim to boost the sample efficiency of MARL by adopting model-based approaches. However, incorporating planning and search methods into multi-agent systems poses significant challenges. The expansive action space of multi-agent systems often necessitates leveraging the nearly-independent property of agents to accelerate learning. To tackle this issue, we propose the MAZero algorithm, which combines a centralized model with Monte Carlo Tree Search (MCTS) for policy search. We design a novel network structure to facilitate distributed execution and parameter sharing. To enhance search efficiency in deterministic environments with sizable action spaces, we introduce two novel techniques: Optimistic Search Lambda (OS($\lambda$)) and Advantage-Weighted Policy Optimization (AWPO). Extensive experiments on the SMAC benchmark demonstrate that MAZero outperforms model-free approaches in terms of sample efficiency and provides comparable or better performance than existing model-based methods in terms of both sample and computational efficiency. Our code is available at https://github.com/liuqh16/MAZero.

Highway Graph to Accelerate Reinforcement Learning

Authors:Zidu Yin, Zhen Zhang, Dong Gong, Stefano V. Albrecht, Javen Q. Shi
Date:2024-05-20 02:09:07

Reinforcement Learning (RL) algorithms often struggle with low training efficiency. A common approach to address this challenge is integrating model-based planning algorithms, such as Monte Carlo Tree Search (MCTS) or Value Iteration (VI), into the environmental model. However, VI requires iterating over a large tensor which updates the value of the preceding state based on the succeeding state through value propagation, resulting in computationally intensive operations. To enhance the RL training efficiency, we propose improving the efficiency of the value learning process. In deterministic environments with discrete state and action spaces, we observe that on the sampled empirical state-transition graph, a non-branching sequence of transitions-termed a highway-can take the agent to another state without deviation through intermediate states. On these non-branching highways, the value-updating process can be streamlined into a single-step operation, eliminating the need for step-by-step updates. Building on this observation, we introduce the highway graph to model state transitions. The highway graph compresses the transition model into a compact representation, where edges can encapsulate multiple state transitions, enabling value propagation across multiple time steps in a single iteration. By integrating the highway graph into RL, the training process is significantly accelerated, particularly in the early stages of training. Experiments across four categories of environments demonstrate that our method learns significantly faster than established and state-of-the-art RL algorithms (often by a factor of 10 to 150) while maintaining equal or superior expected returns. Furthermore, a deep neural network-based agent trained using the highway graph exhibits improved generalization capabilities and reduced storage costs. Code is publicly available at https://github.com/coodest/highwayRL.

Reinforcement learning

Authors:Sarod Yatawatta
Date:2024-05-16 18:03:17

Observing celestial objects and advancing our scientific knowledge about them involves tedious planning, scheduling, data collection and data post-processing. Many of these operational aspects of astronomy are guided and executed by expert astronomers. Reinforcement learning is a mechanism where we (as humans and astronomers) can teach agents of artificial intelligence to perform some of these tedious tasks. In this paper, we will present a state of the art overview of reinforcement learning and how it can benefit astronomy.

Continuous Transfer Learning for UAV Communication-aware Trajectory Design

Authors:Chenrui Sun, Gianluca Fontanesi, Swarna Bindu Chetty, Xuanyu Liang, Berk Canberk, Hamed Ahmadi
Date:2024-05-16 13:30:35

Deep Reinforcement Learning (DRL) emerges as a prime solution for Unmanned Aerial Vehicle (UAV) trajectory planning, offering proficiency in navigating high-dimensional spaces, adaptability to dynamic environments, and making sequential decisions based on real-time feedback. Despite these advantages, the use of DRL for UAV trajectory planning requires significant retraining when the UAV is confronted with a new environment, resulting in wasted resources and time. Therefore, it is essential to develop techniques that can reduce the overhead of retraining DRL models, enabling them to adapt to constantly changing environments. This paper presents a novel method to reduce the need for extensive retraining using a double deep Q network (DDQN) model as a pretrained base, which is subsequently adapted to different urban environments through Continuous Transfer Learning (CTL). Our method involves transferring the learned model weights and adapting the learning parameters, including the learning and exploration rates, to suit each new environment specific characteristics. The effectiveness of our approach is validated in three scenarios, each with different levels of similarity. CTL significantly improves learning speed and success rates compared to DDQN models initiated from scratch. For similar environments, Transfer Learning (TL) improved stability, accelerated convergence by 65%, and facilitated 35% faster adaptation in dissimilar settings.

Optimizing Search and Rescue UAV Connectivity in Challenging Terrain through Multi Q-Learning

Authors:Mohammed M. H. Qazzaz, Syed A. R. Zaidi, Desmond C. McLernon, Abdelaziz Salama, Aubida A. Al-Hameed
Date:2024-05-16 12:23:51

Using Unmanned Aerial Vehicles (UAVs) in Search and rescue operations (SAR) to navigate challenging terrain while maintaining reliable communication with the cellular network is a promising approach. This paper suggests a novel technique employing a reinforcement learning multi Q-learning algorithm to optimize UAV connectivity in such scenarios. We introduce a Strategic Planning Agent for efficient path planning and collision awareness and a Real-time Adaptive Agent to maintain optimal connection with the cellular base station. The agents trained in a simulated environment using multi Q-learning, encouraging them to learn from experience and adjust their decision-making to diverse terrain complexities and communication scenarios. Evaluation results reveal the significance of the approach, highlighting successful navigation in environments with varying obstacle densities and the ability to perform optimal connectivity using different frequency bands. This work paves the way for enhanced UAV autonomy and enhanced communication reliability in search and rescue operations.

Combining RL and IL using a dynamic, performance-based modulation over learning signals and its application to local planning

Authors:Francisco Leiva, Javier Ruiz-del-Solar
Date:2024-05-16 02:08:17

This paper proposes a method to combine reinforcement learning (RL) and imitation learning (IL) using a dynamic, performance-based modulation over learning signals. The proposed method combines RL and behavioral cloning (IL), or corrective feedback in the action space (interactive IL/IIL), by dynamically weighting the losses to be optimized, taking into account the backpropagated gradients used to update the policy and the agent's estimated performance. In this manner, RL and IL/IIL losses are combined by equalizing their impact on the policy's updates, while modulating said impact such that IL signals are prioritized at the beginning of the learning process, and as the agent's performance improves, the RL signals become progressively more relevant, allowing for a smooth transition from pure IL/IIL to pure RL. The proposed method is used to learn local planning policies for mobile robots, synthesizing IL/IIL signals online by means of a scripted policy. An extensive evaluation of the application of the proposed method to this task is performed in simulations, and it is empirically shown that it outperforms pure RL in terms of sample efficiency (achieving the same level of performance in the training environment utilizing approximately 4 times less experiences), while consistently producing local planning policies with better performance metrics (achieving an average success rate of 0.959 in an evaluation environment, outperforming pure RL by 12.5% and pure IL by 13.9%). Furthermore, the obtained local planning policies are successfully deployed in the real world without performing any major fine tuning. The proposed method can extend existing RL algorithms, and is applicable to other problems for which generating IL/IIL signals online is feasible. A video summarizing some of the real world experiments that were conducted can be found in https://youtu.be/mZlaXn9WGzw.

Enhancing Reinforcement Learning in Sensor Fusion: A Comparative Analysis of Cubature and Sampling-based Integration Methods for Rover Search Planning

Authors:Jan-Hendrik Ewers, Sarah Swinton, David Anderson, Euan McGookin, Douglas Thomson
Date:2024-05-14 15:24:52

This study investigates the computational speed and accuracy of two numerical integration methods, cubature and sampling-based, for integrating an integrand over a 2D polygon. Using a group of rovers searching the Martian surface with a limited sensor footprint as a test bed, the relative error and computational time are compared as the area was subdivided to improve accuracy in the sampling-based approach. The results show that the sampling-based approach exhibits a $14.75\%$ deviation in relative error compared to cubature when it matches the computational performance at $100\%$. Furthermore, achieving a relative error below $1\%$ necessitates a $10000\%$ increase in relative time to calculate due to the $\mathcal{O}(N^2)$ complexity of the sampling-based method. It is concluded that for enhancing reinforcement learning capabilities and other high iteration algorithms, the cubature method is preferred over the sampling-based method.

Radio Resource Management and Path Planning in Intelligent Transportation Systems via Reinforcement Learning for Environmental Sustainability

Authors:S. Norouzi, N. Azarasa, M. R. Abedi, N. Mokari, S. E. Seyedabrishami, H. Saeedi, E. A. Jorswieck
Date:2024-05-13 16:48:16

Efficient and dynamic path planning has become an important topic for urban areas with larger density of connected vehicles (CV) which results in reduction of travel time and directly contributes to environmental sustainability through reducing energy consumption. CVs exploit the cellular wireless vehicle-to-everything (C-V2X) communication technology to disseminate the vehicle-to-infrastructure (V2I) messages to the Base-station (BS) to improve situation awareness on urban roads. In this paper, we investigate radio resource management (RRM) in such a framework to minimize the age of information (AoI) so as to enhance path planning results. We use the fact that V2I messages with lower AoI value result in less error in estimating the road capacity and more accurate path planning. Through simulations, we compare road travel times and volume over capacity (V/C) against different levels of AoI and demonstrate the promising performance of the proposed framework.

Space Processor Computation Time Analysis for Reinforcement Learning and Run Time Assurance Control Policies

Authors:Kyle Dunlap, Nathaniel Hamilton, Francisco Viramontes, Derrek Landauer, Evan Kain, Kerianne L. Hobbs
Date:2024-05-10 18:52:28

As the number of spacecraft on orbit continues to grow, it is challenging for human operators to constantly monitor and plan for all missions. Autonomous control methods such as reinforcement learning (RL) have the power to solve complex tasks while reducing the need for constant operator intervention. By combining RL solutions with run time assurance (RTA), safety of these systems can be assured in real time. However, in order to use these algorithms on board a spacecraft, they must be able to run in real time on space grade processors, which are typically outdated and less capable than state-of-the-art equipment. In this paper, multiple RL-trained neural network controllers (NNCs) and RTA algorithms were tested on commercial-off-the-shelf (COTS) and radiation tolerant processors. The results show that all NNCs and most RTA algorithms can compute optimal and safe actions in well under 1 second with room for further optimization before deploying in the real world.

Logic-Skill Programming: An Optimization-based Approach to Sequential Skill Planning

Authors:Teng Xue, Amirreza Razmjoo, Suhan Shetty, Sylvain Calinon
Date:2024-05-07 07:27:28

Recent advances in robot skill learning have unlocked the potential to construct task-agnostic skill libraries, facilitating the seamless sequencing of multiple simple manipulation primitives (aka. skills) to tackle significantly more complex tasks. Nevertheless, determining the optimal sequence for independently learned skills remains an open problem, particularly when the objective is given solely in terms of the final geometric configuration rather than a symbolic goal. To address this challenge, we propose Logic-Skill Programming (LSP), an optimization-based approach that sequences independently learned skills to solve long-horizon tasks. We formulate a first-order extension of a mathematical program to optimize the overall cumulative reward of all skills within a plan, abstracted by the sum of value functions. To solve such programs, we leverage the use of tensor train factorization to construct the value function space, and rely on alternations between symbolic search and skill value optimization to find the appropriate skill skeleton and optimal subgoal sequence. Experimental results indicate that the obtained value functions provide a superior approximation of cumulative rewards compared to state-of-the-art reinforcement learning methods. Furthermore, we validate LSP in three manipulation domains, encompassing both prehensile and non-prehensile primitives. The results demonstrate its capability to identify the optimal solution over the full logic and geometric path. The real-robot experiments showcase the effectiveness of our approach to cope with contact uncertainty and external disturbances in the real world.

Learning Robot Soccer from Egocentric Vision with Deep Reinforcement Learning

Authors:Dhruva Tirumala, Markus Wulfmeier, Ben Moran, Sandy Huang, Jan Humplik, Guy Lever, Tuomas Haarnoja, Leonard Hasenclever, Arunkumar Byravan, Nathan Batchelor, Neil Sreendra, Kushal Patel, Marlon Gwira, Francesco Nori, Martin Riedmiller, Nicolas Heess
Date:2024-05-03 18:41:13

We apply multi-agent deep reinforcement learning (RL) to train end-to-end robot soccer policies with fully onboard computation and sensing via egocentric RGB vision. This setting reflects many challenges of real-world robotics, including active perception, agile full-body control, and long-horizon planning in a dynamic, partially-observable, multi-agent domain. We rely on large-scale, simulation-based data generation to obtain complex behaviors from egocentric vision which can be successfully transferred to physical robots using low-cost sensors. To achieve adequate visual realism, our simulation combines rigid-body physics with learned, realistic rendering via multiple Neural Radiance Fields (NeRFs). We combine teacher-based multi-agent RL and cross-experiment data reuse to enable the discovery of sophisticated soccer strategies. We analyze active-perception behaviors including object tracking and ball seeking that emerge when simply optimizing perception-agnostic soccer play. The agents display equivalent levels of performance and agility as policies with access to privileged, ground-truth state. To our knowledge, this paper constitutes a first demonstration of end-to-end training for multi-agent robot soccer, mapping raw pixel observations to joint-level actions, that can be deployed in the real world. Videos of the game-play and analyses can be seen on our website https://sites.google.com/view/vision-soccer .

Learning Robust Autonomous Navigation and Locomotion for Wheeled-Legged Robots

Authors:Joonho Lee, Marko Bjelonic, Alexander Reske, Lorenz Wellhausen, Takahiro Miki, Marco Hutter
Date:2024-05-03 00:29:20

Autonomous wheeled-legged robots have the potential to transform logistics systems, improving operational efficiency and adaptability in urban environments. Navigating urban environments, however, poses unique challenges for robots, necessitating innovative solutions for locomotion and navigation. These challenges include the need for adaptive locomotion across varied terrains and the ability to navigate efficiently around complex dynamic obstacles. This work introduces a fully integrated system comprising adaptive locomotion control, mobility-aware local navigation planning, and large-scale path planning within the city. Using model-free reinforcement learning (RL) techniques and privileged learning, we develop a versatile locomotion controller. This controller achieves efficient and robust locomotion over various rough terrains, facilitated by smooth transitions between walking and driving modes. It is tightly integrated with a learned navigation controller through a hierarchical RL framework, enabling effective navigation through challenging terrain and various obstacles at high speed. Our controllers are integrated into a large-scale urban navigation system and validated by autonomous, kilometer-scale navigation missions conducted in Zurich, Switzerland, and Seville, Spain. These missions demonstrate the system's robustness and adaptability, underscoring the importance of integrated control systems in achieving seamless navigation in complex environments. Our findings support the feasibility of wheeled-legged robots and hierarchical RL for autonomous navigation, with implications for last-mile delivery and beyond.

Adversarial Attacks on Reinforcement Learning Agents for Command and Control

Authors:Ahaan Dabholkar, James Z. Hare, Mark Mittrick, John Richardson, Nicholas Waytowich, Priya Narayanan, Saurabh Bagchi
Date:2024-05-02 19:28:55

Given the recent impact of Deep Reinforcement Learning in training agents to win complex games like StarCraft and DoTA(Defense Of The Ancients) - there has been a surge in research for exploiting learning based techniques for professional wargaming, battlefield simulation and modeling. Real time strategy games and simulators have become a valuable resource for operational planning and military research. However, recent work has shown that such learning based approaches are highly susceptible to adversarial perturbations. In this paper, we investigate the robustness of an agent trained for a Command and Control task in an environment that is controlled by an active adversary. The C2 agent is trained on custom StarCraft II maps using the state of the art RL algorithms - A3C and PPO. We empirically show that an agent trained using these algorithms is highly susceptible to noise injected by the adversary and investigate the effects these perturbations have on the performance of the trained agent. Our work highlights the urgent need to develop more robust training algorithms especially for critical arenas like the battlefield.

Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks

Authors:Murtaza Dalal, Tarun Chiruvolu, Devendra Chaplot, Ruslan Salakhutdinov
Date:2024-05-02 17:59:31

Large Language Models (LLMs) have been shown to be capable of performing high-level planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL achieves state-of-the-art results on over 25 challenging robotics tasks with up to 10 stages. PSL solves long-horizon tasks from raw visual input spanning four benchmarks at success rates of over 85%, out-performing language-based, classical, and end-to-end approaches. Video results and code at https://mihdalal.github.io/planseqlearn/

Non-iterative Optimization of Trajectory and Radio Resource for Aerial Network

Authors:Hyeonsu Lyu, Jonggyu Jang, Harim Lee, Hyun Jong Yang
Date:2024-05-02 14:21:29

We address a joint trajectory planning, user association, resource allocation, and power control problem to maximize proportional fairness in the aerial IoT network, considering practical end-to-end quality-of-service (QoS) and communication schedules. Though the problem is rather ancient, apart from the fact that the previous approaches have never considered user- and time-specific QoS, we point out a prevalent mistake in coordinate optimization approaches adopted by the majority of the literature. Coordinate optimization approaches, which repetitively optimize radio resources for a fixed trajectory and vice versa, generally converge to local optima when all variables are differentiable. However, these methods often stagnate at a non-stationary point, significantly degrading the network utility in mixed-integer problems such as joint trajectory and radio resource optimization. We detour this problem by converting the formulated problem into the Markov decision process (MDP). Exploiting the beneficial characteristics of the MDP, we design a non-iterative framework that cooperatively optimizes trajectory and radio resources without initial trajectory choice. The proposed framework can incorporate various trajectory-planning algorithms such as the genetic algorithm, tree search, and reinforcement learning. Extensive comparisons with diverse baselines verify that the proposed framework significantly outperforms the state-of-the-art method, nearly achieving the global optimum. Our implementation code is available at https://github.com/hslyu/dbspf.{https://github.com/hslyu/dbspf}.

Learning Tactile Insertion in the Real World

Authors:Daniel Palenicek, Theo Gruner, Tim Schneider, Alina Böhm, Janis Lenz, Inga Pfenning, Eric Krämer, Jan Peters
Date:2024-05-01 08:30:10

Humans have exceptional tactile sensing capabilities, which they can leverage to solve challenging, partially observable tasks that cannot be solved from visual observation alone. Research in tactile sensing attempts to unlock this new input modality for robots. Lately, these sensors have become cheaper and, thus, widely available. At the same time, the question of how to integrate them into control loops is still an active area of research, with central challenges being partial observability and the contact-rich nature of manipulation tasks. In this study, we propose to use Reinforcement Learning to learn an end-to-end policy, mapping directly from tactile sensor readings to actions. Specifically, we use Dreamer-v3 on a challenging, partially observable robotic insertion task with a Franka Research 3, both in simulation and on a real system. For the real setup, we built a robotic platform capable of resetting itself fully autonomously, allowing for extensive training runs without human supervision. Our preliminary results indicate that Dreamer is capable of utilizing tactile inputs to solve robotic manipulation tasks in simulation and reality. Furthermore, we find that providing the robot with tactile feedback generally improves task performance, though, in our setup, we do not yet include other sensing modalities. In the future, we plan to utilize our platform to evaluate a wide range of other Reinforcement Learning algorithms on tactile tasks.

Numeric Reward Machines

Authors:Kristina Levina, Nikolaos Pappas, Athanasios Karapantelakis, Aneta Vulgarakis Feljan, Jendrik Seipp
Date:2024-04-30 08:58:47

Reward machines inform reinforcement learning agents about the reward structure of the environment and often drastically speed up the learning process. However, reward machines only accept Boolean features such as robot-reached-gold. Consequently, many inherently numeric tasks cannot profit from the guidance offered by reward machines. To address this gap, we aim to extend reward machines with numeric features such as distance-to-gold. For this, we present two types of reward machines: numeric-Boolean and numeric. In a numeric-Boolean reward machine, distance-to-gold is emulated by two Boolean features distance-to-gold-decreased and robot-reached-gold. In a numeric reward machine, distance-to-gold is used directly alongside the Boolean feature robot-reached-gold. We compare our new approaches to a baseline reward machine in the Craft domain, where the numeric feature is the agent-to-target distance. We use cross-product Q-learning, Q-learning with counter-factual experiences, and the options framework for learning. Our experimental results show that our new approaches significantly outperform the baseline approach. Extending reward machines with numeric features opens up new possibilities of using reward machines in inherently numeric tasks.

Socially Adaptive Path Planning Based on Generative Adversarial Network

Authors:Yao Wang, Yuqi Kong, Wenzheng Chi, Lining Sun
Date:2024-04-29 13:34:19

The natural interaction between robots and pedestrians in the process of autonomous navigation is crucial for the intelligent development of mobile robots, which requires robots to fully consider social rules and guarantee the psychological comfort of pedestrians. Among the research results in the field of robotic path planning, the learning-based socially adaptive algorithms have performed well in some specific human-robot interaction environments. However, human-robot interaction scenarios are diverse and constantly changing in daily life, and the generalization of robot socially adaptive path planning remains to be further investigated. In order to address this issue, this work proposes a new socially adaptive path planning algorithm by combining the generative adversarial network (GAN) with the Optimal Rapidly-exploring Random Tree (RRT*) navigation algorithm. Firstly, a GAN model with strong generalization performance is proposed to adapt the navigation algorithm to more scenarios. Secondly, a GAN model based Optimal Rapidly-exploring Random Tree navigation algorithm (GAN-RRT*) is proposed to generate paths in human-robot interaction environments. Finally, we propose a socially adaptive path planning framework named GAN-RTIRL, which combines the GAN model with Rapidly-exploring random Trees Inverse Reinforcement Learning (RTIRL) to improve the homotopy rate between planned and demonstration paths. In the GAN-RTIRL framework, the GAN-RRT* path planner can update the GAN model from the demonstration path. In this way, the robot can generate more anthropomorphic paths in human-robot interaction environments and has stronger generalization in more complex environments. Experimental results reveal that our proposed method can effectively improve the anthropomorphic degree of robot motion planning and the homotopy rate between planned and demonstration paths.

Generalize by Touching: Tactile Ensemble Skill Transfer for Robotic Furniture Assembly

Authors:Haohong Lin, Radu Corcodel, Ding Zhao
Date:2024-04-26 20:27:10

Furniture assembly remains an unsolved problem in robotic manipulation due to its long task horizon and nongeneralizable operations plan. This paper presents the Tactile Ensemble Skill Transfer (TEST) framework, a pioneering offline reinforcement learning (RL) approach that incorporates tactile feedback in the control loop. TEST's core design is to learn a skill transition model for high-level planning, along with a set of adaptive intra-skill goal-reaching policies. Such design aims to solve the robotic furniture assembly problem in a more generalizable way, facilitating seamless chaining of skills for this long-horizon task. We first sample demonstration from a set of heuristic policies and trajectories consisting of a set of randomized sub-skill segments, enabling the acquisition of rich robot trajectories that capture skill stages, robot states, visual indicators, and crucially, tactile signals. Leveraging these trajectories, our offline RL method discerns skill termination conditions and coordinates skill transitions. Our evaluations highlight the proficiency of TEST on the in-distribution furniture assemblies, its adaptability to unseen furniture configurations, and its robustness against visual disturbances. Ablation studies further accentuate the pivotal role of two algorithmic components: the skill transition model and tactile ensemble policies. Results indicate that TEST can achieve a success rate of 90\% and is over 4 times more efficient than the heuristic policy in both in-distribution and generalization settings, suggesting a scalable skill transfer approach for contact-rich manipulation.

Precise Object Placement Using Force-Torque Feedback

Authors:Osher Lerner, Zachary Tam, Michael Equi
Date:2024-04-26 19:25:01

Precise object manipulation and placement is a common problem for household robots, surgery robots, and robots working on in-situ construction. Prior work using computer vision, depth sensors, and reinforcement learning lacks the ability to reactively recover from planning errors, execution errors, or sensor noise. This work introduces a method that uses force-torque sensing to robustly place objects in stable poses, even in adversarial environments. On 46 trials, our method finds success rates of 100% for basic stacking, and 17% for cases requiring adjustment.

Adaptive speed planning for Unmanned Vehicle Based on Deep Reinforcement Learning

Authors:Hao Liu, Yi Shen, Wenjing Zhou, Yuelin Zou, Chang Zhou, Shuyao He
Date:2024-04-26 12:55:05

In order to solve the problem of frequent deceleration of unmanned vehicles when approaching obstacles, this article uses a Deep Q-Network (DQN) and its extension, the Double Deep Q-Network (DDQN), to develop a local navigation system that adapts to obstacles while maintaining optimal speed planning. By integrating improved reward functions and obstacle angle determination methods, the system demonstrates significant enhancements in maneuvering capabilities without frequent decelerations. Experiments conducted in simulated environments with varying obstacle densities confirm the effectiveness of the proposed method in achieving more stable and efficient path planning.

Adapting Open-Source Large Language Models for Cost-Effective, Expert-Level Clinical Note Generation with On-Policy Reinforcement Learning

Authors:Hanyin Wang, Chufan Gao, Bolun Liu, Qiping Xu, Guleid Hussein, Mohamad El Labban, Kingsley Iheasirim, Hariprasad Korsapati, Chuck Outcalt, Jimeng Sun
Date:2024-04-25 15:34:53

Proprietary Large Language Models (LLMs) such as GPT-4 and Gemini have demonstrated promising capabilities in clinical text summarization tasks. However, due to patient data privacy concerns and computational costs, many healthcare providers prefer using small, locally-hosted models over external generic LLMs. This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model, enabling it to generate high-quality clinical notes from outpatient patient-doctor dialogues. Our process incorporates continued pre-training, supervised fine-tuning, and reinforcement learning from both AI and human feedback. We introduced a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model. Our resulting model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians. In a blinded physician reader study, the majority (90.4%) of individual evaluations rated the notes generated by LLaMA-Clinic as "acceptable" or higher across all three criteria: real-world readiness, completeness, and accuracy. In the more challenging "Assessment and Plan" section, LLaMA-Clinic scored higher (4.2/5) in real-world readiness than physician-authored notes (4.1/5). Our cost analysis for inference shows that our LLaMA-Clinic model achieves a 3.75-fold cost reduction compared to an external generic LLM service. Additionally, we highlight key considerations for future clinical note-generation tasks, emphasizing the importance of pre-defining a best-practice note format, rather than relying on LLMs to determine this for clinical practice. We have made our newly created synthetic clinic dialogue-note dataset and the physician feedback dataset publicly available to foster future research.

The Power of Resets in Online Reinforcement Learning

Authors:Zakaria Mhammedi, Dylan J. Foster, Alexander Rakhlin
Date:2024-04-23 18:09:53

Simulators are a pervasive tool in reinforcement learning, but most existing algorithms cannot efficiently exploit simulator access -- particularly in high-dimensional domains that require general function approximation. We explore the power of simulators through online reinforcement learning with {local simulator access} (or, local planning), an RL protocol where the agent is allowed to reset to previously observed states and follow their dynamics during training. We use local simulator access to unlock new statistical guarantees that were previously out of reach: - We show that MDPs with low coverability (Xie et al. 2023) -- a general structural condition that subsumes Block MDPs and Low-Rank MDPs -- can be learned in a sample-efficient fashion with only $Q^{\star}$-realizability (realizability of the optimal state-value function); existing online RL algorithms require significantly stronger representation conditions. - As a consequence, we show that the notorious Exogenous Block MDP problem (Efroni et al. 2022) is tractable under local simulator access. The results above are achieved through a computationally inefficient algorithm. We complement them with a more computationally efficient algorithm, RVFS (Recursive Value Function Search), which achieves provable sample complexity guarantees under a strengthened statistical assumption known as pushforward coverability. RVFS can be viewed as a principled, provable counterpart to a successful empirical paradigm that combines recursive search (e.g., MCTS) with value function approximation.

Planning the path with Reinforcement Learning: Optimal Robot Motion Planning in RoboCup Small Size League Environments

Authors:Mateus G. Machado, João G. Melo, Cleber Zanchettin, Pedro H. M. Braga, Pedro V. Cunha, Edna N. S. Barros, Hansenclever F. Bassani
Date:2024-04-23 18:01:30

This work investigates the potential of Reinforcement Learning (RL) to tackle robot motion planning challenges in the dynamic RoboCup Small Size League (SSL). Using a heuristic control approach, we evaluate RL's effectiveness in obstacle-free and single-obstacle path-planning environments. Ablation studies reveal significant performance improvements. Our method achieved a 60% time gain in obstacle-free environments compared to baseline algorithms. Additionally, our findings demonstrated dynamic obstacle avoidance capabilities, adeptly navigating around moving blocks. These findings highlight the potential of RL to enhance robot motion planning in the challenging and unpredictable SSL environment.

Graph Neural Networks and Reinforcement Learning for Proactive Application Image Placement

Authors:Antonios Makris, Theodoros Theodoropoulos, Evangelos Psomakelis, Emanuele Carlini, Matteo Mordacchini, Patrizio Dazzi, Konstantinos Tserpes
Date:2024-04-23 13:06:09

The shift from Cloud Computing to a Cloud-Edge continuum presents new opportunities and challenges for data-intensive and interactive applications. Edge computing has garnered a lot of attention from both industry and academia in recent years, emerging as a key enabler for meeting the increasingly strict demands of Next Generation applications. In Edge computing the computations are placed closer to the end-users, to facilitate low-latency and high-bandwidth applications and services. However, the distributed, dynamic, and heterogeneous nature of Edge computing, presents a significant challenge for service placement. A critical aspect of Edge computing involves managing the placement of applications within the network system to minimize each application's runtime, considering the resources available on system devices and the capabilities of the system's network. The placement of application images must be proactively planned to minimize image tranfer time, and meet the strict demands of the applications. In this regard, this paper proposes an approach for proactive image placement that combines Graph Neural Networks and actor-critic Reinforcement Learning, which is evaluated empirically and compared against various solutions. The findings indicate that although the proposed approach may result in longer execution times in certain scenarios, it consistently achieves superior outcomes in terms of application placement.

Enhancing High-Speed Cruising Performance of Autonomous Vehicles through Integrated Deep Reinforcement Learning Framework

Authors:Jinhao Liang, Kaidi Yang, Chaopeng Tan, Jinxiang Wang, Guodong Yin
Date:2024-04-23 03:40:58

High-speed cruising scenarios with mixed traffic greatly challenge the road safety of autonomous vehicles (AVs). Unlike existing works that only look at fundamental modules in isolation, this work enhances AV safety in mixed-traffic high-speed cruising scenarios by proposing an integrated framework that synthesizes three fundamental modules, i.e., behavioral decision-making, path-planning, and motion-control modules. Considering that the integrated framework would increase the system complexity, a bootstrapped deep Q-Network (DQN) is employed to enhance the deep exploration of the reinforcement learning method and achieve adaptive decision making of AVs. Moreover, to make AV behavior understandable by surrounding HDVs to prevent unexpected operations caused by misinterpretations, we derive an inverse reinforcement learning (IRL) approach to learn the reward function of skilled drivers for the path planning of lane-changing maneuvers. Such a design enables AVs to achieve a human-like tradeoff between multi-performance requirements. Simulations demonstrate that the proposed integrated framework can guide AVs to take safe actions while guaranteeing high-speed cruising performance.

Research on Robot Path Planning Based on Reinforcement Learning

Authors:Wang Ruiqi
Date:2024-04-22 10:49:46

This project has conducted research on robot path planning based on Visual SLAM. The main work of this project is as follows: (1) Construction of Visual SLAM system. Research has been conducted on the basic architecture of Visual SLAM. A Visual SLAM system is developed based on ORB-SLAM3 system, which can conduct dense point cloud mapping. (2) The map suitable for two-dimensional path planning is obtained through map conversion. This part converts the dense point cloud map obtained by Visual SLAM system into an octomap and then performs projection transformation to the grid map. The map conversion converts the dense point cloud map containing a large amount of redundant map information into an extremely lightweight grid map suitable for path planning. (3) Research on path planning algorithm based on reinforcement learning. This project has conducted experimental comparisons between the Q-learning algorithm, the DQN algorithm, and the SARSA algorithm, and found that DQN is the algorithm with the fastest convergence and best performance in high-dimensional complex environments. This project has conducted experimental verification of the Visual SLAM system in a simulation environment. The experimental results obtained based on open-source dataset and self-made dataset prove the feasibility and effectiveness of the designed Visual SLAM system. At the same time, this project has also conducted comparative experiments on the three reinforcement learning algorithms under the same experimental condition to obtain the optimal algorithm under the experimental condition.

Adaptive Social Force Window Planner with Reinforcement Learning

Authors:Mauro Martini, Noé Pérez-Higueras, Andrea Ostuni, Marcello Chiaberge, Fernando Caballero, Luis Merino
Date:2024-04-21 14:41:40

Human-aware navigation is a complex task for mobile robots, requiring an autonomous navigation system capable of achieving efficient path planning together with socially compliant behaviors. Social planners usually add costs or constraints to the objective function, leading to intricate tuning processes or tailoring the solution to the specific social scenario. Machine Learning can enhance planners' versatility and help them learn complex social behaviors from data. This work proposes an adaptive social planner, using a Deep Reinforcement Learning agent to dynamically adjust the weighting parameters of the cost function used to evaluate trajectories. The resulting planner combines the robustness of the classic Dynamic Window Approach, integrated with a social cost based on the Social Force Model, and the flexibility of learning methods to boost the overall performance on social navigation tasks. Our extensive experimentation on different environments demonstrates the general advantage of the proposed method over static cost planners.

Random Network Distillation Based Deep Reinforcement Learning for AGV Path Planning

Authors:Huilin Yin, Shengkai Su, Yinjia Lin, Pengju Zhen, Karin Festl, Daniel Watzenig
Date:2024-04-19 02:52:56

With the flourishing development of intelligent warehousing systems, the technology of Automated Guided Vehicle (AGV) has experienced rapid growth. Within intelligent warehousing environments, AGV is required to safely and rapidly plan an optimal path in complex and dynamic environments. Most research has studied deep reinforcement learning to address this challenge. However, in the environments with sparse extrinsic rewards, these algorithms often converge slowly, learn inefficiently or fail to reach the target. Random Network Distillation (RND), as an exploration enhancement, can effectively improve the performance of proximal policy optimization, especially enhancing the additional intrinsic rewards of the AGV agent which is in sparse reward environments. Moreover, most of the current research continues to use 2D grid mazes as experimental environments. These environments have insufficient complexity and limited action sets. To solve this limitation, we present simulation environments of AGV path planning with continuous actions and positions for AGVs, so that it can be close to realistic physical scenarios. Based on our experiments and comprehensive analysis of the proposed method, the results demonstrate that our proposed method enables AGV to more rapidly complete path planning tasks with continuous actions in our environments. A video of part of our experiments can be found at https://youtu.be/lwrY9YesGmw.

ASID: Active Exploration for System Identification in Robotic Manipulation

Authors:Marius Memmel, Andrew Wagenmaker, Chuning Zhu, Patrick Yin, Dieter Fox, Abhishek Gupta
Date:2024-04-18 16:35:38

Model-free control strategies such as reinforcement learning have shown the ability to learn control strategies without requiring an accurate model or simulator of the world. While this is appealing due to the lack of modeling requirements, such methods can be sample inefficient, making them impractical in many real-world domains. On the other hand, model-based control techniques leveraging accurate simulators can circumvent these challenges and use a large amount of cheap simulation data to learn controllers that can effectively transfer to the real world. The challenge with such model-based techniques is the requirement for an extremely accurate simulation, requiring both the specification of appropriate simulation assets and physical parameters. This requires considerable human effort to design for every environment being considered. In this work, we propose a learning system that can leverage a small amount of real-world data to autonomously refine a simulation model and then plan an accurate control strategy that can be deployed in the real world. Our approach critically relies on utilizing an initial (possibly inaccurate) simulator to design effective exploration policies that, when deployed in the real world, collect high-quality data. We demonstrate the efficacy of this paradigm in identifying articulation, mass, and other physical parameters in several challenging robotic manipulation tasks, and illustrate that only a small amount of real-world data can allow for effective sim-to-real transfer. Project website at https://weirdlabuw.github.io/asid

Trajectory Planning for Autonomous Vehicle Using Iterative Reward Prediction in Reinforcement Learning

Authors:Hyunwoo Park
Date:2024-04-18 11:02:01

Traditional trajectory planning methods for autonomous vehicles have several limitations. For example, heuristic and explicit simple rules limit generalizability and hinder complex motions. These limitations can be addressed using reinforcement learning-based trajectory planning. However, reinforcement learning suffers from unstable learning, and existing reinforcement learning-based trajectory planning methods do not consider the uncertainties. Thus, this paper, proposes a reinforcement learning-based trajectory planning method for autonomous vehicles. The proposed method involves an iterative reward prediction approach that iteratively predicts expectations of future states. These predicted states are then used to forecast rewards and integrated into the learning process to enhance stability. Additionally, a method is proposed that utilizes uncertainty propagation to make the reinforcement learning agent aware of uncertainties. The proposed method was evaluated using the CARLA simulator. Compared to the baseline methods, the proposed method reduced the collision rate by 60.17 %, and increased the average reward by 30.82 times. A video of the proposed method is available at https://www.youtube.com/watch?v=PfDbaeLfcN4.

Trajectory Planning Using Reinforcement Learning for Interactive Overtaking Maneuvers in Autonomous Racing Scenarios

Authors:Levent Ögretmen, Mo Chen, Phillip Pitschi, Boris Lohmann
Date:2024-04-16 15:35:34

Conventional trajectory planning approaches for autonomous racing are based on the sequential execution of prediction of the opposing vehicles and subsequent trajectory planning for the ego vehicle. If the opposing vehicles do not react to the ego vehicle, they can be predicted accurately. However, if there is interaction between the vehicles, the prediction loses its validity. For high interaction, instead of a planning approach that reacts exclusively to the fixed prediction, a trajectory planning approach is required that incorporates the interaction with the opposing vehicles. This paper demonstrates the limitations of a widely used conventional sampling-based approach within a highly interactive blocking scenario. We show that high success rates are achieved for less aggressive blocking behavior but that the collision rate increases with more significant interaction. We further propose a novel Reinforcement Learning (RL)-based trajectory planning approach for racing that explicitly exploits the interaction with the opposing vehicle without requiring a prediction. In contrast to the conventional approach, the RL-based approach achieves high success rates even for aggressive blocking behavior. Furthermore, we propose a novel safety layer (SL) that intervenes when the trajectory generated by the RL-based approach is infeasible. In that event, the SL generates a sub-optimal but feasible trajectory, avoiding termination of the scenario due to a not found valid solution.

Autonomous Path Planning for Intercostal Robotic Ultrasound Imaging Using Reinforcement Learning

Authors:Yuan Bi, Cheng Qian, Zhicheng Zhang, Nassir Navab, Zhongliang Jiang
Date:2024-04-15 16:52:53

Ultrasound (US) has been widely used in daily clinical practice for screening internal organs and guiding interventions. However, due to the acoustic shadow cast by the subcutaneous rib cage, the US examination for thoracic application is still challenging. To fully cover and reconstruct the region of interest in US for diagnosis, an intercostal scanning path is necessary. To tackle this challenge, we present a reinforcement learning (RL) approach for planning scanning paths between ribs to monitor changes in lesions on internal organs, such as the liver and heart, which are covered by rib cages. Structured anatomical information of the human skeleton is crucial for planning these intercostal paths. To obtain such anatomical insight, an RL agent is trained in a virtual environment constructed using computational tomography (CT) templates with randomly initialized tumors of various shapes and locations. In addition, task-specific state representation and reward functions are introduced to ensure the convergence of the training process while minimizing the effects of acoustic attenuation and shadows during scanning. To validate the effectiveness of the proposed approach, experiments have been carried out on unseen CTs with randomly defined single or multiple scanning targets. The results demonstrate the efficiency of the proposed RL framework in planning non-shadowed US scanning trajectories in areas with limited acoustic access.

LLMSat: A Large Language Model-Based Goal-Oriented Agent for Autonomous Space Exploration

Authors:David Maranto
Date:2024-04-13 03:33:17

As spacecraft journey further from Earth with more complex missions, systems of greater autonomy and onboard intelligence are called for. Reducing reliance on human-based mission control becomes increasingly critical if we are to increase our rate of solar-system-wide exploration. Recent work has explored AI-based goal-oriented systems to increase the level of autonomy in mission execution. These systems make use of symbolic reasoning managers to make inferences from the state of a spacecraft and a handcrafted knowledge base, enabling autonomous generation of tasks and re-planning. Such systems have proven to be successful in controlled cases, but they are difficult to implement as they require human-crafted ontological models to allow the spacecraft to understand the world. Reinforcement learning has been applied to train robotic agents to pursue a goal. A new architecture for autonomy is called for. This work explores the application of Large Language Models (LLMs) as the high-level control system of a spacecraft. Using a systems engineering approach, this work presents the design and development of an agentic spacecraft controller by leveraging an LLM as a reasoning engine, to evaluate the utility of such an architecture in achieving higher levels of spacecraft autonomy. A series of deep space mission scenarios simulated within the popular game engine Kerbal Space Program (KSP) are used as case studies to evaluate the implementation against the requirements. It is shown the reasoning and planning abilities of present-day LLMs do not scale well as the complexity of a mission increases, but this can be alleviated with adequate prompting frameworks and strategic selection of the agent's level of authority over the host spacecraft. This research evaluates the potential of LLMs in augmenting autonomous decision-making systems for future robotic space applications.

WROOM: An Autonomous Driving Approach for Off-Road Navigation

Authors:Dvij Kalaria, Shreya Sharma, Sarthak Bhagat, Haoru Xue, John M. Dolan
Date:2024-04-12 23:55:59

Off-road navigation is a challenging problem both at the planning level to get a smooth trajectory and at the control level to avoid flipping over, hitting obstacles, or getting stuck at a rough patch. There have been several recent works using classical approaches involving depth map prediction followed by smooth trajectory planning and using a controller to track it. We design an end-to-end reinforcement learning (RL) system for an autonomous vehicle in off-road environments using a custom-designed simulator in the Unity game engine. We warm-start the agent by imitating a rule-based controller and utilize Proximal Policy Optimization (PPO) to improve the policy based on a reward that incorporates Control Barrier Functions (CBF), facilitating the agent's ability to generalize effectively to real-world scenarios. The training involves agents concurrently undergoing domain-randomized trials in various environments. We also propose a novel simulation environment to replicate off-road driving scenarios and deploy our proposed approach on a real buggy RC car. Videos and additional results: https://sites.google.com/view/wroom-utd/home

Prescribing Optimal Health-Aware Operation for Urban Air Mobility with Deep Reinforcement Learning

Authors:Mina Montazeri, Chetan Kulkarni, Olga Fink
Date:2024-04-12 14:26:48

Urban Air Mobility (UAM) aims to expand existing transportation networks in metropolitan areas by offering short flights either to transport passengers or cargo. Electric vertical takeoff and landing aircraft powered by lithium-ion battery packs are considered promising for such applications. Efficient mission planning is cru-cial, maximizing the number of flights per battery charge while ensuring completion even under unforeseen events. As batteries degrade, precise mission planning becomes challenging due to uncertainties in the end-of-discharge prediction. This often leads to adding safety margins, reducing the number or duration of po-tential flights on one battery charge. While predicting the end of discharge can support decision-making, it remains insufficient in case of unforeseen events, such as adverse weather conditions. This necessitates health-aware real-time control to address any unexpected events and extend the time until the end of charge while taking the current degradation state into account. This paper addresses the joint problem of mission planning and health-aware real-time control of opera-tional parameters to prescriptively control the duration of one discharge cycle of the battery pack. We pro-pose an algorithm that proactively prescribes operational parameters to extend the discharge cycle based on the battery's current health status while optimizing the mission. The proposed deep reinforcement learn-ing algorithm facilitates operational parameter optimization and path planning while accounting for the degradation state, even in the presence of uncertainties. Evaluation of simulated flights of a NASA concep-tual multirotor aircraft model, collected from Hardware-in-the-loop experiments, demonstrates the algo-rithm's near-optimal performance across various operational scenarios, allowing adaptation to changed en-vironmental conditions.

A Review of Reward Functions for Reinforcement Learning in the context of Autonomous Driving

Authors:Ahmed Abouelazm, Jonas Michel, J. Marius Zoellner
Date:2024-04-12 08:32:54

Reinforcement learning has emerged as an important approach for autonomous driving. A reward function is used in reinforcement learning to establish the learned skill objectives and guide the agent toward the optimal policy. Since autonomous driving is a complex domain with partly conflicting objectives with varying degrees of priority, developing a suitable reward function represents a fundamental challenge. This paper aims to highlight the gap in such function design by assessing different proposed formulations in the literature and dividing individual objectives into Safety, Comfort, Progress, and Traffic Rules compliance categories. Additionally, the limitations of the reviewed reward functions are discussed, such as objectives aggregation and indifference to driving context. Furthermore, the reward categories are frequently inadequately formulated and lack standardization. This paper concludes by proposing future research that potentially addresses the observed shortcomings in rewards, including a reward validation framework and structured rewards that are context-aware and able to resolve conflicts.

HGFF: A Deep Reinforcement Learning Framework for Lifetime Maximization in Wireless Sensor Networks

Authors:Xiaoxu Han, Xin Mu, Jinghui Zhong
Date:2024-04-11 13:09:11

Planning the movement of the sink to maximize the lifetime in wireless sensor networks is an essential problem of great research challenge and practical value. Many existing mobile sink techniques based on mathematical programming or heuristics have demonstrated the feasibility of the task. Nevertheless, the huge computation consumption or the over-reliance on human knowledge can result in relatively low performance. In order to balance the need for high-quality solutions with the goal of minimizing inference time, we propose a new framework combining heterogeneous graph neural network with deep reinforcement learning to automatically construct the movement path of the sink. Modeling the wireless sensor networks as heterogeneous graphs, we utilize the graph neural network to learn representations of sites and sensors by aggregating features of neighbor nodes and extracting hierarchical graph features. Meanwhile, the multi-head attention mechanism is leveraged to allow the sites to attend to information from sensor nodes, which highly improves the expressive capacity of the learning model. Based on the node representations, a greedy policy is learned to append the next best site in the solution incrementally. We design ten types of static and dynamic maps to simulate different wireless sensor networks in the real world, and extensive experiments are conducted to evaluate and analyze our approach. The empirical results show that our approach consistently outperforms the existing methods on all types of maps.

Generative Probabilistic Planning for Optimizing Supply Chain Networks

Authors:Hyung-il Ahn, Santiago Olivar, Hershel Mehta, Young Chol Song
Date:2024-04-11 07:06:58

Supply chain networks in enterprises are typically composed of complex topological graphs involving various types of nodes and edges, accommodating numerous products with considerable demand and supply variability. However, as supply chain networks expand in size and complexity, traditional supply chain planning methods (e.g., those found in heuristic rule-based and operations research-based systems) tend to become locally optimal or lack computational scalability, resulting in substantial imbalances between supply and demand across nodes in the network. This paper introduces a novel Generative AI technique, which we call Generative Probabilistic Planning (GPP). GPP generates dynamic supply action plans that are globally optimized across all network nodes over the time horizon for changing objectives like maximizing profits or service levels, factoring in time-varying probabilistic demand, lead time, and production conditions. GPP leverages attention-based graph neural networks (GNN), offline deep reinforcement learning (Offline RL), and policy simulations to train generative policy models and create optimal plans through probabilistic simulations, effectively accounting for various uncertainties. Our experiments using historical data from a global consumer goods company with complex supply chain networks demonstrate that GPP accomplishes objective-adaptable, probabilistically resilient, and dynamic planning for supply chain networks, leading to significant improvements in performance and profitability for enterprises. Our work plays a pivotal role in shaping the trajectory of AI adoption within the supply chain domain.

Deep Reinforcement Learning for Mobile Robot Path Planning

Authors:Hao Liu, Yi Shen, Shuangjiang Yu, Zijun Gao, Tong Wu
Date:2024-04-10 12:38:38

Path planning is an important problem with the the applications in many aspects, such as video games, robotics etc. This paper proposes a novel method to address the problem of Deep Reinforcement Learning (DRL) based path planning for a mobile robot. We design DRL-based algorithms, including reward functions, and parameter optimization, to avoid time-consuming work in a 2D environment. We also designed an Two-way search hybrid A* algorithm to improve the quality of local path planning. We transferred the designed algorithm to a simple embedded environment to test the computational load of the algorithm when running on a mobile robot. Experiments show that when deployed on a robot platform, the DRL-based algorithm in this article can achieve better planning results and consume less computing resources.

Learning Heuristics for Transit Network Design and Improvement with Deep Reinforcement Learning

Authors:Andrew Holliday, Ahmed El-Geneidy, Gregory Dudek
Date:2024-04-08 22:40:57

Transit agencies world-wide face tightening budgets. To maintain quality of service while cutting costs, efficient transit network design is essential. But planning a network of public transit routes is a challenging optimization problem. The most successful approaches to date use metaheuristic algorithms to search through the space of possible transit networks by applying low-level heuristics that randomly alter routes in a network. The design of these low-level heuristics has a major impact on the quality of the result. In this paper we use deep reinforcement learning with graph neural nets to learn low-level heuristics for an evolutionary algorithm, instead of designing them manually. These learned heuristics improve the algorithm's results on benchmark synthetic cities with 70 nodes or more, and obtain state-of-the-art results when optimizing operating costs. They also improve upon a simulation of the real transit network in the city of Laval, Canada, by as much as 54% and 18% on two key metrics, and offer cost savings of up to 12% over the city's existing transit network.

Long-horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Models

Authors:Yutao Ouyang, Jinhan Li, Yunfei Li, Zhongyu Li, Chao Yu, Koushil Sreenath, Yi Wu
Date:2024-04-08 08:29:00

We present a large language model (LLM) based system to empower quadrupedal robots with problem-solving abilities for long-horizon tasks beyond short-term motions. Long-horizon tasks for quadrupeds are challenging since they require both a high-level understanding of the semantics of the problem for task planning and a broad range of locomotion and manipulation skills to interact with the environment. Our system builds a high-level reasoning layer with large language models, which generates hybrid discrete-continuous plans as robot code from task descriptions. It comprises multiple LLM agents: a semantic planner for sketching a plan, a parameter calculator for predicting arguments in the plan, and a code generator to convert the plan into executable robot code. At the low level, we adopt reinforcement learning to train a set of motion planning and control skills to unleash the flexibility of quadrupeds for rich environment interactions. Our system is tested on long-horizon tasks that are infeasible to complete with one single skill. Simulation and real-world experiments show that it successfully figures out multi-step strategies and demonstrates non-trivial behaviors, including building tools or notifying a human for help. Demos are available on our project page: https://sites.google.com/view/long-horizon-robot.

MeSA-DRL: Memory-Enhanced Deep Reinforcement Learning for Advanced Socially Aware Robot Navigation in Crowded Environments

Authors:Mannan Saeed Muhammad, Estrella Montero
Date:2024-04-08 05:10:35

Autonomous navigation capabilities play a critical role in service robots operating in environments where human interactions are pivotal, due to the dynamic and unpredictable nature of these environments. However, the variability in human behavior presents a substantial challenge for robots in predicting and anticipating movements, particularly in crowded scenarios. To address this issue, a memory-enabled deep reinforcement learning framework is proposed for autonomous robot navigation in diverse pedestrian scenarios. The proposed framework leverages long-term memory to retain essential information about the surroundings and model sequential dependencies effectively. The importance of human-robot interactions is also encoded to assign higher attention to these interactions. A global planning mechanism is incorporated into the memory-enabled architecture. Additionally, a multi-term reward system is designed to prioritize and encourage long-sighted robot behaviors by incorporating dynamic warning zones. Simultaneously, it promotes smooth trajectories and minimizes the time taken to reach the robot's desired goal. Extensive simulation experiments show that the suggested approach outperforms representative state-of-the-art methods, showcasing its ability to a navigation efficiency and safety in real-world scenarios.

On the Uniqueness of Solution for the Bellman Equation of LTL Objectives

Authors:Zetong Xuan, Alper Kamil Bozkurt, Miroslav Pajic, Yu Wang
Date:2024-04-07 21:06:52

Surrogate rewards for linear temporal logic (LTL) objectives are commonly utilized in planning problems for LTL objectives. In a widely-adopted surrogate reward approach, two discount factors are used to ensure that the expected return approximates the satisfaction probability of the LTL objective. The expected return then can be estimated by methods using the Bellman updates such as reinforcement learning. However, the uniqueness of the solution to the Bellman equation with two discount factors has not been explicitly discussed. We demonstrate with an example that when one of the discount factors is set to one, as allowed in many previous works, the Bellman equation may have multiple solutions, leading to inaccurate evaluation of the expected return. We then propose a condition for the Bellman equation to have the expected return as the unique solution, requiring the solutions for states inside a rejecting bottom strongly connected component (BSCC) to be 0. We prove this condition is sufficient by showing that the solutions for the states with discounting can be separated from those for the states without discounting under this condition

Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs

Authors:Yiqun Duan, Qiang Zhang, Renjing Xu
Date:2024-04-07 08:31:12

The utilization of Large Language Models (LLMs) within the realm of reinforcement learning, particularly as planners, has garnered a significant degree of attention in recent scholarly literature. However, a substantial proportion of existing research predominantly focuses on planning models for robotics that transmute the outputs derived from perception models into linguistic forms, thus adopting a `pure-language' strategy. In this research, we propose a hybrid End-to-End learning framework for autonomous driving by combining basic driving imitation learning with LLMs based on multi-modality prompt tokens. Instead of simply converting perception results from the separated train model into pure language input, our novelty lies in two aspects. 1) The end-to-end integration of visual and LiDAR sensory input into learnable multi-modality tokens, thereby intrinsically alleviating description bias by separated pre-trained perception models. 2) Instead of directly letting LLMs drive, this paper explores a hybrid setting of letting LLMs help the driving model correct mistakes and complicated scenarios. The results of our experiments suggest that the proposed methodology can attain driving scores of 49.21%, coupled with an impressive route completion rate of 91.34% in the offline evaluation conducted via CARLA. These performance metrics are comparable to the most advanced driving models.

Efficient Reinforcement Learning of Task Planners for Robotic Palletization through Iterative Action Masking Learning

Authors:Zheng Wu, Yichuan Li, Wei Zhan, Changliu Liu, Yun-Hui Liu, Masayoshi Tomizuka
Date:2024-04-07 01:13:07

The development of robotic systems for palletization in logistics scenarios is of paramount importance, addressing critical efficiency and precision demands in supply chain management. This paper investigates the application of Reinforcement Learning (RL) in enhancing task planning for such robotic systems. Confronted with the substantial challenge of a vast action space, which is a significant impediment to efficiently apply out-of-the-shelf RL methods, our study introduces a novel method of utilizing supervised learning to iteratively prune and manage the action space effectively. By reducing the complexity of the action space, our approach not only accelerates the learning phase but also ensures the effectiveness and reliability of the task planning in robotic palletization. The experimental results underscore the efficacy of this method, highlighting its potential in improving the performance of RL applications in complex and high-dimensional environments like logistics palletization.

A proximal policy optimization based intelligent home solar management

Authors:Kode Creer, Imitiaz Parvez
Date:2024-04-05 04:34:43

In the smart grid, the prosumers can sell unused electricity back to the power grid, assuming the prosumers own renewable energy sources and storage units. The maximizing of their profits under a dynamic electricity market is a problem that requires intelligent planning. To address this, we propose a framework based on Proximal Policy Optimization (PPO) using recurrent rewards. By using the information about the rewards modeled effectively with PPO to maximize our objective, we were able to get over 30\% improvement over the other naive algorithms in accumulating total profits. This shows promise in getting reinforcement learning algorithms to perform tasks required to plan their actions in complex domains like financial markets. We also introduce a novel method for embedding longs based on soliton waves that outperformed normal embedding in our use case with random floating point data augmentation.

Conversational Disease Diagnosis via External Planner-Controlled Large Language Models

Authors:Zhoujian Sun, Cheng Luo, Ziyi Liu, Zhengxing Huang
Date:2024-04-04 06:16:35

The development of large language models (LLMs) has brought unprecedented possibilities for artificial intelligence (AI) based medical diagnosis. However, the application perspective of LLMs in real diagnostic scenarios is still unclear because they are not adept at collecting patient data proactively. This study presents a LLM-based diagnostic system that enhances planning capabilities by emulating doctors. Our system involves two external planners to handle planning tasks. The first planner employs a reinforcement learning approach to formulate disease screening questions and conduct initial diagnoses. The second planner uses LLMs to parse medical guidelines and conduct differential diagnoses. By utilizing real patient electronic medical record data, we constructed simulated dialogues between virtual patients and doctors and evaluated the diagnostic abilities of our system. We demonstrated that our system obtained impressive performance in both disease screening and differential diagnoses tasks. This research represents a step towards more seamlessly integrating AI into clinical settings, potentially enhancing the accuracy and accessibility of medical diagnostics.

Model-based Reinforcement Learning for Parameterized Action Spaces

Authors:Renhao Zhang, Haotian Fu, Yilin Miao, George Konidaris
Date:2024-04-03 19:48:13

We propose a novel model-based reinforcement learning algorithm -- Dynamics Learning and predictive control with Parameterized Actions (DLPA) -- for Parameterized Action Markov Decision Processes (PAMDPs). The agent learns a parameterized-action-conditioned dynamics model and plans with a modified Model Predictive Path Integral control. We theoretically quantify the difference between the generated trajectory and the optimal trajectory during planning in terms of the value they achieved through the lens of Lipschitz Continuity. Our empirical results on several standard benchmarks show that our algorithm achieves superior sample efficiency and asymptotic performance than state-of-the-art PAMDP methods.

Deep Reinforcement Learning for Traveling Purchaser Problems

Authors:Haofeng Yuan, Rongping Zhu, Wanlu Yang, Shiji Song, Keyou You, Wei Fan, C. L. Philip Chen
Date:2024-04-03 05:32:10

The traveling purchaser problem (TPP) is an important combinatorial optimization problem with broad applications. Due to the coupling between routing and purchasing, existing works on TPPs commonly address route construction and purchase planning simultaneously, which, however, leads to exact methods with high computational cost and heuristics with sophisticated design but limited performance. In sharp contrast, we propose a novel approach based on deep reinforcement learning (DRL), which addresses route construction and purchase planning separately, while evaluating and optimizing the solution from a global perspective. The key components of our approach include a bipartite graph representation for TPPs to capture the market-product relations, and a policy network that extracts information from the bipartite graph and uses it to sequentially construct the route. One significant benefit of our framework is that we can efficiently construct the route using the policy network, and once the route is determined, the associated purchasing plan can be easily derived through linear programming, while, leveraging DRL, we can train the policy network to optimize the global solution objective. Furthermore, by introducing a meta-learning strategy, the policy network can be trained stably on large-sized TPP instances, and generalize well across instances of varying sizes and distributions, even to much larger instances that are never seen during training. Experiments on various synthetic TPP instances and the TPPLIB benchmark demonstrate that our DRL-based approach can significantly outperform well-established TPP heuristics, reducing the optimality gap by 40%-90%, and also showing an advantage in runtime, especially on large-sized instances.

Deep Reinforcement Learning in Autonomous Car Path Planning and Control: A Survey

Authors:Yiyang Chen, Chao Ji, Yunrui Cai, Tong Yan, Bo Su
Date:2024-03-30 12:37:58

Combining data-driven applications with control systems plays a key role in recent Autonomous Car research. This thesis offers a structured review of the latest literature on Deep Reinforcement Learning (DRL) within the realm of autonomous vehicle Path Planning and Control. It collects a series of DRL methodologies and algorithms and their applications in the field, focusing notably on their roles in trajectory planning and dynamic control. In this review, we delve into the application outcomes of DRL technologies in this domain. By summarizing these literatures, we highlight potential challenges, aiming to offer insights that might aid researchers engaged in related fields.

Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods

Authors:Yuji Cao, Huan Zhao, Yuheng Cheng, Ting Shu, Yue Chen, Guolong Liu, Gaoqi Liang, Junhua Zhao, Jinyue Yan, Yun Li
Date:2024-03-30 08:28:08

With extensive pre-trained knowledge and high-level general capabilities, large language models (LLMs) emerge as a promising avenue to augment reinforcement learning (RL) in aspects such as multi-task learning, sample efficiency, and high-level task planning. In this survey, we provide a comprehensive review of the existing literature in LLM-enhanced RL and summarize its characteristics compared to conventional RL methods, aiming to clarify the research scope and directions for future studies. Utilizing the classical agent-environment interaction paradigm, we propose a structured taxonomy to systematically categorize LLMs' functionalities in RL, including four roles: information processor, reward designer, decision-maker, and generator. For each role, we summarize the methodologies, analyze the specific RL challenges that are mitigated, and provide insights into future directions. Lastly, a comparative analysis of each role, potential applications, prospective opportunities, and challenges of the LLM-enhanced RL are discussed. By proposing this taxonomy, we aim to provide a framework for researchers to effectively leverage LLMs in the RL field, potentially accelerating RL applications in complex applications such as robotics, autonomous driving, and energy systems.

Path planning of magnetic microswimmers in high-fidelity simulations of capillaries with deep reinforcement learning

Authors:Lucas Amoudruz, Sergey Litvinov, Petros Koumoutsakos
Date:2024-03-29 17:01:09

Biomedical applications such as targeted drug delivery, microsurgery or sensing rely on reaching precise areas within the body in a minimally invasive way. Artificial bacterial flagella (ABFs) have emerged as potential tools for this task by navigating through the circulatory system. While the control and swimming characteristics of ABFs is understood in simple scenarios, their behavior within the bloodstream remains unclear. We conduct simulations of ABFs evolving in the complex capillary networks found in the human retina. The ABF is robustly guided to a prescribed target by a reinforcement learning agent previously trained on a reduced order model.

EnCoMP: Enhanced Covert Maneuver Planning with Adaptive Threat-Aware Visibility Estimation using Offline Reinforcement Learning

Authors:Jumman Hossain, Abu-Zaher Faridee, Nirmalya Roy, Jade Freeman, Timothy Gregory, Theron T. Trout
Date:2024-03-29 07:03:10

Autonomous robots operating in complex environments face the critical challenge of identifying and utilizing environmental cover for covert navigation to minimize exposure to potential threats. We propose EnCoMP, an enhanced navigation framework that integrates offline reinforcement learning and our novel Adaptive Threat-Aware Visibility Estimation (ATAVE) algorithm to enable robots to navigate covertly and efficiently in diverse outdoor settings. ATAVE is a dynamic probabilistic threat modeling technique that we designed to continuously assess and mitigate potential threats in real-time, enhancing the robot's ability to navigate covertly by adapting to evolving environmental and threat conditions. Moreover, our approach generates high-fidelity multi-map representations, including cover maps, potential threat maps, height maps, and goal maps from LiDAR point clouds, providing a comprehensive understanding of the environment. These multi-maps offer detailed environmental insights, helping in strategic navigation decisions. The goal map encodes the relative distance and direction to the target location, guiding the robot's navigation. We train a Conservative Q-Learning (CQL) model on a large-scale dataset collected from real-world environments, learning a robust policy that maximizes cover utilization, minimizes threat exposure, and maintains efficient navigation. We demonstrate our method's capabilities on a physical Jackal robot, showing extensive experiments across diverse terrains. These experiments demonstrate EnCoMP's superior performance compared to state-of-the-art methods, achieving a 95% success rate, 85% cover utilization, and reducing threat exposure to 10.5%, while significantly outperforming baselines in navigation efficiency and robustness.

Bridging the Gap: Regularized Reinforcement Learning for Improved Classical Motion Planning with Safety Modules

Authors:Elias Goldsztejn, Ronen I. Brafman
Date:2024-03-27 12:55:16

Classical navigation planners can provide safe navigation, albeit often suboptimally and with hindered human norm compliance. ML-based, contemporary autonomous navigation algorithms can imitate more natural and humancompliant navigation, but usually require large and realistic datasets and do not always provide safety guarantees. We present an approach that leverages a classical algorithm to guide reinforcement learning. This greatly improves the results and convergence rate of the underlying RL algorithm and requires no human-expert demonstrations to jump-start the process. Additionally, we incorporate a practical fallback system that can switch back to a classical planner to ensure safety. The outcome is a sample efficient ML approach for mobile navigation that builds on classical algorithms, improves them to ensure human compliance, and guarantees safety.

Multi-AGV Path Planning Method via Reinforcement Learning and Particle Filters

Authors:Shao Shuo
Date:2024-03-27 03:53:30

Thanks to its robust learning and search stabilities,the reinforcement learning (RL) algorithm has garnered increasingly significant attention and been exten-sively applied in Automated Guided Vehicle (AGV) path planning. However, RL-based planning algorithms have been discovered to suffer from the substantial variance of neural networks caused by environmental instability and significant fluctua-tions in system structure. These challenges manifest in slow convergence speed and low learning efficiency. To tackle this issue, this paper presents a novel multi-AGV path planning method named Particle Filters - Double Deep Q-Network (PF-DDQN)via leveraging Particle Filters (PF) and RL algorithm. Firstly, the proposed method leverages the imprecise weight values of the network as state values to formulate thestate space equation.Subsequently, the DDQN model is optimized to acquire the optimal true weight values through the iterative fusion process of neural networksand PF in order to enhance the optimization efficiency of the proposedmethod. Lastly, the performance of the proposed method is validated by different numerical simulations. The simulation results demonstrate that the proposed methoddominates the traditional DDQN algorithm in terms of path planning superiority andtraining time indicator by 92.62% and 76.88%, respectively. Therefore, the proposedmethod could be considered as a vital alternative in the field of multi-AGV path planning.

Imitating Cost-Constrained Behaviors in Reinforcement Learning

Authors:Qian Shao, Pradeep Varakantham, Shih-Fen Cheng
Date:2024-03-26 07:41:54

Complex planning and scheduling problems have long been solved using various optimization or heuristic approaches. In recent years, imitation learning that aims to learn from expert demonstrations has been proposed as a viable alternative to solving these problems. Generally speaking, imitation learning is designed to learn either the reward (or preference) model or directly the behavioral policy by observing the behavior of an expert. Existing work in imitation learning and inverse reinforcement learning has focused on imitation primarily in unconstrained settings (e.g., no limit on fuel consumed by the vehicle). However, in many real-world domains, the behavior of an expert is governed not only by reward (or preference) but also by constraints. For instance, decisions on self-driving delivery vehicles are dependent not only on the route preferences/rewards (depending on past demand data) but also on the fuel in the vehicle and the time available. In such problems, imitation learning is challenging as decisions are not only dictated by the reward model but are also dependent on a cost-constrained model. In this paper, we provide multiple methods that match expert distributions in the presence of trajectory cost constraints through (a) Lagrangian-based method; (b) Meta-gradients to find a good trade-off between expected return and minimizing constraint violation; and (c) Cost-violation-based alternating gradient. We empirically show that leading imitation learning approaches imitate cost-constrained behaviors poorly and our meta-gradient-based approach achieves the best performance.

Speeding Up Path Planning via Reinforcement Learning in MCTS for Automated Parking

Authors:Xinlong Zheng, Xiaozhou Zhang, Donghao Xu
Date:2024-03-25 22:21:23

In this paper, we address a method that integrates reinforcement learning into the Monte Carlo tree search to boost online path planning under fully observable environments for automated parking tasks. Sampling-based planning methods under high-dimensional space can be computationally expensive and time-consuming. State evaluation methods are useful by leveraging the prior knowledge into the search steps, making the process faster in a real-time system. Given the fact that automated parking tasks are often executed under complex environments, a solid but lightweight heuristic guidance is challenging to compose in a traditional analytical way. To overcome this limitation, we propose a reinforcement learning pipeline with a Monte Carlo tree search under the path planning framework. By iteratively learning the value of a state and the best action among samples from its previous cycle's outcomes, we are able to model a value estimator and a policy generator for given states. By doing that, we build up a balancing mechanism between exploration and exploitation, speeding up the path planning process while maintaining its quality without using human expert driver data.

Trajectory Planning of Robotic Manipulator in Dynamic Environment Exploiting DRL

Authors:Osama Ahmad, Zawar Hussain, Hammad Naeem
Date:2024-03-25 11:40:32

This study is about the implementation of a reinforcement learning algorithm in the trajectory planning of manipulators. We have a 7-DOF robotic arm to pick and place the randomly placed block at a random target point in an unknown environment. The obstacle is randomly moving which creates a hurdle in picking the object. The objective of the robot is to avoid the obstacle and pick the block with constraints to a fixed timestamp. In this literature, we have applied a deep deterministic policy gradient (DDPG) algorithm and compared the model's efficiency with dense and sparse rewards.

Real-World Evaluation of two Cooperative Intersection Management Approaches

Authors:Marvin Klimke, Max Bastian Mertens, Benjamin Völz, Michael Buchholz
Date:2024-03-25 07:04:24

Cooperative maneuver planning promises to significantly improve traffic efficiency at unsignalized intersections by leveraging connected automated vehicles. Previous works on this topic have been mostly developed for completely automated traffic in a simple simulated environment. In contrast, our previously introduced planning approaches are specifically designed to handle real-world mixed traffic. The two methods are based on multi-scenario prediction and graph-based reinforcement learning, respectively. This is the first study to perform evaluations in a novel mixed traffic simulation framework as well as real-world drives with prototype connected automated vehicles in public traffic. The simulation features the same connected automated driving software stack as deployed on one of the automated vehicles. Our quantitative evaluations show that cooperative maneuver planning achieves a substantial reduction in crossing times and the number of stops. In a realistic environment with few automated vehicles, there are noticeable efficiency gains with only slightly increasing criticality metrics.

SRLM: Human-in-Loop Interactive Social Robot Navigation with Large Language Model and Deep Reinforcement Learning

Authors:Weizheng Wang, Ike Obi, Byung-Cheol Min
Date:2024-03-22 23:12:28

An interactive social robotic assistant must provide services in complex and crowded spaces while adapting its behavior based on real-time human language commands or feedback. In this paper, we propose a novel hybrid approach called Social Robot Planner (SRLM), which integrates Large Language Models (LLM) and Deep Reinforcement Learning (DRL) to navigate through human-filled public spaces and provide multiple social services. SRLM infers global planning from human-in-loop commands in real-time, and encodes social information into a LLM-based large navigation model (LNM) for low-level motion execution. Moreover, a DRL-based planner is designed to maintain benchmarking performance, which is blended with LNM by a large feedback model (LFM) to address the instability of current text and LLM-driven LNM. Finally, SRLM demonstrates outstanding performance in extensive experiments. More details about this work are available at: https://sites.google.com/view/navi-srlm

Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

Authors:Guillermo Infante, David Kuric, Anders Jonsson, Vicenç Gómez, Herke van Hoof
Date:2024-03-22 15:51:39

Conventional reinforcement learning (RL) methods can successfully solve a wide range of sequential decision problems. However, learning policies that can generalize predictably across multiple tasks in a setting with non-Markovian reward specifications is a challenging problem. We propose to use successor features to learn a policy basis so that each (sub)policy in it solves a well-defined subproblem. In a task described by a finite state automaton (FSA) that involves the same set of subproblems, the combination of these (sub)policies can then be used to generate an optimal solution without additional learning. In contrast to other methods that combine (sub)policies via planning, our method asymptotically attains global optimality, even in stochastic environments.

Analysis of a Modular Autonomous Driving Architecture: The Top Submission to CARLA Leaderboard 2.0 Challenge

Authors:Weize Zhang, Mohammed Elmahgiubi, Kasra Rezaee, Behzad Khamidehi, Hamidreza Mirkhani, Fazel Arasteh, Chunlin Li, Muhammad Ahsan Kaleem, Eduardo R. Corral-Soto, Dhruv Sharma, Tongtong Cao
Date:2024-03-21 23:44:19

In this paper we present the architecture of the Kyber-E2E submission to the map track of CARLA Leaderboard 2.0 Autonomous Driving (AD) challenge 2023, which achieved first place. We employed a modular architecture for our solution consists of five main components: sensing, localization, perception, tracking/prediction, and planning/control. Our solution leverages state-of-the-art language-assisted perception models to help our planner perform more reliably in highly challenging traffic scenarios. We use open-source driving datasets in conjunction with Inverse Reinforcement Learning (IRL) to enhance the performance of our motion planner. We provide insight into our design choices and trade-offs made to achieve this solution. We also explore the impact of each component in the overall performance of our solution, with the intent of providing a guideline where allocation of resources can have the greatest impact.

TEeVTOL: Balancing Energy and Time Efficiency in eVTOL Aircraft Path Planning Across City-Scale Wind Fields

Authors:Songyang Liu, Shuai Li, Haochen Li, Weizi Li, Jindong Tan
Date:2024-03-21 22:54:08

Electric vertical-takeoff and landing (eVTOL) aircraft, recognized for their maneuverability and flexibility, offer a promising alternative to our transportation system. However, the operational effectiveness of these aircraft faces many challenges, such as the delicate balance between energy and time efficiency, stemming from unpredictable environmental factors, including wind fields. Mathematical modeling-based approaches have been adopted to plan aircraft flight path in urban wind fields with the goal to save energy and time costs. While effective, they are limited in adapting to dynamic and complex environments. To optimize energy and time efficiency in eVTOL's flight through dynamic wind fields, we introduce a novel path planning method leveraging deep reinforcement learning. We assess our method with extensive experiments, comparing it to Dijkstra's algorithm -- the theoretically optimal approach for determining shortest paths in a weighted graph, where weights represent either energy or time cost. The results show that our method achieves a graceful balance between energy and time efficiency, closely resembling the theoretically optimal values for both objectives.

Federated reinforcement learning for robot motion planning with zero-shot generalization

Authors:Zhenyuan Yuan, Siyuan Xu, Minghui Zhu
Date:2024-03-20 02:16:54

This paper considers the problem of learning a control policy for robot motion planning with zero-shot generalization, i.e., no data collection and policy adaptation is needed when the learned policy is deployed in new environments. We develop a federated reinforcement learning framework that enables collaborative learning of multiple learners and a central server, i.e., the Cloud, without sharing their raw data. In each iteration, each learner uploads its local control policy and the corresponding estimated normalized arrival time to the Cloud, which then computes the global optimum among the learners and broadcasts the optimal policy to the learners. Each learner then selects between its local control policy and that from the Cloud for next iteration. The proposed framework leverages on the derived zero-shot generalization guarantees on arrival time and safety. Theoretical guarantees on almost-sure convergence, almost consensus, Pareto improvement and optimality gap are also provided. Monte Carlo simulation is conducted to evaluate the proposed framework.

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Authors:Fucai Ke, Zhixi Cai, Simindokht Jahangard, Weiqing Wang, Pari Delir Haghighi, Hamid Rezatofighi
Date:2024-03-19 16:31:30

Recent advances in visual reasoning (VR), particularly with the aid of Large Vision-Language Models (VLMs), show promise but require access to large-scale datasets and face challenges such as high computational costs and limited generalization capabilities. Compositional visual reasoning approaches have emerged as effective strategies; however, they heavily rely on the commonsense knowledge encoded in Large Language Models (LLMs) to perform planning, reasoning, or both, without considering the effect of their decisions on the visual reasoning process, which can lead to errors or failed procedures. To address these challenges, we introduce HYDRA, a multi-stage dynamic compositional visual reasoning framework designed for reliable and incrementally progressive general reasoning. HYDRA integrates three essential modules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive controller, and a reasoner. The planner and reasoner modules utilize an LLM to generate instruction samples and executable code from the selected instruction, respectively, while the RL agent dynamically interacts with these modules, making high-level decisions on selection of the best instruction sample given information from the historical state stored through a feedback loop. This adaptable design enables HYDRA to adjust its actions based on previous feedback received during the reasoning process, leading to more reliable reasoning outputs and ultimately enhancing its overall effectiveness. Our framework demonstrates state-of-the-art performance in various VR tasks on four different widely-used datasets.

Equivariant Ensembles and Regularization for Reinforcement Learning in Map-based Path Planning

Authors:Mirco Theile, Hongpeng Cao, Marco Caccamo, Alberto L. Sangiovanni-Vincentelli
Date:2024-03-19 16:01:25

In reinforcement learning (RL), exploiting environmental symmetries can significantly enhance efficiency, robustness, and performance. However, ensuring that the deep RL policy and value networks are respectively equivariant and invariant to exploit these symmetries is a substantial challenge. Related works try to design networks that are equivariant and invariant by construction, limiting them to a very restricted library of components, which in turn hampers the expressiveness of the networks. This paper proposes a method to construct equivariant policies and invariant value functions without specialized neural network components, which we term equivariant ensembles. We further add a regularization term for adding inductive bias during training. In a map-based path planning case study, we show how equivariant ensembles and regularization benefit sample efficiency and performance.

FootstepNet: an Efficient Actor-Critic Method for Fast On-line Bipedal Footstep Planning and Forecasting

Authors:Clément Gaspard, Grégoire Passault, Mélodie Daniel, Olivier Ly
Date:2024-03-19 09:48:18

Designing a humanoid locomotion controller is challenging and classically split up in sub-problems. Footstep planning is one of those, where the sequence of footsteps is defined. Even in simpler environments, finding a minimal sequence, or even a feasible sequence, yields a complex optimization problem. In the literature, this problem is usually addressed by search-based algorithms (e.g. variants of A*). However, such approaches are either computationally expensive or rely on hand-crafted tuning of several parameters. In this work, at first, we propose an efficient footstep planning method to navigate in local environments with obstacles, based on state-of-the art Deep Reinforcement Learning (DRL) techniques, with very low computational requirements for on-line inference. Our approach is heuristic-free and relies on a continuous set of actions to generate feasible footsteps. In contrast, other methods necessitate the selection of a relevant discrete set of actions. Second, we propose a forecasting method, allowing to quickly estimate the number of footsteps required to reach different candidates of local targets. This approach relies on inherent computations made by the actor-critic DRL architecture. We demonstrate the validity of our approach with simulation results, and by a deployment on a kid-size humanoid robot during the RoboCup 2023 competition.

Prior-dependent analysis of posterior sampling reinforcement learning with function approximation

Authors:Yingru Li, Zhi-Quan Luo
Date:2024-03-17 11:23:51

This work advances randomized exploration in reinforcement learning (RL) with function approximation modeled by linear mixture MDPs. We establish the first prior-dependent Bayesian regret bound for RL with function approximation; and refine the Bayesian regret analysis for posterior sampling reinforcement learning (PSRL), presenting an upper bound of ${\mathcal{O}}(d\sqrt{H^3 T \log T})$, where $d$ represents the dimensionality of the transition kernel, $H$ the planning horizon, and $T$ the total number of interactions. This signifies a methodological enhancement by optimizing the $\mathcal{O}(\sqrt{\log T})$ factor over the previous benchmark (Osband and Van Roy, 2014) specified to linear mixture MDPs. Our approach, leveraging a value-targeted model learning perspective, introduces a decoupling argument and a variance reduction technique, moving beyond traditional analyses reliant on confidence sets and concentration inequalities to formalize Bayesian regret bounds more effectively.

PyroTrack: Belief-Based Deep Reinforcement Learning Path Planning for Aerial Wildfire Monitoring in Partially Observable Environments

Authors:Sahand Khoshdel, Qi Luo, Fatemeh Afghah
Date:2024-03-17 05:23:43

Motivated by agility, 3D mobility, and low-risk operation compared to human-operated management systems of autonomous unmanned aerial vehicles (UAVs), this work studies UAV-based active wildfire monitoring where a UAV detects fire incidents in remote areas and tracks the fire frontline. A UAV path planning solution is proposed considering realistic wildfire management missions, where a single low-altitude drone with limited power and flight time is available. Noting the limited field of view of commercial low-altitude UAVs, the problem formulates as a partially observable Markov decision process (POMDP), in which wildfire progression outside the field of view causes inaccurate state representation that prevents the UAV from finding the optimal path to track the fire front in limited time. Common deep reinforcement learning (DRL)-based trajectory planning solutions require diverse drone-recorded wildfire data to generalize pre-trained models to real-time systems, which is not currently available at a diverse and standard scale. To narrow down the gap caused by partial observability in the space of possible policies, a belief-based state representation with broad, extensive simulated data is proposed where the beliefs (i.e., ignition probabilities of different grid areas) are updated using a Bayesian framework for the cells within the field of view. The performance of the proposed solution in terms of the ratio of detected fire cells and monitored ignited area (MIA) is evaluated in a complex fire scenario with multiple rapidly growing fire batches, indicating that the belief state representation outperforms the observation state representation both in fire coverage and the distance to fire frontline.

Distributed Multi-Objective Dynamic Offloading Scheduling for Air-Ground Cooperative MEC

Authors:Yang Huang, Miaomiao Dong, Yijie Mao, Wenqiang Liu, Zhen Gao
Date:2024-03-16 13:50:31

Utilizing unmanned aerial vehicles (UAVs) with edge server to assist terrestrial mobile edge computing (MEC) has attracted tremendous attention. Nevertheless, state-of-the-art schemes based on deterministic optimizations or single-objective reinforcement learning (RL) cannot reduce the backlog of task bits and simultaneously improve energy efficiency in highly dynamic network environments, where the design problem amounts to a sequential decision-making problem. In order to address the aforementioned problems, as well as the curses of dimensionality introduced by the growing number of terrestrial terrestrial users, this paper proposes a distributed multi-objective (MO) dynamic trajectory planning and offloading scheduling scheme, integrated with MORL and the kernel method. The design of n-step return is also applied to average fluctuations in the backlog. Numerical results reveal that the n-step return can benefit the proposed kernel-based approach, achieving significant improvement in the long-term average backlog performance, compared to the conventional 1-step return design. Due to such design and the kernel-based neural network, to which decision-making features can be continuously added, the kernel-based approach can outperform the approach based on fully-connected deep neural network, yielding improvement in energy consumption and the backlog performance, as well as a significant reduction in decision-making and online learning time.

Deep Reinforcement Learning-based Large-scale Robot Exploration

Authors:Yuhong Cao, Rui Zhao, Yizhuo Wang, Bairan Xiang, Guillaume Sartoretti
Date:2024-03-16 06:56:32

In this work, we propose a deep reinforcement learning (DRL) based reactive planner to solve large-scale Lidar-based autonomous robot exploration problems in 2D action space. Our DRL-based planner allows the agent to reactively plan its exploration path by making implicit predictions about unknown areas, based on a learned estimation of the underlying transition model of the environment. To this end, our approach relies on learned attention mechanisms for their powerful ability to capture long-term dependencies at different spatial scales to reason about the robot's entire belief over known areas. Our approach relies on ground truth information (i.e., privileged learning) to guide the environment estimation during training, as well as on a graph rarefaction algorithm, which allows models trained in small-scale environments to scale to large-scale ones. Simulation results show that our model exhibits better exploration efficiency (12% in path length, 6% in makespan) and lower planning time (60%) than the state-of-the-art planners in a 130m x 100m benchmark scenario. We also validate our learned model on hardware.

Diffusion-Reinforcement Learning Hierarchical Motion Planning in Adversarial Multi-agent Games

Authors:Zixuan Wu, Sean Ye, Manisha Natarajan, Matthew C. Gombolay
Date:2024-03-16 03:53:55

Reinforcement Learning- (RL-)based motion planning has recently shown the potential to outperform traditional approaches from autonomous navigation to robot manipulation. In this work, we focus on a motion planning task for an evasive target in a partially observable multi-agent adversarial pursuit-evasion games (PEG). These pursuit-evasion problems are relevant to various applications, such as search and rescue operations and surveillance robots, where robots must effectively plan their actions to gather intelligence or accomplish mission tasks while avoiding detection or capture themselves. We propose a hierarchical architecture that integrates a high-level diffusion model to plan global paths responsive to environment data while a low-level RL algorithm reasons about evasive versus global path-following behavior. Our approach outperforms baselines by 51.2% by leveraging the diffusion model to guide the RL algorithm for more efficient exploration and improves the explanability and predictability.

Horizon-Free Regret for Linear Markov Decision Processes

Authors:Zihan Zhang, Jason D. Lee, Yuxin Chen, Simon S. Du
Date:2024-03-15 23:50:58

A recent line of works showed regret bounds in reinforcement learning (RL) can be (nearly) independent of planning horizon, a.k.a.~the horizon-free bounds. However, these regret bounds only apply to settings where a polynomial dependency on the size of transition model is allowed, such as tabular Markov Decision Process (MDP) and linear mixture MDP. We give the first horizon-free bound for the popular linear MDP setting where the size of the transition model can be exponentially large or even uncountable. In contrast to prior works which explicitly estimate the transition model and compute the inhomogeneous value functions at different time steps, we directly estimate the value functions and confidence sets. We obtain the horizon-free bound by: (1) maintaining multiple weighted least square estimators for the value functions; and (2) a structural lemma which shows the maximal total variation of the inhomogeneous value functions is bounded by a polynomial factor of the feature dimension.

Partially Observable Task and Motion Planning with Uncertainty and Risk Awareness

Authors:Aidan Curtis, George Matheos, Nishad Gothoskar, Vikash Mansinghka, Joshua Tenenbaum, Tomás Lozano-Pérez, Leslie Pack Kaelbling
Date:2024-03-15 16:42:14

Integrated task and motion planning (TAMP) has proven to be a valuable approach to generalizable long-horizon robotic manipulation and navigation problems. However, the typical TAMP problem formulation assumes full observability and deterministic action effects. These assumptions limit the ability of the planner to gather information and make decisions that are risk-aware. We propose a strategy for TAMP with Uncertainty and Risk Awareness (TAMPURA) that is capable of efficiently solving long-horizon planning problems with initial-state and action outcome uncertainty, including problems that require information gathering and avoiding undesirable and irreversible outcomes. Our planner reasons under uncertainty at both the abstract task level and continuous controller level. Given a set of closed-loop goal-conditioned controllers operating in the primitive action space and a description of their preconditions and potential capabilities, we learn a high-level abstraction that can be solved efficiently and then refined to continuous actions for execution. We demonstrate our approach on several robotics problems where uncertainty is a crucial factor and show that reasoning under uncertainty in these problems outperforms previously proposed determinized planning, direct search, and reinforcement learning strategies. Lastly, we demonstrate our planner on two real-world robotics problems using recent advancements in probabilistic perception.

Symbiotic Game and Foundation Models for Cyber Deception Operations in Strategic Cyber Warfare

Authors:Tao Li, Quanyan Zhu
Date:2024-03-14 20:17:57

We are currently facing unprecedented cyber warfare with the rapid evolution of tactics, increasing asymmetry of intelligence, and the growing accessibility of hacking tools. In this landscape, cyber deception emerges as a critical component of our defense strategy against increasingly sophisticated attacks. This chapter aims to highlight the pivotal role of game-theoretic models and foundation models (FMs) in analyzing, designing, and implementing cyber deception tactics. Game models (GMs) serve as a foundational framework for modeling diverse adversarial interactions, allowing us to encapsulate both adversarial knowledge and domain-specific insights. Meanwhile, FMs serve as the building blocks for creating tailored machine learning models suited to given applications. By leveraging the synergy between GMs and FMs, we can advance proactive and automated cyber defense mechanisms by not only securing our networks against attacks but also enhancing their resilience against well-planned operations. This chapter discusses the games at the tactical, operational, and strategic levels of warfare, delves into the symbiotic relationship between these methodologies, and explores relevant applications where such a framework can make a substantial impact in cybersecurity. The chapter discusses the promising direction of the multi-agent neurosymbolic conjectural learning (MANSCOL), which allows the defender to predict adversarial behaviors, design adaptive defensive deception tactics, and synthesize knowledge for the operational level synthesis and adaptation. FMs serve as pivotal tools across various functions for MANSCOL, including reinforcement learning, knowledge assimilation, formation of conjectures, and contextual representation. This chapter concludes with a discussion of the challenges associated with FMs and their application in the domain of cybersecurity.

Meta-operators for Enabling Parallel Planning Using Deep Reinforcement Learning

Authors:Ángel Aso-Mollar, Eva Onaindia
Date:2024-03-13 19:00:36

There is a growing interest in the application of Reinforcement Learning (RL) techniques to AI planning with the aim to come up with general policies. Typically, the mapping of the transition model of AI planning to the state transition system of a Markov Decision Process is established by assuming a one-to-one correspondence of the respective action spaces. In this paper, we introduce the concept of meta-operator as the result of simultaneously applying multiple planning operators, and we show that including meta-operators in the RL action space enables new planning perspectives to be addressed using RL, such as parallel planning. Our research aims to analyze the performance and complexity of including meta-operators in the RL process, concretely in domains where satisfactory outcomes have not been previously achieved using usual generalized planning models. The main objective of this article is thus to pave the way towards a redefinition of the RL action space in a manner that is more closely aligned with the planning perspective.

Digital Twin-assisted Reinforcement Learning for Resource-aware Microservice Offloading in Edge Computing

Authors:Xiangchun Chen, Jiannong Cao, Zhixuan Liang, Yuvraj Sahni, Mingjin Zhang
Date:2024-03-13 16:44:36

Collaborative edge computing (CEC) has emerged as a promising paradigm, enabling edge nodes to collaborate and execute microservices from end devices. Microservice offloading, a fundamentally important problem, decides when and where microservices are executed upon the arrival of services. However, the dynamic nature of the real-world CEC environment often leads to inefficient microservice offloading strategies, resulting in underutilized resources and network congestion. To address this challenge, we formulate an online joint microservice offloading and bandwidth allocation problem, JMOBA, to minimize the average completion time of services. In this paper, we introduce a novel microservice offloading algorithm, DTDRLMO, which leverages deep reinforcement learning (DRL) and digital twin technology. Specifically, we employ digital twin techniques to predict and adapt to changing edge node loads and network conditions of CEC in real-time. Furthermore, this approach enables the generation of an efficient offloading plan, selecting the most suitable edge node for each microservice. Simulation results on real-world and synthetic datasets demonstrate that DTDRLMO outperforms heuristic and learning-based methods in average service completion time.

SpaceOctopus: An Octopus-inspired Motion Planning Framework for Multi-arm Space Robot

Authors:Wenbo Zhao, Shengjie Wang, Yixuan Fan, Yang Gao, Tao Zhang
Date:2024-03-13 03:34:00

Space robots have played a critical role in autonomous maintenance and space junk removal. Multi-arm space robots can efficiently complete the target capture and base reorientation tasks due to their flexibility and the collaborative capabilities between the arms. However, the complex coupling properties arising from both the multiple arms and the free-floating base present challenges to the motion planning problems of multi-arm space robots. We observe that the octopus elegantly achieves similar goals when grabbing prey and escaping from danger. Inspired by the distributed control of octopuses' limbs, we develop a multi-level decentralized motion planning framework to manage the movement of different arms of space robots. This motion planning framework integrates naturally with the multi-agent reinforcement learning (MARL) paradigm. The results indicate that our method outperforms the previous method (centralized training). Leveraging the flexibility of the decentralized framework, we reassemble policies trained for different tasks, enabling the space robot to complete trajectory planning tasks while adjusting the base attitude without further learning. Furthermore, our experiments confirm the superior robustness of our method in the face of external disturbances, changing base masses, and even the failure of one arm.

Synchronized Dual-arm Rearrangement via Cooperative mTSP

Authors:Wenhao Li, Shishun Zhang, Sisi Dai, Hui Huang, Ruizhen Hu, Xiaohong Chen, Kai Xu
Date:2024-03-13 02:26:15

Synchronized dual-arm rearrangement is widely studied as a common scenario in industrial applications. It often faces scalability challenges due to the computational complexity of robotic arm rearrangement and the high-dimensional nature of dual-arm planning. To address these challenges, we formulated the problem as cooperative mTSP, a variant of mTSP where agents share cooperative costs, and utilized reinforcement learning for its solution. Our approach involved representing rearrangement tasks using a task state graph that captured spatial relationships and a cooperative cost matrix that provided details about action costs. Taking these representations as observations, we designed an attention-based network to effectively combine them and provide rational task scheduling. Furthermore, a cost predictor is also introduced to directly evaluate actions during both training and planning, significantly expediting the planning process. Our experimental results demonstrate that our approach outperforms existing methods in terms of both performance and planning efficiency.

Multi-Fidelity Reinforcement Learning for Time-Optimal Quadrotor Re-planning

Authors:Gilhyun Ryou, Geoffrey Wang, Sertac Karaman
Date:2024-03-13 00:30:09

High-speed online trajectory planning for UAVs poses a significant challenge due to the need for precise modeling of complex dynamics while also being constrained by computational limitations. This paper presents a multi-fidelity reinforcement learning method (MFRL) that aims to effectively create a realistic dynamics model and simultaneously train a planning policy that can be readily deployed in real-time applications. The proposed method involves the co-training of a planning policy and a reward estimator; the latter predicts the performance of the policy's output and is trained efficiently through multi-fidelity Bayesian optimization. This optimization approach models the correlation between different fidelity levels, thereby constructing a high-fidelity model based on a low-fidelity foundation, which enables the accurate development of the reward model with limited high-fidelity experiments. The framework is further extended to include real-world flight experiments in reinforcement learning training, allowing the reward model to precisely reflect real-world constraints and broadening the policy's applicability to real-world scenarios. We present rigorous evaluations by training and testing the planning policy in both simulated and real-world environments. The resulting trained policy not only generates faster and more reliable trajectories compared to the baseline snap minimization method, but it also achieves trajectory updates in 2 ms on average, while the baseline method takes several minutes.

Learning-Aided Control of Robotic Tether-Net with Maneuverable Nodes to Capture Large Space Debris

Authors:Achira Boonrath, Feng Liu, Elenora M. Botta, Souma Chowdhury
Date:2024-03-11 19:41:40

Maneuverable tether-net systems launched from an unmanned spacecraft offer a promising solution for the active removal of large space debris. Guaranteeing the successful capture of such space debris is dependent on the ability to reliably maneuver the tether-net system -- a flexible, many-DoF (thus complex) system -- for a wide range of launch scenarios. Here, scenarios are defined by the relative location of the debris with respect to the chaser spacecraft. This paper represents and solves this problem as a hierarchically decentralized implementation of robotic trajectory planning and control and demonstrates the effectiveness of the approach when applied to two different tether-net systems, with 4 and 8 maneuverable units (MUs), respectively. Reinforcement learning (policy gradient) is used to design the centralized trajectory planner that, based on the relative location of the target debris at the launch of the net, computes the final aiming positions of each MU, from which their trajectory can be derived. Each MU then seeks to follow its assigned trajectory by using a decentralized PID controller that outputs the MU's thrust vector and is informed by noisy sensor feedback (for realism) of its relative location. System performance is assessed in terms of capture success and overall fuel consumption by the MUs. Reward shaping and surrogate models are used to respectively guide and speed up the RL process. Simulation-based experiments show that this approach allows the successful capture of debris at fuel costs that are notably lower than nominal baselines, including in scenarios where the debris is significantly off-centered compared to the approaching chaser spacecraft.

Scalable Online Exploration via Coverability

Authors:Philip Amortila, Dylan J. Foster, Akshay Krishnamurthy
Date:2024-03-11 10:14:06

Exploration is a major challenge in reinforcement learning, especially for high-dimensional domains that require function approximation. We propose exploration objectives -- policy optimization objectives that enable downstream maximization of any reward function -- as a conceptual framework to systematize the study of exploration. Within this framework, we introduce a new objective, $L_1$-Coverage, which generalizes previous exploration schemes and supports three fundamental desiderata: 1. Intrinsic complexity control. $L_1$-Coverage is associated with a structural parameter, $L_1$-Coverability, which reflects the intrinsic statistical difficulty of the underlying MDP, subsuming Block and Low-Rank MDPs. 2. Efficient planning. For a known MDP, optimizing $L_1$-Coverage efficiently reduces to standard policy optimization, allowing flexible integration with off-the-shelf methods such as policy gradient and Q-learning approaches. 3. Efficient exploration. $L_1$-Coverage enables the first computationally efficient model-based and model-free algorithms for online (reward-free or reward-driven) reinforcement learning in MDPs with low coverability. Empirically, we find that $L_1$-Coverage effectively drives off-the-shelf policy optimization algorithms to explore the state space.

Distributional Successor Features Enable Zero-Shot Policy Optimization

Authors:Chuning Zhu, Xinqi Wang, Tyler Han, Simon S. Du, Abhishek Gupta
Date:2024-03-10 22:27:21

Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, zero-shot policy optimization for new tasks with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code available at https://weirdlabuw.github.io/dispo/.

Scaling Team Coordination on Graphs with Reinforcement Learning

Authors:Manshi Limbu, Zechen Hu, Xuan Wang, Daigo Shishika, Xuesu Xiao
Date:2024-03-09 04:13:58

This paper studies Reinforcement Learning (RL) techniques to enable team coordination behaviors in graph environments with support actions among teammates to reduce the costs of traversing certain risky edges in a centralized manner. While classical approaches can solve this non-standard multi-agent path planning problem by converting the original Environment Graph (EG) into a Joint State Graph (JSG) to implicitly incorporate the support actions, those methods do not scale well to large graphs and teams. To address this curse of dimensionality, we propose to use RL to enable agents to learn such graph traversal and teammate supporting behaviors in a data-driven manner. Specifically, through a new formulation of the team coordination on graphs with risky edges problem into Markov Decision Processes (MDPs) with a novel state and action space, we investigate how RL can solve it in two paradigms: First, we use RL for a team of agents to learn how to coordinate and reach the goal with minimal cost on a single EG. We show that RL efficiently solves problems with up to 20/4 or 25/3 nodes/agents, using a fraction of the time needed for JSG to solve such complex problems; Second, we learn a general RL policy for any $N$-node EGs to produce efficient supporting behaviors. We present extensive experiments and compare our RL approaches against their classical counterparts.

Image-Guided Autonomous Guidewire Navigation in Robot-Assisted Endovascular Interventions using Reinforcement Learning

Authors:Wentao Liu, Tong Tian, Weijin Xu, Bowen Liang, Qingsheng Lu, Xipeng Pan, Wenyi Zhao, Huihua Yang, Ruisheng Su
Date:2024-03-09 01:05:53

Autonomous robots in endovascular interventions possess the potential to navigate guidewires with safety and reliability, while reducing human error and shortening surgical time. However, current methods of guidewire navigation based on Reinforcement Learning (RL) depend on manual demonstration data or magnetic guidance. In this work, we propose an Image-guided Autonomous Guidewire Navigation (IAGN) method. Specifically, we introduce BDA-star, a path planning algorithm with boundary distance constraints, for the trajectory planning of guidewire navigation. We established an IAGN-RL environment where the observations are real-time guidewire feeding images highlighting the position of the guidewire tip and the planned path. We proposed a reward function based on the distances from both the guidewire tip to the planned path and the target to evaluate the agent's actions. Furthermore, in policy network, we employ a pre-trained convolutional neural network to extract features, mitigating stability issues and slow convergence rates associated with direct learning from raw pixels. Experiments conducted on the aortic simulation IAGN platform demonstrated that the proposed method, targeting the left subclavian artery and the brachiocephalic artery, achieved a 100% guidewire navigation success rate, along with reduced movement and retraction distances and trajectories tend to the center of the vessels.

Will GPT-4 Run DOOM?

Authors:Adrian de Wynter
Date:2024-03-08 17:30:41

We show that GPT-4's reasoning and planning capabilities extend to the 1993 first-person shooter Doom. This large language model (LLM) is able to run and play the game with only a few instructions, plus a textual description--generated by the model itself from screenshots--about the state of the game being observed. We find that GPT-4 can play the game to a passable degree: it is able to manipulate doors, combat enemies, and perform pathing. More complex prompting strategies involving multiple model calls provide better results. While further work is required to enable the LLM to play the game as well as its classical, reinforcement learning-based counterparts, we note that GPT-4 required no training, leaning instead on its own reasoning and observational capabilities. We hope our work pushes the boundaries on intelligent, LLM-based agents in video games. We conclude by discussing the ethical implications of our work.

Learning Speed Adaptation for Flight in Clutter

Authors:Guangyu Zhao, Tianyue Wu, Yeke Chen, Fei Gao
Date:2024-03-07 15:30:54

Animals learn to adapt speed of their movements to their capabilities and the environment they observe. Mobile robots should also demonstrate this ability to trade-off aggressiveness and safety for efficiently accomplishing tasks. The aim of this work is to endow flight vehicles with the ability of speed adaptation in prior unknown and partially observable cluttered environments. We propose a hierarchical learning and planning framework where we utilize both well-established methods of model-based trajectory generation and trial-and-error that comprehensively learns a policy to dynamically configure the speed constraint. Technically, we use online reinforcement learning to obtain the deployable policy. The statistical results in simulation demonstrate the advantages of our method over the constant speed constraint baselines and an alternative method in terms of flight efficiency and safety. In particular, the policy behaves perception awareness, which distinguish it from alternative approaches. By deploying the policy to hardware, we verify that these advantages can be brought to the real world.

Generalizing Cooperative Eco-driving via Multi-residual Task Learning

Authors:Vindula Jayawardana, Sirui Li, Cathy Wu, Yashar Farid, Kentaro Oguchi
Date:2024-03-07 05:25:34

Conventional control, such as model-based control, is commonly utilized in autonomous driving due to its efficiency and reliability. However, real-world autonomous driving contends with a multitude of diverse traffic scenarios that are challenging for these planning algorithms. Model-free Deep Reinforcement Learning (DRL) presents a promising avenue in this direction, but learning DRL control policies that generalize to multiple traffic scenarios is still a challenge. To address this, we introduce Multi-residual Task Learning (MRTL), a generic learning framework based on multi-task learning that, for a set of task scenarios, decomposes the control into nominal components that are effectively solved by conventional control methods and residual terms which are solved using learning. We employ MRTL for fleet-level emission reduction in mixed traffic using autonomous vehicles as a means of system control. By analyzing the performance of MRTL across nearly 600 signalized intersections and 1200 traffic scenarios, we demonstrate that it emerges as a promising approach to synergize the strengths of DRL and conventional methods in generalizable control.

Dexterous Legged Locomotion in Confined 3D Spaces with Reinforcement Learning

Authors:Zifan Xu, Amir Hossain Raj, Xuesu Xiao, Peter Stone
Date:2024-03-06 16:49:08

Recent advances of locomotion controllers utilizing deep reinforcement learning (RL) have yielded impressive results in terms of achieving rapid and robust locomotion across challenging terrain, such as rugged rocks, non-rigid ground, and slippery surfaces. However, while these controllers primarily address challenges underneath the robot, relatively little research has investigated legged mobility through confined 3D spaces, such as narrow tunnels or irregular voids, which impose all-around constraints. The cyclic gait patterns resulted from existing RL-based methods to learn parameterized locomotion skills characterized by motion parameters, such as velocity and body height, may not be adequate to navigate robots through challenging confined 3D spaces, requiring both agile 3D obstacle avoidance and robust legged locomotion. Instead, we propose to learn locomotion skills end-to-end from goal-oriented navigation in confined 3D spaces. To address the inefficiency of tracking distant navigation goals, we introduce a hierarchical locomotion controller that combines a classical planner tasked with planning waypoints to reach a faraway global goal location, and an RL-based policy trained to follow these waypoints by generating low-level motion commands. This approach allows the policy to explore its own locomotion skills within the entire solution space and facilitates smooth transitions between local goals, enabling long-term navigation towards distant goals. In simulation, our hierarchical approach succeeds at navigating through demanding confined 3D environments, outperforming both pure end-to-end learning approaches and parameterized locomotion skills. We further demonstrate the successful real-world deployment of our simulation-trained controller on a real robot.

Collision-Free Robot Navigation in Crowded Environments using Learning based Convex Model Predictive Control

Authors:Zhuanglei Wen, Mingze Dong, Xiai Chen
Date:2024-03-03 09:08:07

Navigating robots safely and efficiently in crowded and complex environments remains a significant challenge. However, due to the dynamic and intricate nature of these settings, planning efficient and collision-free paths for robots to track is particularly difficult. In this paper, we uniquely bridge the robot's perception, decision-making and control processes by utilizing the convex obstacle-free region computed from 2D LiDAR data. The overall pipeline is threefold: (1) We proposes a robot navigation framework that utilizes deep reinforcement learning (DRL), conceptualizing the observation as the convex obstacle-free region, a departure from general reliance on raw sensor inputs. (2) We design the action space, derived from the intersection of the robot's kinematic limits and the convex region, to enable efficient sampling of inherently collision-free reference points. These actions assists in guiding the robot to move towards the goal and interact with other obstacles during navigation. (3) We employ model predictive control (MPC) to track the trajectory formed by the reference points while satisfying constraints imposed by the convex obstacle-free region and the robot's kinodynamic limits. The effectiveness of proposed improvements has been validated through two sets of ablation studies and a comparative experiment against the Timed Elastic Band (TEB), demonstrating improved navigation performance in crowded and complex environments.

RL-GPT: Integrating Reinforcement Learning and Code-as-policy

Authors:Shaoteng Liu, Haoqi Yuan, Minda Hu, Yanwei Li, Yukang Chen, Shu Liu, Zongqing Lu, Jiaya Jia
Date:2024-02-29 16:07:22

Large Language Models (LLMs) have demonstrated proficiency in utilizing various tools by coding, yet they face limitations in handling intricate logic and precise control. In embodied tasks, high-level planning is amenable to direct coding, while low-level actions often necessitate task-specific refinement, such as Reinforcement Learning (RL). To seamlessly integrate both modalities, we introduce a two-level hierarchical framework, RL-GPT, comprising a slow agent and a fast agent. The slow agent analyzes actions suitable for coding, while the fast agent executes coding tasks. This decomposition effectively focuses each agent on specific tasks, proving highly efficient within our pipeline. Our approach outperforms traditional RL methods and existing GPT agents, demonstrating superior efficiency. In the Minecraft game, it rapidly obtains diamonds within a single day on an RTX3090. Additionally, it achieves SOTA performance across all designated MineDojo tasks.

Large Language Models are Learnable Planners for Long-Term Recommendation

Authors:Wentao Shi, Xiangnan He, Yang Zhang, Chongming Gao, Xinyue Li, Jizhi Zhang, Qifan Wang, Fuli Feng
Date:2024-02-29 13:49:56

Planning for both immediate and long-term benefits becomes increasingly important in recommendation. Existing methods apply Reinforcement Learning (RL) to learn planning capacity by maximizing cumulative reward for long-term recommendation. However, the scarcity of recommendation data presents challenges such as instability and susceptibility to overfitting when training RL models from scratch, resulting in sub-optimal performance. In this light, we propose to leverage the remarkable planning capabilities over sparse data of Large Language Models (LLMs) for long-term recommendation. The key to achieving the target lies in formulating a guidance plan following principles of enhancing long-term engagement and grounding the plan to effective and executable actions in a personalized manner. To this end, we propose a Bi-level Learnable LLM Planner framework, which consists of a set of LLM instances and breaks down the learning process into macro-learning and micro-learning to learn macro-level guidance and micro-level personalized recommendation policies, respectively. Extensive experiments validate that the framework facilitates the planning ability of LLMs for long-term recommendation. Our code and data can be found at https://github.com/jizhi-zhang/BiLLP.

ARMCHAIR: integrated inverse reinforcement learning and model predictive control for human-robot collaboration

Authors:Angelo Caregnato-Neto, Luciano Cavalcante Siebert, Arkady Zgonnikov, Marcos Ricardo Omena de Albuquerque Maximo, Rubens Junqueira Magalhães Afonso
Date:2024-02-29 13:06:14

One of the key issues in human-robot collaboration is the development of computational models that allow robots to predict and adapt to human behavior. Much progress has been achieved in developing such models, as well as control techniques that address the autonomy problems of motion planning and decision-making in robotics. However, the integration of computational models of human behavior with such control techniques still poses a major challenge, resulting in a bottleneck for efficient collaborative human-robot teams. In this context, we present a novel architecture for human-robot collaboration: Adaptive Robot Motion for Collaboration with Humans using Adversarial Inverse Reinforcement learning (ARMCHAIR). Our solution leverages adversarial inverse reinforcement learning and model predictive control to compute optimal trajectories and decisions for a mobile multi-robot system that collaborates with a human in an exploration task. During the mission, ARMCHAIR operates without human intervention, autonomously identifying the necessity to support and acting accordingly. Our approach also explicitly addresses the network connectivity requirement of the human-robot team. Extensive simulation-based evaluations demonstrate that ARMCHAIR allows a group of robots to safely support a simulated human in an exploration scenario, preventing collisions and network disconnections, and improving the overall performance of the task.

Energy-Efficient UAV Swarm Assisted MEC with Dynamic Clustering and Scheduling

Authors:Jialiuyuan Li, Jiayuan Chen, Changyan Yi, Tong Zhang, Kun Zhu, Jun Cai
Date:2024-02-29 08:05:23

In this paper, the energy-efficient unmanned aerial vehicle (UAV) swarm assisted mobile edge computing (MEC) with dynamic clustering and scheduling is studied. In the considered system model, UAVs are divided into multiple swarms, with each swarm consisting of a leader UAV and several follower UAVs to provide computing services to end-users. Unlike existing work, we allow UAVs to dynamically cluster into different swarms, i.e., each follower UAV can change its leader based on the time-varying spatial positions, updated application placement, etc. in a dynamic manner. Meanwhile, UAVs are required to dynamically schedule their energy replenishment, application placement, trajectory planning and task delegation. With the aim of maximizing the long-term energy efficiency of the UAV swarm assisted MEC system, a joint optimization problem of dynamic clustering and scheduling is formulated. Taking into account the underlying cooperation and competition among intelligent UAVs, we further reformulate this optimization problem as a combination of a series of strongly coupled multi-agent stochastic games, and then propose a novel reinforcement learning-based UAV swarm dynamic coordination (RLDC) algorithm for obtaining the equilibrium. Simulations are conducted to evaluate the performance of the RLDC algorithm and demonstrate its superiority over counterparts.

Dr. Strategy: Model-Based Generalist Agents with Strategic Dreaming

Authors:Hany Hamed, Subin Kim, Dongyeong Kim, Jaesik Yoon, Sungjin Ahn
Date:2024-02-29 05:34:05

Model-based reinforcement learning (MBRL) has been a primary approach to ameliorating the sample efficiency issue as well as to make a generalist agent. However, there has not been much effort toward enhancing the strategy of dreaming itself. Therefore, it is a question whether and how an agent can "dream better" in a more structured and strategic way. In this paper, inspired by the observation from cognitive science suggesting that humans use a spatial divide-and-conquer strategy in planning, we propose a new MBRL agent, called Dr. Strategy, which is equipped with a novel Dreaming Strategy. The proposed agent realizes a version of divide-and-conquer-like strategy in dreaming. This is achieved by learning a set of latent landmarks and then utilizing these to learn a landmark-conditioned highway policy. With the highway policy, the agent can first learn in the dream to move to a landmark, and from there it tackles the exploration and achievement task in a more focused way. In experiments, we show that the proposed model outperforms prior pixel-based MBRL methods in various visually complex and partially observable navigation tasks.

Unifying F1TENTH Autonomous Racing: Survey, Methods and Benchmarks

Authors:Benjamin David Evans, Raphael Trumpp, Marco Caccamo, Felix Jahncke, Johannes Betz, Hendrik Willem Jordaan, Herman Arnold Engelbrecht
Date:2024-02-28 18:42:46

The F1TENTH autonomous driving platform, consisting of 1:10-scale remote-controlled cars, has evolved into a well-established education and research platform. The many publications and real-world competitions span many domains, from classical path planning to novel learning-based algorithms. Consequently, the field is wide and disjointed, hindering direct comparison of developed methods and making it difficult to assess the state-of-the-art. Therefore, we aim to unify the field by surveying current approaches, describing common methods, and providing benchmark results to facilitate clear comparisons and establish a baseline for future work. This research aims to survey past and current work with F1TENTH vehicles in the classical and learning categories and explain the different solution approaches. We describe particle filter localisation, trajectory optimisation and tracking, model predictive contouring control, follow-the-gap, and end-to-end reinforcement learning. We provide an open-source evaluation of benchmark methods and investigate overlooked factors of control frequency and localisation accuracy for classical methods as well as reward signal and training map for learning methods. The evaluation shows that the optimisation and tracking method achieves the fastest lap times, followed by the online planning approach. Finally, our work identifies and outlines the relevant research aspects to help motivate future work in the F1TENTH domain.

Human-Centric Aware UAV Trajectory Planning in Search and Rescue Missions Employing Multi-Objective Reinforcement Learning with AHP and Similarity-Based Experience Replay

Authors:Mahya Ramezani, Jose Luis Sanchez-Lopez
Date:2024-02-28 17:10:22

The integration of Unmanned Aerial Vehicles (UAVs) into Search and Rescue (SAR) missions presents a promising avenue for enhancing operational efficiency and effectiveness. However, the success of these missions is not solely dependent on the technical capabilities of the drones but also on their acceptance and interaction with humans on the ground. This paper explores the effect of human-centric factor in UAV trajectory planning for SAR missions. We introduce a novel approach based on the reinforcement learning augmented with Analytic Hierarchy Process and novel similarity-based experience replay to optimize UAV trajectories, balancing operational objectives with human comfort and safety considerations. Additionally, through a comprehensive survey, we investigate the impact of gender cues and anthropomorphism in UAV design on public acceptance and trust, revealing significant implications for drone interaction strategies in SAR. Our contributions include (1) a reinforcement learning framework for UAV trajectory planning that dynamically integrates multi-objective considerations, (2) an analysis of human perceptions towards gendered and anthropomorphized drones in SAR contexts, and (3) the application of similarity-based experience replay for enhanced learning efficiency in complex SAR scenarios. The findings offer valuable insights into designing UAV systems that are not only technically proficient but also aligned with human-centric values.

Imitation-regularized Optimal Transport on Networks: Provable Robustness and Application to Logistics Planning

Authors:Koshi Oishi, Yota Hashizume, Tomohiko Jimbo, Hirotaka Kaji, Kenji Kashima
Date:2024-02-28 01:19:42

Network systems form the foundation of modern society, playing a critical role in various applications. However, these systems are at significant risk of being adversely affected by unforeseen circumstances, such as disasters. Considering this, there is a pressing need for research to enhance the robustness of network systems. Recently, in reinforcement learning, the relationship between acquiring robustness and regularizing entropy has been identified. Additionally, imitation learning is used within this framework to reflect experts' behavior. However, there are no comprehensive studies on the use of a similar imitation framework for optimal transport on networks. Therefore, in this study, imitation-regularized optimal transport (I-OT) on networks was investigated. It encodes prior knowledge on the network by imitating a given prior distribution. The I-OT solution demonstrated robustness in terms of the cost defined on the network. Moreover, we applied the I-OT to a logistics planning problem using real data. We also examined the imitation and apriori risk information scenarios to demonstrate the usefulness and implications of the proposed method.

Video as the New Language for Real-World Decision Making

Authors:Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, Dale Schuurmans
Date:2024-02-27 02:05:29

Both text and video data are abundant on the internet and support large-scale self-supervised learning through next token or frame prediction. However, they have not been equally leveraged: language models have had significant real-world impact, whereas video generation has remained largely limited to media entertainment. Yet video data captures important information about the physical world that is difficult to express in language. To address this gap, we discuss an under-appreciated opportunity to extend video generation to solve tasks in the real world. We observe how, akin to language, video can serve as a unified interface that can absorb internet knowledge and represent diverse tasks. Moreover, we demonstrate how, like language models, video generation can serve as planners, agents, compute engines, and environment simulators through techniques such as in-context learning, planning and reinforcement learning. We identify major impact opportunities in domains such as robotics, self-driving, and science, supported by recent work that demonstrates how such advanced capabilities in video generation are plausibly within reach. Lastly, we identify key challenges in video generation that mitigate progress. Addressing these challenges will enable video generation models to demonstrate unique value alongside language models in a wider array of AI applications.

Monitoring Fidelity of Online Reinforcement Learning Algorithms in Clinical Trials

Authors:Anna L. Trella, Kelly W. Zhang, Inbal Nahum-Shani, Vivek Shetty, Iris Yan, Finale Doshi-Velez, Susan A. Murphy
Date:2024-02-26 20:19:14

Online reinforcement learning (RL) algorithms offer great potential for personalizing treatment for participants in clinical trials. However, deploying an online, autonomous algorithm in the high-stakes healthcare setting makes quality control and data quality especially difficult to achieve. This paper proposes algorithm fidelity as a critical requirement for deploying online RL algorithms in clinical trials. It emphasizes the responsibility of the algorithm to (1) safeguard participants and (2) preserve the scientific utility of the data for post-trial analyses. We also present a framework for pre-deployment planning and real-time monitoring to help algorithm developers and clinical researchers ensure algorithm fidelity. To illustrate our framework's practical application, we present real-world examples from the Oralytics clinical trial. Since Spring 2023, this trial successfully deployed an autonomous, online RL algorithm to personalize behavioral interventions for participants at risk for dental disease.

Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning

Authors:Michael Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Jackson, Samuel Coward, Jakob Foerster
Date:2024-02-26 18:19:07

Benchmarks play a crucial role in the development and analysis of reinforcement learning (RL) algorithms. We identify that existing benchmarks used for research into open-ended learning fall into one of two categories. Either they are too slow for meaningful research to be performed without enormous computational resources, like Crafter, NetHack and Minecraft, or they are not complex enough to pose a significant challenge, like Minigrid and Procgen. To remedy this, we first present Craftax-Classic: a ground-up rewrite of Crafter in JAX that runs up to 250x faster than the Python-native original. A run of PPO using 1 billion environment interactions finishes in under an hour using only a single GPU and averages 90% of the optimal reward. To provide a more compelling challenge we present the main Craftax benchmark, a significant extension of the Crafter mechanics with elements inspired from NetHack. Solving Craftax requires deep exploration, long term planning and memory, as well as continual adaptation to novel situations as more of the world is discovered. We show that existing methods including global and episodic exploration, as well as unsupervised environment design fail to make material progress on the benchmark. We believe that Craftax can for the first time allow researchers to experiment in a complex, open-ended environment with limited computational resources.

Think2Drive: Efficient Reinforcement Learning by Thinking in Latent World Model for Quasi-Realistic Autonomous Driving (in CARLA-v2)

Authors:Qifeng Li, Xiaosong Jia, Shaobo Wang, Junchi Yan
Date:2024-02-26 16:43:17

Real-world autonomous driving (AD) especially urban driving involves many corner cases. The lately released AD simulator CARLA v2 adds 39 common events in the driving scene, and provide more quasi-realistic testbed compared to CARLA v1. It poses new challenge to the community and so far no literature has reported any success on the new scenarios in V2 as existing works mostly have to rely on specific rules for planning yet they cannot cover the more complex cases in CARLA v2. In this work, we take the initiative of directly training a planner and the hope is to handle the corner cases flexibly and effectively, which we believe is also the future of AD. To our best knowledge, we develop the first model-based RL method named Think2Drive for AD, with a world model to learn the transitions of the environment, and then it acts as a neural simulator to train the planner. This paradigm significantly boosts the training efficiency due to the low dimensional state space and parallel computing of tensors in the world model. As a result, Think2Drive is able to run in an expert-level proficiency in CARLA v2 within 3 days of training on a single A6000 GPU, and to our best knowledge, so far there is no reported success (100\% route completion)on CARLA v2. We also propose CornerCase-Repository, a benchmark that supports the evaluation of driving models by scenarios. Additionally, we propose a new and balanced metric to evaluate the performance by route completion, infraction number, and scenario density, so that the driving score could give more information about the actual driving performance.

How Can LLM Guide RL? A Value-Based Approach

Authors:Shenao Zhang, Sirui Zheng, Shuqi Ke, Zhihan Liu, Wanxin Jin, Jianbo Yuan, Yingxiang Yang, Hongxia Yang, Zhaoran Wang
Date:2024-02-25 20:07:13

Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. However, RL algorithms may require extensive trial-and-error interactions to collect useful feedback for improvement. On the other hand, recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities for planning tasks, lacking the ability to autonomously refine their responses based on feedback. Therefore, in this paper, we study how the policy prior provided by the LLM can enhance the sample efficiency of RL algorithms. Specifically, we develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning, particularly when the difference between the ideal policy and the LLM-informed policy is small, which suggests that the initial policy is close to optimal, reducing the need for further exploration. Additionally, we present a practical algorithm SLINVIT that simplifies the construction of the value function and employs subgoals to reduce the search complexity. Our experiments across three interactive environments ALFWorld, InterCode, and BlocksWorld demonstrate that our method achieves state-of-the-art success rates and also surpasses previous RL and LLM approaches in terms of sample efficiency. Our code is available at https://github.com/agentification/Language-Integrated-VI.

When in Doubt, Think Slow: Iterative Reasoning with Latent Imagination

Authors:Martin Benfeghoul, Umais Zahid, Qinghai Guo, Zafeirios Fountas
Date:2024-02-23 12:27:48

In an unfamiliar setting, a model-based reinforcement learning agent can be limited by the accuracy of its world model. In this work, we present a novel, training-free approach to improving the performance of such agents separately from planning and learning. We do so by applying iterative inference at decision-time, to fine-tune the inferred agent states based on the coherence of future state representations. Our approach achieves a consistent improvement in both reconstruction accuracy and task performance when applied to visual 3D navigation tasks. We go on to show that considering more future states further improves the performance of the agent in partially-observable environments, but not in a fully-observable one. Finally, we demonstrate that agents with less training pre-evaluation benefit most from our approach.

Large Language Model-based Human-Agent Collaboration for Complex Task Solving

Authors:Xueyang Feng, Zhi-Yuan Chen, Yujia Qin, Yankai Lin, Xu Chen, Zhiyuan Liu, Ji-Rong Wen
Date:2024-02-20 11:03:36

In recent developments within the research community, the integration of Large Language Models (LLMs) in creating fully autonomous agents has garnered significant interest. Despite this, LLM-based agents frequently demonstrate notable shortcomings in adjusting to dynamic environments and fully grasping human needs. In this work, we introduce the problem of LLM-based human-agent collaboration for complex task-solving, exploring their synergistic potential. In addition, we propose a Reinforcement Learning-based Human-Agent Collaboration method, ReHAC. This approach includes a policy model designed to determine the most opportune stages for human intervention within the task-solving process. We construct a human-agent collaboration dataset to train this policy model in an offline reinforcement learning environment. Our validation tests confirm the model's effectiveness. The results demonstrate that the synergistic efforts of humans and LLM-based agents significantly improve performance in complex tasks, primarily through well-planned, limited human intervention. Datasets and code are available at: https://github.com/XueyangFeng/ReHAC.

Dynamic planning in hierarchical active inference

Authors:Matteo Priorelli, Ivilin Peev Stoianov
Date:2024-02-18 17:32:53

By dynamic planning, we refer to the ability of the human brain to infer and impose motor trajectories related to cognitive decisions. A recent paradigm, active inference, brings fundamental insights into the adaptation of biological organisms, constantly striving to minimize prediction errors to restrict themselves to life-compatible states. Over the past years, many studies have shown how human and animal behaviors could be explained in terms of active inference - either as discrete decision-making or continuous motor control - inspiring innovative solutions in robotics and artificial intelligence. Still, the literature lacks a comprehensive outlook on effectively planning realistic actions in changing environments. Setting ourselves the goal of modeling complex tasks such as tool use, we delve into the topic of dynamic planning in active inference, keeping in mind two crucial aspects of biological behavior: the capacity to understand and exploit affordances for object manipulation, and to learn the hierarchical interactions between the self and the environment, including other agents. We start from a simple unit and gradually describe more advanced structures, comparing recently proposed design choices and providing basic examples. This study distances itself from traditional views centered on neural networks and reinforcement learning, and points toward a yet unexplored direction in active inference: hybrid representations in hierarchical models.

Programmatic Reinforcement Learning: Navigating Gridworlds

Authors:Guruprerana Shabadi, Nathanaël Fijalkow, Théo Matricon
Date:2024-02-18 17:02:39

The field of reinforcement learning (RL) is concerned with algorithms for learning optimal policies in unknown stochastic environments. Programmatic RL studies representations of policies as programs, meaning involving higher order constructs such as control loops. Despite attracting a lot of attention at the intersection of the machine learning and formal methods communities, very little is known on the theoretical front about programmatic RL: what are good classes of programmatic policies? How large are optimal programmatic policies? How can we learn them? The goal of this paper is to give first answers to these questions, initiating a theoretical study of programmatic RL. Considering a class of gridworld environments, we define a class of programmatic policies. Our main contributions are to place upper bounds on the size of optimal programmatic policies, and to construct an algorithm for synthesizing them. These theoretical findings are complemented by a prototype implementation of the algorithm.

Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning

Authors:Zihao Li, Boyi Liu, Zhuoran Yang, Zhaoran Wang, Mengdi Wang
Date:2024-02-16 16:35:18

We study the Constrained Convex Markov Decision Process (MDP), where the goal is to minimize a convex functional of the visitation measure, subject to a convex constraint. Designing algorithms for a constrained convex MDP faces several challenges, including (1) handling the large state space, (2) managing the exploration/exploitation tradeoff, and (3) solving the constrained optimization where the objective and the constraint are both nonlinear functions of the visitation measure. In this work, we present a model-based algorithm, Variational Primal-Dual Policy Optimization (VPDPO), in which Lagrangian and Fenchel duality are implemented to reformulate the original constrained problem into an unconstrained primal-dual optimization. Moreover, the primal variables are updated by model-based value iteration following the principle of Optimism in the Face of Uncertainty (OFU), while the dual variables are updated by gradient ascent. Moreover, by embedding the visitation measure into a finite-dimensional space, we can handle large state spaces by incorporating function approximation. Two notable examples are (1) Kernelized Nonlinear Regulators and (2) Low-rank MDPs. We prove that with an optimistic planning oracle, our algorithm achieves sublinear regret and constraint violation in both cases and can attain the globally optimal policy of the original constrained problem.

Enhancing Courier Scheduling in Crowdsourced Last-Mile Delivery through Dynamic Shift Extensions: A Deep Reinforcement Learning Approach

Authors:Zead Saleh, Ahmad Al Hanbali, Ahmad Baubaid
Date:2024-02-15 14:15:51

Crowdsourced delivery platforms face complex scheduling challenges to match couriers and customer orders. We consider two types of crowdsourced couriers, namely, committed and occasional couriers, each with different compensation schemes. Crowdsourced delivery platforms usually schedule committed courier shifts based on predicted demand. Therefore, platforms may devise an offline schedule for committed couriers before the planning period. However, due to the unpredictability of demand, there are instances where it becomes necessary to make online adjustments to the offline schedule. In this study, we focus on the problem of dynamically adjusting the offline schedule through shift extensions for committed couriers. This problem is modeled as a sequential decision process. The objective is to maximize platform profit by determining the shift extensions of couriers and the assignments of requests to couriers. To solve the model, a Deep Q-Network (DQN) learning approach is developed. Comparing this model with the baseline policy where no extensions are allowed demonstrates the benefits that platforms can gain from allowing shift extensions in terms of reward, reduced lost order costs, and lost requests. Additionally, sensitivity analysis showed that the total extension compensation increases in a nonlinear manner with the arrival rate of requests, and in a linear manner with the arrival rate of occasional couriers. On the compensation sensitivity, the results showed that the normal scenario exhibited the highest average number of shift extensions and, consequently, the fewest average number of lost requests. These findings serve as evidence of the successful learning of such dynamics by the DQN algorithm.

Optimal Task Assignment and Path Planning using Conflict-Based Search with Precedence and Temporal Constraints

Authors:Yu Quan Chong, Jiaoyang Li, Katia Sycara
Date:2024-02-13 20:07:58

The Multi-Agent Path Finding (MAPF) problem entails finding collision-free paths for a set of agents, guiding them from their start to goal locations. However, MAPF does not account for several practical task-related constraints. For example, agents may need to perform actions at goal locations with specific execution times, adhering to predetermined orders and timeframes. Moreover, goal assignments may not be predefined for agents, and the optimization objective may lack an explicit definition. To incorporate task assignment, path planning, and a user-defined objective into a coherent framework, this paper examines the Task Assignment and Path Finding with Precedence and Temporal Constraints (TAPF-PTC) problem. We augment Conflict-Based Search (CBS) to simultaneously generate task assignments and collision-free paths that adhere to precedence and temporal constraints, maximizing an objective quantified by the return from a user-defined reward function in reinforcement learning (RL). Experimentally, we demonstrate that our algorithm, CBS-TA-PTC, can solve highly challenging bomb-defusing tasks with precedence and temporal constraints efficiently relative to MARL and adapted Target Assignment and Path Finding (TAPF) methods.

Provable Traffic Rule Compliance in Safe Reinforcement Learning on the Open Sea

Authors:Hanna Krasowski, Matthias Althoff
Date:2024-02-13 14:59:19

For safe operation, autonomous vehicles have to obey traffic rules that are set forth in legal documents formulated in natural language. Temporal logic is a suitable concept to formalize such traffic rules. Still, temporal logic rules often result in constraints that are hard to solve using optimization-based motion planners. Reinforcement learning (RL) is a promising method to find motion plans for autonomous vehicles. However, vanilla RL algorithms are based on random exploration and do not automatically comply with traffic rules. Our approach accomplishes guaranteed rule-compliance by integrating temporal logic specifications into RL. Specifically, we consider the application of vessels on the open sea, which must adhere to the Convention on the International Regulations for Preventing Collisions at Sea (COLREGS). To efficiently synthesize rule-compliant actions, we combine predicates based on set-based prediction with a statechart representing our formalized rules and their priorities. Action masking then restricts the RL agent to this set of verified rule-compliant actions. In numerical evaluations on critical maritime traffic situations, our agent always complies with the formalized legal rules and never collides while achieving a high goal-reaching rate during training and deployment. In contrast, vanilla and traffic rule-informed RL agents frequently violate traffic rules and collide even after training.

Conservative and Risk-Aware Offline Multi-Agent Reinforcement Learning

Authors:Eslam Eldeeb, Houssem Sifaou, Osvaldo Simeone, Mohammad Shehab, Hirley Alves
Date:2024-02-13 12:49:22

Reinforcement learning (RL) has been widely adopted for controlling and optimizing complex engineering systems such as next-generation wireless networks. An important challenge in adopting RL is the need for direct access to the physical environment. This limitation is particularly severe in multi-agent systems, for which conventional multi-agent reinforcement learning (MARL) requires a large number of coordinated online interactions with the environment during training. When only offline data is available, a direct application of online MARL schemes would generally fail due to the epistemic uncertainty entailed by the lack of exploration during training. In this work, we propose an offline MARL scheme that integrates distributional RL and conservative Q-learning to address the environment's inherent aleatoric uncertainty and the epistemic uncertainty arising from the use of offline data. We explore both independent and joint learning strategies. The proposed MARL scheme, referred to as multi-agent conservative quantile regression, addresses general risk-sensitive design criteria and is applied to the trajectory planning problem in drone networks, showcasing its advantages.

Transition Constrained Bayesian Optimization via Markov Decision Processes

Authors:Jose Pablo Folch, Calvin Tsay, Robert M Lee, Behrang Shafei, Weronika Ormaniec, Andreas Krause, Mark van der Wilk, Ruth Misener, Mojmír Mutný
Date:2024-02-13 12:11:40

Bayesian optimization is a methodology to optimize black-box functions. Traditionally, it focuses on the setting where you can arbitrarily query the search space. However, many real-life problems do not offer this flexibility; in particular, the search space of the next query may depend on previous ones. Example challenges arise in the physical sciences in the form of local movement constraints, required monotonicity in certain variables, and transitions influencing the accuracy of measurements. Altogether, such transition constraints necessitate a form of planning. This work extends classical Bayesian optimization via the framework of Markov Decision Processes. We iteratively solve a tractable linearization of our utility function using reinforcement learning to obtain a policy that plans ahead for the entire horizon. This is a parallel to the optimization of an acquisition function in policy space. The resulting policy is potentially history-dependent and non-Markovian. We showcase applications in chemical reactor optimization, informative path planning, machine calibration, and other synthetic examples.

SPO: Sequential Monte Carlo Policy Optimisation

Authors:Matthew V Macfarlane, Edan Toledo, Donal Byrne, Paul Duckworth, Alexandre Laterre
Date:2024-02-12 10:32:47

Leveraging planning during learning and decision-making is central to the long-term development of intelligent agents. Recent works have successfully combined tree-based search methods and self-play learning mechanisms to this end. However, these methods typically face scaling challenges due to the sequential nature of their search. While practical engineering solutions can partly overcome this, they often result in a negative impact on performance. In this paper, we introduce SPO: Sequential Monte Carlo Policy Optimisation, a model-based reinforcement learning algorithm grounded within the Expectation Maximisation (EM) framework. We show that SPO provides robust policy improvement and efficient scaling properties. The sample-based search makes it directly applicable to both discrete and continuous action spaces without modifications. We demonstrate statistically significant improvements in performance relative to model-free and model-based baselines across both continuous and discrete environments. Furthermore, the parallel nature of SPO's search enables effective utilisation of hardware accelerators, yielding favourable scaling laws.

Stitching Sub-Trajectories with Conditional Diffusion Model for Goal-Conditioned Offline RL

Authors:Sungyoon Kim, Yunseon Choi, Daiki E. Matsunaga, Kee-Eung Kim
Date:2024-02-11 15:23:13

Offline Goal-Conditioned Reinforcement Learning (Offline GCRL) is an important problem in RL that focuses on acquiring diverse goal-oriented skills solely from pre-collected behavior datasets. In this setting, the reward feedback is typically absent except when the goal is achieved, which makes it difficult to learn policies especially from a finite dataset of suboptimal behaviors. In addition, realistic scenarios involve long-horizon planning, which necessitates the extraction of useful skills within sub-trajectories. Recently, the conditional diffusion model has been shown to be a promising approach to generate high-quality long-horizon plans for RL. However, their practicality for the goal-conditioned setting is still limited due to a number of technical assumptions made by the methods. In this paper, we propose SSD (Sub-trajectory Stitching with Diffusion), a model-based offline GCRL method that leverages the conditional diffusion model to address these limitations. In summary, we use the diffusion model that generates future plans conditioned on the target goal and value, with the target value estimated from the goal-relabeled offline dataset. We report state-of-the-art performance in the standard benchmark set of GCRL tasks, and demonstrate the capability to successfully stitch the segments of suboptimal trajectories in the offline data to generate high-quality plans.

Divide and Conquer: Provably Unveiling the Pareto Front with Multi-Objective Reinforcement Learning

Authors:Willem Röpke, Mathieu Reymond, Patrick Mannion, Diederik M. Roijers, Ann Nowé, Roxana Rădulescu
Date:2024-02-11 12:35:13

An important challenge in multi-objective reinforcement learning is obtaining a Pareto front of policies to attain optimal performance under different preferences. We introduce Iterated Pareto Referent Optimisation (IPRO), which decomposes finding the Pareto front into a sequence of constrained single-objective problems. This enables us to guarantee convergence while providing an upper bound on the distance to undiscovered Pareto optimal solutions at each step. We evaluate IPRO using utility-based metrics and its hypervolume and find that it matches or outperforms methods that require additional assumptions. By leveraging problem-specific single-objective solvers, our approach also holds promise for applications beyond multi-objective reinforcement learning, such as planning and pathfinding.

Deceptive Path Planning via Reinforcement Learning with Graph Neural Networks

Authors:Michael Y. Fatemi, Wesley A. Suttle, Brian M. Sadler
Date:2024-02-09 17:07:31

Deceptive path planning (DPP) is the problem of designing a path that hides its true goal from an outside observer. Existing methods for DPP rely on unrealistic assumptions, such as global state observability and perfect model knowledge, and are typically problem-specific, meaning that even minor changes to a previously solved problem can force expensive computation of an entirely new solution. Given these drawbacks, such methods do not generalize to unseen problem instances, lack scalability to realistic problem sizes, and preclude both on-the-fly tunability of deception levels and real-time adaptivity to changing environments. In this paper, we propose a reinforcement learning (RL)-based scheme for training policies to perform DPP over arbitrary weighted graphs that overcomes these issues. The core of our approach is the introduction of a local perception model for the agent, a new state space representation distilling the key components of the DPP problem, the use of graph neural network-based policies to facilitate generalization and scaling, and the introduction of new deception bonuses that translate the deception objectives of classical methods to the RL setting. Through extensive experimentation we show that, without additional fine-tuning, at test time the resulting policies successfully generalize, scale, enjoy tunable levels of deception, and adapt in real-time to changes in the environment.

Dynamic Q-planning for Online UAV Path Planning in Unknown and Complex Environments

Authors:Lidia Gianne Souza da Rocha, Kenny Anderson Queiroz Caldas, Marco Henrique Terra, Fabio Ramos, Kelen Cristiane Teixeira Vivaldini
Date:2024-02-09 10:19:48

Unmanned Aerial Vehicles need an online path planning capability to move in high-risk missions in unknown and complex environments to complete them safely. However, many algorithms reported in the literature may not return reliable trajectories to solve online problems in these scenarios. The Q-Learning algorithm, a Reinforcement Learning Technique, can generate trajectories in real-time and has demonstrated fast and reliable results. This technique, however, has the disadvantage of defining the iteration number. If this value is not well defined, it will take a long time or not return an optimal trajectory. Therefore, we propose a method to dynamically choose the number of iterations to obtain the best performance of Q-Learning. The proposed method is compared to the Q-Learning algorithm with a fixed number of iterations, A*, Rapid-Exploring Random Tree, and Particle Swarm Optimization. As a result, the proposed Q-learning algorithm demonstrates the efficacy and reliability of online path planning with a dynamic number of iterations to carry out online missions in unknown and complex environments.

Scaling Intelligent Agents in Combat Simulations for Wargaming

Authors:Scotty Black, Christian Darken
Date:2024-02-08 21:57:10

Remaining competitive in future conflicts with technologically-advanced competitors requires us to accelerate our research and development in artificial intelligence (AI) for wargaming. More importantly, leveraging machine learning for intelligent combat behavior development will be key to one day achieving superhuman performance in this domain--elevating the quality and accelerating the speed of our decisions in future wars. Although deep reinforcement learning (RL) continues to show promising results in intelligent agent behavior development in games, it has yet to perform at or above the human level in the long-horizon, complex tasks typically found in combat modeling and simulation. Capitalizing on the proven potential of RL and recent successes of hierarchical reinforcement learning (HRL), our research is investigating and extending the use of HRL to create intelligent agents capable of performing effectively in these large and complex simulation environments. Our ultimate goal is to develop an agent capable of superhuman performance that could then serve as an AI advisor to military planners and decision-makers. This papers covers our ongoing approach and the first three of our five research areas aimed at managing the exponential growth of computations that have thus far limited the use of AI in combat simulations: (1) developing an HRL training framework and agent architecture for combat units; (2) developing a multi-model framework for agent decision-making; (3) developing dimension-invariant observation abstractions of the state space to manage the exponential growth of computations; (4) developing an intrinsic rewards engine to enable long-term planning; and (5) implementing this framework into a higher-fidelity combat simulation.

Exploration Without Maps via Zero-Shot Out-of-Distribution Deep Reinforcement Learning

Authors:Shathushan Sivashangaran, Apoorva Khairnar, Azim Eskandarian
Date:2024-02-07 18:17:54

Operation of Autonomous Mobile Robots (AMRs) of all forms that include wheeled ground vehicles, quadrupeds and humanoids in dynamically changing GPS denied environments without a-priori maps, exclusively using onboard sensors, is an unsolved problem that has potential to transform the economy, and vastly improve humanity's capabilities with improvements to agriculture, manufacturing, disaster response, military and space exploration. Conventional AMR automation approaches are modularized into perception, motion planning and control which is computationally inefficient, and requires explicit feature extraction and engineering, that inhibits generalization, and deployment at scale. Few works have focused on real-world end-to-end approaches that directly map sensor inputs to control outputs due to the large amount of well curated training data required for supervised Deep Learning (DL) which is time consuming and labor intensive to collect and label, and sample inefficiency and challenges to bridging the simulation to reality gap using Deep Reinforcement Learning (DRL). This paper presents a novel method to efficiently train DRL for robust end-to-end AMR exploration, in a constrained environment at physical limits in simulation, transferred zero-shot to the real-world. The representation learned in a compact parameter space with 2 fully connected layers with 64 nodes each is demonstrated to exhibit emergent behavior for out-of-distribution generalization to navigation in new environments that include unstructured terrain without maps, and dynamic obstacle avoidance. The learned policy outperforms conventional navigation algorithms while consuming a fraction of the computation resources, enabling execution on a range of AMR forms with varying embedded computer payloads.

Deep Reinforcement Learning with Dynamic Graphs for Adaptive Informative Path Planning

Authors:Apoorva Vashisth, Julius Rückin, Federico Magistri, Cyrill Stachniss, Marija Popović
Date:2024-02-07 14:24:41

Autonomous robots are often employed for data collection due to their efficiency and low labour costs. A key task in robotic data acquisition is planning paths through an initially unknown environment to collect observations given platform-specific resource constraints, such as limited battery life. Adaptive online path planning in 3D environments is challenging due to the large set of valid actions and the presence of unknown occlusions. To address these issues, we propose a novel deep reinforcement learning approach for adaptively replanning robot paths to map targets of interest in unknown 3D environments. A key aspect of our approach is a dynamically constructed graph that restricts planning actions local to the robot, allowing us to react to newly discovered static obstacles and targets of interest. For replanning, we propose a new reward function that balances between exploring the unknown environment and exploiting online-discovered targets of interest. Our experiments show that our method enables more efficient target discovery compared to state-of-the-art learning and non-learning baselines. We also showcase our approach for orchard monitoring using an unmanned aerial vehicle in a photorealistic simulator. We open-source our code and model at: https://github.com/dmar-bonn/ipp-rl-3d.

Latent Plan Transformer for Trajectory Abstraction: Planning as Latent Space Inference

Authors:Deqian Kong, Dehong Xu, Minglu Zhao, Bo Pang, Jianwen Xie, Andrew Lizarraga, Yuhao Huang, Sirui Xie, Ying Nian Wu
Date:2024-02-07 08:18:09

In tasks aiming for long-term returns, planning becomes essential. We study generative modeling for planning with datasets repurposed from offline reinforcement learning. Specifically, we identify temporal consistency in the absence of step-wise rewards as one key technical challenge. We introduce the Latent Plan Transformer (LPT), a novel model that leverages a latent variable to connect a Transformer-based trajectory generator and the final return. LPT can be learned with maximum likelihood estimation on trajectory-return pairs. In learning, posterior sampling of the latent variable naturally integrates sub-trajectories to form a consistent abstraction despite the finite context. At test time, the latent variable is inferred from an expected return before policy execution, realizing the idea of planning as inference. Our experiments demonstrate that LPT can discover improved decisions from sub-optimal trajectories, achieving competitive performance across several benchmarks, including Gym-Mujoco, Franka Kitchen, Maze2D, and Connect Four. It exhibits capabilities in nuanced credit assignments, trajectory stitching, and adaptation to environmental contingencies. These results validate that latent variable inference can be a strong alternative to step-wise reward prompting.

Reinforcement Learning with Ensemble Model Predictive Safety Certification

Authors:Sven Gronauer, Tom Haider, Felippe Schmoeller da Roza, Klaus Diepold
Date:2024-02-06 17:42:39

Reinforcement learning algorithms need exploration to learn. However, unsupervised exploration prevents the deployment of such algorithms on safety-critical tasks and limits real-world deployment. In this paper, we propose a new algorithm called Ensemble Model Predictive Safety Certification that combines model-based deep reinforcement learning with tube-based model predictive control to correct the actions taken by a learning agent, keeping safety constraint violations at a minimum through planning. Our approach aims to reduce the amount of prior knowledge about the actual system by requiring only offline data generated by a safe controller. Our results show that we can achieve significantly fewer constraint violations than comparable reinforcement learning methods.

Contrastive Diffuser: Planning Towards High Return States via Contrastive Learning

Authors:Yixiang Shan, Zhengbang Zhu, Ting Long, Qifan Liang, Yi Chang, Weinan Zhang, Liang Yin
Date:2024-02-05 07:12:02

The performance of offline reinforcement learning (RL) is sensitive to the proportion of high-return trajectories in the offline dataset. However, in many simulation environments and real-world scenarios, there are large ratios of low-return trajectories rather than high-return trajectories, which makes learning an efficient policy challenging. In this paper, we propose a method called Contrastive Diffuser (CDiffuser) to make full use of low-return trajectories and improve the performance of offline RL algorithms. Specifically, CDiffuser groups the states of trajectories in the offline dataset into high-return states and low-return states and treats them as positive and negative samples correspondingly. Then, it designs a contrastive mechanism to pull the trajectory of an agent toward high-return states and push them away from low-return states. Through the contrast mechanism, trajectories with low returns can serve as negative examples for policy learning, guiding the agent to avoid areas associated with low returns and achieve better performance. Experiments on 14 commonly used D4RL benchmarks demonstrate the effectiveness of our proposed method. Our code is publicly available at \url{https://anonymous.4open.science/r/CDiffuser}.

Integrating DeepRL with Robust Low-Level Control in Robotic Manipulators for Non-Repetitive Reaching Tasks

Authors:Mehdi Heydari Shahna, Seyed Adel Alizadeh Kolagar, Jouni Mattila
Date:2024-02-04 15:54:03

In robotics, contemporary strategies are learning-based, characterized by a complex black-box nature and a lack of interpretability, which may pose challenges in ensuring stability and safety. To address these issues, we propose integrating a collision-free trajectory planner based on deep reinforcement learning (DRL) with a novel auto-tuning low-level control strategy, all while actively engaging in the learning phase through interactions with the environment. This approach circumvents the control performance and complexities associated with computations while addressing nonrepetitive reaching tasks in the presence of obstacles. First, a model-free DRL agent is employed to plan velocity-bounded motion for a manipulator with 'n' degrees of freedom (DoF), ensuring collision avoidance for the end-effector through joint-level reasoning. The generated reference motion is then input into a robust subsystem-based adaptive controller, which produces the necessary torques, while the cuckoo search optimization (CSO) algorithm enhances control gains to minimize the stabilization and tracking error in the steady state. This approach guarantees robustness and uniform exponential convergence in an unfamiliar environment, despite the presence of uncertainties and disturbances. Theoretical assertions are validated through the presentation of simulation outcomes.

The RL/LLM Taxonomy Tree: Reviewing Synergies Between Reinforcement Learning and Large Language Models

Authors:Moschoula Pternea, Prerna Singh, Abir Chakraborty, Yagna Oruganti, Mirco Milletari, Sayli Bapat, Kebei Jiang
Date:2024-02-02 20:01:15

In this work, we review research studies that combine Reinforcement Learning (RL) and Large Language Models (LLMs), two areas that owe their momentum to the development of deep neural networks. We propose a novel taxonomy of three main classes based on the way that the two model types interact with each other. The first class, RL4LLM, includes studies where RL is leveraged to improve the performance of LLMs on tasks related to Natural Language Processing. L4LLM is divided into two sub-categories depending on whether RL is used to directly fine-tune an existing LLM or to improve the prompt of the LLM. In the second class, LLM4RL, an LLM assists the training of an RL model that performs a task that is not inherently related to natural language. We further break down LLM4RL based on the component of the RL training framework that the LLM assists or replaces, namely reward shaping, goal generation, and policy function. Finally, in the third class, RL+LLM, an LLM and an RL agent are embedded in a common planning framework without either of them contributing to training or fine-tuning of the other. We further branch this class to distinguish between studies with and without natural language feedback. We use this taxonomy to explore the motivations behind the synergy of LLMs and RL and explain the reasons for its success, while pinpointing potential shortcomings and areas where further research is needed, as well as alternative methodologies that serve the same goal.

A Reinforcement Learning-Boosted Motion Planning Framework: Comprehensive Generalization Performance in Autonomous Driving

Authors:Rainer Trauth, Alexander Hobmeier, Johannes Betz
Date:2024-02-02 14:54:38

This study introduces a novel approach to autonomous motion planning, informing an analytical algorithm with a reinforcement learning (RL) agent within a Frenet coordinate system. The combination directly addresses the challenges of adaptability and safety in autonomous driving. Motion planning algorithms are essential for navigating dynamic and complex scenarios. Traditional methods, however, lack the flexibility required for unpredictable environments, whereas machine learning techniques, particularly reinforcement learning (RL), offer adaptability but suffer from instability and a lack of explainability. Our unique solution synergizes the predictability and stability of traditional motion planning algorithms with the dynamic adaptability of RL, resulting in a system that efficiently manages complex situations and adapts to changing environmental conditions. Evaluation of our integrated approach shows a significant reduction in collisions, improved risk management, and improved goal success rates across multiple scenarios. The code used in this research is publicly available as open-source software and can be accessed at the following link: https://github.com/TUM-AVS/Frenetix-RL.

COA-GPT: Generative Pre-trained Transformers for Accelerated Course of Action Development in Military Operations

Authors:Vinicius G. Goecks, Nicholas Waytowich
Date:2024-02-01 21:51:09

The development of Courses of Action (COAs) in military operations is traditionally a time-consuming and intricate process. Addressing this challenge, this study introduces COA-GPT, a novel algorithm employing Large Language Models (LLMs) for rapid and efficient generation of valid COAs. COA-GPT incorporates military doctrine and domain expertise to LLMs through in-context learning, allowing commanders to input mission information - in both text and image formats - and receive strategically aligned COAs for review and approval. Uniquely, COA-GPT not only accelerates COA development, producing initial COAs within seconds, but also facilitates real-time refinement based on commander feedback. This work evaluates COA-GPT in a military-relevant scenario within a militarized version of the StarCraft II game, comparing its performance against state-of-the-art reinforcement learning algorithms. Our results demonstrate COA-GPT's superiority in generating strategically sound COAs more swiftly, with added benefits of enhanced adaptability and alignment with commander intentions. COA-GPT's capability to rapidly adapt and update COAs during missions presents a transformative potential for military planning, particularly in addressing planning discrepancies and capitalizing on emergent windows of opportunities.

SLIM: Skill Learning with Multiple Critics

Authors:David Emukpere, Bingbing Wu, Julien Perez, Jean-Michel Renders
Date:2024-02-01 18:07:33

Self-supervised skill learning aims to acquire useful behaviors that leverage the underlying dynamics of the environment. Latent variable models, based on mutual information maximization, have been successful in this task but still struggle in the context of robotic manipulation. As it requires impacting a possibly large set of degrees of freedom composing the environment, mutual information maximization fails alone in producing useful and safe manipulation behaviors. Furthermore, tackling this by augmenting skill discovery rewards with additional rewards through a naive combination might fail to produce desired behaviors. To address this limitation, we introduce SLIM, a multi-critic learning approach for skill discovery with a particular focus on robotic manipulation. Our main insight is that utilizing multiple critics in an actor-critic framework to gracefully combine multiple reward functions leads to a significant improvement in latent-variable skill discovery for robotic manipulation while overcoming possible interference occurring among rewards which hinders convergence to useful skills. Furthermore, in the context of tabletop manipulation, we demonstrate the applicability of our novel skill discovery approach to acquire safe and efficient motor primitives in a hierarchical reinforcement learning fashion and leverage them through planning, significantly surpassing baseline approaches for skill discovery.

Control in Stochastic Environment with Delays: A Model-based Reinforcement Learning Approach

Authors:Zhiyuan Yao, Ionut Florescu, Chihoon Lee
Date:2024-02-01 03:53:56

In this paper we are introducing a new reinforcement learning method for control problems in environments with delayed feedback. Specifically, our method employs stochastic planning, versus previous methods that used deterministic planning. This allows us to embed risk preference in the policy optimization problem. We show that this formulation can recover the optimal policy for problems with deterministic transitions. We contrast our policy with two prior methods from literature. We apply the methodology to simple tasks to understand its features. Then, we compare the performance of the methods in controlling multiple Atari games.

Attention Graph for Multi-Robot Social Navigation with Deep Reinforcement Learning

Authors:Erwan Escudie, Laetitia Matignon, Jacques Saraydaryan
Date:2024-01-31 15:24:13

Learning robot navigation strategies among pedestrian is crucial for domain based applications. Combining perception, planning and prediction allows us to model the interactions between robots and pedestrians, resulting in impressive outcomes especially with recent approaches based on deep reinforcement learning (RL). However, these works do not consider multi-robot scenarios. In this paper, we present MultiSoc, a new method for learning multi-agent socially aware navigation strategies using RL. Inspired by recent works on multi-agent deep RL, our method leverages graph-based representation of agent interactions, combining the positions and fields of view of entities (pedestrians and agents). Each agent uses a model based on two Graph Neural Network combined with attention mechanisms. First an edge-selector produces a sparse graph, then a crowd coordinator applies node attention to produce a graph representing the influence of each entity on the others. This is incorporated into a model-free RL framework to learn multi-agent policies. We evaluate our approach on simulation and provide a series of experiments in a set of various conditions (number of agents / pedestrians). Empirical results show that our method learns faster than social navigation deep RL mono-agent techniques, and enables efficient multi-agent implicit coordination in challenging crowd navigation with multiple heterogeneous humans. Furthermore, by incorporating customizable meta-parameters, we can adjust the neighborhood density to take into account in our navigation strategy.

Simplifying Latent Dynamics with Softly State-Invariant World Models

Authors:Tankred Saanum, Peter Dayan, Eric Schulz
Date:2024-01-31 13:52:11

To solve control problems via model-based reasoning or planning, an agent needs to know how its actions affect the state of the world. The actions an agent has at its disposal often change the state of the environment in systematic ways. However, existing techniques for world modelling do not guarantee that the effect of actions are represented in such systematic ways. We introduce the Parsimonious Latent Space Model (PLSM), a world model that regularizes the latent dynamics to make the effect of the agent's actions more predictable. Our approach minimizes the mutual information between latent states and the change that an action produces in the agent's latent state, in turn minimizing the dependence the state has on the dynamics. This makes the world model softly state-invariant. We combine PLSM with different model classes used for i) future latent state prediction, ii) planning, and iii) model-free reinforcement learning. We find that our regularization improves accuracy, generalization, and performance in downstream tasks, highlighting the importance of systematic treatment of actions in world models.

CORE: Towards Scalable and Efficient Causal Discovery with Reinforcement Learning

Authors:Andreas W. M. Sauter, Nicolò Botteghi, Erman Acar, Aske Plaat
Date:2024-01-30 12:57:52

Causal discovery is the challenging task of inferring causal structure from data. Motivated by Pearl's Causal Hierarchy (PCH), which tells us that passive observations alone are not enough to distinguish correlation from causation, there has been a recent push to incorporate interventions into machine learning research. Reinforcement learning provides a convenient framework for such an active approach to learning. This paper presents CORE, a deep reinforcement learning-based approach for causal discovery and intervention planning. CORE learns to sequentially reconstruct causal graphs from data while learning to perform informative interventions. Our results demonstrate that CORE generalizes to unseen graphs and efficiently uncovers causal structures. Furthermore, CORE scales to larger graphs with up to 10 variables and outperforms existing approaches in structure estimation accuracy and sample efficiency. All relevant code and supplementary material can be found at https://github.com/sa-and/CORE

R$\times$R: Rapid eXploration for Reinforcement Learning via Sampling-based Reset Distributions and Imitation Pre-training

Authors:Gagan Khandate, Tristan L. Saidi, Siqi Shang, Eric T. Chang, Yang Liu, Seth Dennis, Johnson Adams, Matei Ciocarlie
Date:2024-01-27 19:19:06

We present a method for enabling Reinforcement Learning of motor control policies for complex skills such as dexterous manipulation. We posit that a key difficulty for training such policies is the difficulty of exploring the problem state space, as the accessible and useful regions of this space form a complex structure along manifolds of the original high-dimensional state space. This work presents a method to enable and support exploration with Sampling-based Planning. We use a generally applicable non-holonomic Rapidly-exploring Random Trees algorithm and present multiple methods to use the resulting structure to bootstrap model-free Reinforcement Learning. Our method is effective at learning various challenging dexterous motor control skills of higher difficulty than previously shown. In particular, we achieve dexterous in-hand manipulation of complex objects while simultaneously securing the object without the use of passive support surfaces. These policies also transfer effectively to real robots. A number of example videos can also be found on the project website: https://sbrl.cs.columbia.edu

Reinforcement Learning Interventions on Boundedly Rational Human Agents in Frictionful Tasks

Authors:Eura Nofshin, Siddharth Swaroop, Weiwei Pan, Susan Murphy, Finale Doshi-Velez
Date:2024-01-26 14:59:48

Many important behavior changes are frictionful; they require individuals to expend effort over a long period with little immediate gratification. Here, an artificial intelligence (AI) agent can provide personalized interventions to help individuals stick to their goals. In these settings, the AI agent must personalize rapidly (before the individual disengages) and interpretably, to help us understand the behavioral interventions. In this paper, we introduce Behavior Model Reinforcement Learning (BMRL), a framework in which an AI agent intervenes on the parameters of a Markov Decision Process (MDP) belonging to a boundedly rational human agent. Our formulation of the human decision-maker as a planning agent allows us to attribute undesirable human policies (ones that do not lead to the goal) to their maladapted MDP parameters, such as an extremely low discount factor. Furthermore, we propose a class of tractable human models that captures fundamental behaviors in frictionful tasks. Introducing a notion of MDP equivalence specific to BMRL, we theoretically and empirically show that AI planning with our human models can lead to helpful policies on a wide range of more complex, ground-truth humans.

Traffic Learning and Proactive UAV Trajectory Planning for Data Uplink in Markovian IoT Models

Authors:Eslam Eldeeb, Mohammad Shehab, Hirley Alves
Date:2024-01-24 21:57:55

The age of information (AoI) is used to measure the freshness of the data. In IoT networks, the traditional resource management schemes rely on a message exchange between the devices and the base station (BS) before communication which causes high AoI, high energy consumption, and low reliability. Unmanned aerial vehicles (UAVs) as flying BSs have many advantages in minimizing the AoI, energy-saving, and throughput improvement. In this paper, we present a novel learning-based framework that estimates the traffic arrival of IoT devices based on Markovian events. The learning proceeds to optimize the trajectory of multiple UAVs and their scheduling policy. First, the BS predicts the future traffic of the devices. We compare two traffic predictors: the forward algorithm (FA) and the long short-term memory (LSTM). Afterward, we propose a deep reinforcement learning (DRL) approach to optimize the optimal policy of each UAV. Finally, we manipulate the optimum reward function for the proposed DRL approach. Simulation results show that the proposed algorithm outperforms the random-walk (RW) baseline model regarding the AoI, scheduling accuracy, and transmission power.

Large language model empowered participatory urban planning

Authors:Zhilun Zhou, Yuming Lin, Yong Li
Date:2024-01-24 10:50:01

Participatory urban planning is the mainstream of modern urban planning and involves the active engagement of different stakeholders. However, the traditional participatory paradigm encounters challenges in time and manpower, while the generative planning tools fail to provide adjustable and inclusive solutions. This research introduces an innovative urban planning approach integrating Large Language Models (LLMs) within the participatory process. The framework, based on the crafted LLM agent, consists of role-play, collaborative generation, and feedback iteration, solving a community-level land-use task catering to 1000 distinct interests. Empirical experiments in diverse urban communities exhibit LLM's adaptability and effectiveness across varied planning scenarios. The results were evaluated on four metrics, surpassing human experts in satisfaction and inclusion, and rivaling state-of-the-art reinforcement learning methods in service and ecology. Further analysis shows the advantage of LLM agents in providing adjustable and inclusive solutions with natural language reasoning and strong scalability. While implementing the recent advancements in emulating human behavior for planning, this work envisions both planners and citizens benefiting from low-cost, efficient LLM agents, which is crucial for enhancing participation and realizing participatory urban planning.

Motion Hologram: Jointly optimized hologram generation and motion planning for photorealistic and speckle-free 3D displays via reinforcement learning

Authors:Zhenxing Dong, Yuye Ling, Yan Li, Yikai Su
Date:2024-01-23 07:43:11

Holography is capable of rendering three-dimensional scenes with full-depth control, and delivering transformative experiences across numerous domains, including virtual and augmented reality, education, and communication. However, traditional holography presents 3D scenes with unnatural defocus and severe speckles due to the limited space bandwidth product of the spatial light modulator (SLM). Here, we introduce Motion Hologram, a novel holographic technique to accurately portray photorealistic and speckle-free 3D scenes, by leveraging a single hologram and learnable motion trajectory, which are jointly optimized within the deep reinforcement learning framework. Specifically, we experimentally demonstrated the proposed technique could achieve a 4~5 dB PSNR improvement of focal stacks in comparison with traditional holography and could successfully depict speckle-free, high-fidelity, and full-color 3D displays using only a commercial SLM for the first time. We believe the proposed method promises a new form of holographic displays that will offer immersive viewing experiences for audiences.

Towards Socially and Morally Aware RL agent: Reward Design With LLM

Authors:Zhaoyue Wang
Date:2024-01-23 03:00:03

When we design and deploy an Reinforcement Learning (RL) agent, reward functions motivates agents to achieve an objective. An incorrect or incomplete specification of the objective can result in behavior that does not align with human values - failing to adhere with social and moral norms that are ambiguous and context dependent, and cause undesired outcomes such as negative side effects and exploration that is unsafe. Previous work have manually defined reward functions to avoid negative side effects, use human oversight for safe exploration, or use foundation models as planning tools. This work studies the ability of leveraging Large Language Models (LLM)' understanding of morality and social norms on safe exploration augmented RL methods. This work evaluates language model's result against human feedbacks and demonstrates language model's capability as direct reward signals.

Multi-agent deep reinforcement learning with centralized training and decentralized execution for transportation infrastructure management

Authors:M. Saifullah, K. G. Papakonstantinou, C. P. Andriotis, S. M. Stoffels
Date:2024-01-23 02:52:36

We present a multi-agent Deep Reinforcement Learning (DRL) framework for managing large transportation infrastructure systems over their life-cycle. Life-cycle management of such engineering systems is a computationally intensive task, requiring appropriate sequential inspection and maintenance decisions able to reduce long-term risks and costs, while dealing with different uncertainties and constraints that lie in high-dimensional spaces. To date, static age- or condition-based maintenance methods and risk-based or periodic inspection plans have mostly addressed this class of optimization problems. However, optimality, scalability, and uncertainty limitations are often manifested under such approaches. The optimization problem in this work is cast in the framework of constrained Partially Observable Markov Decision Processes (POMDPs), which provides a comprehensive mathematical basis for stochastic sequential decision settings with observation uncertainties, risk considerations, and limited resources. To address significantly large state and action spaces, a Deep Decentralized Multi-agent Actor-Critic (DDMAC) DRL method with Centralized Training and Decentralized Execution (CTDE), termed as DDMAC-CTDE is developed. The performance strengths of the DDMAC-CTDE method are demonstrated in a generally representative and realistic example application of an existing transportation network in Virginia, USA. The network includes several bridge and pavement components with nonstationary degradation, agency-imposed constraints, and traffic delay and risk considerations. Compared to traditional management policies for transportation networks, the proposed DDMAC-CTDE method vastly outperforms its counterparts. Overall, the proposed algorithmic framework provides near optimal solutions for transportation infrastructure management under real-world constraints and complexities.

Multi-Agent Dynamic Relational Reasoning for Social Robot Navigation

Authors:Jiachen Li, Chuanbo Hua, Jianpeng Yao, Hengbo Ma, Jinkyoo Park, Victoria Dax, Mykel J. Kochenderfer
Date:2024-01-22 18:58:22

Social robot navigation can be helpful in various contexts of daily life but requires safe human-robot interactions and efficient trajectory planning. While modeling pairwise relations has been widely studied in multi-agent interacting systems, the ability to capture larger-scale group-wise activities is limited. In this paper, we propose a systematic relational reasoning approach with explicit inference of the underlying dynamically evolving relational structures, and we demonstrate its effectiveness for multi-agent trajectory prediction and social robot navigation. In addition to the edges between pairs of nodes (i.e., agents), we propose to infer hyperedges that adaptively connect multiple nodes to enable group-wise reasoning in an unsupervised manner. Our approach infers dynamically evolving relation graphs and hypergraphs to capture the evolution of relations, which the trajectory predictor employs to generate future states. Meanwhile, we propose to regularize the sharpness and sparsity of the learned relations and the smoothness of the relation evolution, which proves to enhance training stability and model performance. The proposed approach is validated on synthetic crowd simulations and real-world benchmark datasets. Experiments demonstrate that the approach infers reasonable relations and achieves state-of-the-art prediction performance. In addition, we present a deep reinforcement learning (DRL) framework for social robot navigation, which incorporates relational reasoning and trajectory prediction systematically. In a group-based crowd simulation, our method outperforms the strongest baseline by a significant margin in terms of safety, efficiency, and social compliance in dense, interactive scenarios. We also demonstrate the practical applicability of our method with real-world robot experiments. The code and videos can be found at https://relational-reasoning-nav.github.io/.

Adaptive Motion Planning for Multi-fingered Functional Grasp via Force Feedback

Authors:Dongying Tian, Xiangbo Lin, Yi Sun
Date:2024-01-22 14:28:00

Enabling multi-fingered robots to grasp and manipulate objects with human-like dexterity is especially challenging during the dynamic, continuous hand-object interactions. Closed-loop feedback control is essential for dexterous hands to dynamically finetune hand poses when performing precise functional grasps. This work proposes an adaptive motion planning method based on deep reinforcement learning to adjust grasping poses according to real-time feedback from joint torques from pre-grasp to goal grasp. We find the multi-joint torques of the dexterous hand can sense object positions through contacts and collisions, enabling real-time adjustment of grasps to generate varying grasping trajectories for objects in different positions. In our experiments, the performance gap with and without force feedback reveals the important role of force feedback in adaptive manipulation. Our approach utilizing force feedback preliminarily exhibits human-like flexibility, adaptability, and precision.

Obstacle-Aware Navigation of Soft Growing Robots via Deep Reinforcement Learning

Authors:Haitham El-Hussieny, Ibrahim Hameed
Date:2024-01-20 10:35:35

Soft growing robots, are a type of robots that are designed to move and adapt to their environment in a similar way to how plants grow and move with potential applications where they could be used to navigate through tight spaces, dangerous terrain, and hard-to-reach areas. This research explores the application of deep reinforcement Q-learning algorithm for facilitating the navigation of the soft growing robots in cluttered environments. The proposed algorithm utilizes the flexibility of the soft robot to adapt and incorporate the interaction between the robot and the environment into the decision-making process. Results from simulations show that the proposed algorithm improves the soft robot's ability to navigate effectively and efficiently in confined spaces. This study presents a promising approach to addressing the challenges faced by growing robots in particular and soft robots general in planning obstacle-aware paths in real-world scenarios.

Meta Reinforcement Learning for Strategic IoT Deployments Coverage in Disaster-Response UAV Swarms

Authors:Marwan Dhuheir, Aiman Erbad, Ala Al-Fuqaha
Date:2024-01-20 05:05:39

In the past decade, Unmanned Aerial Vehicles (UAVs) have grabbed the attention of researchers in academia and industry for their potential use in critical emergency applications, such as providing wireless services to ground users and collecting data from areas affected by disasters, due to their advantages in terms of maneuverability and movement flexibility. The UAVs' limited resources, energy budget, and strict mission completion time have posed challenges in adopting UAVs for these applications. Our system model considers a UAV swarm that navigates an area collecting data from ground IoT devices focusing on providing better service for strategic locations and allowing UAVs to join and leave the swarm (e.g., for recharging) in a dynamic way. In this work, we introduce an optimization model with the aim of minimizing the total energy consumption and provide the optimal path planning of UAVs under the constraints of minimum completion time and transmit power. The formulated optimization is NP-hard making it not applicable for real-time decision making. Therefore, we introduce a light-weight meta-reinforcement learning solution that can also cope with sudden changes in the environment through fast convergence. We conduct extensive simulations and compare our approach to three state-of-the-art learning models. Our simulation results prove that our introduced approach is better than the three state-of-the-art algorithms in providing coverage to strategic locations with fast convergence.

Robotic Test Tube Rearrangement Using Combined Reinforcement Learning and Motion Planning

Authors:Hao Chen, Weiwei Wan, Masaki Matsushita, Takeyuki Kotaka, Kensuke Harada
Date:2024-01-18 07:42:51

A combined task-level reinforcement learning and motion planning framework is proposed in this paper to address a multi-class in-rack test tube rearrangement problem. At the task level, the framework uses reinforcement learning to infer a sequence of swap actions while ignoring robotic motion details. At the motion level, the framework accepts the swapping action sequences inferred by task-level agents and plans the detailed robotic pick-and-place motion. The task and motion-level planning form a closed loop with the help of a condition set maintained for each rack slot, which allows the framework to perform replanning and effectively find solutions in the presence of low-level failures. Particularly for reinforcement learning, the framework leverages a distributed deep Q-learning structure with the Dueling Double Deep Q Network (D3QN) to acquire near-optimal policies and uses an A${}^\star$-based post-processing technique to amplify the collected training data. The D3QN and distributed learning help increase training efficiency. The post-processing helps complete unfinished action sequences and remove redundancy, thus making the training data more effective. We carry out both simulations and real-world studies to understand the performance of the proposed framework. The results verify the performance of the RL and post-processing and show that the closed-loop combination improves robustness. The framework is ready to incorporate various sensory feedback. The real-world studies also demonstrated the incorporation.

LLMs for Relational Reasoning: How Far are We?

Authors:Zhiming Li, Yushi Cao, Xiufeng Xu, Junzhe Jiang, Xu Liu, Yon Shin Teo, Shang-wei Lin, Yang Liu
Date:2024-01-17 08:22:52

Large language models (LLMs) have revolutionized many areas (e.g. natural language processing, software engineering, etc.) by achieving state-of-the-art performance on extensive downstream tasks. Aiming to achieve robust and general artificial intelligence, there has been a surge of interest in investigating the reasoning ability of the LLMs. Whereas the textual and numerical reasoning benchmarks adopted by previous works are rather shallow and simple, it is hard to conclude that the LLMs possess strong reasoning ability by merely achieving positive results on these benchmarks. Recent efforts have demonstrated that the LLMs are poor at solving sequential decision-making problems that require common-sense planning by evaluating their performance on the reinforcement learning benchmarks. In this work, we conduct an in-depth assessment of several state-of-the-art LLMs' reasoning ability based on the inductive logic programming (ILP) benchmark, which is broadly recognized as a representative and challenging measurement for evaluating logic program induction/synthesis systems as it requires inducing strict cause-effect logic to achieve robust deduction on independent and identically distributed (IID) and out-of-distribution (OOD) test samples. Our evaluations illustrate that compared with the neural program induction systems which are much smaller in model size, the state-of-the-art LLMs are much poorer in terms of reasoning ability by achieving much lower performance and generalization using either natural language prompting or truth-value matrix prompting.

Go-Explore for Residential Energy Management

Authors:Junlin Lu, Patrick Mannion, Karl Mason
Date:2024-01-15 14:26:44

Reinforcement learning is commonly applied in residential energy management, particularly for optimizing energy costs. However, RL agents often face challenges when dealing with deceptive and sparse rewards in the energy control domain, especially with stochastic rewards. In such situations, thorough exploration becomes crucial for learning an optimal policy. Unfortunately, the exploration mechanism can be misled by deceptive reward signals, making thorough exploration difficult. Go-Explore is a family of algorithms which combines planning methods and reinforcement learning methods to achieve efficient exploration. We use the Go-Explore algorithm to solve the cost-saving task in residential energy management problems and achieve an improvement of up to 19.84\% compared to the well-known reinforcement learning algorithms.

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study

Authors:Shangding Gu
Date:2024-01-12 14:35:57

Large Language Models (LLMs) have demonstrated remarkable capabilities for reinforcement learning (RL) models, such as planning and reasoning capabilities. However, the problems of LLMs and RL model collaboration still need to be solved. In this study, we employ a teacher-student learning framework to tackle these problems, specifically by offering feedback for LLMs using RL models and providing high-level information for RL models with LLMs in a cooperative multi-agent setting. Within this framework, the LLM acts as a teacher, while the RL model acts as a student. The two agents cooperatively assist each other through a process of recursive help, such as "I help you help I help." The LLM agent supplies abstract information to the RL agent, enabling efficient exploration and policy improvement. In turn, the RL agent offers feedback to the LLM agent, providing valuable, real-time information that helps generate more useful tokens. This bi-directional feedback loop promotes optimization, exploration, and mutual improvement for both agents, enabling them to accomplish increasingly challenging tasks. Remarkably, we propose a practical algorithm to address the problem and conduct empirical experiments to evaluate the effectiveness of our method.

Spatial-Aware Deep Reinforcement Learning for the Traveling Officer Problem

Authors:Niklas Strauß, Matthias Schubert
Date:2024-01-11 15:16:20

The traveling officer problem (TOP) is a challenging stochastic optimization task. In this problem, a parking officer is guided through a city equipped with parking sensors to fine as many parking offenders as possible. A major challenge in TOP is the dynamic nature of parking offenses, which randomly appear and disappear after some time, regardless of whether they have been fined. Thus, solutions need to dynamically adjust to currently fineable parking offenses while also planning ahead to increase the likelihood that the officer arrives during the offense taking place. Though various solutions exist, these methods often struggle to take the implications of actions on the ability to fine future parking violations into account. This paper proposes SATOP, a novel spatial-aware deep reinforcement learning approach for TOP. Our novel state encoder creates a representation of each action, leveraging the spatial relationships between parking spots, the agent, and the action. Furthermore, we propose a novel message-passing module for learning future inter-action correlations in the given environment. Thus, the agent can estimate the potential to fine further parking violations after executing an action. We evaluate our method using an environment based on real-world data from Melbourne. Our results show that SATOP consistently outperforms state-of-the-art TOP agents and is able to fine up to 22% more parking offenses.

Graph Learning-based Fleet Scheduling for Urban Air Mobility under Operational Constraints, Varying Demand & Uncertainties

Authors:Steve Paul, Jhoel Witter, Souma Chowdhury
Date:2024-01-09 23:46:22

This paper develops a graph reinforcement learning approach to online planning of the schedule and destinations of electric aircraft that comprise an urban air mobility (UAM) fleet operating across multiple vertiports. This fleet scheduling problem is formulated to consider time-varying demand, constraints related to vertiport capacity, aircraft capacity and airspace safety guidelines, uncertainties related to take-off delay, weather-induced route closures, and unanticipated aircraft downtime. Collectively, such a formulation presents greater complexity, and potentially increased realism, than in existing UAM fleet planning implementations. To address these complexities, a new policy architecture is constructed, primary components of which include: graph capsule conv-nets for encoding vertiport and aircraft-fleet states both abstracted as graphs; transformer layers encoding time series information on demand and passenger fare; and a Multi-head Attention-based decoder that uses the encoded information to compute the probability of selecting each available destination for an aircraft. Trained with Proximal Policy Optimization, this policy architecture shows significantly better performance in terms of daily averaged profits on unseen test scenarios involving 8 vertiports and 40 aircraft, when compared to a random baseline and genetic algorithm-derived optimal solutions, while being nearly 1000 times faster in execution than the latter.

Learn Once Plan Arbitrarily (LOPA): Attention-Enhanced Deep Reinforcement Learning Method for Global Path Planning

Authors:Guoming Huang, Mingxin Hou, Xiaofang Yuan, Shuqiao Huang, Yaonan Wang
Date:2024-01-08 02:27:14

Deep reinforcement learning (DRL) methods have recently shown promise in path planning tasks. However, when dealing with global planning tasks, these methods face serious challenges such as poor convergence and generalization. To this end, we propose an attention-enhanced DRL method called LOPA (Learn Once Plan Arbitrarily) in this paper. Firstly, we analyze the reasons of these problems from the perspective of DRL's observation, revealing that the traditional design causes DRL to be interfered by irrelevant map information. Secondly, we develop the LOPA which utilizes a novel attention-enhanced mechanism to attain an improved attention capability towards the key information of the observation. Such a mechanism is realized by two steps: (1) an attention model is built to transform the DRL's observation into two dynamic views: local and global, significantly guiding the LOPA to focus on the key information on the given maps; (2) a dual-channel network is constructed to process these two views and integrate them to attain an improved reasoning capability. The LOPA is validated via multi-objective global path planning experiments. The result suggests the LOPA has improved convergence and generalization performance as well as great path planning efficiency.

NovelGym: A Flexible Ecosystem for Hybrid Planning and Learning Agents Designed for Open Worlds

Authors:Shivam Goel, Yichen Wei, Panagiotis Lymperopoulos, Klara Chura, Matthias Scheutz, Jivko Sinapov
Date:2024-01-07 17:13:28

As AI agents leave the lab and venture into the real world as autonomous vehicles, delivery robots, and cooking robots, it is increasingly necessary to design and comprehensively evaluate algorithms that tackle the ``open-world''. To this end, we introduce NovelGym, a flexible and adaptable ecosystem designed to simulate gridworld environments, serving as a robust platform for benchmarking reinforcement learning (RL) and hybrid planning and learning agents in open-world contexts. The modular architecture of NovelGym facilitates rapid creation and modification of task environments, including multi-agent scenarios, with multiple environment transformations, thus providing a dynamic testbed for researchers to develop open-world AI agents.

Decision Making in Non-Stationary Environments with Policy-Augmented Search

Authors:Ava Pettet, Yunuo Zhang, Baiting Luo, Kyle Wray, Hendrik Baier, Aron Laszka, Abhishek Dubey, Ayan Mukhopadhyay
Date:2024-01-06 11:51:50

Sequential decision-making under uncertainty is present in many important problems. Two popular approaches for tackling such problems are reinforcement learning and online search (e.g., Monte Carlo tree search). While the former learns a policy by interacting with the environment (typically done before execution), the latter uses a generative model of the environment to sample promising action trajectories at decision time. Decision-making is particularly challenging in non-stationary environments, where the environment in which an agent operates can change over time. Both approaches have shortcomings in such settings -- on the one hand, policies learned before execution become stale when the environment changes and relearning takes both time and computational effort. Online search, on the other hand, can return sub-optimal actions when there are limitations on allowed runtime. In this paper, we introduce \textit{Policy-Augmented Monte Carlo tree search} (PA-MCTS), which combines action-value estimates from an out-of-date policy with an online search using an up-to-date model of the environment. We prove theoretical results showing conditions under which PA-MCTS selects the one-step optimal action and also bound the error accrued while following PA-MCTS as a policy. We compare and contrast our approach with AlphaZero, another hybrid planning approach, and Deep Q Learning on several OpenAI Gym environments. Through extensive experiments, we show that under non-stationary settings with limited time constraints, PA-MCTS outperforms these baselines.

Adaptive Discounting of Training Time Attacks

Authors:Ridhima Bector, Abhay Aradhya, Chai Quek, Zinovi Rabinovich
Date:2024-01-05 06:03:14

Among the most insidious attacks on Reinforcement Learning (RL) solutions are training-time attacks (TTAs) that create loopholes and backdoors in the learned behaviour. Not limited to a simple disruption, constructive TTAs (C-TTAs) are now available, where the attacker forces a specific, target behaviour upon a training RL agent (victim). However, even state-of-the-art C-TTAs focus on target behaviours that could be naturally adopted by the victim if not for a particular feature of the environment dynamics, which C-TTAs exploit. In this work, we show that a C-TTA is possible even when the target behaviour is un-adoptable due to both environment dynamics as well as non-optimality with respect to the victim objective(s). To find efficient attacks in this context, we develop a specialised flavour of the DDPG algorithm, which we term gammaDDPG, that learns this stronger version of C-TTA. gammaDDPG dynamically alters the attack policy planning horizon based on the victim's current behaviour. This improves effort distribution throughout the attack timeline and reduces the effect of uncertainty the attacker has about the victim. To demonstrate the features of our method and better relate the results to prior research, we borrow a 3D grid domain from a state-of-the-art C-TTA for our experiments. Code is available at "bit.ly/github-rb-gDDPG".

Simple Hierarchical Planning with Diffusion

Authors:Chang Chen, Fei Deng, Kenji Kawaguchi, Caglar Gulcehre, Sungjin Ahn
Date:2024-01-05 05:28:40

Diffusion-based generative methods have proven effective in modeling trajectories with offline datasets. However, they often face computational challenges and can falter in generalization, especially in capturing temporal abstractions for long-horizon tasks. To overcome this, we introduce the Hierarchical Diffuser, a simple, fast, yet surprisingly effective planning method combining the advantages of hierarchical and diffusion-based planning. Our model adopts a "jumpy" planning strategy at the higher level, which allows it to have a larger receptive field but at a lower computational cost -- a crucial factor for diffusion-based planning methods, as we have empirically verified. Additionally, the jumpy sub-goals guide our low-level planner, facilitating a fine-tuning stage and further improving our approach's effectiveness. We conducted empirical evaluations on standard offline reinforcement learning benchmarks, demonstrating our method's superior performance and efficiency in terms of training and planning speed compared to the non-hierarchical Diffuser as well as other hierarchical planning methods. Moreover, we explore our model's generalization capability, particularly on how our method improves generalization capabilities on compositional out-of-distribution tasks.

A comprehensive survey of research towards AI-enabled unmanned aerial systems in pre-, active-, and post-wildfire management

Authors:Sayed Pedram Haeri Boroujeni, Abolfazl Razi, Sahand Khoshdel, Fatemeh Afghah, Janice L. Coen, Leo ONeill, Peter Z. Fule, Adam Watts, Nick-Marios T. Kokolakis, Kyriakos G. Vamvoudakis
Date:2024-01-04 05:09:35

Wildfires have emerged as one of the most destructive natural disasters worldwide, causing catastrophic losses in both human lives and forest wildlife. Recently, the use of Artificial Intelligence (AI) in wildfires, propelled by the integration of Unmanned Aerial Vehicles (UAVs) and deep learning models, has created an unprecedented momentum to implement and develop more effective wildfire management. Although some of the existing survey papers have explored various learning-based approaches, a comprehensive review emphasizing the application of AI-enabled UAV systems and their subsequent impact on multi-stage wildfire management is notably lacking. This survey aims to bridge these gaps by offering a systematic review of the recent state-of-the-art technologies, highlighting the advancements of UAV systems and AI models from pre-fire, through the active-fire stage, to post-fire management. To this aim, we provide an extensive analysis of the existing remote sensing systems with a particular focus on the UAV advancements, device specifications, and sensor technologies relevant to wildfire management. We also examine the pre-fire and post-fire management approaches, including fuel monitoring, prevention strategies, as well as evacuation planning, damage assessment, and operation strategies. Additionally, we review and summarize a wide range of computer vision techniques in active-fire management, with an emphasis on Machine Learning (ML), Reinforcement Learning (RL), and Deep Learning (DL) algorithms for wildfire classification, segmentation, detection, and monitoring tasks. Ultimately, we underscore the substantial advancement in wildfire modeling through the integration of cutting-edge AI techniques and UAV-based data, providing novel insights and enhanced predictive capabilities to understand dynamic wildfire behavior.

Optimizing UAV-UGV Coalition Operations: A Hybrid Clustering and Multi-Agent Reinforcement Learning Approach for Path Planning in Obstructed Environment

Authors:Shamyo Brotee, Farhan Kabir, Md. Abdur Razzaque, Palash Roy, Md. Mamun-Or-Rashid, Md. Rafiul Hassan, Mohammad Mehedi Hassan
Date:2024-01-03 01:09:56

One of the most critical applications undertaken by coalitions of Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs) is reaching predefined targets by following the most time-efficient routes while avoiding collisions. Unfortunately, UAVs are hampered by limited battery life, and UGVs face challenges in reachability due to obstacles and elevation variations. Existing literature primarily focuses on one-to-one coalitions, which constrains the efficiency of reaching targets. In this work, we introduce a novel approach for a UAV-UGV coalition with a variable number of vehicles, employing a modified mean-shift clustering algorithm to segment targets into multiple zones. Each vehicle utilizes Multi-agent Deep Deterministic Policy Gradient (MADDPG) and Multi-agent Proximal Policy Optimization (MAPPO), two advanced reinforcement learning algorithms, to form an effective coalition for navigating obstructed environments without collisions. This approach of assigning targets to various circular zones, based on density and range, significantly reduces the time required to reach these targets. Moreover, introducing variability in the number of UAVs and UGVs in a coalition enhances task efficiency by enabling simultaneous multi-target engagement. The results of our experimental evaluation demonstrate that our proposed method substantially surpasses current state-of-the-art techniques, nearly doubling efficiency in terms of target navigation time and task completion rate.

Data Assimilation in Chaotic Systems Using Deep Reinforcement Learning

Authors:Mohamad Abed El Rahman Hammoud, Naila Raboudi, Edriss S. Titi, Omar Knio, Ibrahim Hoteit
Date:2024-01-01 06:53:36

Data assimilation (DA) plays a pivotal role in diverse applications, ranging from climate predictions and weather forecasts to trajectory planning for autonomous vehicles. A prime example is the widely used ensemble Kalman filter (EnKF), which relies on linear updates to minimize variance among the ensemble of forecast states. Recent advancements have seen the emergence of deep learning approaches in this domain, primarily within a supervised learning framework. However, the adaptability of such models to untrained scenarios remains a challenge. In this study, we introduce a novel DA strategy that utilizes reinforcement learning (RL) to apply state corrections using full or partial observations of the state variables. Our investigation focuses on demonstrating this approach to the chaotic Lorenz '63 system, where the agent's objective is to minimize the root-mean-squared error between the observations and corresponding forecast states. Consequently, the agent develops a correction strategy, enhancing model forecasts based on available system state observations. Our strategy employs a stochastic action policy, enabling a Monte Carlo-based DA framework that relies on randomly sampling the policy to generate an ensemble of assimilated realizations. Results demonstrate that the developed RL algorithm performs favorably when compared to the EnKF. Additionally, we illustrate the agent's capability to assimilate non-Gaussian data, addressing a significant limitation of the EnKF.

Explicit-Implicit Subgoal Planning for Long-Horizon Tasks with Sparse Reward

Authors:Fangyuan Wang, Anqing Duan, Peng Zhou, Shengzeng Huo, Guodong Guo, Chenguang Yang, David Navarro-Alarcon
Date:2023-12-25 01:21:34

The challenges inherent in long-horizon tasks in robotics persist due to the typical inefficient exploration and sparse rewards in traditional reinforcement learning approaches. To address these challenges, we have developed a novel algorithm, termed Explicit-Implicit Subgoal Planning (EISP), designed to tackle long-horizon tasks through a divide-and-conquer approach. We utilize two primary criteria, feasibility and optimality, to ensure the quality of the generated subgoals. EISP consists of three components: a hybrid subgoal generator, a hindsight sampler, and a value selector. The hybrid subgoal generator uses an explicit model to infer subgoals and an implicit model to predict the final goal, inspired by way of human thinking that infers subgoals by using the current state and final goal as well as reason about the final goal conditioned on the current state and given subgoals. Additionally, the hindsight sampler selects valid subgoals from an offline dataset to enhance the feasibility of the generated subgoals. While the value selector utilizes the value function in reinforcement learning to filter the optimal subgoals from subgoal candidates. To validate our method, we conduct four long-horizon tasks in both simulation and the real world. The obtained quantitative and qualitative data indicate that our approach achieves promising performance compared to other baseline methods. These experimental results can be seen on the website \url{https://sites.google.com/view/vaesi}.

An investigation of belief-free DRL and MCTS for inspection and maintenance planning

Authors:Daniel Koutas, Elizabeth Bismut, Daniel Straub
Date:2023-12-22 16:53:02

We propose a novel Deep Reinforcement Learning (DRL) architecture for sequential decision processes under uncertainty, as encountered in inspection and maintenance (I&M) planning. Unlike other DRL algorithms for (I&M) planning, the proposed +RQN architecture dispenses with computing the belief state and directly handles erroneous observations instead. We apply the algorithm to a basic I&M planning problem for a one-component system subject to deterioration. In addition, we investigate the performance of Monte Carlo tree search for the I&M problem and compare it to the +RQN. The comparison includes a statistical analysis of the two methods' resulting policies, as well as their visualization in the belief space.

RFRL Gym: A Reinforcement Learning Testbed for Cognitive Radio Applications

Authors:Daniel Rosen, Illa Rochez, Caleb McIrvin, Joshua Lee, Kevin D'Alessandro, Max Wiecek, Nhan Hoang, Ramzy Saffarini, Sam Philips, Vanessa Jones, Will Ivey, Zavier Harris-Smart, Zavion Harris-Smart, Zayden Chin, Amos Johnson, Alyse M. Jones, William C. Headley
Date:2023-12-20 15:00:10

Radio Frequency Reinforcement Learning (RFRL) is anticipated to be a widely applicable technology in the next generation of wireless communication systems, particularly 6G and next-gen military communications. Given this, our research is focused on developing a tool to promote the development of RFRL techniques that leverage spectrum sensing. In particular, the tool was designed to address two cognitive radio applications, specifically dynamic spectrum access and jamming. In order to train and test reinforcement learning (RL) algorithms for these applications, a simulation environment is necessary to simulate the conditions that an agent will encounter within the Radio Frequency (RF) spectrum. In this paper, such an environment has been developed, herein referred to as the RFRL Gym. Through the RFRL Gym, users can design their own scenarios to model what an RL agent may encounter within the RF spectrum as well as experiment with different spectrum sensing techniques. Additionally, the RFRL Gym is a subclass of OpenAI gym, enabling the use of third-party ML/RL Libraries. We plan to open-source this codebase to enable other researchers to utilize the RFRL Gym to test their own scenarios and RL algorithms, ultimately leading to the advancement of RL research in the wireless communications domain. This paper describes in further detail the components of the Gym, results from example scenarios, and plans for future additions. Index Terms-machine learning, reinforcement learning, wireless communications, dynamic spectrum access, OpenAI gym

Stable Relay Learning Optimization Approach for Fast Power System Production Cost Minimization Simulation

Authors:Zishan Guo, Qinran Hu, Tao Qian, Xin Fang, Renjie Hu, Zaijun Wu
Date:2023-12-19 06:39:52

Production cost minimization (PCM) simulation is commonly employed for assessing the operational efficiency, economic viability, and reliability, providing valuable insights for power system planning and operations. However, solving a PCM problem is time-consuming, consisting of numerous binary variables for simulation horizon extending over months and years. This hinders rapid assessment of modern energy systems with diverse planning requirements. Existing methods for accelerating PCM tend to sacrifice accuracy for speed. In this paper, we propose a stable relay learning optimization (s-RLO) approach within the Branch and Bound (B&B) algorithm. The proposed approach offers rapid and stable performance, and ensures optimal solutions. The two-stage s-RLO involves an imitation learning (IL) phase for accurate policy initialization and a reinforcement learning (RL) phase for time-efficient fine-tuning. When implemented on the popular SCIP solver, s-RLO returns the optimal solution up to 2 times faster than the default relpscost rule and 1.4 times faster than IL, or exhibits a smaller gap at the predefined time limit. The proposed approach shows stable performance, reducing fluctuations by approximately 50% compared with IL. The efficacy of the proposed s-RLO approach is supported by numerical results.

Spatial Deep Learning for Site-Specific Movement Optimization of Aerial Base Stations

Authors:Jiangbin Lyu, Xu Chen, Jiefeng Zhang, Liqun Fu
Date:2023-12-16 15:52:13

Unmanned aerial vehicles (UAVs) can be utilized as aerial base stations (ABSs) to provide wireless connectivity for ground users (GUs) in various emergency scenarios. However, it is a NP-hard problem with exponential complexity in $M$ and $N$, in order to maximize the coverage rate of $M$ GUs by jointly placing $N$ ABSs with limited coverage range. The problem is further complicated when the coverage range becomes irregular due to site-specific blockages (e.g., buildings) on the air-ground channel, and/or when the GUs are moving. To address the above challenges, we study a multi-ABS movement optimization problem to maximize the average coverage rate of mobile GUs in a site-specific environment. The Spatial Deep Learning with Multi-dimensional Archive of Phenotypic Elites (SDL-ME) algorithm is proposed to tackle this challenging problem by 1) partitioning the complicated ABS movement problem into ABS placement sub-problems each spanning finite time horizon; 2) using an encoder-decoder deep neural network (DNN) as the emulator to capture the spatial correlation of ABSs/GUs and thereby reducing the cost of interaction with the actual environment; 3) employing the emulator to speed up a quality-diversity search for the optimal placement solution; and 4) proposing a planning-exploration-serving scheme for multi-ABS movement coordination. Numerical results demonstrate that the proposed approach significantly outperforms the benchmark Deep Reinforcement Learning (DRL)-based method and other two baselines in terms of average coverage rate, training time and/or sample efficiency. Moreover, with one-time training, our proposed method can be applied in scenarios where the number of ABSs/GUs dynamically changes on site and/or with different/varying GU speeds, which is thus more robust and flexible compared with conventional DRL-based methods.

Leveraging User Simulation to Develop and Evaluate Conversational Information Access Agents

Authors:Nolwenn Bernard
Date:2023-12-13 10:40:15

We observe a change in the way users access information, that is, the rise of conversational information access (CIA) agents. However, the automatic evaluation of these agents remains an open challenge. Moreover, the training of CIA agents is cumbersome as it mostly relies on conversational corpora, expert knowledge, and reinforcement learning. User simulation has been identified as a promising solution to tackle automatic evaluation and has been previously used in reinforcement learning. In this research, we investigate how user simulation can be leveraged in the context of CIA. We organize the work in three parts. We begin with the identification of requirements for user simulators for training and evaluating CIA agents and compare existing types of simulator regarding these. Then, we plan to combine these different types of simulators into a new hybrid simulator. Finally, we aim to extend simulators to handle more complex information seeking scenarios.

Sequential Planning in Large Partially Observable Environments guided by LLMs

Authors:Swarna Kamal Paul
Date:2023-12-12 15:36:59

Sequential planning in large state space and action space quickly becomes intractable due to combinatorial explosion of the search space. Heuristic methods, like monte-carlo tree search, though effective for large state space, but struggle if action space is large. Pure reinforcement learning methods, relying only on reward signals, needs prohibitively large interactions with the environment to device a viable plan. If the state space, observations and actions can be represented in natural language then Large Language models (LLM) can be used to generate action plans. Recently several such goal-directed agents like Reflexion, CLIN, SayCan were able to surpass the performance of other state-of-the-art methods with minimum or no task specific training. But they still struggle with exploration and get stuck in local optima. Their planning capabilities are limited by the limited reasoning capability of the foundational LLMs on text data. We propose a hybrid agent "neoplanner", that synergizes both state space search with queries to foundational LLM to get the best action plan. The reward signals are quantitatively used to drive the search. A balance of exploration and exploitation is maintained by maximizing upper confidence bounds of values of states. In places where random exploration is needed, the LLM is queried to generate an action plan. Learnings from each trial are stored as entity relationships in text format. Those are used in future queries to the LLM for continual improvement. Experiments in the Scienceworld environment reveals a 124% improvement from the current best method in terms of average reward gained across multiple tasks.

Learning from Interaction: User Interface Adaptation using Reinforcement Learning

Authors:Daniel Gaspar-Figueiredo
Date:2023-12-12 12:29:18

The continuous adaptation of software systems to meet the evolving needs of users is very important for enhancing user experience (UX). User interface (UI) adaptation, which involves adjusting the layout, navigation, and content presentation based on user preferences and contextual conditions, plays an important role in achieving this goal. However, suggesting the right adaptation at the right time and in the right place remains a challenge in order to make it valuable for the end-user. To tackle this challenge, machine learning approaches could be used. In particular, we are using Reinforcement Learning (RL) due to its ability to learn from interaction with the users. In this approach, the feedback is very important and the use of physiological data could be benefitial to obtain objective insights into how users are reacting to the different adaptations. Thus, in this PhD thesis, we propose an RL-based UI adaptation framework that uses physiological data. The framework aims to learn from user interactions and make informed adaptations to improve UX. To this end, our research aims to answer the following questions: Does the use of an RL-based approach improve UX? How effective is RL in guiding UI adaptation? and Can physiological data support UI adaptation for enhancing UX? The evaluation plan involves conducting user studies to evaluate answer these questions. The empirical evaluation will provide a strong empirical foundation for building, evaluating, and improving the proposed adaptation framework. The expected contributions of this research include the development of a novel framework for intelligent Adaptive UIs, insights into the effectiveness of RL algorithms in guiding UI adaptation, the integration of physiological data as objective measures of UX, and empirical validation of the proposed framework's impact on UX.

Building Open-Ended Embodied Agent via Language-Policy Bidirectional Adaptation

Authors:Shaopeng Zhai, Jie Wang, Tianyi Zhang, Fuxian Huang, Qi Zhang, Ming Zhou, Jing Hou, Yu Qiao, Yu Liu
Date:2023-12-12 11:06:07

Building embodied agents on integrating Large Language Models (LLMs) and Reinforcement Learning (RL) have revolutionized human-AI interaction: researchers can now leverage language instructions to plan decision-making for open-ended tasks. However, existing research faces challenges in meeting the requirement of open-endedness. They typically either train LLM/RL models to adapt to a fixed counterpart, limiting exploration of novel skills and hindering the efficacy of human-AI interaction. To this end, we present OpenPAL, a co-training framework comprising two stages: (1) fine-tuning a pre-trained LLM to translate human instructions into goals for planning, and goal-conditioned training a policy for decision-making; (2) co-training to align the LLM and policy, achieving instruction open-endedness. We conducted experiments using Contra, an open-ended FPS game, demonstrating that an agent trained with OpenPAL not only comprehends arbitrary instructions but also exhibits efficient execution. These results suggest that OpenPAL holds the potential to construct open-ended embodied agents in practical scenarios.

Partial End-to-end Reinforcement Learning for Robustness Against Modelling Error in Autonomous Racing

Authors:Andrew Murdoch, Johannes Cornelius Schoeman, Hendrik Willem Jordaan
Date:2023-12-11 14:27:10

In this paper, we address the issue of increasing the performance of reinforcement learning (RL) solutions for autonomous racing cars when navigating under conditions where practical vehicle modelling errors (commonly known as \emph{model mismatches}) are present. To address this challenge, we propose a partial end-to-end algorithm that decouples the planning and control tasks. Within this framework, an RL agent generates a trajectory comprising a path and velocity, which is subsequently tracked using a pure pursuit steering controller and a proportional velocity controller, respectively. In contrast, many current learning-based (i.e., reinforcement and imitation learning) algorithms utilise an end-to-end approach whereby a deep neural network directly maps from sensor data to control commands. By leveraging the robustness of a classical controller, our partial end-to-end driving algorithm exhibits better robustness towards model mismatches than standard end-to-end algorithms.

FOSS: A Self-Learned Doctor for Query Optimizer

Authors:Kai Zhong, Luming Sun, Tao Ji, Cuiping Li, Hong Chen
Date:2023-12-11 13:05:51

Various works have utilized deep learning to address the query optimization problem in database system. They either learn to construct plans from scratch in a bottom-up manner or steer the plan generation behavior of traditional optimizer using hints. While these methods have achieved some success, they face challenges in either low training efficiency or limited plan search space. To address these challenges, we introduce FOSS, a novel framework for query optimization based on deep reinforcement learning. FOSS initiates optimization from the original plan generated by a traditional optimizer and incrementally refines suboptimal nodes of the plan through a sequence of actions. Additionally, we devise an asymmetric advantage model to evaluate the advantage between two plans. We integrate it with a traditional optimizer to form a simulated environment. Leveraging this simulated environment, FOSS can bootstrap itself to rapidly generate a large amount of high-quality simulated experiences. FOSS then learns from these experiences to improve its optimization capability. We evaluate the performance of FOSS on Join Order Benchmark, TPC-DS, and Stack Overflow. The experimental results demonstrate that FOSS outperforms the state-of-the-art methods in terms of latency performance. Compared to PostgreSQL, FOSS achieves speedup ranging from 1.15x to 8.33x in total latency across different benchmarks.

Robust and Decentralized Reinforcement Learning for UAV Path Planning in IoT Networks

Authors:Xueyuan Wang, M. Cenk Gursoy
Date:2023-12-11 09:47:41

Unmanned aerial vehicle (UAV)-based networks and Internet of Things (IoT) are being considered as integral components of current and next-generation wireless networks. In particular, UAVs can provide IoT devices with seamless connectivity and high coverage and this can be accomplished with effective UAV path planning. In this article, we study robust and decentralized UAV path planning for data collection in IoT networks in the presence of other noncooperative UAVs and adversarial jamming attacks. We address three different practical scenarios, including single UAV path planning, UAV swarm path planning, and single UAV path planning in the presence of an intelligent mobile UAV jammer. We advocate a reinforcement learning framework for UAV path planning in these three scenarios under practical constraints. The simulation results demonstrate that with learning-based path planning, the UAVs can complete their missions with high success rates and data collection rates. In addition, the UAVs can adapt and execute different trajectories as a defensive measure against the intelligent jammer.

Resilient Path Planning for UAVs in Data Collection under Adversarial Attacks

Authors:Xueyuan Wang, M. Cenk Gursoy
Date:2023-12-11 09:28:28

In this paper, we investigate jamming-resilient UAV path planning strategies for data collection in Internet of Things (IoT) networks, in which the typical UAV can learn the optimal trajectory to elude such jamming attacks. Specifically, the typical UAV is required to collect data from multiple distributed IoT nodes under collision avoidance, mission completion deadline, and kinematic constraints in the presence of jamming attacks. We first design a fixed ground jammer with continuous jamming attack and periodical jamming attack strategies to jam the link between the typical UAV and IoT nodes. Defensive strategies involving a reinforcement learning (RL) based virtual jammer and the adoption of higher SINR thresholds are proposed to counteract against such attacks. Secondly, we design an intelligent UAV jammer, which utilizes the RL algorithm to choose actions based on its observation. Then, an intelligent UAV anti-jamming strategy is constructed to deal with such attacks, and the optimal trajectory of the typical UAV is obtained via dueling double deep Q-network (D3QN). Simulation results show that both non-intelligent and intelligent jamming attacks have significant influence on the UAV's performance, and the proposed defense strategies can recover the performance close to that in no-jammer scenarios.

Graph-based Prediction and Planning Policy Network (GP3Net) for scalable self-driving in dynamic environments using Deep Reinforcement Learning

Authors:Jayabrata Chowdhury, Venkataramanan Shivaraman, Suresh Sundaram, P B Sujit
Date:2023-12-10 06:04:45

Recent advancements in motion planning for Autonomous Vehicles (AVs) show great promise in using expert driver behaviors in non-stationary driving environments. However, learning only through expert drivers needs more generalizability to recover from domain shifts and near-failure scenarios due to the dynamic behavior of traffic participants and weather conditions. A deep Graph-based Prediction and Planning Policy Network (GP3Net) framework is proposed for non-stationary environments that encodes the interactions between traffic participants with contextual information and provides a decision for safe maneuver for AV. A spatio-temporal graph models the interactions between traffic participants for predicting the future trajectories of those participants. The predicted trajectories are utilized to generate a future occupancy map around the AV with uncertainties embedded to anticipate the evolving non-stationary driving environments. Then the contextual information and future occupancy maps are input to the policy network of the GP3Net framework and trained using Proximal Policy Optimization (PPO) algorithm. The proposed GP3Net performance is evaluated on standard CARLA benchmarking scenarios with domain shifts of traffic patterns (urban, highway, and mixed). The results show that the GP3Net outperforms previous state-of-the-art imitation learning-based planning models for different towns. Further, in unseen new weather conditions, GP3Net completes the desired route with fewer traffic infractions. Finally, the results emphasize the advantage of including the prediction module to enhance safety measures in non-stationary environments.

An Autonomous Driving Model Integrated with BEV-V2X Perception, Fusion Prediction of Motion and Occupancy, and Driving Planning, in Complex Traffic Intersections

Authors:Fukang Li, Wenlin Ou, Kunpeng Gao, Yuwen Pang, Yifei Li, Henry Fan
Date:2023-12-08 15:36:08

The comprehensiveness of vehicle-to-everything (V2X) recognition enriches and holistically shapes the global Birds-Eye-View (BEV) perception, incorporating rich semantics and integrating driving scene information, thereby serving features of vehicle state prediction, decision-making and driving planning. Utilizing V2X message sets to form BEV map proves to be an effective perception method for connected and automated vehicles (CAVs). Specifically, Map Msg. (MAP), Signal Phase And Timing (SPAT) and Roadside Information (RSI) contributes to the achievement of road connectivity, synchronized traffic signal navigation and obstacle warning. Moreover, harnessing time-sequential Basic Safety Msg. (BSM) data from multiple vehicles allows for the real-time perception and future state prediction. Therefore, this paper develops a comprehensive autonomous driving model that relies on BEV-V2X perception, Interacting Multiple model Unscented Kalman Filter (IMM-UKF)-based fusion prediction, and deep reinforcement learning (DRL)-based decision making and planning. We integrated them into a DRL environment to develop an optimal set of unified driving behaviors that encompass obstacle avoidance, lane changes, overtaking, turning maneuver, and synchronized traffic signal navigation. Consequently, a complex traffic intersection scenario was simulated, and the well-trained model was applied for driving planning. The observed driving behavior closely resembled that of an experienced driver, exhibiting anticipatory actions and revealing notable operational highlights of driving policy.

Development and Assessment of Autonomous Vehicles in Both Fully Automated and Mixed Traffic Conditions

Authors:Ahmed Abdelrahman
Date:2023-12-08 02:40:11

Autonomous Vehicle (AV) technology is advancing rapidly, promising a significant shift in road transportation safety and potentially resolving various complex transportation issues. With the increasing deployment of AVs by various companies, questions emerge about how AVs interact with each other and with human drivers, especially when AVs are prevalent on the roads. Ensuring cooperative interaction between AVs and between AVs and human drivers is critical, though there are concerns about possible negative competitive behaviors. This paper presents a multi-stage approach, starting with the development of a single AV and progressing to connected AVs, incorporating sharing and caring V2V communication strategy to enhance mutual coordination. A survey is conducted to validate the driving performance of the AV and will be utilized for a mixed traffic case study, which focuses on how the human drivers will react to the AV driving alongside them on the same road. Results show that using deep reinforcement learning, the AV acquired driving behavior that reached human driving performance. The adoption of sharing and caring based V2V communication within AV networks enhances their driving behavior, aids in more effective action planning, and promotes collaborative behavior amongst the AVs. The survey shows that safety in mixed traffic cannot be guaranteed, as we cannot control human ego-driven actions if they decide to compete with AV. Consequently, this paper advocates for enhanced research into the safe incorporation of AVs on public roads.

Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation

Authors:Jiayi Huang, Han Zhong, Liwei Wang, Lin F. Yang
Date:2023-12-07 17:35:34

To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon. The derived regret bound is deemed \emph{sharp}, as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors. Furthermore, UCRL-WVTR is \emph{computationally efficient} with access to a regression oracle. The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs: weighted value-targeted regression and a high-order moment estimator in the context of general function approximation; and (ii) fine-grained analyses: a novel concentration bound of weighted non-linear least squares and a refined analysis which leads to the tight instance-dependent bound. We also conduct comprehensive experiments to corroborate our theoretical findings.

Learning to sample in Cartesian MRI

Authors:Thomas Sanchez
Date:2023-12-07 14:38:07

Despite its exceptional soft tissue contrast, Magnetic Resonance Imaging (MRI) faces the challenge of long scanning times compared to other modalities like X-ray radiography. Shortening scanning times is crucial in clinical settings, as it increases patient comfort, decreases examination costs and improves throughput. Recent advances in compressed sensing (CS) and deep learning allow accelerated MRI acquisition by reconstructing high-quality images from undersampled data. While reconstruction algorithms have received most of the focus, designing acquisition trajectories to optimize reconstruction quality remains an open question. This thesis explores two approaches to address this gap in the context of Cartesian MRI. First, we propose two algorithms, lazy LBCS and stochastic LBCS, that significantly improve upon G\"ozc\"u et al.'s greedy learning-based CS (LBCS) approach. These algorithms scale to large, clinically relevant scenarios like multi-coil 3D MR and dynamic MRI, previously inaccessible to LBCS. Additionally, we demonstrate that generative adversarial networks (GANs) can serve as a natural criterion for adaptive sampling by leveraging variance in the measurement domain to guide acquisition. Second, we delve into the underlying structures or assumptions that enable mask design algorithms to perform well in practice. Our experiments reveal that state-of-the-art deep reinforcement learning (RL) approaches, while capable of adaptation and long-horizon planning, offer only marginal improvements over stochastic LBCS, which is neither adaptive nor does long-term planning. Altogether, our findings suggest that stochastic LBCS and similar methods represent promising alternatives to deep RL. They shine in particular by their scalability and computational efficiency and could be key in the deployment of optimized acquisition trajectories in Cartesian MRI.

Diffused Task-Agnostic Milestone Planner

Authors:Mineui Hong, Minjae Kang, Songhwai Oh
Date:2023-12-06 10:09:22

Addressing decision-making problems using sequence modeling to predict future trajectories shows promising results in recent years. In this paper, we take a step further to leverage the sequence predictive method in wider areas such as long-term planning, vision-based control, and multi-task decision-making. To this end, we propose a method to utilize a diffusion-based generative sequence model to plan a series of milestones in a latent space and to have an agent to follow the milestones to accomplish a given task. The proposed method can learn control-relevant, low-dimensional latent representations of milestones, which makes it possible to efficiently perform long-term planning and vision-based control. Furthermore, our approach exploits generation flexibility of the diffusion model, which makes it possible to plan diverse trajectories for multi-task decision-making. We demonstrate the proposed method across offline reinforcement learning (RL) benchmarks and an visual manipulation environment. The results show that our approach outperforms offline RL methods in solving long-horizon, sparse-reward tasks and multi-task problems, while also achieving the state-of-the-art performance on the most challenging vision-based manipulation benchmark.

Hierarchical RL-Guided Large-scale Navigation of a Snake Robot

Authors:Shuo Jiang, Adarsh Salagame, Alireza Ramezani, Lawson Wong
Date:2023-12-06 01:44:58

Classical snake robot control leverages mimicking snake-like gaits tuned for specific environments. However, to operate adaptively in unstructured environments, gait generation must be dynamically scheduled. In this work, we present a four-layer hierarchical control scheme to enable the snake robot to navigate freely in large-scale environments. The proposed model decomposes navigation into global planning, local planning, gait generation, and gait tracking. Using reinforcement learning (RL) and a central pattern generator (CPG), our method learns to navigate in complex mazes within hours and can be directly deployed to arbitrary new environments in a zero-shot fashion. We use the high-fidelity model of Northeastern's slithering robot COBRA to test the effectiveness of the proposed hierarchical control approach.

MASP: Scalable GNN-based Planning for Multi-Agent Navigation

Authors:Xinyi Yang, Xinting Yang, Chao Yu, Jiayu Chen, Wenbo Ding, Huazhong Yang, Yu Wang
Date:2023-12-05 06:05:04

We investigate multi-agent navigation tasks, where multiple agents need to reach initially unassigned goals in a limited time. Classical planning-based methods suffer from expensive computation overhead at each step and offer limited expressiveness for complex cooperation strategies. In contrast, reinforcement learning (RL) has recently become a popular approach for addressing this issue. However, RL struggles with low data efficiency and cooperation when directly exploring (nearly) optimal policies in a large exploration space, especially with an increased number of agents(e.g., 10+ agents) or in complex environments (e.g., 3-D simulators). In this paper, we propose the Multi-Agent Scalable Graph-based Planner (MASP), a goal-conditioned hierarchical planner for navigation tasks with a substantial number of agents in the decentralized setting. MASP employs a hierarchical framework to reduce space complexity by decomposing a large exploration space into multiple goal-conditioned subspaces, where a high-level policy assigns agents goals, and a low-level policy navigates agents toward designated goals. For agent cooperation and the adaptation to varying team sizes, we model agents and goals as graphs to better capture their relationship. The high-level policy, the Goal Matcher, leverages a graph-based Self-Encoder and Cross-Encoder to optimize goal assignment by updating the agent and the goal graphs. The low-level policy, the Coordinated Action Executor, introduces the Group Information Fusion to facilitate group division and extract agent relationships across groups, enhancing training efficiency for agent cooperation. The results demonstrate that MASP outperforms RL and planning-based baselines in task efficiency.

RL-Based Cargo-UAV Trajectory Planning and Cell Association for Minimum Handoffs, Disconnectivity, and Energy Consumption

Authors:Nesrine Cherif, Wael Jaafar, Halim Yanikomeroglu, Abbas Yongacoglu
Date:2023-12-05 04:06:09

Unmanned aerial vehicle (UAV) is a promising technology for last-mile cargo delivery. However, the limited on-board battery capacity, cellular unreliability, and frequent handoffs in the airspace are the main obstacles to unleash its full potential. Given that existing cellular networks were primarily designed to service ground users, re-utilizing the same architecture for highly mobile aerial users, e.g., cargo-UAVs, is deemed challenging. Indeed, to ensure a safe delivery using cargo-UAVs, it is crucial to utilize the available energy efficiently, while guaranteeing reliable connectivity for command-and-control and avoiding frequent handoff. To achieve this goal, we propose a novel approach for joint cargo-UAV trajectory planning and cell association. Specifically, we formulate the cargo-UAV mission as a multi-objective problem aiming to 1) minimize energy consumption, 2) reduce handoff events, and 3) guarantee cellular reliability along the trajectory. We leverage reinforcement learning (RL) to jointly optimize the cargo-UAV's trajectory and cell association. Simulation results demonstrate a performance improvement of our proposed method, in terms of handoffs, disconnectivity, and energy consumption, compared to benchmarks.

Autonomous and Adaptive Role Selection for Multi-robot Collaborative Area Search Based on Deep Reinforcement Learning

Authors:Lina Zhu, Jiyu Cheng, Hao Zhang, Zhichao Cui, Wei Zhang, Yuehu Liu
Date:2023-12-04 09:10:44

In the tasks of multi-robot collaborative area search, we propose the unified approach for simultaneous mapping for sensing more targets (exploration) while searching and locating the targets (coverage). Specifically, we implement a hierarchical multi-agent reinforcement learning algorithm to decouple task planning from task execution. The role concept is integrated into the upper-level task planning for role selection, which enables robots to learn the role based on the state status from the upper-view. Besides, an intelligent role switching mechanism enables the role selection module to function between two timesteps, promoting both exploration and coverage interchangeably. Then the primitive policy learns how to plan based on their assigned roles and local observation for sub-task execution. The well-designed experiments show the scalability and generalization of our method compared with state-of-the-art approaches in the scenes with varying complexity and number of robots.

Towards Goal-oriented Intelligent Tutoring Systems in Online Education

Authors:Yang Deng, Zifeng Ren, An Zhang, Wenqiang Lei, Tat-Seng Chua
Date:2023-12-03 12:37:16

Interactive Intelligent Tutoring Systems (ITSs) enhance traditional ITSs by promoting effective learning through interactions and problem resolution in online education. Yet, proactive engagement, prioritizing resource optimization with planning and assessment capabilities, is often overlooked in current ITS designs. In this work, we investigate a new task, named Goal-oriented Intelligent Tutoring Systems (GITS), which aims to enable the student's mastery of a designated concept by strategically planning a customized sequence of exercises and assessment. To address the problem of goal-oriented policy learning in GITS, we propose a novel graph-based reinforcement learning framework, named Planning-Assessment-Interaction (PAI). Specifically, we first leverage cognitive structure information to improve state representation learning and action selection for planning the next action, which can be either to tutor an exercise or to assess the target concept. Further, we use a dynamically updated cognitive diagnosis model to simulate student responses to exercises and concepts. Three benchmark datasets across different subjects are constructed for enabling offline academic research on GITS. Experimental results demonstrate the effectiveness and efficiency of PAI and extensive analyses of various types of students are conducted to showcase the challenges in this task.

Self Generated Wargame AI: Double Layer Agent Task Planning Based on Large Language Model

Authors:Y. Sun, J. Zhao, C. Yu, W. Wang, X. Zhou
Date:2023-12-02 09:45:45

The large language models represented by ChatGPT have a disruptive impact on the field of artificial intelligence. But it mainly focuses on natural language processing, speech recognition, machine learning and natural language understanding. This paper innovatively applies the large language model to the field of intelligent decision-making, places the large language model in the decision-making center, and constructs an agent architecture with the large language model as the core. Based on this, it further proposes a two-layer agent task planning, issues and executes decision commands through the interaction of natural language, and carries out simulation verification through the wargame simulation environment. Through the game confrontation simulation experiment, it is found that the intelligent decision-making ability of the large language model is significantly stronger than the commonly used reinforcement learning AI and rule AI, and the intelligence, understandability and generalization are all better. And through experiments, it was found that the intelligence of the large language model is closely related to prompt. This work also extends the large language model from previous human-computer interaction to the field of intelligent decision-making, which has important reference value and significance for the development of intelligent decision-making.

A Survey of Progress on Cooperative Multi-agent Reinforcement Learning in Open Environment

Authors:Lei Yuan, Ziqian Zhang, Lihe Li, Cong Guan, Yang Yu
Date:2023-12-02 08:04:31

Multi-agent Reinforcement Learning (MARL) has gained wide attention in recent years and has made progress in various fields. Specifically, cooperative MARL focuses on training a team of agents to cooperatively achieve tasks that are difficult for a single agent to handle. It has shown great potential in applications such as path planning, autonomous driving, active voltage control, and dynamic algorithm configuration. One of the research focuses in the field of cooperative MARL is how to improve the coordination efficiency of the system, while research work has mainly been conducted in simple, static, and closed environment settings. To promote the application of artificial intelligence in real-world, some research has begun to explore multi-agent coordination in open environments. These works have made progress in exploring and researching the environments where important factors might change. However, the mainstream work still lacks a comprehensive review of the research direction. In this paper, starting from the concept of reinforcement learning, we subsequently introduce multi-agent systems (MAS), cooperative MARL, typical methods, and test environments. Then, we summarize the research work of cooperative MARL from closed to open environments, extract multiple research directions, and introduce typical works. Finally, we summarize the strengths and weaknesses of the current research, and look forward to the future development direction and research problems in cooperative MARL in open environments.

Optimal Attack and Defense for Reinforcement Learning

Authors:Jeremy McMahan, Young Wu, Xiaojin Zhu, Qiaomin Xie
Date:2023-11-30 21:21:47

To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.

Maximum Entropy Model Correction in Reinforcement Learning

Authors:Amin Rakhsha, Mete Kemertas, Mohammad Ghavamzadeh, Amir-massoud Farahmand
Date:2023-11-29 18:00:41

We propose and theoretically analyze an approach for planning with an approximate model in reinforcement learning that can reduce the adverse impact of model error. If the model is accurate enough, it accelerates the convergence to the true value function too. One of its key components is the MaxEnt Model Correction (MoCo) procedure that corrects the model's next-state distributions based on a Maximum Entropy density estimation formulation. Based on MoCo, we introduce the Model Correcting Value Iteration (MoCoVI) algorithm, and its sampled-based variant MoCoDyna. We show that MoCoVI and MoCoDyna's convergence can be much faster than the conventional model-free algorithms. Unlike traditional model-based algorithms, MoCoVI and MoCoDyna effectively utilize an approximate model and still converge to the correct value function.

Deep Reinforcement Learning Graphs: Feedback Motion Planning via Neural Lyapunov Verification

Authors:Armin Ghanbarzadeh, Esmaeil Najafi
Date:2023-11-29 12:31:06

Recent advancements in model-free deep reinforcement learning have enabled efficient agent training. However, challenges arise when determining the region of attraction for these controllers, especially if the region does not fully cover the desired area. This paper addresses this issue by introducing a feedback motion control algorithm that utilizes data-driven techniques and neural networks. The algorithm constructs a graph of connected reinforcement-learning based controllers, each with its own defined region of attraction. This incremental approach effectively covers a bounded region of interest, creating a trajectory of interconnected nodes that guide the system from an initial state to the goal. Two approaches are presented for connecting nodes within the algorithm. The first is a tree-structured method, facilitating "point-to-point" control by constructing a tree connecting the initial state to the goal state. The second is a graph-structured method, enabling "space-to-space" control by building a graph within a bounded region. This approach allows for control from arbitrary initial and goal states. The proposed method's performance is evaluated on a first-order dynamic system, considering scenarios both with and without obstacles. The results demonstrate the effectiveness of the proposed algorithm in achieving the desired control objectives.

Goal-conditioned Offline Planning from Curious Exploration

Authors:Marco Bagatella, Georg Martius
Date:2023-11-28 17:48:18

Curiosity has established itself as a powerful exploration strategy in deep reinforcement learning. Notably, leveraging expected future novelty as intrinsic motivation has been shown to efficiently generate exploratory trajectories, as well as a robust dynamics model. We consider the challenge of extracting goal-conditioned behavior from the products of such unsupervised exploration techniques, without any additional environment interaction. We find that conventional goal-conditioned reinforcement learning approaches for extracting a value function and policy fall short in this difficult offline setting. By analyzing the geometry of optimal goal-conditioned value functions, we relate this issue to a specific class of estimation artifacts in learned values. In order to mitigate their occurrence, we propose to combine model-based planning over learned value landscapes with a graph-based value aggregation scheme. We show how this combination can correct both local and global artifacts, obtaining significant improvements in zero-shot goal-reaching performance across diverse simulated environments.

Reinforcement Learning for Wildfire Mitigation in Simulated Disaster Environments

Authors:Alexander Tapley, Marissa Dotter, Michael Doyle, Aidan Fennelly, Dhanuj Gandikota, Savanna Smith, Michael Threet, Tim Welsh
Date:2023-11-27 15:37:05

Climate change has resulted in a year over year increase in adverse weather and weather conditions which contribute to increasingly severe fire seasons. Without effective mitigation, these fires pose a threat to life, property, ecology, cultural heritage, and critical infrastructure. To better prepare for and react to the increasing threat of wildfires, more accurate fire modelers and mitigation responses are necessary. In this paper, we introduce SimFire, a versatile wildland fire projection simulator designed to generate realistic wildfire scenarios, and SimHarness, a modular agent-based machine learning wrapper capable of automatically generating land management strategies within SimFire to reduce the overall damage to the area. Together, this publicly available system allows researchers and practitioners the ability to emulate and assess the effectiveness of firefighter interventions and formulate strategic plans that prioritize value preservation and resource allocation optimization. The repositories are available for download at https://github.com/mitrefireline.

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Authors:Heyang Zhao, Jiafan He, Quanquan Gu
Date:2023-11-26 08:31:57

The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $\tilde{O}(d\sqrt{HK})$ when $K$ is sufficiently large and near-optimal policy switching cost of $\tilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.

How to ensure a safe control strategy? Towards a SRL for urban transit autonomous operation

Authors:Zicong Zhao
Date:2023-11-24 13:11:07

Deep reinforcement learning has gradually shown its latent decision-making ability in urban rail transit autonomous operation. However, since reinforcement learning can not neither guarantee safety during learning nor execution, this is still one of the major obstacles to the practical application of reinforcement learning. Given this drawback, reinforcement learning applied in the safety-critical autonomous operation domain remains challenging without generating a safe control command sequence that avoids overspeed operations. Therefore, a SSA-DRL framework is proposed in this paper for safe intelligent control of urban rail transit autonomous operation trains. The proposed framework is combined with linear temporal logic, reinforcement learning and Monte Carlo tree search and consists of four mainly module: a post-posed shielding, a searching tree module, a DRL framework and an additional actor. Furthermore, the output of the framework can meet speed constraint, schedule constraint and optimize the operation process. Finally, the proposed SSA-DRL framework for decision-making in urban rail transit autonomous operation is evaluated in sixteen different sections, and its effectiveness is demonstrated through an ablation experiment and comparison with the scheduled operation plan.

Offline Skill Generalization via Task and Motion Planning

Authors:Shin Watanabe, Geir Horn, Jim Tørresen, Kai Olav Ellefsen
Date:2023-11-24 08:06:55

This paper presents a novel approach to generalizing robot manipulation skills by combining a sampling-based task-and-motion planner with an offline reinforcement learning algorithm. Starting with a small library of scripted primitive skills (e.g. Push) and object-centric symbolic predicates (e.g. On(block, plate)), the planner autonomously generates a demonstration dataset of manipulation skills in the context of a long-horizon task. An offline reinforcement learning algorithm then extracts a policy from the dataset without further interactions with the environment and replaces the scripted skill in the existing library. Refining the skill library improves the robustness of the planner, which in turn facilitates data collection for more complex manipulation skills. We validate our approach in simulation, on a block-pushing task. We show that the proposed method requires less training data than conventional reinforcement learning methods. Furthermore, interaction with the environment is collision-free because of the use of planner demonstrations, making the approach more amenable to persistent robot learning in the real world.

Guided Flows for Generative Modeling and Decision Making

Authors:Qinqing Zheng, Matt Le, Neta Shaul, Yaron Lipman, Aditya Grover, Ricky T. Q. Chen
Date:2023-11-22 15:07:59

Classifier-free guidance is a key component for enhancing the performance of conditional generative models across diverse tasks. While it has previously demonstrated remarkable improvements for the sample quality, it has only been exclusively employed for diffusion models. In this paper, we integrate classifier-free guidance into Flow Matching (FM) models, an alternative simulation-free approach that trains Continuous Normalizing Flows (CNFs) based on regressing vector fields. We explore the usage of \emph{Guided Flows} for a variety of downstream applications. We show that Guided Flows significantly improves the sample quality in conditional image generation and zero-shot text-to-speech synthesis, boasting state-of-the-art performance. Notably, we are the first to apply flow models for plan generation in the offline reinforcement learning setting, showcasing a 10x speedup in computation compared to diffusion models while maintaining comparable performance.

Neural Approximate Dynamic Programming for the Ultra-fast Order Dispatching Problem

Authors:Arash Dehghan, Mucahit Cevik, Merve Bodur
Date:2023-11-21 20:23:58

Same-Day Delivery (SDD) services aim to maximize the fulfillment of online orders while minimizing delivery delays but are beset by operational uncertainties such as those in order volumes and courier planning. Our work aims to enhance the operational efficiency of SDD by focusing on the ultra-fast Order Dispatching Problem (ODP), which involves matching and dispatching orders to couriers within a centralized warehouse setting, and completing the delivery within a strict timeline (e.g., within minutes). We introduce important extensions to ultra-fast ODP such as order batching and explicit courier assignments to provide a more realistic representation of dispatching operations and improve delivery efficiency. As a solution method, we primarily focus on NeurADP, a methodology that combines Approximate Dynamic Programming (ADP) and Deep Reinforcement Learning (DRL), and our work constitutes the first application of NeurADP outside of the ride-pool matching problem. NeurADP is particularly suitable for ultra-fast ODP as it addresses complex one-to-many matching and routing intricacies through a neural network-based VFA that captures high-dimensional problem dynamics without requiring manual feature engineering as in generic ADP methods. We test our proposed approach using four distinct realistic datasets tailored for ODP and compare the performance of NeurADP against myopic and DRL baselines by also making use of non-trivial bounds to assess the quality of the policies. Our numerical results indicate that the inclusion of order batching and courier queues enhances the efficiency of delivery operations and that NeurADP significantly outperforms other methods. Detailed sensitivity analysis with important parameters confirms the robustness of NeurADP under different scenarios, including variations in courier numbers, spatial setup, vehicle capacity, and permitted delay time.

Provable Representation with Efficient Planning for Partial Observable Reinforcement Learning

Authors:Hongming Zhang, Tongzheng Ren, Chenjun Xiao, Dale Schuurmans, Bo Dai
Date:2023-11-20 23:56:58

In most real-world reinforcement learning applications, state information is only partially observable, which breaks the Markov decision process assumption and leads to inferior performance for algorithms that conflate observations with state. Partially Observable Markov Decision Processes (POMDPs), on the other hand, provide a general framework that allows for partial observability to be accounted for in learning, exploration and planning, but presents significant computational and statistical challenges. To address these difficulties, we develop a representation-based perspective that leads to a coherent framework and tractable algorithmic approach for practical reinforcement learning from partial observations. We provide a theoretical analysis for justifying the statistical efficiency of the proposed algorithm, and also empirically demonstrate the proposed algorithm can surpass state-of-the-art performance with partial observations across various benchmarks, advancing reliable reinforcement learning towards more practical applications.

Provably Efficient CVaR RL in Low-rank MDPs

Authors:Yulai Zhao, Wenhao Zhan, Xiaoyan Hu, Ho-fung Leung, Farzan Farnia, Wen Sun, Jason D. Lee
Date:2023-11-20 17:44:40

We study risk-sensitive Reinforcement Learning (RL), where we aim to maximize the Conditional Value at Risk (CVaR) with a fixed risk tolerance $\tau$. Prior theoretical work studying risk-sensitive RL focuses on the tabular Markov Decision Processes (MDPs) setting. To extend CVaR RL to settings where state space is large, function approximation must be deployed. We study CVaR RL in low-rank MDPs with nonlinear function approximation. Low-rank MDPs assume the underlying transition kernel admits a low-rank decomposition, but unlike prior linear models, low-rank MDPs do not assume the feature or state-action representation is known. We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to carefully balance the interplay between exploration, exploitation, and representation learning in CVaR RL. We prove that our algorithm achieves a sample complexity of $\tilde{O}\left(\frac{H^7 A^2 d^4}{\tau^2 \epsilon^2}\right)$ to yield an $\epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations. Computational-wise, we design a novel discretized Least-Squares Value Iteration (LSVI) algorithm for the CVaR objective as the planning oracle and show that we can find the near-optimal policy in a polynomial running time with a Maximum Likelihood Estimation oracle. To our knowledge, this is the first provably efficient CVaR RL algorithm in low-rank MDPs.

Tactile Active Inference Reinforcement Learning for Efficient Robotic Manipulation Skill Acquisition

Authors:Zihao Liu, Xing Liu, Yizhai Zhang, Zhengxiong Liu, Panfeng Huang
Date:2023-11-19 10:19:22

Robotic manipulation holds the potential to replace humans in the execution of tedious or dangerous tasks. However, control-based approaches are not suitable due to the difficulty of formally describing open-world manipulation in reality, and the inefficiency of existing learning methods. Thus, applying manipulation in a wide range of scenarios presents significant challenges. In this study, we propose a novel method for skill learning in robotic manipulation called Tactile Active Inference Reinforcement Learning (Tactile-AIRL), aimed at achieving efficient training. To enhance the performance of reinforcement learning (RL), we introduce active inference, which integrates model-based techniques and intrinsic curiosity into the RL process. This integration improves the algorithm's training efficiency and adaptability to sparse rewards. Additionally, we utilize a vision-based tactile sensor to provide detailed perception for manipulation tasks. Finally, we employ a model-based approach to imagine and plan appropriate actions through free energy minimization. Simulation results demonstrate that our method achieves significantly high training efficiency in non-prehensile objects pushing tasks. It enables agents to excel in both dense and sparse reward tasks with just a few interaction episodes, surpassing the SAC baseline. Furthermore, we conduct physical experiments on a gripper screwing task using our method, which showcases the algorithm's rapid learning capability and its potential for practical applications.

Safety Aware Autonomous Path Planning Using Model Predictive Reinforcement Learning for Inland Waterways

Authors:Astrid Vanneste, Simon Vanneste, Olivier Vasseur, Robin Janssens, Mattias Billast, Ali Anwar, Kevin Mets, Tom De Schepper, Siegfried Mercelis, Peter Hellinckx
Date:2023-11-16 13:12:58

In recent years, interest in autonomous shipping in urban waterways has increased significantly due to the trend of keeping cars and trucks out of city centers. Classical approaches such as Frenet frame based planning and potential field navigation often require tuning of many configuration parameters and sometimes even require a different configuration depending on the situation. In this paper, we propose a novel path planning approach based on reinforcement learning called Model Predictive Reinforcement Learning (MPRL). MPRL calculates a series of waypoints for the vessel to follow. The environment is represented as an occupancy grid map, allowing us to deal with any shape of waterway and any number and shape of obstacles. We demonstrate our approach on two scenarios and compare the resulting path with path planning using a Frenet frame and path planning based on a proximal policy optimization (PPO) agent. Our results show that MPRL outperforms both baselines in both test scenarios. The PPO based approach was not able to reach the goal in either scenario while the Frenet frame approach failed in the scenario consisting of a corner with obstacles. MPRL was able to safely (collision free) navigate to the goal in both of the test scenarios.

On Convex Optimal Value Functions For POSGs

Authors:Rafael F. Cunha, Jacopo Castellini, Johan Peralez, Jilles S. Dibangoye
Date:2023-11-15 23:48:21

Multi-agent planning and reinforcement learning can be challenging when agents cannot see the state of the world or communicate with each other due to communication costs, latency, or noise. Partially Observable Stochastic Games (POSGs) provide a mathematical framework for modelling such scenarios. This paper aims to improve the efficiency of planning and reinforcement learning algorithms for POSGs by identifying the underlying structure of optimal state-value functions. The approach involves reformulating the original game from the perspective of a trusted third party who plans on behalf of the agents simultaneously. From this viewpoint, the original POSGs can be viewed as Markov games where states are occupancy states, \ie posterior probability distributions over the hidden states of the world and the stream of actions and observations that agents have experienced so far. This study mainly proves that the optimal state-value function is a convex function of occupancy states expressed on an appropriate basis in all zero-sum, common-payoff, and Stackelberg POSGs.

Flexible and Adaptive Manufacturing by Complementing Knowledge Representation, Reasoning and Planning with Reinforcement Learning

Authors:Matthias Mayr, Faseeh Ahmad, Volker Krueger
Date:2023-11-15 20:28:27

This paper describes a novel approach to adaptive manufacturing in the context of small batch production and customization. It focuses on integrating task-level planning and reasoning with reinforcement learning (RL) in the SkiROS2 skill-based robot control platform. This integration enhances the efficiency and adaptability of robotic systems in manufacturing, enabling them to adjust to task variations and learn from interaction data. The paper highlights the architecture of SkiROS2, particularly its world model, skill libraries, and task management. It demonstrates how combining RL with robotic manipulators can learn and improve the execution of industrial tasks. It advocates a multi-objective learning model that eases the learning problem design. The approach can incorporate user priors or previous experiences to accelerate learning and increase safety. Spotlight video: https://youtu.be/H5PmZl2rRbs?si=8wmZ-gbwuSJRxe3S&t=1422 SkiROS2 code: https://github.com/RVMI/skiros2 SkiROS2 talk at ROSCon: https://vimeo.com/879001825/2a0e9d5412 SkiREIL code: https://github.com/matthias-mayr/SkiREIL

Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework

Authors:Weiqin Zu, Wenbin Song, Ruiqing Chen, Ze Guo, Fanglei Sun, Zheng Tian, Wei Pan, Jun Wang
Date:2023-11-14 15:29:52

The socially-aware navigation system has evolved to adeptly avoid various obstacles while performing multiple tasks, such as point-to-point navigation, human-following, and -guiding. However, a prominent gap persists: in Human-Robot Interaction (HRI), the procedure of communicating commands to robots demands intricate mathematical formulations. Furthermore, the transition between tasks does not quite possess the intuitive control and user-centric interactivity that one would desire. In this work, we propose an LLM-driven interactive multimodal multitask robot navigation framework, termed LIM2N, to solve the above new challenge in the navigation field. We achieve this by first introducing a multimodal interaction framework where language and hand-drawn inputs can serve as navigation constraints and control objectives. Next, a reinforcement learning agent is built to handle multiple tasks with the received information. Crucially, LIM2N creates smooth cooperation among the reasoning of multimodal input, multitask planning, and adaptation and processing of the intelligent sensing modules in the complicated system. Extensive experiments are conducted in both simulation and the real world demonstrating that LIM2N has superior user needs understanding, alongside an enhanced interactive experience.

LLM Augmented Hierarchical Agents

Authors:Bharat Prakash, Tim Oates, Tinoosh Mohsenin
Date:2023-11-09 18:54:28

Solving long-horizon, temporally-extended tasks using Reinforcement Learning (RL) is challenging, compounded by the common practice of learning without prior knowledge (or tabula rasa learning). Humans can generate and execute plans with temporally-extended actions and quickly learn to perform new tasks because we almost never solve problems from scratch. We want autonomous agents to have this same ability. Recently, LLMs have been shown to encode a tremendous amount of knowledge about the world and to perform impressive in-context learning and reasoning. However, using LLMs to solve real world problems is hard because they are not grounded in the current task. In this paper we exploit the planning capabilities of LLMs while using RL to provide learning from the environment, resulting in a hierarchical agent that uses LLMs to solve long-horizon tasks. Instead of completely relying on LLMs, they guide a high-level policy, making learning significantly more sample efficient. This approach is evaluated in simulation environments such as MiniGrid, SkillHack, and Crafter, and on a real robot arm in block manipulation tasks. We show that agents trained using our approach outperform other baselines methods and, once trained, don't need access to LLMs during deployment.

Anytime-Constrained Reinforcement Learning

Authors:Jeremy McMahan, Xiaojin Zhu
Date:2023-11-09 16:51:26

We introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints. An anytime constraint requires the agent to never violate its budget at any point in time, almost surely. Although Markovian policies are no longer sufficient, we show that there exist optimal deterministic policies augmented with cumulative costs. In fact, we present a fixed-parameter tractable reduction from anytime-constrained cMDPs to unconstrained MDPs. Our reduction yields planning and learning algorithms that are time and sample-efficient for tabular cMDPs so long as the precision of the costs is logarithmic in the size of the cMDP. However, we also show that computing non-trivial approximately optimal policies is NP-hard in general. To circumvent this bottleneck, we design provable approximation algorithms that efficiently compute or learn an arbitrarily accurate approximately feasible policy with optimal value so long as the maximum supported cost is bounded by a polynomial in the cMDP or the absolute budget. Given our hardness results, our approximation guarantees are the best possible under worst-case analysis.

Social Motion Prediction with Cognitive Hierarchies

Authors:Wentao Zhu, Jason Qin, Yuke Lou, Hang Ye, Xiaoxuan Ma, Hai Ci, Yizhou Wang
Date:2023-11-08 14:51:17

Humans exhibit a remarkable capacity for anticipating the actions of others and planning their own actions accordingly. In this study, we strive to replicate this ability by addressing the social motion prediction problem. We introduce a new benchmark, a novel formulation, and a cognition-inspired framework. We present Wusi, a 3D multi-person motion dataset under the context of team sports, which features intense and strategic human interactions and diverse pose distributions. By reformulating the problem from a multi-agent reinforcement learning perspective, we incorporate behavioral cloning and generative adversarial imitation learning to boost learning efficiency and generalization. Furthermore, we take into account the cognitive aspects of the human social action planning process and develop a cognitive hierarchy framework to predict strategic human social interactions. We conduct comprehensive experiments to validate the effectiveness of our proposed dataset and approach. Code and data are available at https://walter0807.github.io/Social-CH/.

Force-Constrained Visual Policy: Safe Robot-Assisted Dressing via Multi-Modal Sensing

Authors:Zhanyi Sun, Yufei Wang, David Held, Zackory Erickson
Date:2023-11-07 23:39:43

Robot-assisted dressing could profoundly enhance the quality of life of adults with physical disabilities. To achieve this, a robot can benefit from both visual and force sensing. The former enables the robot to ascertain human body pose and garment deformations, while the latter helps maintain safety and comfort during the dressing process. In this paper, we introduce a new technique that leverages both vision and force modalities for this assistive task. Our approach first trains a vision-based dressing policy using reinforcement learning in simulation with varying body sizes, poses, and types of garments. We then learn a force dynamics model for action planning to ensure safety. Due to limitations of simulating accurate force data when deformable garments interact with the human body, we learn a force dynamics model directly from real-world data. Our proposed method combines the vision-based policy, trained in simulation, with the force dynamics model, learned in the real world, by solving a constrained optimization problem to infer actions that facilitate the dressing process without applying excessive force on the person. We evaluate our system in simulation and in a real-world human study with 10 participants across 240 dressing trials, showing it greatly outperforms prior baselines. Video demonstrations are available on our project website (https://sites.google.com/view/dressing-fcvp).

Interactive Semantic Map Representation for Skill-based Visual Object Navigation

Authors:Tatiana Zemskova, Aleksei Staroverov, Kirill Muravyev, Dmitry Yudin, Aleksandr Panov
Date:2023-11-07 16:30:12

Visual object navigation using learning methods is one of the key tasks in mobile robotics. This paper introduces a new representation of a scene semantic map formed during the embodied agent interaction with the indoor environment. It is based on a neural network method that adjusts the weights of the segmentation model with backpropagation of the predicted fusion loss values during inference on a regular (backward) or delayed (forward) image sequence. We have implemented this representation into a full-fledged navigation approach called SkillTron, which can select robot skills from end-to-end policies based on reinforcement learning and classic map-based planning methods. The proposed approach makes it possible to form both intermediate goals for robot exploration and the final goal for object navigation. We conducted intensive experiments with the proposed approach in the Habitat environment, which showed a significant superiority in navigation quality metrics compared to state-of-the-art approaches. The developed code and used custom datasets are publicly available at github.com/AIRI-Institute/skill-fusion.

Hypothesis Network Planned Exploration for Rapid Meta-Reinforcement Learning Adaptation

Authors:Maxwell Joseph Jacobson, Yexiang Xue
Date:2023-11-07 03:53:52

Meta Reinforcement Learning (Meta RL) trains agents that adapt to fast-changing environments and tasks. Current strategies often lose adaption efficiency due to the passive nature of model exploration, causing delayed understanding of new transition dynamics. This results in particularly fast-evolving tasks being impossible to solve. We propose a novel approach, Hypothesis Network Planned Exploration (HyPE), that integrates an active and planned exploration process via the hypothesis network to optimize adaptation speed. HyPE uses a generative hypothesis network to form potential models of state transition dynamics, then eliminates incorrect models through strategically devised experiments. Evaluated on a symbolic version of the Alchemy game, HyPE outpaces baseline methods in adaptation speed and model accuracy, validating its potential in enhancing reinforcement learning adaptation in rapidly evolving settings.

Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs

Authors:Wenke Xia, Dong Wang, Xincheng Pang, Zhigang Wang, Bin Zhao, Di Hu, Xuelong Li
Date:2023-11-06 03:26:41

Generalizable articulated object manipulation is essential for home-assistant robots. Recent efforts focus on imitation learning from demonstrations or reinforcement learning in simulation, however, due to the prohibitive costs of real-world data collection and precise object simulation, it still remains challenging for these works to achieve broad adaptability across diverse articulated objects. Recently, many works have tried to utilize the strong in-context learning ability of Large Language Models (LLMs) to achieve generalizable robotic manipulation, but most of these researches focus on high-level task planning, sidelining low-level robotic control. In this work, building on the idea that the kinematic structure of the object determines how we can manipulate it, we propose a kinematic-aware prompting framework that prompts LLMs with kinematic knowledge of objects to generate low-level motion trajectory waypoints, supporting various object manipulation. To effectively prompt LLMs with the kinematic structure of different objects, we design a unified kinematic knowledge parser, which represents various articulated objects as a unified textual description containing kinematic joints and contact location. Building upon this unified description, a kinematic-aware planner model is proposed to generate precise 3D manipulation waypoints via a designed kinematic-aware chain-of-thoughts prompting method. Our evaluation spanned 48 instances across 16 distinct categories, revealing that our framework not only outperforms traditional methods on 8 seen categories but also shows a powerful zero-shot capability for 8 unseen articulated object categories. Moreover, the real-world experiments on 7 different object categories prove our framework's adaptability in practical scenarios. Code is released at https://github.com/GeWu-Lab/LLM_articulated_object_manipulation/tree/main.

RDE: A Hybrid Policy Framework for Multi-Agent Path Finding Problem

Authors:Jianqi Gao, Yanjie Li, Xiaoqing Yang, Mingshan Tan
Date:2023-11-03 05:52:40

Multi-agent path finding (MAPF) is an abstract model for the navigation of multiple robots in warehouse automation, where multiple robots plan collision-free paths from the start to goal positions. Reinforcement learning (RL) has been employed to develop partially observable distributed MAPF policies that can be scaled to any number of agents. However, RL-based MAPF policies often get agents stuck in deadlock due to warehouse automation's dense and structured obstacles. This paper proposes a novel hybrid MAPF policy, RDE, based on switching among the RL-based MAPF policy, the Distance heat map (DHM)-based policy and the Escape policy. The RL-based policy is used for coordination among agents. In contrast, when no other agents are in the agent's field of view, it can get the next action by querying the DHM. The escape policy that randomly selects valid actions can help agents escape the deadlock. We conduct simulations on warehouse-like structured grid maps using state-of-the-art RL-based MAPF policies (DHC and DCC), which show that RDE can significantly improve their performance.

RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation

Authors:Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, Chuang Gan
Date:2023-11-02 17:59:21

We present RoboGen, a generative robotic agent that automatically learns diverse robotic skills at scale via generative simulation. RoboGen leverages the latest advancements in foundation and generative models. Instead of directly using or adapting these models to produce policies or low-level actions, we advocate for a generative scheme, which uses these models to automatically generate diversified tasks, scenes, and training supervisions, thereby scaling up robotic skill learning with minimal human supervision. Our approach equips a robotic agent with a self-guided propose-generate-learn cycle: the agent first proposes interesting tasks and skills to develop, and then generates corresponding simulation environments by populating pertinent objects and assets with proper spatial configurations. Afterwards, the agent decomposes the proposed high-level task into sub-tasks, selects the optimal learning approach (reinforcement learning, motion planning, or trajectory optimization), generates required training supervision, and then learns policies to acquire the proposed skill. Our work attempts to extract the extensive and versatile knowledge embedded in large-scale models and transfer them to the field of robotics. Our fully generative pipeline can be queried repeatedly, producing an endless stream of skill demonstrations associated with diverse tasks and environments.

DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing

Authors:Vint Lee, Pieter Abbeel, Youngwoon Lee
Date:2023-11-02 17:57:38

Model-based reinforcement learning (MBRL) has gained much attention for its ability to learn complex behaviors in a sample-efficient way: planning actions by generating imaginary trajectories with predicted rewards. Despite its success, we found that surprisingly, reward prediction is often a bottleneck of MBRL, especially for sparse rewards that are challenging (or even ambiguous) to predict. Motivated by the intuition that humans can learn from rough reward estimates, we propose a simple yet effective reward smoothing approach, DreamSmooth, which learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep. We empirically show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks both in sample efficiency and final performance without losing performance on common benchmarks, such as Deepmind Control Suite and Atari benchmarks.

A Multi-Agent Reinforcement Learning Framework for Evaluating the U.S. Ending the HIV Epidemic Plan

Authors:Dinesh Sharma, Ankit Shah, Chaitra Gopalappa
Date:2023-11-01 21:19:35

Human immunodeficiency virus (HIV) is a major public health concern in the United States, with about 1.2 million people living with HIV and 35,000 newly infected each year. There are considerable geographical disparities in HIV burden and care access across the U.S. The 2019 Ending the HIV Epidemic (EHE) initiative aims to reduce new infections by 90% by 2030, by improving coverage of diagnoses, treatment, and prevention interventions and prioritizing jurisdictions with high HIV prevalence. Identifying optimal scale-up of intervention combinations will help inform resource allocation. Existing HIV decision analytic models either evaluate specific cities or the overall national population, thus overlooking jurisdictional interactions or differences. In this paper, we propose a multi-agent reinforcement learning (MARL) model, that enables jurisdiction-specific decision analyses but in an environment with cross-jurisdictional epidemiological interactions. In experimental analyses, conducted on jurisdictions within California and Florida, optimal policies from MARL were significantly different than those generated from single-agent RL, highlighting the influence of jurisdictional variations and interactions. By using comprehensive modeling of HIV and formulations of state space, action space, and reward functions, this work helps demonstrate the strengths and applicability of MARL for informing public health policies, and provides a framework for expanding to the national-level to inform the EHE.

Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents

Authors:Yang Deng, Wenxuan Zhang, Wai Lam, See-Kiong Ng, Tat-Seng Chua
Date:2023-11-01 03:20:16

Proactive dialogues serve as a practical yet challenging dialogue problem in the era of large language models (LLMs), where the dialogue policy planning is the key to improving the proactivity of LLMs. Most existing studies enable the dialogue policy planning of LLMs using various prompting schemes or iteratively enhance this capability in handling the given case with verbal AI feedback. However, these approaches are either bounded by the policy planning capability of the frozen LLMs or hard to be transferred to new cases. In this work, we introduce a new dialogue policy planning paradigm to strategize LLMs for proactive dialogue problems with a tunable language model plug-in as a plug-and-play dialogue policy planner, named PPDPP. Specifically, we develop a novel training framework to facilitate supervised fine-tuning over available human-annotated data as well as reinforcement learning from goal-oriented AI feedback with dynamic interaction data collected by the LLM-based self-play simulation. In this manner, the LLM-powered dialogue agent can not only be generalized to different cases after the training, but also be applicable to different applications by just substituting the learned plug-in. In addition, we propose to evaluate the policy planning capability of dialogue systems under the interactive setting. Experimental results demonstrate that PPDPP consistently and substantially outperforms existing approaches on three different proactive dialogue applications, including negotiation, emotional support, and tutoring dialogues.

Active Neural Topological Mapping for Multi-Agent Exploration

Authors:Xinyi Yang, Yuxiang Yang, Chao Yu, Jiayu Chen, Jingchen Yu, Haibing Ren, Huazhong Yang, Yu Wang
Date:2023-11-01 03:06:14

This paper investigates the multi-agent cooperative exploration problem, which requires multiple agents to explore an unseen environment via sensory signals in a limited time. A popular approach to exploration tasks is to combine active mapping with planning. Metric maps capture the details of the spatial representation, but are with high communication traffic and may vary significantly between scenarios, resulting in inferior generalization. Topological maps are a promising alternative as they consist only of nodes and edges with abstract but essential information and are less influenced by the scene structures. However, most existing topology-based exploration tasks utilize classical methods for planning, which are time-consuming and sub-optimal due to their handcrafted design. Deep reinforcement learning (DRL) has shown great potential for learning (near) optimal policies through fast end-to-end inference. In this paper, we propose Multi-Agent Neural Topological Mapping (MANTM) to improve exploration efficiency and generalization for multi-agent exploration tasks. MANTM mainly comprises a Topological Mapper and a novel RL-based Hierarchical Topological Planner (HTP). The Topological Mapper employs a visual encoder and distance-based heuristics to construct a graph containing main nodes and their corresponding ghost nodes. The HTP leverages graph neural networks to capture correlations between agents and graph nodes in a coarse-to-fine manner for effective global goal selection. Extensive experiments conducted in a physically-realistic simulator, Habitat, demonstrate that MANTM reduces the steps by at least 26.40% over planning-based baselines and by at least 7.63% over RL-based competitors in unseen scenarios.

Safe multi-agent motion planning under uncertainty for drones using filtered reinforcement learning

Authors:Sleiman Safaoui, Abraham P. Vinod, Ankush Chakrabarty, Rien Quirynen, Nobuyuki Yoshikawa, Stefano Di Cairano
Date:2023-10-31 18:09:26

We consider the problem of safe multi-agent motion planning for drones in uncertain, cluttered workspaces. For this problem, we present a tractable motion planner that builds upon the strengths of reinforcement learning and constrained-control-based trajectory planning. First, we use single-agent reinforcement learning to learn motion plans from data that reach the target but may not be collision-free. Next, we use a convex optimization, chance constraints, and set-based methods for constrained control to ensure safety, despite the uncertainty in the workspace, agent motion, and sensing. The proposed approach can handle state and control constraints on the agents, and enforce collision avoidance among themselves and with static obstacles in the workspace with high probability. The proposed approach yields a safe, real-time implementable, multi-agent motion planner that is simpler to train than methods based solely on learning. Numerical simulations and experiments show the efficacy of the approach.

Beyond Average Return in Markov Decision Processes

Authors:Alexandre Marthe, Aurélien Garivier, Claire Vernade
Date:2023-10-31 08:36:41

What are the functionals of the reward that can be computed and optimized exactly in Markov Decision Processes?In the finite-horizon, undiscounted setting, Dynamic Programming (DP) can only handle these operations efficiently for certain classes of statistics. We summarize the characterization of these classes for policy evaluation, and give a new answer for the planning problem. Interestingly, we prove that only generalized means can be optimized exactly, even in the more general framework of Distributional Reinforcement Learning (DistRL).DistRL permits, however, to evaluate other functionals approximately. We provide error bounds on the resulting estimators, and discuss the potential of this approach as well as its limitations.These results contribute to advancing the theory of Markov Decision Processes by examining overall characteristics of the return, and particularly risk-conscious strategies.

GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models

Authors:Mianchu Wang, Rui Yang, Xi Chen, Hao Sun, Meng Fang, Giovanni Montana
Date:2023-10-30 21:19:52

Offline Goal-Conditioned RL (GCRL) offers a feasible paradigm for learning general-purpose policies from diverse and multi-task offline datasets. Despite notable recent progress, the predominant offline GCRL methods, mainly model-free, face constraints in handling limited data and generalizing to unseen goals. In this work, we propose Goal-conditioned Offline Planning (GOPlan), a novel model-based framework that contains two key phases: (1) pretraining a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset; (2) employing the reanalysis method with planning to generate imagined trajectories for funetuning policies. Specifically, we base the prior policy on an advantage-weighted conditioned generative adversarial network, which facilitates distinct mode separation, mitigating the pitfalls of out-of-distribution (OOD) actions. For further policy optimization, the reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals. With thorough experimental evaluations, we demonstrate that GOPlan achieves state-of-the-art performance on various offline multi-goal navigation and manipulation tasks. Moreover, our results highlight the superior ability of GOPlan to handle small data budgets and generalize to OOD goals.

Remember what you did so you know what to do next

Authors:Manuel R. Ciosici, Alex Hedges, Yash Kankanampati, Justin Martin, Marjorie Freedman, Ralph Weischedel
Date:2023-10-30 19:29:00

We explore using a moderately sized large language model (GPT-J 6B parameters) to create a plan for a simulated robot to achieve 30 classes of goals in ScienceWorld, a text game simulator for elementary science experiments. Previously published empirical work claimed that large language models (LLMs) are a poor fit (Wang et al., 2022) compared to reinforcement learning. Using the Markov assumption (a single previous step), the LLM outperforms the reinforcement learning-based approach by a factor of 1.4. When we fill the LLM's input buffer with as many prior steps as possible, improvement rises to 3.5x. Even when training on only 6.5% of the training data, we observe a 2.2x improvement over the reinforcement-learning-based approach. Our experiments show that performance varies widely across the 30 classes of actions, indicating that averaging over tasks can hide significant performance issues. In work contemporaneous with ours, Lin et al. (2023) demonstrated a two-part approach (SwiftSage) that uses a small LLM (T5-large) complemented by OpenAI's massive LLMs to achieve outstanding results in ScienceWorld. Our 6-B parameter, single-stage GPT-J matches the performance of SwiftSage's two-stage architecture when it incorporates GPT-3.5 turbo which has 29-times more parameters than GPT-J.

AdapINT: A Flexible and Adaptive In-Band Network Telemetry System Based on Deep Reinforcement Learning

Authors:Penghui Zhang, Hua Zhang, Yibo Pi, Zijian Cao, Jingyu Wang, Jianxin Liao
Date:2023-10-30 08:02:35

In-band Network Telemetry (INT) has emerged as a promising network measurement technology. However, existing network telemetry systems lack the flexibility to meet diverse telemetry requirements and are also difficult to adapt to dynamic network environments. In this paper, we propose AdapINT, a versatile and adaptive in-band network telemetry framework assisted by dual-timescale probes, including long-period auxiliary probes (APs) and short-period dynamic probes (DPs). Technically, the APs collect basic network status information, which is used for the path planning of DPs. To achieve full network coverage, we propose an auxiliary probes path deployment (APPD) algorithm based on the Depth-First-Search (DFS). The DPs collect specific network information for telemetry tasks. To ensure that the DPs can meet diverse telemetry requirements and adapt to dynamic network environments, we apply the deep reinforcement learning (DRL) technique and transfer learning method to design the dynamic probes path deployment (DPPD) algorithm. The evaluation results show that AdapINT can redesign the telemetry system according to telemetry requirements and network environments. AdapINT can reduce telemetry latency by 75\% in online games and video conferencing scenarios. For overhead-aware networks, AdapINT can reduce control overheads by 34\% in cloud computing services.

Spacecraft Autonomous Decision-Planning for Collision Avoidance: a Reinforcement Learning Approach

Authors:Nicolas Bourriez, Adrien Loizeau, Adam F. Abdin
Date:2023-10-29 10:15:33

The space environment around the Earth is becoming increasingly populated by both active spacecraft and space debris. To avoid potential collision events, significant improvements in Space Situational Awareness (SSA) activities and Collision Avoidance (CA) technologies are allowing the tracking and maneuvering of spacecraft with increasing accuracy and reliability. However, these procedures still largely involve a high level of human intervention to make the necessary decisions. For an increasingly complex space environment, this decision-making strategy is not likely to be sustainable. Therefore, it is important to successfully introduce higher levels of automation for key Space Traffic Management (STM) processes to ensure the level of reliability needed for navigating a large number of spacecraft. These processes range from collision risk detection to the identification of the appropriate action to take and the execution of avoidance maneuvers. This work proposes an implementation of autonomous CA decision-making capabilities on spacecraft based on Reinforcement Learning (RL) techniques. A novel methodology based on a Partially Observable Markov Decision Process (POMDP) framework is developed to train the Artificial Intelligence (AI) system on board the spacecraft, considering epistemic and aleatory uncertainties. The proposed framework considers imperfect monitoring information about the status of the debris in orbit and allows the AI system to effectively learn stochastic policies to perform accurate Collision Avoidance Maneuvers (CAMs). The objective is to successfully delegate the decision-making process for autonomously implementing a CAM to the spacecraft without human intervention. This approach would allow for a faster response in the decision-making process and for highly decentralized operations.

Learning Extrinsic Dexterity with Parameterized Manipulation Primitives

Authors:Shih-Min Yang, Martin Magnusson, Johannes A. Stork, Todor Stoyanov
Date:2023-10-26 21:28:23

Many practically relevant robot grasping problems feature a target object for which all grasps are occluded, e.g., by the environment. Single-shot grasp planning invariably fails in such scenarios. Instead, it is necessary to first manipulate the object into a configuration that affords a grasp. We solve this problem by learning a sequence of actions that utilize the environment to change the object's pose. Concretely, we employ hierarchical reinforcement learning to combine a sequence of learned parameterized manipulation primitives. By learning the low-level manipulation policies, our approach can control the object's state through exploiting interactions between the object, the gripper, and the environment. Designing such a complex behavior analytically would be infeasible under uncontrolled conditions, as an analytic approach requires accurate physical modeling of the interaction and contact dynamics. In contrast, we learn a hierarchical policy model that operates directly on depth perception data, without the need for object detection, pose estimation, or manual design of controllers. We evaluate our approach on picking box-shaped objects of various weight, shape, and friction properties from a constrained table-top workspace. Our method transfers to a real robot and is able to successfully complete the object picking task in 98\% of experimental trials. Supplementary information and videos can be found at https://shihminyang.github.io/ED-PMP/.

Optimal Robotic Assembly Sequence Planning: A Sequential Decision-Making Approach

Authors:Kartik Nagpal, Negar Mehr
Date:2023-10-26 03:01:14

The optimal robot assembly planning problem is challenging due to the necessity of finding the optimal solution amongst an exponentially vast number of possible plans, all while satisfying a selection of constraints. Traditionally, robotic assembly planning problems have been solved using heuristics, but these methods are specific to a given objective structure or set of problem parameters. In this paper, we propose a novel approach to robotic assembly planning that poses assembly sequencing as a sequential decision making problem, enabling us to harness methods that far outperform the state-of-the-art. We formulate the problem as a Markov Decision Process (MDP) and utilize Dynamic Programming (DP) to find optimal assembly policies for moderately sized strictures. We further expand our framework to exploit the deterministic nature of assembly planning and introduce a class of optimal Graph Exploration Assembly Planners (GEAPs). For larger structures, we show how Reinforcement Learning (RL) enables us to learn policies that generate high reward assembly sequences. We evaluate our approach on a variety of robotic assembly problems, such as the assembly of the Hubble Space Telescope, the International Space Station, and the James Webb Space Telescope. We further showcase how our DP, GEAP, and RL implementations are capable of finding optimal solutions under a variety of different objective functions and how our formulation allows us to translate precedence constraints to branch pruning and thus further improve performance. We have published our code at https://github.com/labicon/ORASP-Code.

Conditionally Combining Robot Skills using Large Language Models

Authors:K. R. Zentner, Ryan Julian, Brian Ichter, Gaurav S. Sukhatme
Date:2023-10-25 21:46:34

This paper combines two contributions. First, we introduce an extension of the Meta-World benchmark, which we call "Language-World," which allows a large language model to operate in a simulated robotic environment using semi-structured natural language queries and scripted skills described using natural language. By using the same set of tasks as Meta-World, Language-World results can be easily compared to Meta-World results, allowing for a point of comparison between recent methods using Large Language Models (LLMs) and those using Deep Reinforcement Learning. Second, we introduce a method we call Plan Conditioned Behavioral Cloning (PCBC), that allows finetuning the behavior of high-level plans using end-to-end demonstrations. Using Language-World, we show that PCBC is able to achieve strong performance in a variety of few-shot regimes, often achieving task generalization with as little as a single demonstration. We have made Language-World available as open-source software at https://github.com/krzentner/language-world/.

AI Agent as Urban Planner: Steering Stakeholder Dynamics in Urban Planning via Consensus-based Multi-Agent Reinforcement Learning

Authors:Kejiang Qian, Lingjun Mao, Xin Liang, Yimin Ding, Jin Gao, Xinran Wei, Ziyi Guo, Jiajie Li
Date:2023-10-25 17:04:11

In urban planning, land use readjustment plays a pivotal role in aligning land use configurations with the current demands for sustainable urban development. However, present-day urban planning practices face two main issues. Firstly, land use decisions are predominantly dependent on human experts. Besides, while resident engagement in urban planning can promote urban sustainability and livability, it is challenging to reconcile the diverse interests of stakeholders. To address these challenges, we introduce a Consensus-based Multi-Agent Reinforcement Learning framework for real-world land use readjustment. This framework serves participatory urban planning, allowing diverse intelligent agents as stakeholder representatives to vote for preferred land use types. Within this framework, we propose a novel consensus mechanism in reward design to optimize land utilization through collective decision making. To abstract the structure of the complex urban system, the geographic information of cities is transformed into a spatial graph structure and then processed by graph neural networks. Comprehensive experiments on both traditional top-down planning and participatory planning methods from real-world communities indicate that our computational framework enhances global benefits and accommodates diverse interests, leading to improved satisfaction across different demographic groups. By integrating Multi-Agent Reinforcement Learning, our framework ensures that participatory urban planning decisions are more dynamic and adaptive to evolving community needs and provides a robust platform for automating complex real-world urban planning processes.

Multi-Agent Reinforcement Learning-Based UAV Pathfinding for Obstacle Avoidance in Stochastic Environment

Authors:Qizhen Wu, Kexin Liu, Lei Chen, Jinhu Lü
Date:2023-10-25 14:21:22

Traditional methods plan feasible paths for multiple agents in the stochastic environment. However, the methods' iterations with the changes in the environment result in computation complexities, especially for the decentralized agents without a centralized planner. Although reinforcement learning provides a plausible solution because of the generalization for different environments, it struggles with enormous agent-environment interactions in training. Here, we propose a novel centralized training with decentralized execution method based on multi-agent reinforcement learning, which is improved based on the idea of model predictive control. In our approach, agents communicate only with the centralized planner to make decentralized decisions online in the stochastic environment. Furthermore, considering the communication constraint with the centralized planner, each agent plans feasible paths through the extended observation, which combines information on neighboring agents based on the distance-weighted mean field approach. Inspired by the rolling optimization approach of model predictive control, we conduct multi-step value convergence in multi-agent reinforcement learning to enhance the training efficiency, which reduces the expensive interactions in convergence. Experiment results in both comparison, ablation, and real-robot studies validate the effectiveness and generalization performance of our method.

Reinforcement learning based local path planning for mobile robot

Authors:Mehmet Gok, Mehmet Tekerek, Hamza Aydemir
Date:2023-10-24 18:26:25

Different methods are used for a mobile robot to go to a specific target location. These methods work in different ways for online and offline scenarios. In the offline scenario, an environment map is created once, and fixed path planning is made on this map to reach the target. Path planning algorithms such as A* and RRT (Rapidly-Exploring Random Tree) are the examples of offline methods. The most obvious situation here is the need to re-plan the path for changing conditions of the loaded map. On the other hand, in the online scenario, the robot moves dynamically to a given target without using a map by using the perceived data coming from the sensors. Approaches such as SFM (Social Force Model) are used in online systems. However, these methods suffer from the requirement of a lot of dynamic sensing data. Thus, it can be said that the need for re-planning and mapping in offline systems and various system design requirements in online systems are the subjects that focus on autonomous mobile robot research. Recently, deep neural network powered Q-Learning methods are used as an emerging solution to the aforementioned problems in mobile robot navigation. In this study, machine learning algorithms with deep Q-Learning (DQN) and Deep DQN architectures, are evaluated for the solution of the problems presented above to realize path planning of an autonomous mobile robot to avoid obstacles.

Finetuning Offline World Models in the Real World

Authors:Yunhai Feng, Nicklas Hansen, Ziyan Xiong, Chandramouli Rajagopalan, Xiaolong Wang
Date:2023-10-24 17:46:12

Reinforcement Learning (RL) is notoriously data-inefficient, which makes training on a real robot difficult. While model-based RL algorithms (world models) improve data-efficiency to some extent, they still require hours or days of interaction to learn skills. Recently, offline RL has been proposed as a framework for training RL policies on pre-existing datasets without any online interaction. However, constraining an algorithm to a fixed dataset induces a state-action distribution shift between training and inference, and limits its applicability to new tasks. In this work, we seek to get the best of both worlds: we consider the problem of pretraining a world model with offline data collected on a real robot, and then finetuning the model on online data collected by planning with the learned model. To mitigate extrapolation errors during online interaction, we propose to regularize the planner at test-time by balancing estimated returns and (epistemic) model uncertainty. We evaluate our method on a variety of visuo-motor control tasks in simulation and on a real robot, and find that our method enables few-shot finetuning to seen and unseen tasks even when offline data is limited. Videos, code, and data are available at https://yunhaifeng.com/FOWM .

A Review of Reinforcement Learning for Natural Language Processing, and Applications in Healthcare

Authors:Ying Liu, Haozhu Wang, Huixue Zhou, Mingchen Li, Yu Hou, Sicheng Zhou, Fang Wang, Rama Hoetzlein, Rui Zhang
Date:2023-10-23 20:26:15

Reinforcement learning (RL) has emerged as a powerful approach for tackling complex medical decision-making problems such as treatment planning, personalized medicine, and optimizing the scheduling of surgeries and appointments. It has gained significant attention in the field of Natural Language Processing (NLP) due to its ability to learn optimal strategies for tasks such as dialogue systems, machine translation, and question-answering. This paper presents a review of the RL techniques in NLP, highlighting key advancements, challenges, and applications in healthcare. The review begins by visualizing a roadmap of machine learning and its applications in healthcare. And then it explores the integration of RL with NLP tasks. We examined dialogue systems where RL enables the learning of conversational strategies, RL-based machine translation models, question-answering systems, text summarization, and information extraction. Additionally, ethical considerations and biases in RL-NLP systems are addressed.

Diversify Question Generation with Retrieval-Augmented Style Transfer

Authors:Qi Gou, Zehua Xia, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li, Nguyen Cam-Tu
Date:2023-10-23 02:27:31

Given a textual passage and an answer, humans are able to ask questions with various expressions, but this ability is still challenging for most question generation (QG) systems. Existing solutions mainly focus on the internal knowledge within the given passage or the semantic word space for diverse content planning. These methods, however, have not considered the potential of external knowledge for expression diversity. To bridge this gap, we propose RAST, a framework for Retrieval-Augmented Style Transfer, where the objective is to utilize the style of diverse templates for question generation. For training RAST, we develop a novel Reinforcement Learning (RL) based approach that maximizes a weighted combination of diversity reward and consistency reward. Here, the consistency reward is computed by a Question-Answering (QA) model, whereas the diversity reward measures how much the final output mimics the retrieved template. Experimental results show that our method outperforms previous diversity-driven baselines on diversity while being comparable in terms of consistency scores. Our code is available at https://github.com/gouqi666/RAST.

Detrive: Imitation Learning with Transformer Detection for End-to-End Autonomous Driving

Authors:Daoming Chen, Ning Wang, Feng Chen, Tony Pipe
Date:2023-10-22 08:07:56

This Paper proposes a novel Transformer-based end-to-end autonomous driving model named Detrive. This model solves the problem that the past end-to-end models cannot detect the position and size of traffic participants. Detrive uses an end-to-end transformer based detection model as its perception module; a multi-layer perceptron as its feature fusion network; a recurrent neural network with gate recurrent unit for path planning; and two controllers for the vehicle's forward speed and turning angle. The model is trained with an on-line imitation learning method. In order to obtain a better training set, a reinforcement learning agent that can directly obtain a ground truth bird's-eye view map from the Carla simulator as a perceptual output, is used as teacher for the imitation learning. The trained model is tested on the Carla's autonomous driving benchmark. The results show that the Transformer detector based end-to-end model has obvious advantages in dynamic obstacle avoidance compared with the traditional classifier based end-to-end model.

Plan-Guided Reinforcement Learning for Whole-Body Manipulation

Authors:Mengchao Zhang, Jose Barreiros, Aykut Ozgun Onol
Date:2023-10-18 18:57:42

Synthesizing complex whole-body manipulation behaviors has fundamental challenges due to the rapidly growing combinatorics inherent to contact interaction planning. While model-based methods have shown promising results in solving long-horizon manipulation tasks, they often work under strict assumptions, such as known model parameters, oracular observation of the environment state, and simplified dynamics, resulting in plans that cannot easily transfer to hardware. Learning-based approaches, such as imitation learning (IL) and reinforcement learning (RL), have been shown to be robust when operating over in-distribution states; however, they need heavy human supervision. Specifically, model-free RL requires a tedious reward-shaping process. IL methods, on the other hand, rely on human demonstrations that involve advanced teleoperation methods. In this work, we propose a plan-guided reinforcement learning (PGRL) framework to combine the advantages of model-based planning and reinforcement learning. Our method requires minimal human supervision because it relies on plans generated by model-based planners to guide the exploration in RL. In exchange, RL derives a more robust policy thanks to domain randomization. We test this approach on a whole-body manipulation task on Punyo, an upper-body humanoid robot with compliant, air-filled arm coverings, to pivot and lift a large box. Our preliminary results indicate that the proposed methodology is promising to address challenges that remain difficult for either model- or learning-based strategies alone.

Neural Packing: from Visual Sensing to Reinforcement Learning

Authors:Juzhan Xu, Minglun Gong, Hao Zhang, Hui Huang, Ruizhen Hu
Date:2023-10-17 02:42:54

We present a novel learning framework to solve the transport-and-packing (TAP) problem in 3D. It constitutes a full solution pipeline from partial observations of input objects via RGBD sensing and recognition to final box placement, via robotic motion planning, to arrive at a compact packing in a target container. The technical core of our method is a neural network for TAP, trained via reinforcement learning (RL), to solve the NP-hard combinatorial optimization problem. Our network simultaneously selects an object to pack and determines the final packing location, based on a judicious encoding of the continuously evolving states of partially observed source objects and available spaces in the target container, using separate encoders both enabled with attention mechanisms. The encoded feature vectors are employed to compute the matching scores and feasibility masks of different pairings of box selection and available space configuration for packing strategy optimization. Extensive experiments, including ablation studies and physical packing execution by a real robot (Universal Robot UR5e), are conducted to evaluate our method in terms of its design choices, scalability, generalizability, and comparisons to baselines, including the most recent RL-based TAP solution. We also contribute the first benchmark for TAP which covers a variety of input settings and difficulty levels.

Reaching the Limit in Autonomous Racing: Optimal Control versus Reinforcement Learning

Authors:Yunlong Song, Angel Romero, Matthias Mueller, Vladlen Koltun, Davide Scaramuzza
Date:2023-10-17 02:40:27

A central question in robotics is how to design a control system for an agile mobile robot. This paper studies this question systematically, focusing on a challenging setting: autonomous drone racing. We show that a neural network controller trained with reinforcement learning (RL) outperformed optimal control (OC) methods in this setting. We then investigated which fundamental factors have contributed to the success of RL or have limited OC. Our study indicates that the fundamental advantage of RL over OC is not that it optimizes its objective better but that it optimizes a better objective. OC decomposes the problem into planning and control with an explicit intermediate representation, such as a trajectory, that serves as an interface. This decomposition limits the range of behaviors that can be expressed by the controller, leading to inferior control performance when facing unmodeled effects. In contrast, RL can directly optimize a task-level objective and can leverage domain randomization to cope with model uncertainty, allowing the discovery of more robust control responses. Our findings allowed us to push an agile drone to its maximum performance, achieving a peak acceleration greater than 12 times the gravitational acceleration and a peak velocity of 108 kilometers per hour. Our policy achieved superhuman control within minutes of training on a standard workstation. This work presents a milestone in agile robotics and sheds light on the role of RL and OC in robot control.

Unlocking Metasurface Practicality for B5G Networks: AI-assisted RIS Planning

Authors:Guillermo Encinas-Lago, Antonio Albanese, Vincenzo Sciancalepore, Marco Di Renzo, Xavier Costa-Pérez
Date:2023-10-16 12:14:42

The advent of reconfigurable intelligent surfaces(RISs) brings along significant improvements for wireless technology on the verge of beyond-fifth-generation networks (B5G).The proven flexibility in influencing the propagation environment opens up the possibility of programmatically altering the wireless channel to the advantage of network designers, enabling the exploitation of higher-frequency bands for superior throughput overcoming the challenging electromagnetic (EM) propagation properties at these frequency bands. However, RISs are not magic bullets. Their employment comes with significant complexity, requiring ad-hoc deployments and management operations to come to fruition. In this paper, we tackle the open problem of bringing RISs to the field, focusing on areas with little or no coverage. In fact, we present a first-of-its-kind deep reinforcement learning (DRL) solution, dubbed as D-RISA, which trains a DRL agent and, in turn, obtain san optimal RIS deployment. We validate our framework in the indoor scenario of the Rennes railway station in France, assessing the performance of our algorithm against state-of-the-art (SOA) approaches. Our benchmarks showcase better coverage, i.e., 10-dB increase in minimum signal-to-noise ratio (SNR), at lower computational time (up to -25 percent) while improving scalability towards denser network deployments.

Theory of Mind for Multi-Agent Collaboration via Large Language Models

Authors:Huao Li, Yu Quan Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Michael Lewis, Katia Sycara
Date:2023-10-16 07:51:19

While Large Language Models (LLMs) have demonstrated impressive accomplishments in both reasoning and planning, their abilities in multi-agent collaborations remains largely unexplored. This study evaluates LLM-based agents in a multi-agent cooperative text game with Theory of Mind (ToM) inference tasks, comparing their performance with Multi-Agent Reinforcement Learning (MARL) and planning-based baselines. We observed evidence of emergent collaborative behaviors and high-order Theory of Mind capabilities among LLM-based agents. Our results reveal limitations in LLM-based agents' planning optimization due to systematic failures in managing long-horizon contexts and hallucination about the task state. We explore the use of explicit belief state representations to mitigate these issues, finding that it enhances task performance and the accuracy of ToM inferences for LLM-based agents.

Forecaster: Towards Temporally Abstract Tree-Search Planning from Pixels

Authors:Thomas Jiralerspong, Flemming Kondrup, Doina Precup, Khimya Khetarpal
Date:2023-10-16 01:13:26

The ability to plan at many different levels of abstraction enables agents to envision the long-term repercussions of their decisions and thus enables sample-efficient learning. This becomes particularly beneficial in complex environments from high-dimensional state space such as pixels, where the goal is distant and the reward sparse. We introduce Forecaster, a deep hierarchical reinforcement learning approach which plans over high-level goals leveraging a temporally abstract world model. Forecaster learns an abstract model of its environment by modelling the transitions dynamics at an abstract level and training a world model on such transition. It then uses this world model to choose optimal high-level goals through a tree-search planning procedure. It additionally trains a low-level policy that learns to reach those goals. Our method not only captures building world models with longer horizons, but also, planning with such models in downstream tasks. We empirically demonstrate Forecaster's potential in both single-task learning and generalization to new tasks in the AntMaze domain.

AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents

Authors:Jake Grigsby, Linxi Fan, Yuke Zhu
Date:2023-10-15 22:20:39

We introduce AMAGO, an in-context Reinforcement Learning (RL) agent that uses sequence models to tackle the challenges of generalization, long-term memory, and meta-learning. Recent works have shown that off-policy learning can make in-context RL with recurrent policies viable. Nonetheless, these approaches require extensive tuning and limit scalability by creating key bottlenecks in agents' memory capacity, planning horizon, and model size. AMAGO revisits and redesigns the off-policy in-context approach to successfully train long-sequence Transformers over entire rollouts in parallel with end-to-end RL. Our agent is scalable and applicable to a wide range of problems, and we demonstrate its strong performance empirically in meta-RL and long-term memory domains. AMAGO's focus on sparse rewards and off-policy data also allows in-context learning to extend to goal-conditioned problems with challenging exploration. When combined with a multi-goal hindsight relabeling scheme, AMAGO can solve a previously difficult category of open-world domains, where agents complete many possible instructions in procedurally generated environments.

Reinforcement Learning for Reduced-order Models of Legged Robots

Authors:Yu-Ming Chen, Hien Bui, Michael Posa
Date:2023-10-15 16:13:16

Model-based approaches for planning and control for bipedal locomotion have a long history of success. It can provide stability and safety guarantees while being effective in accomplishing many locomotion tasks. Model-free reinforcement learning, on the other hand, has gained much popularity in recent years due to computational advancements. It can achieve high performance in specific tasks, but it lacks physical interpretability and flexibility in re-purposing the policy for a different set of tasks. For instance, we can initially train a neural network (NN) policy using velocity commands as inputs. However, to handle new task commands like desired hand or footstep locations at a desired walking velocity, we must retrain a new NN policy. In this work, we attempt to bridge the gap between these two bodies of work on a bipedal platform. We formulate a model-based reinforcement learning problem to learn a reduced-order model (ROM) within a model predictive control (MPC). Results show a 49% improvement in viable task region size and a 21% reduction in motor torque cost. All videos and code are available at https://sites.google.com/view/ymchen/research/rl-for-roms.

Enhancing Task Performance of Learned Simplified Models via Reinforcement Learning

Authors:Hien Bui, Michael Posa
Date:2023-10-15 03:01:14

In contact-rich tasks, the hybrid, multi-modal nature of contact dynamics poses great challenges in model representation, planning, and control. Recent efforts have attempted to address these challenges via data-driven methods, learning dynamical models in combination with model predictive control. Those methods, while effective, rely solely on minimizing forward prediction errors to hope for better task performance with MPC controllers. This weak correlation can result in data inefficiency as well as limitations to overall performance. In response, we propose a novel strategy: using a policy gradient algorithm to find a simplified dynamics model that explicitly maximizes task performance. Specifically, we parameterize the stochastic policy as the perturbed output of the MPC controller, thus, the learned model representation can directly associate with the policy or task performance. We apply the proposed method to contact-rich tasks where a three-fingered robotic hand manipulates previously unknown objects. Our method significantly enhances task success rate by up to 15% in manipulating diverse objects compared to the existing method while sustaining data efficiency. Our method can solve some tasks with success rates of 70% or higher using under 30 minutes of data. All videos and codes are available at https://sites.google.com/view/lcs-rl.

Adaptive Online Replanning with Diffusion Models

Authors:Siyuan Zhou, Yilun Du, Shun Zhang, Mengdi Xu, Yikang Shen, Wei Xiao, Dit-Yan Yeung, Chuang Gan
Date:2023-10-14 17:52:04

Diffusion models have risen as a promising approach to data-driven planning, and have demonstrated impressive robotic control, reinforcement learning, and video planning performance. Given an effective planner, an important question to consider is replanning -- when given plans should be regenerated due to both action execution error and external environment changes. Direct plan execution, without replanning, is problematic as errors from individual actions rapidly accumulate and environments are partially observable and stochastic. Simultaneously, replanning at each timestep incurs a substantial computational cost, and may prevent successful task execution, as different generated plans prevent consistent progress to any particular goal. In this paper, we explore how we may effectively replan with diffusion models. We propose a principled approach to determine when to replan, based on the diffusion model's estimated likelihood of existing generated plans. We further present an approach to replan existing trajectories to ensure that new plans follow the same goal state as the original trajectory, which may efficiently bootstrap off previously generated plans. We illustrate how a combination of our proposed additions significantly improves the performance of diffusion planners leading to 38\% gains over past diffusion planning approaches on Maze2D, and further enables the handling of stochastic and long-horizon robotic control tasks. Videos can be found on the anonymous website: \url{https://vis-www.cs.umass.edu/replandiffuser/}.

LgTS: Dynamic Task Sampling using LLM-generated sub-goals for Reinforcement Learning Agents

Authors:Yash Shukla, Wenchang Gao, Vasanth Sarathy, Alvaro Velasquez, Robert Wright, Jivko Sinapov
Date:2023-10-14 00:07:03

Recent advancements in reasoning abilities of Large Language Models (LLM) has promoted their usage in problems that require high-level planning for robots and artificial agents. However, current techniques that utilize LLMs for such planning tasks make certain key assumptions such as, access to datasets that permit finetuning, meticulously engineered prompts that only provide relevant and essential information to the LLM, and most importantly, a deterministic approach to allow execution of the LLM responses either in the form of existing policies or plan operators. In this work, we propose LgTS (LLM-guided Teacher-Student learning), a novel approach that explores the planning abilities of LLMs to provide a graphical representation of the sub-goals to a reinforcement learning (RL) agent that does not have access to the transition dynamics of the environment. The RL agent uses Teacher-Student learning algorithm to learn a set of successful policies for reaching the goal state from the start state while simultaneously minimizing the number of environmental interactions. Unlike previous methods that utilize LLMs, our approach does not assume access to a propreitary or a fine-tuned LLM, nor does it require pre-trained policies that achieve the sub-goals proposed by the LLM. Through experiments on a gridworld based DoorKey domain and a search-and-rescue inspired domain, we show that generating a graphical structure of sub-goals helps in learning policies for the LLM proposed sub-goals and the Teacher-Student learning algorithm minimizes the number of environment interactions when the transition dynamics are unknown.

LLaMA Rider: Spurring Large Language Models to Explore the Open World

Authors:Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, Zongqing Lu
Date:2023-10-13 07:47:44

Recently, various studies have leveraged Large Language Models (LLMs) to help decision-making and planning in environments, and try to align the LLMs' knowledge with the world conditions. Nonetheless, the capacity of LLMs to continuously acquire environmental knowledge and adapt in an open world remains uncertain. In this paper, we propose an approach to spur LLMs to explore the open world, gather experiences, and learn to improve their task-solving capabilities. In this approach, a multi-round feedback-revision mechanism is utilized to encourage LLMs to actively select appropriate revision actions guided by feedback information from the environment. This facilitates exploration and enhances the model's performance. Besides, we integrate sub-task relabeling to assist LLMs in maintaining consistency in sub-task planning and help the model learn the combinatorial nature between tasks, enabling it to complete a wider range of tasks through training based on the acquired exploration experiences. By evaluation in Minecraft, an open-ended sandbox world, we demonstrate that our approach LLaMA-Rider enhances the efficiency of the LLM in exploring the environment, and effectively improves the LLM's ability to accomplish more tasks through fine-tuning with merely 1.3k instances of collected data, showing minimal training costs compared to the baseline using reinforcement learning.

Waymax: An Accelerated, Data-Driven Simulator for Large-Scale Autonomous Driving Research

Authors:Cole Gulino, Justin Fu, Wenjie Luo, George Tucker, Eli Bronstein, Yiren Lu, Jean Harb, Xinlei Pan, Yan Wang, Xiangyu Chen, John D. Co-Reyes, Rishabh Agarwal, Rebecca Roelofs, Yao Lu, Nico Montali, Paul Mougin, Zoey Yang, Brandyn White, Aleksandra Faust, Rowan McAllister, Dragomir Anguelov, Benjamin Sapp
Date:2023-10-12 20:49:15

Simulation is an essential tool to develop and benchmark autonomous vehicle planning software in a safe and cost-effective manner. However, realistic simulation requires accurate modeling of nuanced and complex multi-agent interactive behaviors. To address these challenges, we introduce Waymax, a new data-driven simulator for autonomous driving in multi-agent scenes, designed for large-scale simulation and testing. Waymax uses publicly-released, real-world driving data (e.g., the Waymo Open Motion Dataset) to initialize or play back a diverse set of multi-agent simulated scenarios. It runs entirely on hardware accelerators such as TPUs/GPUs and supports in-graph simulation for training, making it suitable for modern large-scale, distributed machine learning workflows. To support online training and evaluation, Waymax includes several learned and hard-coded behavior models that allow for realistic interaction within simulation. To supplement Waymax, we benchmark a suite of popular imitation and reinforcement learning algorithms with ablation studies on different design decisions, where we highlight the effectiveness of routes as guidance for planning agents and the ability of RL to overfit against simulated agents.

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Authors:Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Chencheng Jiang, Haoran Tan, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu
Date:2023-10-12 17:59:58

Large vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning. When integrated into an embodied agent, existing embodied VLM works either output detailed action sequences at the manipulation level or only provide plans at an abstract level, leaving a gap between high-level planning and real-world manipulation. To bridge this gap, we introduce Octopus, an embodied vision-language programmer that uses executable code generation as a medium to connect planning and manipulation. Octopus is designed to 1) proficiently comprehend an agent's visual and textual task objectives, 2) formulate intricate action sequences, and 3) generate executable code. To facilitate Octopus model development, we introduce OctoVerse: a suite of environments tailored for benchmarking vision-based code generators on a wide spectrum of tasks, ranging from mundane daily chores in simulators to sophisticated interactions in complex video games such as Grand Theft Auto (GTA) and Minecraft. To train Octopus, we leverage GPT-4 to control an explorative agent that generates training data, i.e., action blueprints and corresponding executable code. We also collect feedback that enables an enhanced training scheme called Reinforcement Learning with Environmental Feedback (RLEF). Through a series of experiments, we demonstrate Octopus's functionality and present compelling results, showing that the proposed RLEF refines the agent's decision-making. By open-sourcing our simulation environments, dataset, and model architecture, we aspire to ignite further innovation and foster collaborative applications within the broader embodied AI community.

GARL: Genetic Algorithm-Augmented Reinforcement Learning to Detect Violations in Marker-Based Autonomous Landing Systems

Authors:Linfeng Liang, Yao Deng, Kye Morton, Valtteri Kallinen, Alice James, Avishkar Seth, Endrowednes Kuantama, Subhas Mukhopadhyay, Richard Han, Xi Zheng
Date:2023-10-11 10:54:01

Automated Uncrewed Aerial Vehicle (UAV) landing is crucial for autonomous UAV services such as monitoring, surveying, and package delivery. It involves detecting landing targets, perceiving obstacles, planning collision-free paths, and controlling UAV movements for safe landing. Failures can lead to significant losses, necessitating rigorous simulation-based testing for safety. Traditional offline testing methods, limited to static environments and predefined trajectories, may miss violation cases caused by dynamic objects like people and animals. Conversely, online testing methods require extensive training time, which is impractical with limited budgets. To address these issues, we introduce GARL, a framework combining a genetic algorithm (GA) and reinforcement learning (RL) for efficient generation of diverse and real landing system failures within a practical budget. GARL employs GA for exploring various environment setups offline, reducing the complexity of RL's online testing in simulating challenging landing scenarios. Our approach outperforms existing methods by up to 18.35% in violation rate and 58% in diversity metric. We validate most discovered violation types with real-world UAV tests, pioneering the integration of offline and online testing strategies for autonomous systems. This method opens new research directions for online testing, with our code and supplementary material available at https://github.com/lfeng0722/drone_testing/.

Interactive Interior Design Recommendation via Coarse-to-fine Multimodal Reinforcement Learning

Authors:He Zhang, Ying Sun, Weiyu Guo, Yafei Liu, Haonan Lu, Xiaodong Lin, Hui Xiong
Date:2023-10-11 08:18:51

Personalized interior decoration design often incurs high labor costs. Recent efforts in developing intelligent interior design systems have focused on generating textual requirement-based decoration designs while neglecting the problem of how to mine homeowner's hidden preferences and choose the proper initial design. To fill this gap, we propose an Interactive Interior Design Recommendation System (IIDRS) based on reinforcement learning (RL). IIDRS aims to find an ideal plan by interacting with the user, who provides feedback on the gap between the recommended plan and their ideal one. To improve decision-making efficiency and effectiveness in large decoration spaces, we propose a Decoration Recommendation Coarse-to-Fine Policy Network (DecorRCFN). Additionally, to enhance generalization in online scenarios, we propose an object-aware feedback generation method that augments model training with diversified and dynamic textual feedback. Extensive experiments on a real-world dataset demonstrate our method outperforms traditional methods by a large margin in terms of recommendation accuracy. Further user studies demonstrate that our method reaches higher real-world user satisfaction than baseline methods.

COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL

Authors:Xiyao Wang, Ruijie Zheng, Yanchao Sun, Ruonan Jia, Wichayaporn Wongkamjan, Huazhe Xu, Furong Huang
Date:2023-10-11 06:10:07

Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose $\texttt{COPlanner}$, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. $\texttt{COPlanner}$ leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, $\texttt{COPlanner}$ can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. $\texttt{COPlanner}$ is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with $\texttt{COPlanner}$.

EARL: Eye-on-Hand Reinforcement Learner for Dynamic Grasping with Active Pose Estimation

Authors:Baichuan Huang, Jingjin Yu, Siddarth Jain
Date:2023-10-10 16:23:34

In this paper, we explore the dynamic grasping of moving objects through active pose tracking and reinforcement learning for hand-eye coordination systems. Most existing vision-based robotic grasping methods implicitly assume target objects are stationary or moving predictably. Performing grasping of unpredictably moving objects presents a unique set of challenges. For example, a pre-computed robust grasp can become unreachable or unstable as the target object moves, and motion planning must also be adaptive. In this work, we present a new approach, Eye-on-hAnd Reinforcement Learner (EARL), for enabling coupled Eye-on-Hand (EoH) robotic manipulation systems to perform real-time active pose tracking and dynamic grasping of novel objects without explicit motion prediction. EARL readily addresses many thorny issues in automated hand-eye coordination, including fast-tracking of 6D object pose from vision, learning control policy for a robotic arm to track a moving object while keeping the object in the camera's field of view, and performing dynamic grasping. We demonstrate the effectiveness of our approach in extensive experiments validated on multiple commercial robotic arms in both simulations and complex real-world tasks.

Deep reinforcement learning uncovers processes for separating azeotropic mixtures without prior knowledge

Authors:Quirin Göttl, Jonathan Pirnay, Jakob Burger, Dominik G. Grimm
Date:2023-10-10 08:36:21

Process synthesis in chemical engineering is a complex planning problem due to vast search spaces, continuous parameters and the need for generalization. Deep reinforcement learning agents, trained without prior knowledge, have shown to outperform humans in various complex planning problems in recent years. Existing work on reinforcement learning for flowsheet synthesis shows promising concepts, but focuses on narrow problems in a single chemical system, limiting its practicality. We present a general deep reinforcement learning approach for flowsheet synthesis. We demonstrate the adaptability of a single agent to the general task of separating binary azeotropic mixtures. Without prior knowledge, it learns to craft near-optimal flowsheets for multiple chemical systems, considering different feed compositions and conceptual approaches. On average, the agent can separate more than 99% of the involved materials into pure components, while autonomously learning fundamental process engineering paradigms. This highlights the agent's planning flexibility, an encouraging step toward true generality.

Human-Robot Gym: Benchmarking Reinforcement Learning in Human-Robot Collaboration

Authors:Jakob Thumm, Felix Trost, Matthias Althoff
Date:2023-10-09 23:34:09

Deep reinforcement learning (RL) has shown promising results in robot motion planning with first attempts in human-robot collaboration (HRC). However, a fair comparison of RL approaches in HRC under the constraint of guaranteed safety is yet to be made. We, therefore, present human-robot gym, a benchmark suite for safe RL in HRC. Our benchmark suite provides eight challenging, realistic HRC tasks in a modular simulation framework. Most importantly, human-robot gym includes a safety shield that provably guarantees human safety. We are, thereby, the first to provide a benchmark suite to train RL agents that adhere to the safety specifications of real-world HRC. This bridges a critical gap between theoretic RL research and its real-world deployment. Our evaluation of six tasks led to three key results: (a) the diverse nature of the tasks offered by human-robot gym creates a challenging benchmark for state-of-the-art RL methods, (b) incorporating expert knowledge in RL training in the form of an action-based reward can outperform the expert, and (c) our agents negligibly overfit to training data.

Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

Authors:Trevor McInroe, Adam Jelley, Stefano V. Albrecht, Amos Storkey
Date:2023-10-09 13:47:05

Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm well matched to a real-world RL deployment process. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but we show this can unnecessarily limit policy performance if the behavior policy is far from optimal. Instead, we forgo constraints and frame OtO RL as an exploration problem that aims to maximize the benefit of online data-collection. We first study the major online RL exploration methods based on intrinsic rewards and UCB in the OtO setting, showing that intrinsic rewards add training instability through reward-function modification, and UCB methods are myopic and it is unclear which learned-component's ensemble to use for action selection. We then introduce an algorithm for planning to go out-of-distribution (PTGOOD) that avoids these issues. PTGOOD uses a non-myopic planning procedure that targets exploration in relatively high-reward regions of the state-action space unlikely to be visited by the behavior policy. By leveraging concepts from the Conditional Entropy Bottleneck, PTGOOD encourages data collected online to provide new information relevant to improving the final deployment policy without altering rewards. We show empirically in several continuous control tasks that PTGOOD significantly improves agent returns during online fine-tuning and avoids the suboptimal policy convergence that many of our baselines exhibit in several environments.

Terrain-Aware Quadrupedal Locomotion via Reinforcement Learning

Authors:Haojie Shi, Qingxu Zhu, Lei Han, Wanchao Chi, Tingguang Li, Max Q. -H. Meng
Date:2023-10-07 03:20:26

In nature, legged animals have developed the ability to adapt to challenging terrains through perception, allowing them to plan safe body and foot trajectories in advance, which leads to safe and energy-efficient locomotion. Inspired by this observation, we present a novel approach to train a Deep Neural Network (DNN) policy that integrates proprioceptive and exteroceptive states with a parameterized trajectory generator for quadruped robots to traverse rough terrains. Our key idea is to use a DNN policy that can modify the parameters of the trajectory generator, such as foot height and frequency, to adapt to different terrains. To encourage the robot to step on safe regions and save energy consumption, we propose foot terrain reward and lifting foot height reward, respectively. By incorporating these rewards, our method can learn a safer and more efficient terrain-aware locomotion policy that can move a quadruped robot flexibly in any direction. To evaluate the effectiveness of our approach, we conduct simulation experiments on challenging terrains, including stairs, stepping stones, and poles. The simulation results demonstrate that our approach can successfully direct the robot to traverse such tough terrains in any direction. Furthermore, we validate our method on a real legged robot, which learns to traverse stepping stones with gaps over 25.5cm.

Deep Model Predictive Optimization

Authors:Jacob Sacks, Rwik Rana, Kevin Huang, Alex Spitzer, Guanya Shi, Byron Boots
Date:2023-10-06 21:11:52

A major challenge in robotics is to design robust policies which enable complex and agile behaviors in the real world. On one end of the spectrum, we have model-free reinforcement learning (MFRL), which is incredibly flexible and general but often results in brittle policies. In contrast, model predictive control (MPC) continually re-plans at each time step to remain robust to perturbations and model inaccuracies. However, despite its real-world successes, MPC often under-performs the optimal strategy. This is due to model quality, myopic behavior from short planning horizons, and approximations due to computational constraints. And even with a perfect model and enough compute, MPC can get stuck in bad local optima, depending heavily on the quality of the optimization algorithm. To this end, we propose Deep Model Predictive Optimization (DMPO), which learns the inner-loop of an MPC optimization algorithm directly via experience, specifically tailored to the needs of the control problem. We evaluate DMPO on a real quadrotor agile trajectory tracking task, on which it improves performance over a baseline MPC algorithm for a given computational budget. It can outperform the best MPC algorithm by up to 27% with fewer samples and an end-to-end policy trained with MFRL by 19%. Moreover, because DMPO requires fewer samples, it can also achieve these benefits with 4.3X less memory. When we subject the quadrotor to turbulent wind fields with an attached drag plate, DMPO can adapt zero-shot while still outperforming all baselines. Additional results can be found at https://tinyurl.com/mr2ywmnw.

Deep Learning Based Active Spatial Channel Gain Prediction Using a Swarm of Unmanned Aerial Vehicles

Authors:Enes Krijestorac, Danijela Cabric
Date:2023-10-06 19:19:29

Prediction of wireless channel gain (CG) across space is a necessary tool for many important wireless network design problems. In this paper, we develop prediction methods that use environment-specific features, namely building maps and CG measurements, to achieve high prediction accuracy. We assume that measurements are collected using a swarm of coordinated unmanned aerial vehicles (UAVs). We develop novel active prediction approaches which consist of both methods for UAV path planning for optimal measurement collection and methods for prediction of CG across space based on the collected measurements. We propose two active prediction approaches based on deep learning (DL) and Kriging interpolation. The first approach does not rely on the location of the transmitter and utilizes 3D maps to compensate for the lack of it. We utilize DL to incorporate 3D maps into prediction and reinforcement learning for optimal path planning for the UAVs based on DL prediction. The second active prediction approach is based on Kriging interpolation, which requires known transmitter location and cannot utilize 3D maps. We train and evaluate the two proposed approaches in a ray-tracing-based channel simulator. Using simulations, we demonstrate the importance of active prediction compared to prediction based on randomly collected measurements of channel gain. Furthermore, we show that using DL and 3D maps, we can achieve high prediction accuracy even without knowing the transmitter location. We also demonstrate the importance of coordinated path planning for active prediction when using multiple UAVs compared to UAVs collecting measurements independently in a greedy manner.

Amortized Network Intervention to Steer the Excitatory Point Processes

Authors:Zitao Song, Wendi Ren, Shuang Li
Date:2023-10-06 11:17:28

Excitatory point processes (i.e., event flows) occurring over dynamic graphs (i.e., evolving topologies) provide a fine-grained model to capture how discrete events may spread over time and space. How to effectively steer the event flows by modifying the dynamic graph structures presents an interesting problem, motivated by curbing the spread of infectious diseases through strategically locking down cities to mitigating traffic congestion via traffic light optimization. To address the intricacies of planning and overcome the high dimensionality inherent to such decision-making problems, we design an Amortized Network Interventions (ANI) framework, allowing for the pooling of optimal policies from history and other contexts while ensuring a permutation equivalent property. This property enables efficient knowledge transfer and sharing across diverse contexts. Each task is solved by an H-step lookahead model-based reinforcement learning, where neural ODEs are introduced to model the dynamics of the excitatory point processes. Instead of simulating rollouts from the dynamics model, we derive an analytical mean-field approximation for the event flows given the dynamics, making the online planning more efficiently solvable. We empirically illustrate that this ANI approach substantially enhances policy learning for unseen dynamics and exhibits promising outcomes in steering event flows through network intervention using synthetic and real COVID datasets.

LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models

Authors:Saaket Agashe, Yue Fan, Anthony Reyna, Xin Eric Wang
Date:2023-10-05 21:18:15

The emergent reasoning and Theory of Mind (ToM) abilities demonstrated by Large Language Models (LLMs) make them promising candidates for developing coordination agents. In this study, we introduce a new LLM-Coordination Benchmark aimed at a detailed analysis of LLMs within the context of Pure Coordination Games, where participating agents need to cooperate for the most gain. This benchmark evaluates LLMs through two distinct tasks: (1) \emph{Agentic Coordination}, where LLMs act as proactive participants for cooperation in 4 pure coordination games; (2) \emph{Coordination Question Answering (QA)}, where LLMs are prompted to answer 198 multiple-choice questions from the 4 games for evaluation of three key reasoning abilities: Environment Comprehension, ToM Reasoning, and Joint Planning. Furthermore, to enable LLMs for multi-agent coordination, we introduce a Cognitive Architecture for Coordination (CAC) framework that can easily integrate different LLMs as plug-and-play modules for pure coordination games. Our findings indicate that LLM agents equipped with GPT-4-turbo achieve comparable performance to state-of-the-art reinforcement learning methods in games that require commonsense actions based on the environment. Besides, zero-shot coordination experiments reveal that, unlike RL methods, LLM agents are robust to new unseen partners. However, results on Coordination QA show a large room for improvement in the Theory of Mind reasoning and joint planning abilities of LLMs. The analysis also sheds light on how the ability of LLMs to understand their environment and their partner's beliefs and intentions plays a part in their ability to plan for coordination. Our code is available at \url{https://github.com/eric-ai-lab/llm_coordination}.

HandMeThat: Human-Robot Communication in Physical and Social Environments

Authors:Yanming Wan, Jiayuan Mao, Joshua B. Tenenbaum
Date:2023-10-05 16:14:46

We introduce HandMeThat, a benchmark for a holistic evaluation of instruction understanding and following in physical and social environments. While previous datasets primarily focused on language grounding and planning, HandMeThat considers the resolution of human instructions with ambiguities based on the physical (object states and relations) and social (human actions and goals) information. HandMeThat contains 10,000 episodes of human-robot interactions. In each episode, the robot first observes a trajectory of human actions towards her internal goal. Next, the robot receives a human instruction and should take actions to accomplish the subgoal set through the instruction. In this paper, we present a textual interface for our benchmark, where the robot interacts with a virtual environment through textual commands. We evaluate several baseline models on HandMeThat, and show that both offline and online reinforcement learning algorithms perform poorly on HandMeThat, suggesting significant room for future work on physical and social human-robot communications and interactions.

Towards a Unified Framework for Sequential Decision Making

Authors:Carlos Núñez-Molina, Pablo Mesejo, Juan Fernández-Olivares
Date:2023-10-03 16:01:06

In recent years, the integration of Automated Planning (AP) and Reinforcement Learning (RL) has seen a surge of interest. To perform this integration, a general framework for Sequential Decision Making (SDM) would prove immensely useful, as it would help us understand how AP and RL fit together. In this preliminary work, we attempt to provide such a framework, suitable for any method ranging from Classical Planning to Deep RL, by drawing on concepts from Probability Theory and Bayesian inference. We formulate an SDM task as a set of training and test Markov Decision Processes (MDPs), to account for generalization. We provide a general algorithm for SDM which we hypothesize every SDM method is based on. According to it, every SDM algorithm can be seen as a procedure that iteratively improves its solution estimate by leveraging the task knowledge available. Finally, we derive a set of formulas and algorithms for calculating interesting properties of SDM tasks and methods, which make possible their empirical evaluation and comparison.

AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model

Authors:Zibin Dong, Yifu Yuan, Jianye Hao, Fei Ni, Yao Mu, Yan Zheng, Yujing Hu, Tangjie Lv, Changjie Fan, Zhipeng Hu
Date:2023-10-03 13:53:08

Aligning agent behaviors with diverse human preferences remains a challenging problem in reinforcement learning (RL), owing to the inherent abstractness and mutability of human preferences. To address these issues, we propose AlignDiff, a novel framework that leverages RL from Human Feedback (RLHF) to quantify human preferences, covering abstractness, and utilizes them to guide diffusion planning for zero-shot behavior customizing, covering mutability. AlignDiff can accurately match user-customized behaviors and efficiently switch from one to another. To build the framework, we first establish the multi-perspective human feedback datasets, which contain comparisons for the attributes of diverse behaviors, and then train an attribute strength model to predict quantified relative strengths. After relabeling behavioral datasets with relative strengths, we proceed to train an attribute-conditioned diffusion model, which serves as a planner with the attribute strength model as a director for preference aligning at the inference phase. We evaluate AlignDiff on various locomotion tasks and demonstrate its superior performance on preference matching, switching, and covering compared to other baselines. Its capability of completing unseen downstream tasks under human instructions also showcases the promising potential for human-AI collaboration. More visualization videos are released on https://aligndiff.github.io/.

Mini-BEHAVIOR: A Procedurally Generated Benchmark for Long-horizon Decision-Making in Embodied AI

Authors:Emily Jin, Jiaheng Hu, Zhuoyi Huang, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Roberto Martín-Martín
Date:2023-10-03 06:41:18

We present Mini-BEHAVIOR, a novel benchmark for embodied AI that challenges agents to use reasoning and decision-making skills to solve complex activities that resemble everyday human challenges. The Mini-BEHAVIOR environment is a fast, realistic Gridworld environment that offers the benefits of rapid prototyping and ease of use while preserving a symbolic level of physical realism and complexity found in complex embodied AI benchmarks. We introduce key features such as procedural generation, to enable the creation of countless task variations and support open-ended learning. Mini-BEHAVIOR provides implementations of various household tasks from the original BEHAVIOR benchmark, along with starter code for data collection and reinforcement learning agent training. In essence, Mini-BEHAVIOR offers a fast, open-ended benchmark for evaluating decision-making and planning solutions in embodied AI. It serves as a user-friendly entry point for research and facilitates the evaluation and development of solutions, simplifying their assessment and development while advancing the field of embodied AI. Code is publicly available at https://github.com/StanfordVL/mini_behavior.

Iterative Option Discovery for Planning, by Planning

Authors:Kenny Young, Richard S. Sutton
Date:2023-10-02 19:03:30

Discovering useful temporal abstractions, in the form of options, is widely thought to be key to applying reinforcement learning and planning to increasingly complex domains. Building on the empirical success of the Expert Iteration approach to policy learning used in AlphaZero, we propose Option Iteration, an analogous approach to option discovery. Rather than learning a single strong policy that is trained to match the search results everywhere, Option Iteration learns a set of option policies trained such that for each state encountered, at least one policy in the set matches the search results for some horizon into the future. Intuitively, this may be significantly easier as it allows the algorithm to hedge its bets compared to learning a single globally strong policy, which may have complex dependencies on the details of the current state. Having learned such a set of locally strong policies, we can use them to guide the search algorithm resulting in a virtuous cycle where better options lead to better search results which allows for training of better options. We demonstrate experimentally that planning using options learned with Option Iteration leads to a significant benefit in challenging planning environments compared to an analogous planning algorithm operating in the space of primitive actions and learning a single rollout policy with Expert Iteration.

Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning

Authors:Qiwei Di, Heyang Zhao, Jiafan He, Quanquan Gu
Date:2023-10-02 17:42:01

Offline reinforcement learning (RL), where the agent aims to learn the optimal policy based on the data collected by a behavior policy, has attracted increasing attention in recent years. While offline RL with linear function approximation has been extensively studied with optimal results achieved under certain assumptions, many works shift their interest to offline RL with non-linear function approximation. However, limited works on offline RL with non-linear function approximation have instance-dependent regret guarantees. In this paper, we propose an oracle-efficient algorithm, dubbed Pessimistic Nonlinear Least-Square Value Iteration (PNLSVI), for offline RL with non-linear function approximation. Our algorithmic design comprises three innovative components: (1) a variance-based weighted regression scheme that can be applied to a wide range of function classes, (2) a subroutine for variance estimation, and (3) a planning phase that utilizes a pessimistic value iteration approach. Our algorithm enjoys a regret bound that has a tight dependency on the function class complexity and achieves minimax optimal instance-dependent regret when specialized to linear function approximation. Our work extends the previous instance-dependent results within simpler function classes, such as linear and differentiable function to a more general framework.

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

Authors:Yizhe Zhang, Jiarui Lu, Navdeep Jaitly
Date:2023-10-02 16:55:37

Large language models (LLMs) are effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively. This capability requires complex understanding, state tracking, reasoning and planning over multiple conversational turns. However, directly measuring this can be challenging. In this paper, we offer a surrogate problem which assesses an LLMs's capability to deduce an entity unknown to itself, but revealed to a judge, by asking the judge a series of queries. This \textit{entity-deducing game} can serve as an evaluation framework to probe the conversational reasoning and planning capabilities of language models. We systematically evaluate various LLMs and discover significant differences in their performance on this task. We find that strong LLMs like GPT-4 outperform human players by a large margin. We further employ Behavior Cloning (BC) to examine whether a weaker model is capable of imitating a stronger model and generalizing to data or domains, using only the demonstrations from a stronger model. We finally propose to use Reinforcement Learning to enhance reasoning and planning capacity of Vicuna models through episodes of game playing, which lead to significant performance improvement. We hope that this problem offers insights into how autonomous agents could be trained to behave more intelligently in ambiguous circumstances.

Learn to Follow: Decentralized Lifelong Multi-agent Pathfinding via Planning and Learning

Authors:Alexey Skrynnik, Anton Andreychuk, Maria Nesterova, Konstantin Yakovlev, Aleksandr Panov
Date:2023-10-02 13:51:32

Multi-agent Pathfinding (MAPF) problem generally asks to find a set of conflict-free paths for a set of agents confined to a graph and is typically solved in a centralized fashion. Conversely, in this work, we investigate the decentralized MAPF setting, when the central controller that posses all the information on the agents' locations and goals is absent and the agents have to sequientially decide the actions on their own without having access to a full state of the environment. We focus on the practically important lifelong variant of MAPF, which involves continuously assigning new goals to the agents upon arrival to the previous ones. To address this complex problem, we propose a method that integrates two complementary approaches: planning with heuristic search and reinforcement learning through policy optimization. Planning is utilized to construct and re-plan individual paths. We enhance our planning algorithm with a dedicated technique tailored to avoid congestion and increase the throughput of the system. We employ reinforcement learning to discover the collision avoidance policies that effectively guide the agents along the paths. The policy is implemented as a neural network and is effectively trained without any reward-shaping or external guidance. We evaluate our method on a wide range of setups comparing it to the state-of-the-art solvers. The results show that our method consistently outperforms the learnable competitors, showing higher throughput and better ability to generalize to the maps that were unseen at the training stage. Moreover our solver outperforms a rule-based one in terms of throughput and is an order of magnitude faster than a state-of-the-art search-based solver.

BeBOP -- Combining Reactive Planning and Bayesian Optimization to Solve Robotic Manipulation Tasks

Authors:Jonathan Styrud, Matthias Mayr, Erik Hellsten, Volker Krueger, Christian Smith
Date:2023-10-02 08:23:55

Robotic systems for manipulation tasks are increasingly expected to be easy to configure for new tasks. While in the past, robot programs were often written statically and tuned manually, the current, faster transition times call for robust, modular and interpretable solutions that also allow a robotic system to learn how to perform a task. We propose the method Behavior-based Bayesian Optimization and Planning (BeBOP) that combines two approaches for generating behavior trees: we build the structure using a reactive planner and learn specific parameters with Bayesian optimization. The method is evaluated on a set of robotic manipulation benchmarks and is shown to outperform state-of-the-art reinforcement learning algorithms by being up to 46 times faster while simultaneously being less dependent on reward shaping. We also propose a modification to the uncertainty estimate for the random forest surrogate models that drastically improves the results.

Uncertainty-aware hybrid paradigm of nonlinear MPC and model-based RL for offroad navigation: Exploration of transformers in the predictive model

Authors:Faraz Lotfi, Khalil Virji, Farnoosh Faraji, Lucas Berry, Andrew Holliday, David Meger, Gregory Dudek
Date:2023-10-01 18:47:02

In this paper, we investigate a hybrid scheme that combines nonlinear model predictive control (MPC) and model-based reinforcement learning (RL) for navigation planning of an autonomous model car across offroad, unstructured terrains without relying on predefined maps. Our innovative approach takes inspiration from BADGR, an LSTM-based network that primarily concentrates on environment modeling, but distinguishes itself by substituting LSTM modules with transformers to greatly elevate the performance our model. Addressing uncertainty within the system, we train an ensemble of predictive models and estimate the mutual information between model weights and outputs, facilitating dynamic horizon planning through the introduction of variable speeds. Further enhancing our methodology, we incorporate a nonlinear MPC controller that accounts for the intricacies of the vehicle's model and states. The model-based RL facet produces steering angles and quantifies inherent uncertainty. At the same time, the nonlinear MPC suggests optimal throttle settings, striking a balance between goal attainment speed and managing model uncertainty influenced by velocity. In the conducted studies, our approach excels over the existing baseline by consistently achieving higher metric values in predicting future events and seamlessly integrating the vehicle's kinematic model for enhanced decision-making. The code and the evaluation data are available at https://github.com/FARAZLOTFI/offroad_autonomous_navigation/).

Efficient Planning with Latent Diffusion

Authors:Wenhao Li
Date:2023-09-30 08:50:49

Temporal abstraction and efficient planning pose significant challenges in offline reinforcement learning, mainly when dealing with domains that involve temporally extended tasks and delayed sparse rewards. Existing methods typically plan in the raw action space and can be inefficient and inflexible. Latent action spaces offer a more flexible paradigm, capturing only possible actions within the behavior policy support and decoupling the temporal structure between planning and modeling. However, current latent-action-based methods are limited to discrete spaces and require expensive planning. This paper presents a unified framework for continuous latent action space representation learning and planning by leveraging latent, score-based diffusion models. We establish the theoretical equivalence between planning in the latent action space and energy-guided sampling with a pretrained diffusion model and incorporate a novel sequence-level exact sampling method. Our proposed method, $\texttt{LatentDiffuser}$, demonstrates competitive performance on low-dimensional locomotion control tasks and surpasses existing methods in higher-dimensional tasks.

Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning

Authors:Mingde Zhao, Safa Alver, Harm van Seijen, Romain Laroche, Doina Precup, Yoshua Bengio
Date:2023-09-30 02:25:18

Inspired by human conscious planning, we propose Skipper, a model-based reinforcement learning framework utilizing spatio-temporal abstractions to generalize better in novel situations. It automatically decomposes the given task into smaller, more manageable subtasks, and thus enables sparse decision-making and focused computation on the relevant parts of the environment. The decomposition relies on the extraction of an abstracted proxy problem represented as a directed graph, in which vertices and edges are learned end-to-end from hindsight. Our theoretical analyses provide performance guarantees under appropriate assumptions and establish where our approach is expected to be helpful. Generalization-focused experiments validate Skipper's significant advantage in zero-shot generalization, compared to some existing state-of-the-art hierarchical planning methods.

Improving Planning with Large Language Models: A Modular Agentic Architecture

Authors:Taylor Webb, Shanka Subhra Mondal, Ida Momennejad
Date:2023-09-30 00:10:14

Large language models (LLMs) demonstrate impressive performance on a wide variety of tasks, but they often struggle with tasks that require multi-step reasoning or goal-directed planning. Both cognitive neuroscience and reinforcement learning (RL) have proposed a number of interacting functional components that together implement search and evaluation in multi-step decision making. These components include conflict monitoring, state prediction, state evaluation, task decomposition, and orchestration. To improve planning with LLMs, we propose an agentic architecture, the Modular Agentic Planner (MAP), in which planning is accomplished via the recurrent interaction of the specialized modules mentioned above, each implemented using an LLM. MAP improves planning through the interaction of specialized modules that break down a larger problem into multiple brief automated calls to the LLM. We evaluate MAP on three challenging planning tasks -- graph traversal, Tower of Hanoi, and the PlanBench benchmark -- as well as an NLP task requiring multi-step reasoning (strategyQA). We find that MAP yields significant improvements over both standard LLM methods (zero-shot prompting, in-context learning) and competitive baselines (chain-of-thought, multi-agent debate, and tree-of-thought), can be effectively combined with smaller and more cost-efficient LLMs (Llama3-70B), and displays superior transfer across tasks. These results suggest the benefit of a modular and multi-agent approach to planning with LLMs.

DREAM: Decentralized Reinforcement Learning for Exploration and Efficient Energy Management in Multi-Robot Systems

Authors:Dipam Patel, Phu Pham, Kshitij Tiwari, Aniket Bera
Date:2023-09-29 17:43:41

Resource-constrained robots often suffer from energy inefficiencies, underutilized computational abilities due to inadequate task allocation, and a lack of robustness in dynamic environments, all of which strongly affect their performance. This paper introduces DREAM - Decentralized Reinforcement Learning for Exploration and Efficient Energy Management in Multi-Robot Systems, a comprehensive framework that optimizes the allocation of resources for efficient exploration. It advances beyond conventional heuristic-based task planning as observed conventionally. The framework incorporates Operational Range Estimation using Reinforcement Learning to perform exploration and obstacle avoidance in unfamiliar terrains. DREAM further introduces an Energy Consumption Model for goal allocation, thereby ensuring mission completion under constrained resources using a Graph Neural Network. This approach also ensures that the entire Multi-Robot System can survive for an extended period of time for further missions compared to the conventional approach of randomly allocating goals, which compromises one or more agents. Our approach adapts to prioritizing agents in real-time, showcasing remarkable resilience against dynamic environments. This robust solution was evaluated in various simulated environments, demonstrating adaptability and applicability across diverse scenarios. We observed a substantial improvement of about 25% over the baseline method, leading the way for future research in resource-constrained robotics.

Social Navigation in Crowded Environments with Model Predictive Control and Deep Learning-Based Human Trajectory Prediction

Authors:Viet-Anh Le, Behdad Chalaki, Vaishnav Tadiparthi, Hossein Nourkhiz Mahjoub, Jovin D'sa, Ehsan Moradi-Pari
Date:2023-09-28 20:31:59

Crowd navigation has received increasing attention from researchers over the last few decades, resulting in the emergence of numerous approaches aimed at addressing this problem to date. Our proposed approach couples agent motion prediction and planning to avoid the freezing robot problem while simultaneously capturing multi-agent social interactions by utilizing a state-of-the-art trajectory prediction model i.e., social long short-term memory model (Social-LSTM). Leveraging the output of Social-LSTM for the prediction of future trajectories of pedestrians at each time-step given the robot's possible actions, our framework computes the optimal control action using Model Predictive Control (MPC) for the robot to navigate among pedestrians. We demonstrate the effectiveness of our proposed approach in multiple scenarios of simulated crowd navigation and compare it against several state-of-the-art reinforcement learning-based methods.

Qwen Technical Report

Authors:Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, Tianhang Zhu
Date:2023-09-28 17:07:49

Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.

Uncertainty-Aware Decision Transformer for Stochastic Driving Environments

Authors:Zenan Li, Fan Nie, Qiao Sun, Fang Da, Hang Zhao
Date:2023-09-28 12:44:51

Offline Reinforcement Learning (RL) enables policy learning without active interactions, making it especially appealing for self-driving tasks. Recent successes of Transformers inspire casting offline RL as sequence modeling, which, however, fails in stochastic environments with incorrect assumptions that identical actions can consistently achieve the same goal. In this paper, we introduce an UNcertainty-awaRE deciSion Transformer (UNREST) for planning in stochastic driving environments without introducing additional transition or complex generative models. Specifically, UNREST estimates uncertainties by conditional mutual information between transitions and returns. Discovering 'uncertainty accumulation' and 'temporal locality' properties of driving environments, we replace the global returns in decision transformers with truncated returns less affected by environments to learn from actual outcomes of actions rather than environment transitions. We also dynamically evaluate uncertainty at inference for cautious planning. Extensive experiments demonstrate UNREST's superior performance in various driving scenarios and the power of our uncertainty estimation strategy.

Efficiency Separation between RL Methods: Model-Free, Model-Based and Goal-Conditioned

Authors:Brieuc Pinon, Raphaël Jungers, Jean-Charles Delvenne
Date:2023-09-28 09:38:27

We prove a fundamental limitation on the efficiency of a wide class of Reinforcement Learning (RL) algorithms. This limitation applies to model-free RL methods as well as a broad range of model-based methods, such as planning with tree search. Under an abstract definition of this class, we provide a family of RL problems for which these methods suffer a lower bound exponential in the horizon for their interactions with the environment to find an optimal behavior. However, there exists a method, not tailored to this specific family of problems, which can efficiently solve the problems in the family. In contrast, our limitation does not apply to several types of methods proposed in the literature, for instance, goal-conditioned methods or other algorithms that construct an inverse dynamics model.

Learning to Terminate in Object Navigation

Authors:Yuhang Song, Anh Nguyen, Chun-Yi Lee
Date:2023-09-28 04:32:08

This paper tackles the critical challenge of object navigation in autonomous navigation systems, particularly focusing on the problem of target approach and episode termination in environments with long optimal episode length in Deep Reinforcement Learning (DRL) based methods. While effective in environment exploration and object localization, conventional DRL methods often struggle with optimal path planning and termination recognition due to a lack of depth information. To overcome these limitations, we propose a novel approach, namely the Depth-Inference Termination Agent (DITA), which incorporates a supervised model called the Judge Model to implicitly infer object-wise depth and decide termination jointly with reinforcement learning. We train our judge model along with reinforcement learning in parallel and supervise the former efficiently by reward signal. Our evaluation shows the method is demonstrating superior performance, we achieve a 9.3% gain on success rate than our baseline method across all room types and gain 51.2% improvements on long episodes environment while maintaining slightly better Success Weighted by Path Length (SPL). Code and resources, visualization are available at: https://github.com/HuskyKingdom/DITA_acml2023

Data-Driven Latent Space Representation for Robust Bipedal Locomotion Learning

Authors:Guillermo A. Castillo, Bowen Weng, Wei Zhang, Ayonga Hereid
Date:2023-09-27 15:51:18

This paper presents a novel framework for learning robust bipedal walking by combining a data-driven state representation with a Reinforcement Learning (RL) based locomotion policy. The framework utilizes an autoencoder to learn a low-dimensional latent space that captures the complex dynamics of bipedal locomotion from existing locomotion data. This reduced dimensional state representation is then used as states for training a robust RL-based gait policy, eliminating the need for heuristic state selections or the use of template models for gait planning. The results demonstrate that the learned latent variables are disentangled and directly correspond to different gaits or speeds, such as moving forward, backward, or walking in place. Compared to traditional template model-based approaches, our framework exhibits superior performance and robustness in simulation. The trained policy effectively tracks a wide range of walking speeds and demonstrates good generalization capabilities to unseen scenarios.

DTC: Deep Tracking Control

Authors:Fabian Jenelten, Junzhe He, Farbod Farshidian, Marco Hutter
Date:2023-09-27 07:57:37

Legged locomotion is a complex control problem that requires both accuracy and robustness to cope with real-world challenges. Legged systems have traditionally been controlled using trajectory optimization with inverse dynamics. Such hierarchical model-based methods are appealing due to intuitive cost function tuning, accurate planning, generalization, and most importantly, the insightful understanding gained from more than one decade of extensive research. However, model mismatch and violation of assumptions are common sources of faulty operation. Simulation-based reinforcement learning, on the other hand, results in locomotion policies with unprecedented robustness and recovery skills. Yet, all learning algorithms struggle with sparse rewards emerging from environments where valid footholds are rare, such as gaps or stepping stones. In this work, we propose a hybrid control architecture that combines the advantages of both worlds to simultaneously achieve greater robustness, foot-placement accuracy, and terrain generalization. Our approach utilizes a model-based planner to roll out a reference motion during training. A deep neural network policy is trained in simulation, aiming to track the optimized footholds. We evaluate the accuracy of our locomotion pipeline on sparse terrains, where pure data-driven methods are prone to fail. Furthermore, we demonstrate superior robustness in the presence of slippery or deformable ground when compared to model-based counterparts. Finally, we show that our proposed tracking controller generalizes across different trajectory optimization methods not seen during training. In conclusion, our work unites the predictive capabilities and optimality guarantees of online planning with the inherent robustness attributed to offline learning.

Towards High Efficient Long-horizon Planning with Expert-guided Motion-encoding Tree Search

Authors:Tong Zhou, Erli Lyu, Jiaole Wang, Guangdu Cen, Ziqi Zha, Senmao Qi, Max Q. -H. Meng
Date:2023-09-26 17:19:40

Autonomous driving holds promise for increased safety, optimized traffic management, and a new level of convenience in transportation. While model-based reinforcement learning approaches such as MuZero enables long-term planning, the exponentially increase of the number of search nodes as the tree goes deeper significantly effect the searching efficiency. To deal with this problem, in this paper we proposed the expert-guided motion-encoding tree search (EMTS) algorithm. EMTS extends the MuZero algorithm by representing possible motions with a comprehensive motion primitives latent space and incorporating expert policies toimprove the searching efficiency. The comprehensive motion primitives latent space enables EMTS to sample arbitrary trajectories instead of raw action to reduce the depth of the search tree. And the incorporation of expert policies guided the search and training phases the EMTS algorithm to enable early convergence. In the experiment section, the EMTS algorithm is compared with other four algorithms in three challenging scenarios. The experiment result verifies the effectiveness and the searching efficiency of the proposed EMTS algorithm.

Hierarchical Reinforcement Learning Based on Planning Operators

Authors:Jing Zhang, Emmanuel Dean, Karinne Ramirez-Amaro
Date:2023-09-25 15:54:32

Long-horizon manipulation tasks such as stacking represent a longstanding challenge in the field of robotic manipulation, particularly when using reinforcement learning (RL) methods which often struggle to learn the correct sequence of actions for achieving these complex goals. To learn this sequence, symbolic planning methods offer a good solution based on high-level reasoning, however, planners often fall short in addressing the low-level control specificity needed for precise execution. This paper introduces a novel framework that integrates symbolic planning with hierarchical RL through the cooperation of high-level operators and low-level policies. Our contribution integrates planning operators (e.g. preconditions and effects) as part of the hierarchical RL algorithm based on the Scheduled Auxiliary Control (SAC-X) method. We developed a dual-purpose high-level operator, which can be used both in holistic planning and as independent, reusable policies. Our approach offers a flexible solution for long-horizon tasks, e.g., stacking a cube. The experimental results show that our proposed method obtained an average of 97.2% success rate for learning and executing the whole stack sequence, and the success rate for learning independent policies, e.g. reach (98.9%), lift (99.7%), stack (85%), etc. The training time is also reduced by 68% when using our proposed approach.

Designing and evaluating an online reinforcement learning agent for physical exercise recommendations in N-of-1 trials

Authors:Dominik Meier, Ipek Ensari, Stefan Konigorski
Date:2023-09-25 14:08:21

Personalized adaptive interventions offer the opportunity to increase patient benefits, however, there are challenges in their planning and implementation. Once implemented, it is an important question whether personalized adaptive interventions are indeed clinically more effective compared to a fixed gold standard intervention. In this paper, we present an innovative N-of-1 trial study design testing whether implementing a personalized intervention by an online reinforcement learning agent is feasible and effective. Throughout, we use a new study on physical exercise recommendations to reduce pain in endometriosis for illustration. We describe the design of a contextual bandit recommendation agent and evaluate the agent in simulation studies. The results show that, first, implementing a personalized intervention by an online reinforcement learning agent is feasible. Second, such adaptive interventions have the potential to improve patients' benefits even if only few observations are available. As one challenge, they add complexity to the design and implementation process. In order to quantify the expected benefit, data from previous interventional studies is required. We expect our approach to be transferable to other interventions and clinical interventions.

RL-I2IT: Image-to-Image Translation with Deep Reinforcement Learning

Authors:Xin Wang, Ziwei Luo, Jing Hu, Chengming Feng, Shu Hu, Bin Zhu, Xi Wu, Hongtu Zhu, Xin Li, Siwei Lyu
Date:2023-09-24 15:40:40

Most existing Image-to-Image Translation (I2IT) methods generate images in a single run of a deep learning (DL) model. However, designing such a single-step model is always challenging, requiring a huge number of parameters and easily falling into bad global minimums and overfitting. In this work, we reformulate I2IT as a step-wise decision-making problem via deep reinforcement learning (DRL) and propose a novel framework that performs RL-based I2IT (RL-I2IT). The key feature in the RL-I2IT framework is to decompose a monolithic learning process into small steps with a lightweight model to progressively transform a source image successively to a target image. Considering that it is challenging to handle high dimensional continuous state and action spaces in the conventional RL framework, we introduce meta policy with a new concept Plan to the standard Actor-Critic model, which is of a lower dimension than the original image and can facilitate the actor to generate a tractable high dimensional action. In the RL-I2IT framework, we also employ a task-specific auxiliary learning strategy to stabilize the training process and improve the performance of the corresponding task. Experiments on several I2IT tasks demonstrate the effectiveness and robustness of the proposed method when facing high-dimensional continuous action space problems. Our implementation of the RL-I2IT framework is available at https://github.com/Algolzw/SPAC-Deformable-Registration.

Boosting Offline Reinforcement Learning for Autonomous Driving with Hierarchical Latent Skills

Authors:Zenan Li, Fan Nie, Qiao Sun, Fang Da, Hang Zhao
Date:2023-09-24 11:51:17

Learning-based vehicle planning is receiving increasing attention with the emergence of diverse driving simulators and large-scale driving datasets. While offline reinforcement learning (RL) is well suited for these safety-critical tasks, it still struggles to plan over extended periods. In this work, we present a skill-based framework that enhances offline RL to overcome the long-horizon vehicle planning challenge. Specifically, we design a variational autoencoder (VAE) to learn skills from offline demonstrations. To mitigate posterior collapse of common VAEs, we introduce a two-branch sequence encoder to capture both discrete options and continuous variations of the complex driving skills. The final policy treats learned skills as actions and can be trained by any off-the-shelf offline RL algorithms. This facilitates a shift in focus from per-step actions to temporally extended skills, thereby enabling long-term reasoning into the future. Extensive results on CARLA prove that our model consistently outperforms strong baselines at both training and new scenarios. Additional visualizations and experiments demonstrate the interpretability and transferability of extracted skills.

Guided Cooperation in Hierarchical Reinforcement Learning via Model-based Rollout

Authors:Haoran Wang, Zeshen Tang, Leya Yang, Yaoru Sun, Fang Wang, Siyu Zhang, Yeming Chen
Date:2023-09-24 00:13:16

Goal-conditioned hierarchical reinforcement learning (HRL) presents a promising approach for enabling effective exploration in complex, long-horizon reinforcement learning (RL) tasks through temporal abstraction. Empirically, heightened inter-level communication and coordination can induce more stable and robust policy improvement in hierarchical systems. Yet, most existing goal-conditioned HRL algorithms have primarily focused on the subgoal discovery, neglecting inter-level cooperation. Here, we propose a goal-conditioned HRL framework named Guided Cooperation via Model-based Rollout (GCMR), aiming to bridge inter-layer information synchronization and cooperation by exploiting forward dynamics. Firstly, the GCMR mitigates the state-transition error within off-policy correction via model-based rollout, thereby enhancing sample efficiency. Secondly, to prevent disruption by the unseen subgoals and states, lower-level Q-function gradients are constrained using a gradient penalty with a model-inferred upper bound, leading to a more stable behavioral policy conducive to effective exploration. Thirdly, we propose a one-step rollout-based planning, using higher-level critics to guide the lower-level policy. Specifically, we estimate the value of future states of the lower-level policy using the higher-level critic function, thereby transmitting global task information downwards to avoid local pitfalls. These three critical components in GCMR are expected to facilitate inter-level cooperation significantly. Experimental results demonstrate that incorporating the proposed GCMR framework with a disentangled variant of HIGL, namely ACLG, yields more stable and robust policy improvement compared to various baselines and significantly outperforms previous state-of-the-art algorithms.

Robust Perception-Informed Navigation using PAC-NMPC with a Learned Value Function

Authors:Adam Polevoy, Mark Gonzales, Marin Kobilarov, Joseph Moore
Date:2023-09-22 20:17:11

Nonlinear model predictive control (NMPC) is typically restricted to short, finite horizons to limit the computational burden of online optimization. As a result, global planning frameworks are frequently necessary to avoid local minima when using NMPC for navigation in complex environments. By contrast, reinforcement learning (RL) can generate policies that minimize the expected cost over an infinite-horizon and can often avoid local minima, even when operating only on current sensor measurements. However, these learned policies are usually unable to provide performance guarantees (e.g., on collision avoidance), especially when outside of the training distribution. In this paper, we augment Probably Approximately Correct NMPC (PAC-NMPC), a sampling-based stochastic NMPC algorithm capable of providing statistical guarantees of performance and safety, with an approximate perception-dependent value function trained via RL. We demonstrate in simulation that our algorithm can improve the long-term behavior of PAC-NMPC while outperforming other approaches with regards to safety for both planar car dynamics and more complex, high-dimensional fixed-wing aerial vehicle dynamics. We also demonstrate that, even when our value function is trained in simulation, our algorithm can successfully achieve statistically safe navigation on hardware using a 1/10th scale rally car in cluttered real-world environments using only current sensor information.

Trip Planning for Autonomous Vehicles with Wireless Data Transfer Needs Using Reinforcement Learning

Authors:Yousef AlSaqabi, Bhaskar Krishnamachari
Date:2023-09-21 23:19:16

With recent advancements in the field of communications and the Internet of Things, vehicles are becoming more aware of their environment and are evolving towards full autonomy. Vehicular communication opens up the possibility for vehicle-to-infrastructure interaction, where vehicles could share information with components such as cameras, traffic lights, and signage that support a countrys road system. As a result, vehicles are becoming more than just a means of transportation; they are collecting, processing, and transmitting massive amounts of data used to make driving safer and more convenient. With 5G cellular networks and beyond, there is going to be more data bandwidth available on our roads, but it may be heterogeneous because of limitations like line of sight, infrastructure, and heterogeneous traffic on the road. This paper addresses the problem of route planning for autonomous vehicles in urban areas accounting for both driving time and data transfer needs. We propose a novel reinforcement learning solution that prioritizes high bandwidth roads to meet a vehicles data transfer requirement, while also minimizing driving time. We compare this approach to traffic-unaware and bandwidth-unaware baselines to show how much better it performs under heterogeneous traffic. This solution could be used as a starting point to understand what good policies look like, which could potentially yield faster, more efficient heuristics in the future.

Optimizing V2V Unicast Communication Transmission with Reinforcement Learning and Vehicle Clustering

Authors:Yu Wang
Date:2023-09-21 13:17:50

Efficient routing algorithms based on vehicular ad hoc networks (VANETs) play an important role in emerging intelligent transportation systems. This highly dynamic topology faces a number of wireless communication service challenges. In this paper, we propose a protocol based on reinforcement learning and vehicle node clustering, the protocol is called Qucts, solve vehicle-to-fixed-destination or V2V messaging problems. Improve message delivery rates with minimal hops and latency, link stability is also taken into account. The agreement is divided into three levels, first cluster the vehicles, each cluster head broadcasts its own coordinates and speed, to get more cluster members. Also when a cluster member receives another cluster head broadcast message, the cluster head generates a list of surrounding clusters, find the best cluster to the destination as the next cluster during message passing. Second, the protocol constructs a Q-value table based on the state after clustering, used to participate in the selection of messaging clusters. Finally, we introduce parameters that express the stability of the vehicle within the cluster, for communication node selection. This protocol hierarchy makes Qucts an offline and online solution. In order to distinguish unstable nodes within a cluster, Coding of each road, will have vehicles with planned routes, For example, car hailing and public bus. Compare the overlap with other planned paths vehicles in the cluster, low overlap is labeled as unstable nodes. Vehicle path overlap rate without a planned path is set to the mean value. Comparing Qucts with existing routing protocols through simulation, Our proposed Qucts scheme provides large improvements in both data delivery rate and end-to-end delay reduction.

Prompt, Plan, Perform: LLM-based Humanoid Control via Quantized Imitation Learning

Authors:Jingkai Sun, Qiang Zhang, Yiqun Duan, Xiaoyang Jiang, Chong Cheng, Renjing Xu
Date:2023-09-20 14:42:01

In recent years, reinforcement learning and imitation learning have shown great potential for controlling humanoid robots' motion. However, these methods typically create simulation environments and rewards for specific tasks, resulting in the requirements of multiple policies and limited capabilities for tackling complex and unknown tasks. To overcome these issues, we present a novel approach that combines adversarial imitation learning with large language models (LLMs). This innovative method enables the agent to learn reusable skills with a single policy and solve zero-shot tasks under the guidance of LLMs. In particular, we utilize the LLM as a strategic planner for applying previously learned skills to novel tasks through the comprehension of task-specific prompts. This empowers the robot to perform the specified actions in a sequence. To improve our model, we incorporate codebook-based vector quantization, allowing the agent to generate suitable actions in response to unseen textual commands from LLMs. Furthermore, we design general reward functions that consider the distinct motion features of humanoid robots, ensuring the agent imitates the motion data while maintaining goal orientation without additional guiding direction approaches or policies. To the best of our knowledge, this is the first framework that controls humanoid robots using a single learning policy network and LLM as a planner. Extensive experiments demonstrate that our method exhibits efficient and adaptive ability in complicated motion tasks.

Optimizing Crowd-Aware Multi-Agent Path Finding through Local Communication with Graph Neural Networks

Authors:Phu Pham, Aniket Bera
Date:2023-09-19 03:02:43

Multi-Agent Path Finding (MAPF) in crowded environments presents a challenging problem in motion planning, aiming to find collision-free paths for all agents in the system. MAPF finds a wide range of applications in various domains, including aerial swarms, autonomous warehouse robotics, and self-driving vehicles. Current approaches to MAPF generally fall into two main categories: centralized and decentralized planning. Centralized planning suffers from the curse of dimensionality when the number of agents or states increases and thus does not scale well in large and complex environments. On the other hand, decentralized planning enables agents to engage in real-time path planning within a partially observable environment, demonstrating implicit coordination. However, they suffer from slow convergence and performance degradation in dense environments. In this paper, we introduce CRAMP, a novel crowd-aware decentralized reinforcement learning approach to address this problem by enabling efficient local communication among agents via Graph Neural Networks (GNNs), facilitating situational awareness and decision-making capabilities in congested environments. We test CRAMP on simulated environments and demonstrate that our method outperforms the state-of-the-art decentralized methods for MAPF on various metrics. CRAMP improves the solution quality up to 59% measured in makespan and collision count, and up to 35% improvement in success rate in comparison to previous methods.

OptiRoute: A Heuristic-assisted Deep Reinforcement Learning Framework for UAV-UGV Collaborative Route Planning

Authors:Md Safwan Mondal, Subramanian Ramasamy, Pranav Bhounsule
Date:2023-09-18 17:01:17

Unmanned aerial vehicles (UAVs) are capable of surveying expansive areas, but their operational range is constrained by limited battery capacity. The deployment of mobile recharging stations using unmanned ground vehicles (UGVs) significantly extends the endurance and effectiveness of UAVs. However, optimizing the routes of both UAVs and UGVs, known as the UAV-UGV cooperative routing problem, poses substantial challenges, particularly with respect to the selection of recharging locations. Here in this paper, we leverage reinforcement learning (RL) for the purpose of identifying optimal recharging locations while employing constraint programming to determine cooperative routes for the UAV and UGV. Our proposed framework is then benchmarked against a baseline solution that employs Genetic Algorithms (GA) to select rendezvous points. Our findings reveal that RL surpasses GA in terms of reducing overall mission time, minimizing UAV-UGV idle time, and mitigating energy consumption for both the UAV and UGV. These results underscore the efficacy of incorporating heuristics to assist RL, a method we refer to as heuristics-assisted RL, in generating high-quality solutions for intricate routing problems.

A Spiking Binary Neuron -- Detector of Causal Links

Authors:Mikhail Kiselev, Denis Larionov, Andrey Urusov
Date:2023-09-15 15:34:17

Causal relationship recognition is a fundamental operation in neural networks aimed at learning behavior, action planning, and inferring external world dynamics. This operation is particularly crucial for reinforcement learning (RL). In the context of spiking neural networks (SNNs), events are represented as spikes emitted by network neurons or input nodes. Detecting causal relationships within these events is essential for effective RL implementation. This research paper presents a novel approach to realize causal relationship recognition using a simple spiking binary neuron. The proposed method leverages specially designed synaptic plasticity rules, which are both straightforward and efficient. Notably, our approach accounts for the temporal aspects of detected causal links and accommodates the representation of spiking signals as single spikes or tight spike sequences (bursts), as observed in biological brains. Furthermore, this study places a strong emphasis on the hardware-friendliness of the proposed models, ensuring their efficient implementation on modern and future neuroprocessors. Being compared with precise machine learning techniques, such as decision tree algorithms and convolutional neural networks, our neuron demonstrates satisfactory accuracy despite its simplicity. In conclusion, we introduce a multi-neuron structure capable of operating in more complex environments with enhanced accuracy, making it a promising candidate for the advancement of RL applications in SNNs.

Optimal Mobility and Communication Strategy to Maximize the Value of Information in IoT Networks

Authors:Zijing Wang, Mihai-Alin Badiu, Justin P. Coon
Date:2023-09-15 10:32:32

The Internet of Things (IoT) is an emerging next-generation technology in the fourth industrial revolution. In industrial IoT networks, sensing devices are largely deployed to monitor various types of physical processes. They are required to transmit the collected data in a timely manner to support real-time monitoring, control and automation. The timeliness of information is very important in such systems. Recently, an information-theoretic metric named the "value of information" (VoI) has been proposed to measure the usefulness of information. In this work, we consider an industrial IoT network with a set of heterogeneous sensing devices and an intelligent mobile entity. The concept of the value of information is applied to study a joint path planning and user scheduling optimisation problem. We aim to maximise the network-level VoI under mobility and communication constraints. We formulate this problem as a Markov decision process (MDP), and an efficient algorithm based on reinforcement learning is proposed to solve this problem. Through numerical results, we show that the proposed method is able to capture the usefulness of data from both time and space dimensions. By exploiting the correlation property of the data source, the proposed method is suitable for applications in resource-limited networks.

Self-Refined Large Language Model as Automated Reward Function Designer for Deep Reinforcement Learning in Robotics

Authors:Jiayang Song, Zhehua Zhou, Jiawei Liu, Chunrong Fang, Zhan Shu, Lei Ma
Date:2023-09-13 02:56:56

Although Deep Reinforcement Learning (DRL) has achieved notable success in numerous robotic applications, designing a high-performing reward function remains a challenging task that often requires substantial manual input. Recently, Large Language Models (LLMs) have been extensively adopted to address tasks demanding in-depth common-sense knowledge, such as reasoning and planning. Recognizing that reward function design is also inherently linked to such knowledge, LLM offers a promising potential in this context. Motivated by this, we propose in this work a novel LLM framework with a self-refinement mechanism for automated reward function design. The framework commences with the LLM formulating an initial reward function based on natural language inputs. Then, the performance of the reward function is assessed, and the results are presented back to the LLM for guiding its self-refinement process. We examine the performance of our proposed framework through a variety of continuous robotic control tasks across three diverse robotic systems. The results indicate that our LLM-designed reward functions are able to rival or even surpass manually designed reward functions, highlighting the efficacy and applicability of our approach.

Emergent Communication in Multi-Agent Reinforcement Learning for Future Wireless Networks

Authors:Marwa Chafii, Salmane Naoumi, Reda Alami, Ebtesam Almazrouei, Mehdi Bennis, Merouane Debbah
Date:2023-09-12 07:40:53

In different wireless network scenarios, multiple network entities need to cooperate in order to achieve a common task with minimum delay and energy consumption. Future wireless networks mandate exchanging high dimensional data in dynamic and uncertain environments, therefore implementing communication control tasks becomes challenging and highly complex. Multi-agent reinforcement learning with emergent communication (EC-MARL) is a promising solution to address high dimensional continuous control problems with partially observable states in a cooperative fashion where agents build an emergent communication protocol to solve complex tasks. This paper articulates the importance of EC-MARL within the context of future 6G wireless networks, which imbues autonomous decision-making capabilities into network entities to solve complex tasks such as autonomous driving, robot navigation, flying base stations network planning, and smart city applications. An overview of EC-MARL algorithms and their design criteria are provided while presenting use cases and research opportunities on this emerging topic.

Career Path Recommendations for Long-term Income Maximization: A Reinforcement Learning Approach

Authors:Spyros Avlonitis, Dor Lavi, Masoud Mansoury, David Graus
Date:2023-09-11 11:42:28

This study explores the potential of reinforcement learning algorithms to enhance career planning processes. Leveraging data from Randstad The Netherlands, the study simulates the Dutch job market and develops strategies to optimize employees' long-term income. By formulating career planning as a Markov Decision Process (MDP) and utilizing machine learning algorithms such as Sarsa, Q-Learning, and A2C, we learn optimal policies that recommend career paths with high-income occupations and industries. The results demonstrate significant improvements in employees' income trajectories, with RL models, particularly Q-Learning and Sarsa, achieving an average increase of 5% compared to observed career paths. The study acknowledges limitations, including narrow job filtering, simplifications in the environment formulation, and assumptions regarding employment continuity and zero application costs. Future research can explore additional objectives beyond income optimization and address these limitations to further enhance career planning processes.

Signal Temporal Logic Neural Predictive Control

Authors:Yue Meng, Chuchu Fan
Date:2023-09-10 20:31:25

Ensuring safety and meeting temporal specifications are critical challenges for long-term robotic tasks. Signal temporal logic (STL) has been widely used to systematically and rigorously specify these requirements. However, traditional methods of finding the control policy under those STL requirements are computationally complex and not scalable to high-dimensional or systems with complex nonlinear dynamics. Reinforcement learning (RL) methods can learn the policy to satisfy the STL specifications via hand-crafted or STL-inspired rewards, but might encounter unexpected behaviors due to ambiguity and sparsity in the reward. In this paper, we propose a method to directly learn a neural network controller to satisfy the requirements specified in STL. Our controller learns to roll out trajectories to maximize the STL robustness score in training. In testing, similar to Model Predictive Control (MPC), the learned controller predicts a trajectory within a planning horizon to ensure the satisfaction of the STL requirement in deployment. A backup policy is designed to ensure safety when our controller fails. Our approach can adapt to various initial conditions and environmental parameters. We conduct experiments on six tasks, where our method with the backup policy outperforms the classical methods (MPC, STL-solver), model-free and model-based RL methods in STL satisfaction rate, especially on tasks with complex STL specifications while being 10X-100X faster than the classical methods.

AVARS -- Alleviating Unexpected Urban Road Traffic Congestion using UAVs

Authors:Jiaying Guo, Michael R. Jones, Soufiene Djahel, Shen Wang
Date:2023-09-10 09:40:20

Reducing unexpected urban traffic congestion caused by en-route events (e.g., road closures, car crashes, etc.) often requires fast and accurate reactions to choose the best-fit traffic signals. Traditional traffic light control systems, such as SCATS and SCOOT, are not efficient as their traffic data provided by induction loops has a low update frequency (i.e., longer than 1 minute). Moreover, the traffic light signal plans used by these systems are selected from a limited set of candidate plans pre-programmed prior to unexpected events' occurrence. Recent research demonstrates that camera-based traffic light systems controlled by deep reinforcement learning (DRL) algorithms are more effective in reducing traffic congestion, in which the cameras can provide high-frequency high-resolution traffic data. However, these systems are costly to deploy in big cities due to the excessive potential upgrades required to road infrastructure. In this paper, we argue that Unmanned Aerial Vehicles (UAVs) can play a crucial role in dealing with unexpected traffic congestion because UAVs with onboard cameras can be economically deployed when and where unexpected congestion occurs. Then, we propose a system called "AVARS" that explores the potential of using UAVs to reduce unexpected urban traffic congestion using DRL-based traffic light signal control. This approach is validated on a widely used open-source traffic simulator with practical UAV settings, including its traffic monitoring ranges and battery lifetime. Our simulation results show that AVARS can effectively recover the unexpected traffic congestion in Dublin, Ireland, back to its original un-congested level within the typical battery life duration of a UAV.

Verifiable Reinforcement Learning Systems via Compositionality

Authors:Cyrus Neary, Aryaman Singh Samyal, Christos Verginis, Murat Cubuktepe, Ufuk Topcu
Date:2023-09-09 17:11:44

We propose a framework for verifiable and compositional reinforcement learning (RL) in which a collection of RL subsystems, each of which learns to accomplish a separate subtask, are composed to achieve an overall task. The framework consists of a high-level model, represented as a parametric Markov decision process, which is used to plan and analyze compositions of subsystems, and of the collection of low-level subsystems themselves. The subsystems are implemented as deep RL agents operating under partial observability. By defining interfaces between the subsystems, the framework enables automatic decompositions of task specifications, e.g., reach a target set of states with a probability of at least 0.95, into individual subtask specifications, i.e. achieve the subsystem's exit conditions with at least some minimum probability, given that its entry conditions are met. This in turn allows for the independent training and testing of the subsystems. We present theoretical results guaranteeing that if each subsystem learns a policy satisfying its subtask specification, then their composition is guaranteed to satisfy the overall task specification. Conversely, if the subtask specifications cannot all be satisfied by the learned policies, we present a method, formulated as the problem of finding an optimal set of parameters in the high-level model, to automatically update the subtask specifications to account for the observed shortcomings. The result is an iterative procedure for defining subtask specifications, and for training the subsystems to meet them. Experimental results demonstrate the presented framework's novel capabilities in environments with both full and partial observability, discrete and continuous state and action spaces, as well as deterministic and stochastic dynamics.

Hybrid of representation learning and reinforcement learning for dynamic and complex robotic motion planning

Authors:Chengmin Zhou, Xin Lu, Jiapeng Dai, Bingding Huang, Xiaoxu Liu, Pasi Fränti
Date:2023-09-07 15:00:49

Motion planning is the soul of robot decision making. Classical planning algorithms like graph search and reaction-based algorithms face challenges in cases of dense and dynamic obstacles. Deep learning algorithms generate suboptimal one-step predictions that cause many collisions. Reinforcement learning algorithms generate optimal or near-optimal time-sequential predictions. However, they suffer from slow convergence, suboptimal converged results, and overfittings. This paper introduces a hybrid algorithm for robotic motion planning: long short-term memory (LSTM) pooling and skip connection for attention-based discrete soft actor critic (LSA-DSAC). First, graph network (relational graph) and attention network (attention weight) interpret the environmental state for the learning of the discrete soft actor critic algorithm. The expressive power of attention network outperforms that of graph in our task by difference analysis of these two representation methods. However, attention based DSAC faces the overfitting problem in training. Second, the skip connection method is integrated to attention based DSAC to mitigate overfitting and improve convergence speed. Third, LSTM pooling is taken to replace the sum operator of attention weigh and eliminate overfitting by slightly sacrificing convergence speed at early-stage training. Experiments show that LSA-DSAC outperforms the state-of-the-art in training and most evaluations. The physical robot is also implemented and tested in the real world.

Chat Failures and Troubles: Reasons and Solutions

Authors:Manal Helal, Patrick Holthaus, Gabriella Lakatos, Farshid Amirabdollahian
Date:2023-09-07 13:36:03

This paper examines some common problems in Human-Robot Interaction (HRI) causing failures and troubles in Chat. A given use case's design decisions start with the suitable robot, the suitable chatting model, identifying common problems that cause failures, identifying potential solutions, and planning continuous improvement. In conclusion, it is recommended to use a closed-loop control algorithm that guides the use of trained Artificial Intelligence (AI) pre-trained models and provides vocabulary filtering, re-train batched models on new datasets, learn online from data streams, and/or use reinforcement learning models to self-update the trained models and reduce errors.

Learning to Recharge: UAV Coverage Path Planning through Deep Reinforcement Learning

Authors:Mirco Theile, Harald Bayerlein, Marco Caccamo, Alberto L. Sangiovanni-Vincentelli
Date:2023-09-06 16:55:11

Coverage path planning (CPP) is a critical problem in robotics, where the goal is to find an efficient path that covers every point in an area of interest. This work addresses the power-constrained CPP problem with recharge for battery-limited unmanned aerial vehicles (UAVs). In this problem, a notable challenge emerges from integrating recharge journeys into the overall coverage strategy, highlighting the intricate task of making strategic, long-term decisions. We propose a novel proximal policy optimization (PPO)-based deep reinforcement learning (DRL) approach with map-based observations, utilizing action masking and discount factor scheduling to optimize coverage trajectories over the entire mission horizon. We further provide the agent with a position history to handle emergent state loops caused by the recharge capability. Our approach outperforms a baseline heuristic, generalizes to different target zones and maps, with limited generalization to unseen maps. We offer valuable insights into DRL algorithm design for long-horizon problems and provide a publicly available software framework for the CPP problem.

Multi-log grasping using reinforcement learning and virtual visual servoing

Authors:Erik Wallin, Viktor Wiberg, Martin Servin
Date:2023-09-06 13:37:52

We explore multi-log grasping using reinforcement learning and virtual visual servoing for automated forwarding in a simulated environment. Automation of forest processes is a major challenge, and many techniques regarding robot control pose different challenges due to the unstructured and harsh outdoor environment. Grasping multiple logs involves various problems of dynamics and path planning, where understanding the interaction between the grapple, logs, terrain, and obstacles requires visual information. To address these challenges, we separate image segmentation from crane control and utilise a virtual camera to provide an image stream from reconstructed 3D data. We use Cartesian control to simplify domain transfer to real-world applications. Since log piles are static, visual servoing using a 3D reconstruction of the pile and its surroundings is equivalent to using real camera data until the point of grasping. This relaxes the limits on computational resources and time for the challenge of image segmentation and allows for collecting data in situations where the log piles are not occluded. The disadvantage is the lack of information during grasping. We demonstrate that this problem is manageable and present an agent that is 95% successful in picking one or several logs from challenging piles of 2--5 logs.

Near-continuous time Reinforcement Learning for continuous state-action spaces

Authors:Lorenzo Croissant, Marc Abeille, Bruno Bouchard
Date:2023-09-06 08:01:17

We consider the Reinforcement Learning problem of controlling an unknown dynamical system to maximise the long-term average reward along a single trajectory. Most of the literature considers system interactions that occur in discrete time and discrete state-action spaces. Although this standpoint is suitable for games, it is often inadequate for mechanical or digital systems in which interactions occur at a high frequency, if not in continuous time, and whose state spaces are large if not inherently continuous. Perhaps the only exception is the Linear Quadratic framework for which results exist both in discrete and continuous time. However, its ability to handle continuous states comes with the drawback of a rigid dynamic and reward structure. This work aims to overcome these shortcomings by modelling interaction times with a Poisson clock of frequency $\varepsilon^{-1}$, which captures arbitrary time scales: from discrete ($\varepsilon=1$) to continuous time ($\varepsilon\downarrow0$). In addition, we consider a generic reward function and model the state dynamics according to a jump process with an arbitrary transition kernel on $\mathbb{R}^d$. We show that the celebrated optimism protocol applies when the sub-tasks (learning and planning) can be performed effectively. We tackle learning within the eluder dimension framework and propose an approximate planning method based on a diffusive limit approximation of the jump process. Overall, our algorithm enjoys a regret of order $\tilde{\mathcal{O}}(\varepsilon^{1/2} T+\sqrt{T})$. As the frequency of interactions blows up, the approximation error $\varepsilon^{1/2} T$ vanishes, showing that $\tilde{\mathcal{O}}(\sqrt{T})$ is attainable in near-continuous time.

Pre- and post-contact policy decomposition for non-prehensile manipulation with zero-shot sim-to-real transfer

Authors:Minchan Kim, Junhyek Han, Jaehyung Kim, Beomjoon Kim
Date:2023-09-06 06:17:53

We present a system for non-prehensile manipulation that require a significant number of contact mode transitions and the use of environmental contacts to successfully manipulate an object to a target location. Our method is based on deep reinforcement learning which, unlike state-of-the-art planning algorithms, does not require apriori knowledge of the physical parameters of the object or environment such as friction coefficients or centers of mass. The planning time is reduced to the simple feed-forward prediction time on a neural network. We propose a computational structure, action space design, and curriculum learning scheme that facilitates efficient exploration and sim-to-real transfer. In challenging real-world non-prehensile manipulation tasks, we show that our method can generalize over different objects, and succeed even for novel objects not seen during training. Project website: https://sites.google.com/view/nonprenehsile-decomposition

Reinforcement Learning of Action and Query Policies with LTL Instructions under Uncertain Event Detector

Authors:Wataru Hatanaka, Ryota Yamashina, Takamitsu Matsubara
Date:2023-09-06 05:12:07

Reinforcement learning (RL) with linear temporal logic (LTL) objectives can allow robots to carry out symbolic event plans in unknown environments. Most existing methods assume that the event detector can accurately map environmental states to symbolic events; however, uncertainty is inevitable for real-world event detectors. Such uncertainty in an event detector generates multiple branching possibilities on LTL instructions, confusing action decisions. Moreover, the queries to the uncertain event detector, necessary for the task's progress, may increase the uncertainty further. To cope with those issues, we propose an RL framework, Learning Action and Query over Belief LTL (LAQBL), to learn an agent that can consider the diversity of LTL instructions due to uncertain event detection while avoiding task failure due to the unnecessary event-detection query. Our framework simultaneously learns 1) an embedding of belief LTL, which is multiple branching possibilities on LTL instructions using a graph neural network, 2) an action policy, and 3) a query policy which decides whether or not to query for the event detector. Simulations in a 2D grid world and image-input robotic inspection environments show that our method successfully learns actions to follow LTL instructions even with uncertain event detectors.

RLSynC: Offline-Online Reinforcement Learning for Synthon Completion

Authors:Frazier N. Baker, Ziqi Chen, Daniel Adu-Ampratwum, Xia Ning
Date:2023-09-06 02:40:33

Retrosynthesis is the process of determining the set of reactant molecules that can react to form a desired product. Semi-template-based retrosynthesis methods, which imitate the reverse logic of synthesis reactions, first predict the reaction centers in the products, and then complete the resulting synthons back into reactants. We develop a new offline-online reinforcement learning method RLSynC for synthon completion in semi-template-based methods. RLSynC assigns one agent to each synthon, all of which complete the synthons by conducting actions step by step in a synchronized fashion. RLSynC learns the policy from both offline training episodes and online interactions, which allows RLSynC to explore new reaction spaces. RLSynC uses a standalone forward synthesis model to evaluate the likelihood of the predicted reactants in synthesizing a product, and thus guides the action search. Our results demonstrate that RLSynC can outperform state-of-the-art synthon completion methods with improvements as high as 14.9%, highlighting its potential in synthesis planning.

Neurosymbolic Reinforcement Learning and Planning: A Survey

Authors:K. Acharya, W. Raza, C. M. J. M. Dourado Jr, A. Velasquez, H. Song
Date:2023-09-02 23:41:35

The area of Neurosymbolic Artificial Intelligence (Neurosymbolic AI) is rapidly developing and has become a popular research topic, encompassing sub-fields such as Neurosymbolic Deep Learning (Neurosymbolic DL) and Neurosymbolic Reinforcement Learning (Neurosymbolic RL). Compared to traditional learning methods, Neurosymbolic AI offers significant advantages by simplifying complexity and providing transparency and explainability. Reinforcement Learning(RL), a long-standing Artificial Intelligence(AI) concept that mimics human behavior using rewards and punishment, is a fundamental component of Neurosymbolic RL, a recent integration of the two fields that has yielded promising results. The aim of this paper is to contribute to the emerging field of Neurosymbolic RL by conducting a literature survey. Our evaluation focuses on the three components that constitute Neurosymbolic RL: neural, symbolic, and RL. We categorize works based on the role played by the neural and symbolic parts in RL, into three taxonomies:Learning for Reasoning, Reasoning for Learning and Learning-Reasoning. These categories are further divided into sub-categories based on their applications. Furthermore, we analyze the RL components of each research work, including the state space, action space, policy module, and RL algorithm. Additionally, we identify research opportunities and challenges in various applications within this dynamic field.

A reinforcement learning based construction material supply strategy using robotic crane and computer vision for building reconstruction after an earthquake

Authors:Yifei Xiao, T. Y. Yang, Xiao Pan, Fan Xie, Zhongwei Chen
Date:2023-08-30 19:13:23

After an earthquake, it is particularly important to provide the necessary resources on site because a large number of infrastructures need to be repaired or newly constructed. Due to the complex construction environment after the disaster, there are potential safety hazards for human labors working in this environment. With the advancement of robotic technology and artificial intelligent (AI) algorithms, smart robotic technology is the potential solution to provide construction resources after an earthquake. In this paper, the robotic crane with advanced AI algorithms is proposed to provide resources for infrastructure reconstruction after an earthquake. The proximal policy optimization (PPO), a reinforcement learning (RL) algorithm, is implemented for 3D lift path planning when transporting the construction materials. The state and reward function are designed in detail for RL model training. Two models are trained through a loading task in different environments by using PPO algorithm, one considering the influence of obstacles and the other not considering obstacles. Then, the two trained models are compared and evaluated through an unloading task and a loading task in simulation environments. For each task, two different cases are considered. One is that there is no obstacle between the initial position where the construction material is lifted and the target position, and the other is that there are obstacles between the initial position and the target position. The results show that the model that considering the obstacles during training can generate proper actions for the robotic crane to execute so that the crane can automatically transport the construction materials to the desired location with swing suppression, short time consumption and collision avoidance.

EnsembleFollower: A Hybrid Car-Following Framework Based On Reinforcement Learning and Hierarchical Planning

Authors:Xu Han, Xianda Chen, Meixin Zhu, Pinlong Cai, Jianshan Zhou, Xiaowen Chu
Date:2023-08-30 12:55:02

Car-following models have made significant contributions to our understanding of longitudinal driving behavior. However, they often exhibit limited accuracy and flexibility, as they cannot fully capture the complexity inherent in car-following processes, or may falter in unseen scenarios due to their reliance on confined driving skills present in training data. It is worth noting that each car-following model possesses its own strengths and weaknesses depending on specific driving scenarios. Therefore, we propose EnsembleFollower, a hierarchical planning framework for achieving advanced human-like car-following. The EnsembleFollower framework involves a high-level Reinforcement Learning-based agent responsible for judiciously managing multiple low-level car-following models according to the current state, either by selecting an appropriate low-level model to perform an action or by allocating different weights across all low-level components. Moreover, we propose a jerk-constrained kinematic model for more convincing car-following simulations. We evaluate the proposed method based on real-world driving data from the HighD dataset. The experimental results illustrate that EnsembleFollower yields improved accuracy of human-like behavior and achieves effectiveness in combining hybrid models, demonstrating that our proposed framework can handle diverse car-following conditions by leveraging the strengths of various low-level models.

Learning the References of Online Model Predictive Control for Urban Self-Driving

Authors:Yubin Wang, Zengqi Peng, Yusen Xie, Yulin Li, Hakim Ghazzai, Jun Ma
Date:2023-08-30 07:23:37

In this work, we propose a novel learning-based model predictive control (MPC) framework for motion planning and control of urban self-driving. In this framework, instantaneous references and cost functions of online MPC are learned from raw sensor data without relying on any oracle or predicted states of traffic. Moreover, driving safety conditions are latently encoded via the introduction of a learnable instantaneous reference vector. In particular, we implement a deep reinforcement learning (DRL) framework for policy search, where practical and lightweight raw observations are processed to reason about the traffic and provide the online MPC with instantaneous references. The proposed approach is validated in a high-fidelity simulator, where our development manifests remarkable adaptiveness to complex and dynamic traffic. Furthermore, sim-to-real deployments are also conducted to evaluate the generalizability of the proposed framework in various real-world applications. Also, we provide the open-source code and video demonstrations at the project website: https://latent-mpc.github.io/.

Distributed multi-agent target search and tracking with Gaussian process and reinforcement learning

Authors:Jigang Kim, Dohyun Jang, H. Jin Kim
Date:2023-08-29 01:53:14

Deploying multiple robots for target search and tracking has many practical applications, yet the challenge of planning over unknown or partially known targets remains difficult to address. With recent advances in deep learning, intelligent control techniques such as reinforcement learning have enabled agents to learn autonomously from environment interactions with little to no prior knowledge. Such methods can address the exploration-exploitation tradeoff of planning over unknown targets in a data-driven manner, eliminating the reliance on heuristics typical of traditional approaches and streamlining the decision-making pipeline with end-to-end training. In this paper, we propose a multi-agent reinforcement learning technique with target map building based on distributed Gaussian process. We leverage the distributed Gaussian process to encode belief over the target locations and efficiently plan over unknown targets. We evaluate the performance and transferability of the trained policy in simulation and demonstrate the method on a swarm of micro unmanned aerial vehicles with hardware experiments.

Improving Generalization in Reinforcement Learning Training Regimes for Social Robot Navigation

Authors:Adam Sigal, Hsiu-Chin Lin, AJung Moon
Date:2023-08-29 00:00:18

In order for autonomous mobile robots to navigate in human spaces, they must abide by our social norms. Reinforcement learning (RL) has emerged as an effective method to train sequential decision-making policies that are able to respect these norms. However, a large portion of existing work in the field conducts both RL training and testing in simplistic environments. This limits the generalization potential of these models to unseen environments, and the meaningfulness of their reported results. We propose a method to improve the generalization performance of RL social navigation methods using curriculum learning. By employing multiple environment types and by modeling pedestrians using multiple dynamics models, we are able to progressively diversify and escalate difficulty in training. Our results show that the use of curriculum learning in training can be used to achieve better generalization performance than previous training methods. We also show that results presented in many existing state-of-the-art RL social navigation works do not evaluate their methods outside of their training environments, and thus do not reflect their policies' failure to adequately generalize to out-of-distribution scenarios. In response, we validate our training approach on larger and more crowded testing environments than those used in training, allowing for more meaningful measurements of model performance.

On Reward Structures of Markov Decision Processes

Authors:Falcon Z. Dai
Date:2023-08-28 22:29:16

A Markov decision process can be parameterized by a transition kernel and a reward function. Both play essential roles in the study of reinforcement learning as evidenced by their presence in the Bellman equations. In our inquiry of various kinds of "costs" associated with reinforcement learning inspired by the demands in robotic applications, rewards are central to understanding the structure of a Markov decision process and reward-centric notions can elucidate important concepts in reinforcement learning. Specifically, we study the sample complexity of policy evaluation and develop a novel estimator with an instance-specific error bound of $\tilde{O}(\sqrt{\frac{\tau_s}{n}})$ for estimating a single state value. Under the online regret minimization setting, we refine the transition-based MDP constant, diameter, into a reward-based constant, maximum expected hitting cost, and with it, provide a theoretical explanation for how a well-known technique, potential-based reward shaping, could accelerate learning with expert knowledge. In an attempt to study safe reinforcement learning, we model hazardous environments with irrecoverability and proposed a quantitative notion of safe learning via reset efficiency. In this setting, we modify a classic algorithm to account for resets achieving promising preliminary numerical results. Lastly, for MDPs with multiple reward functions, we develop a planning algorithm that computationally efficiently finds Pareto-optimal stochastic policies.

Traffic Light Control with Reinforcement Learning

Authors:Taoyu Pan
Date:2023-08-28 04:29:49

Traffic light control is important for reducing congestion in urban mobility systems. This paper proposes a real-time traffic light control method using deep Q learning. Our approach incorporates a reward function considering queue lengths, delays, travel time, and throughput. The model dynamically decides phase changes based on current traffic conditions. The training of the deep Q network involves an offline stage from pre-generated data with fixed schedules and an online stage using real-time traffic data. A deep Q network structure with a "phase gate" component is used to simplify the model's learning task under different phases. A "memory palace" mechanism is used to address sample imbalance during the training process. We validate our approach using both synthetic and real-world traffic flow data on a road intersecting in Hangzhou, China. Results demonstrate significant performance improvements of the proposed method in reducing vehicle waiting time (57.1% to 100%), queue lengths (40.9% to 100%), and total travel time (16.8% to 68.0%) compared to traditional fixed signal plans.

End-to-end Autonomous Driving using Deep Learning: A Systematic Review

Authors:Apoorv Singh
Date:2023-08-27 17:43:58

End-to-end autonomous driving is a fully differentiable machine learning system that takes raw sensor input data and other metadata as prior information and directly outputs the ego vehicle's control signals or planned trajectories. This paper attempts to systematically review all recent Machine Learning-based techniques to perform this end-to-end task, including, but not limited to, object detection, semantic scene understanding, object tracking, trajectory predictions, trajectory planning, vehicle control, social behavior, and communications. This paper focuses on recent fully differentiable end-to-end reinforcement learning and deep learning-based techniques. Our paper also builds taxonomies of the significant approaches by sub-grouping them and showcasing their research trends. Finally, this survey highlights the open challenges and points out possible future directions to enlighten further research on the topic.

Reinforcement Learning-based Optimal Control and Software Rejuvenation for Safe and Efficient UAV Navigation

Authors:Angela Chen, Konstantinos Mitsopoulos, Raffaele Romagnoli
Date:2023-08-27 15:38:15

Unmanned autonomous vehicles (UAVs) rely on effective path planning and tracking control to accomplish complex tasks in various domains. Reinforcement Learning (RL) methods are becoming increasingly popular in control applications, as they can learn from data and deal with unmodelled dynamics. Cyber-physical systems (CPSs), such as UAVs, integrate sensing, network communication, control, and computation to solve challenging problems. In this context, Software Rejuvenation (SR) is a protection mechanism that refreshes the control software to mitigate cyber-attacks, but it can affect the tracking controller's performance due to discrepancies between the control software and the physical system state. Traditional approaches to mitigate this effect are conservative, hindering the overall system performance. In this paper, we propose a novel approach that incorporates Deep Reinforcement Learning (Deep RL) into SR to design a safe and high-performing tracking controller. Our approach optimizes safety and performance, and we demonstrate its effectiveness during UAV simulations. We compare our approach with traditional methods and show that it improves the system's performance while maintaining safety constraints.

Actuator Trajectory Planning for UAVs with Overhead Manipulator using Reinforcement Learning

Authors:Hazim Alzorgan, Abolfazl Razi, Ata Jahangir Moshayedi
Date:2023-08-24 15:06:23

In this paper, we investigate the operation of an aerial manipulator system, namely an Unmanned Aerial Vehicle (UAV) equipped with a controllable arm with two degrees of freedom to carry out actuation tasks on the fly. Our solution is based on employing a Q-learning method to control the trajectory of the tip of the arm, also called end-effector. More specifically, we develop a motion planning model based on Time To Collision (TTC), which enables a quadrotor UAV to navigate around obstacles while ensuring the manipulator's reachability. Additionally, we utilize a model-based Q-learning model to independently track and control the desired trajectory of the manipulator's end-effector, given an arbitrary baseline trajectory for the UAV platform. Such a combination enables a variety of actuation tasks such as high-altitude welding, structural monitoring and repair, battery replacement, gutter cleaning, skyscrapper cleaning, and power line maintenance in hard-to-reach and risky environments while retaining compatibility with flight control firmware. Our RL-based control mechanism results in a robust control strategy that can handle uncertainties in the motion of the UAV, offering promising performance. Specifically, our method achieves 92% accuracy in terms of average displacement error (i.e. the mean distance between the target and obtained trajectory points) using Q-learning with 15,000 episodes

Out of the Cage: How Stochastic Parrots Win in Cyber Security Environments

Authors:Maria Rigaki, Ondřej Lukáš, Carlos A. Catania, Sebastian Garcia
Date:2023-08-23 12:11:27

Large Language Models (LLMs) have gained widespread popularity across diverse domains involving text generation, summarization, and various natural language processing tasks. Despite their inherent limitations, LLM-based designs have shown promising capabilities in planning and navigating open-world scenarios. This paper introduces a novel application of pre-trained LLMs as agents within cybersecurity network environments, focusing on their utility for sequential decision-making processes. We present an approach wherein pre-trained LLMs are leveraged as attacking agents in two reinforcement learning environments. Our proposed agents demonstrate similar or better performance against state-of-the-art agents trained for thousands of episodes in most scenarios and configurations. In addition, the best LLM agents perform similarly to human testers of the environment without any additional training process. This design highlights the potential of LLMs to efficiently address complex decision-making tasks within cybersecurity. Furthermore, we introduce a new network security environment named NetSecGame. The environment is designed to eventually support complex multi-agent scenarios within the network security domain. The proposed environment mimics real network attacks and is designed to be highly modular and adaptable for various scenarios.

A Review on Objective-Driven Artificial Intelligence

Authors:Apoorv Singh
Date:2023-08-20 02:07:42

While advancing rapidly, Artificial Intelligence still falls short of human intelligence in several key aspects due to inherent limitations in current AI technologies and our understanding of cognition. Humans have an innate ability to understand context, nuances, and subtle cues in communication, which allows us to comprehend jokes, sarcasm, and metaphors. Machines struggle to interpret such contextual information accurately. Humans possess a vast repository of common-sense knowledge that helps us make logical inferences and predictions about the world. Machines lack this innate understanding and often struggle with making sense of situations that humans find trivial. In this article, we review the prospective Machine Intelligence candidates, a review from Prof. Yann LeCun, and other work that can help close this gap between human and machine intelligence. Specifically, we talk about what's lacking with the current AI techniques such as supervised learning, reinforcement learning, self-supervised learning, etc. Then we show how Hierarchical planning-based approaches can help us close that gap and deep-dive into energy-based, latent-variable methods and Joint embedding predictive architecture methods.

Intelligent Communication Planning for Constrained Environmental IoT Sensing with Reinforcement Learning

Authors:Yi Hu, Jinhang Zuo, Bob Iannucci, Carlee Joe-Wong
Date:2023-08-19 22:59:09

Internet of Things (IoT) technologies have enabled numerous data-driven mobile applications and have the potential to significantly improve environmental monitoring and hazard warnings through the deployment of a network of IoT sensors. However, these IoT devices are often power-constrained and utilize wireless communication schemes with limited bandwidth. Such power constraints limit the amount of information each device can share across the network, while bandwidth limitations hinder sensors' coordination of their transmissions. In this work, we formulate the communication planning problem of IoT sensors that track the state of the environment. We seek to optimize sensors' decisions in collecting environmental data under stringent resource constraints. We propose a multi-agent reinforcement learning (MARL) method to find the optimal communication policies for each sensor that maximize the tracking accuracy subject to the power and bandwidth limitations. MARL learns and exploits the spatial-temporal correlation of the environmental data at each sensor's location to reduce the redundant reports from the sensors. Experiments on wildfire spread with LoRA wireless network simulators show that our MARL method can learn to balance the need to collect enough data to predict wildfire spread with unknown bandwidth limitations.

Partially Observable Multi-Agent Reinforcement Learning with Information Sharing

Authors:Xiangyu Liu, Kaiqing Zhang
Date:2023-08-16 23:42:03

We study provable multi-agent reinforcement learning (RL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we advocate leveraging the potential \emph{information-sharing} among agents, a common practice in empirical multi-agent RL, and a standard model for multi-agent control systems with communications. We first establish several computational complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-efficient single-agent RL with partial observations, for efficiently solving POSGs. {Inspired by the inefficiency of planning in the ground-truth model,} we then propose to further \emph{approximate} the shared common information to construct an {approximate model} of the POSG, in which planning an approximate \emph{equilibrium} (in terms of solving the original POSG) can be quasi-efficient, i.e., of quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable multi-agent RL algorithm that is \emph{both} statistically and computationally quasi-efficient. {Finally, beyond equilibrium learning, we extend our algorithmic framework to finding the \emph{team-optimal solution} in cooperative POSGs, i.e., decentralized partially observable Markov decision processes, a much more challenging goal. We establish concrete computational and sample complexities under several common structural assumptions of the model.} We hope our study could open up the possibilities of leveraging and even designing different \emph{information structures}, a well-studied notion in control theory, for developing both sample- and computation-efficient partially observable multi-agent RL.

AI planning in the imagination: High-level planning on learned abstract search spaces

Authors:Carlos Martin, Tuomas Sandholm
Date:2023-08-16 22:47:16

Search and planning algorithms have been a cornerstone of artificial intelligence since the field's inception. Giving reinforcement learning agents the ability to plan during execution time has resulted in significant performance improvements in various domains. However, in real-world environments, the model with respect to which the agent plans has been constrained to be grounded in the real environment itself, as opposed to a more abstract model which allows for planning over compound actions and behaviors. We propose a new method, called PiZero, that gives an agent the ability to plan in an abstract search space that the agent learns during training, which is completely decoupled from the real environment. Unlike prior approaches, this enables the agent to perform high-level planning at arbitrary timescales and reason in terms of compound or temporally-extended actions, which can be useful in environments where large numbers of base-level micro-actions are needed to perform relevant macro-actions. In addition, our method is more general than comparable prior methods because it seamlessly handles settings with continuous action spaces, combinatorial action spaces, and partial observability. We evaluate our method on multiple domains, including the traveling salesman problem, Sokoban, 2048, the facility location problem, and Pacman. Experimentally, it outperforms comparable prior methods without assuming access to an environment simulator at execution time.

Planning to Learn: A Novel Algorithm for Active Learning during Model-Based Planning

Authors:Rowan Hodson, Bruce Bassett, Charel van Hoof, Benjamin Rosman, Mark Solms, Jonathan P. Shock, Ryan Smith
Date:2023-08-15 20:39:23

Active Inference is a recent framework for modeling planning under uncertainty. Empirical and theoretical work have now begun to evaluate the strengths and weaknesses of this approach and how it might be improved. A recent extension - the sophisticated inference (SI) algorithm - improves performance on multi-step planning problems through recursive decision tree search. However, little work to date has been done to compare SI to other established planning algorithms. SI was also developed with a focus on inference as opposed to learning. The present paper has two aims. First, we compare performance of SI to Bayesian reinforcement learning (RL) schemes designed to solve similar problems. Second, we present an extension of SI - sophisticated learning (SL) - that more fully incorporates active learning during planning. SL maintains beliefs about how model parameters would change under the future observations expected under each policy. This allows a form of counterfactual retrospective inference in which the agent considers what could be learned from current or past observations given different future observations. To accomplish these aims, we make use of a novel, biologically inspired environment designed to highlight the problem structure for which SL offers a unique solution. Here, an agent must continually search for available (but changing) resources in the presence of competing affordances for information gain. Our simulations show that SL outperforms all other algorithms in this context - most notably, Bayes-adaptive RL and upper confidence bound algorithms, which aim to solve multi-step planning problems using similar principles (i.e., directed exploration and counterfactual reasoning). These results provide added support for the utility of Active Inference in solving this class of biologically-relevant problems and offer added tools for testing hypotheses about human cognition.

Routing Recovery for UAV Networks with Deliberate Attacks: A Reinforcement Learning based Approach

Authors:Sijie He, Ziye Jia, Chao Dong, Wei Wang, Yilu Cao, Yang Yang, Qihui Wu
Date:2023-08-14 07:11:55

The unmanned aerial vehicle (UAV) network is popular these years due to its various applications. In the UAV network, routing is significantly affected by the distributed network topology, leading to the issue that UAVs are vulnerable to deliberate damage. Hence, this paper focuses on the routing plan and recovery for UAV networks with attacks. In detail, a deliberate attack model based on the importance of nodes is designed to represent enemy attacks. Then, a node importance ranking mechanism is presented, considering the degree of nodes and link importance. However, it is intractable to handle the routing problem by traditional methods for UAV networks, since link connections change with the UAV availability. Hence, an intelligent algorithm based on reinforcement learning is proposed to recover the routing path when UAVs are attacked. Simulations are conducted and numerical results verify the proposed mechanism performs better than other referred methods.

CoverNav: Cover Following Navigation Planning in Unstructured Outdoor Environment with Deep Reinforcement Learning

Authors:Jumman Hossain, Abu-Zaher Faridee, Nirmalya Roy, Anjan Basak, Derrik E. Asher
Date:2023-08-12 15:19:49

Autonomous navigation in offroad environments has been extensively studied in the robotics field. However, navigation in covert situations where an autonomous vehicle needs to remain hidden from outside observers remains an underexplored area. In this paper, we propose a novel Deep Reinforcement Learning (DRL) based algorithm, called CoverNav, for identifying covert and navigable trajectories with minimal cost in offroad terrains and jungle environments in the presence of observers. CoverNav focuses on unmanned ground vehicles seeking shelters and taking covers while safely navigating to a predefined destination. Our proposed DRL method computes a local cost map that helps distinguish which path will grant the maximal covertness while maintaining a low cost trajectory using an elevation map generated from 3D point cloud data, the robot's pose, and directed goal information. CoverNav helps robot agents to learn the low elevation terrain using a reward function while penalizing it proportionately when it experiences high elevation. If an observer is spotted, CoverNav enables the robot to select natural obstacles (e.g., rocks, houses, disabled vehicles, trees, etc.) and use them as shelters to hide behind. We evaluate CoverNav using the Unity simulation environment and show that it guarantees dynamically feasible velocities in the terrain when fed with an elevation map generated by another DRL based navigation algorithm. Additionally, we evaluate CoverNav's effectiveness in achieving a maximum goal distance of 12 meters and its success rate in different elevation scenarios with and without cover objects. We observe competitive performance comparable to state of the art (SOTA) methods without compromising accuracy.

Bayesian Inverse Transition Learning for Offline Settings

Authors:Leo Benac, Sonali Parbhoo, Finale Doshi-Velez
Date:2023-08-09 17:08:29

Offline Reinforcement learning is commonly used for sequential decision-making in domains such as healthcare and education, where the rewards are known and the transition dynamics $T$ must be estimated on the basis of batch data. A key challenge for all tasks is how to learn a reliable estimate of the transition dynamics $T$ that produce near-optimal policies that are safe enough so that they never take actions that are far away from the best action with respect to their value functions and informative enough so that they communicate the uncertainties they have. Using data from an expert, we propose a new constraint-based approach that captures our desiderata for reliably learning a posterior distribution of the transition dynamics $T$ that is free from gradients. Our results demonstrate that by using our constraints, we learn a high-performing policy, while considerably reducing the policy's variance over different datasets. We also explain how combining uncertainty estimation with these constraints can help us infer a partial ranking of actions that produce higher returns, and helps us infer safer and more informative policies for planning.

AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning

Authors:Michaël Mathieu, Sherjil Ozair, Srivatsan Srinivasan, Caglar Gulcehre, Shangtong Zhang, Ray Jiang, Tom Le Paine, Richard Powell, Konrad Żołna, Julian Schrittwieser, David Choi, Petko Georgiev, Daniel Toyama, Aja Huang, Roman Ring, Igor Babuschkin, Timo Ewalds, Mahyar Bordbar, Sarah Henderson, Sergio Gómez Colmenarejo, Aäron van den Oord, Wojciech Marian Czarnecki, Nando de Freitas, Oriol Vinyals
Date:2023-08-07 12:21:37

StarCraft II is one of the most challenging simulated reinforcement learning environments; it is partially observable, stochastic, multi-agent, and mastering StarCraft II requires strategic planning over long time horizons with real-time low-level execution. It also has an active professional competitive scene. StarCraft II is uniquely suited for advancing offline RL algorithms, both because of its challenging nature and because Blizzard has released a massive dataset of millions of StarCraft II games played by human players. This paper leverages that and establishes a benchmark, called AlphaStar Unplugged, introducing unprecedented challenges for offline reinforcement learning. We define a dataset (a subset of Blizzard's release), tools standardizing an API for machine learning methods, and an evaluation protocol. We also present baseline agents, including behavior cloning, offline variants of actor-critic and MuZero. We improve the state of the art of agents using only offline data, and we achieve 90% win rate against previously published AlphaStar behavior cloning agent.

Learning-based Near-optimal Motion Planning for Intelligent Vehicles with Uncertain Dynamics

Authors:Yang Lu, Xinglong Zhang, Xin Xu, Weijia Yao
Date:2023-08-07 03:06:12

Motion planning has been an important research topic in achieving safe and flexible maneuvers for intelligent vehicles. However, it remains challenging to realize efficient and optimal planning in the presence of uncertain model dynamics. In this paper, a sparse kernel-based reinforcement learning (RL) algorithm with Gaussian Process (GP) Regression (called GP-SKRL) is proposed to achieve online adaption and near-optimal motion planning performance. In this algorithm, we design an efficient sparse GP regression method to learn the uncertain dynamics. Based on the updated model, a sparse kernel-based policy iteration algorithm with an exponential barrier function is designed to learn the near-optimal planning policies with the capability to avoid dynamic obstacles. Thereby, batch-mode GP-SKRL with online adaption capability can estimate the changing system dynamics. The converged RL policies are then deployed on vehicles efficiently under a safety-aware module. As a result, the produced driving actions are safe and less conservative, and the planning performance has been noticeably improved. Extensive simulation results show that GP-SKRL outperforms several advanced motion planning methods in terms of average cumulative cost, trajectory length, and task completion time. In particular, experiments on a Hongqi E-HS3 vehicle demonstrate that superior GP-SKRL provides a practical planning solution.

Job Shop Scheduling via Deep Reinforcement Learning: a Sequence to Sequence approach

Authors:Giovanni Bonetta, Davide Zago, Rossella Cancelliere, Andrea Grosso
Date:2023-08-03 14:52:17

Job scheduling is a well-known Combinatorial Optimization problem with endless applications. Well planned schedules bring many benefits in the context of automated systems: among others, they limit production costs and waste. Nevertheless, the NP-hardness of this problem makes it essential to use heuristics whose design is difficult, requires specialized knowledge and often produces methods tailored to the specific task. This paper presents an original end-to-end Deep Reinforcement Learning approach to scheduling that automatically learns dispatching rules. Our technique is inspired by natural language encoder-decoder models for sequence processing and has never been used, to the best of our knowledge, for scheduling purposes. We applied and tested our method in particular to some benchmark instances of Job Shop Problem, but this technique is general enough to be potentially used to tackle other different optimal job scheduling tasks with minimal intervention. Results demonstrate that we outperform many classical approaches exploiting priority dispatching rules and show competitive results on state-of-the-art Deep Reinforcement Learning ones.

Learning whom to trust in navigation: dynamically switching between classical and neural planning

Authors:Sombit Dey, Assem Sadek, Gianluca Monaci, Boris Chidlovskii, Christian Wolf
Date:2023-07-31 14:29:26

Navigation of terrestrial robots is typically addressed either with localization and mapping (SLAM) followed by classical planning on the dynamically created maps, or by machine learning (ML), often through end-to-end training with reinforcement learning (RL) or imitation learning (IL). Recently, modular designs have achieved promising results, and hybrid algorithms that combine ML with classical planning have been proposed. Existing methods implement these combinations with hand-crafted functions, which cannot fully exploit the complementary nature of the policies and the complex regularities between scene structure and planning performance. Our work builds on the hypothesis that the strengths and weaknesses of neural planners and classical planners follow some regularities, which can be learned from training data, in particular from interactions. This is grounded on the assumption that, both, trained planners and the mapping algorithms underlying classical planning are subject to failure cases depending on the semantics of the scene and that this dependence is learnable: for instance, certain areas, objects or scene structures can be reconstructed easier than others. We propose a hierarchical method composed of a high-level planner dynamically switching between a classical and a neural planner. We fully train all neural policies in simulation and evaluate the method in both simulation and real experiments with a LoCoBot robot, showing significant gains in performance, in particular in the real environment. We also qualitatively conjecture on the nature of data regularities exploited by the high-level planner.

Robust Unmanned Surface Vehicle Navigation with Distributional Reinforcement Learning

Authors:Xi Lin, John McConnell, Brendan Englot
Date:2023-07-30 14:15:27

Autonomous navigation of Unmanned Surface Vehicles (USV) in marine environments with current flows is challenging, and few prior works have addressed the sensorbased navigation problem in such environments under no prior knowledge of the current flow and obstacles. We propose a Distributional Reinforcement Learning (RL) based local path planner that learns return distributions which capture the uncertainty of action outcomes, and an adaptive algorithm that automatically tunes the level of sensitivity to the risk in the environment. The proposed planner achieves a more stable learning performance and converges to safer policies than a traditional RL based planner. Computational experiments demonstrate that comparing to a traditional RL based planner and classical local planning methods such as Artificial Potential Fields and the Bug Algorithm, the proposed planner is robust against environmental flows, and is able to plan trajectories that are superior in safety, time and energy consumption.

Using Implicit Behavior Cloning and Dynamic Movement Primitive to Facilitate Reinforcement Learning for Robot Motion Planning

Authors:Zengjie Zhang, Jayden Hong, Amir Soufi Enayati, Homayoun Najjaran
Date:2023-07-29 19:46:09

Reinforcement learning (RL) for motion planning of multi-degree-of-freedom robots still suffers from low efficiency in terms of slow training speed and poor generalizability. In this paper, we propose a novel RL-based robot motion planning framework that uses implicit behavior cloning (IBC) and dynamic movement primitive (DMP) to improve the training speed and generalizability of an off-policy RL agent. IBC utilizes human demonstration data to leverage the training speed of RL, and DMP serves as a heuristic model that transfers motion planning into a simpler planning space. To support this, we also create a human demonstration dataset using a pick-and-place experiment that can be used for similar studies. Comparison studies in simulation reveal the advantage of the proposed method over the conventional RL agents with faster training speed and higher scores. A real-robot experiment indicates the applicability of the proposed method to a simple assembly task. Our work provides a novel perspective on using motion primitives and human demonstration to leverage the performance of RL for robot applications.

Human-Like Implicit Intention Expression for Autonomous Driving Motion Planning: A Method Based on Learning Human Intention Priors

Authors:Jiaqi Liu, Xiao Qi, Ying Ni, Jian Sun, Peng Hang
Date:2023-07-29 10:16:28

One of the key factors determining whether autonomous vehicles (AVs) can be seamlessly integrated into existing traffic systems is their ability to interact smoothly and efficiently with human drivers and communicate their intentions. While many studies have focused on enhancing AVs' human-like interaction and communication capabilities at the behavioral decision-making level, a significant gap remains between the actual motion trajectories of AVs and the psychological expectations of human drivers. This discrepancy can seriously affect the safety and efficiency of AV-HV (Autonomous Vehicle-Human Vehicle) interactions. To address these challenges, we propose a motion planning method for AVs that incorporates implicit intention expression. First, we construct a trajectory space constraint based on human implicit intention priors, compressing and pruning the trajectory space to generate candidate motion trajectories that consider intention expression. We then apply maximum entropy inverse reinforcement learning to learn and estimate human trajectory preferences, constructing a reward function that represents the cognitive characteristics of drivers. Finally, using a Boltzmann distribution, we establish a probabilistic distribution of candidate trajectories based on the reward obtained, selecting human-like trajectory actions. We validated our approach on a real trajectory dataset and compared it with several baseline methods. The results demonstrate that our method excels in human-likeness, intention expression capability, and computational efficiency.

Coordination of Bounded Rational Drones through Informed Prior Policy

Authors:Durgakant Pushp, Junhong Xu, Lantao Liu
Date:2023-07-28 20:41:17

Biological agents, such as humans and animals, are capable of making decisions out of a very large number of choices in a limited time. They can do so because they use their prior knowledge to find a solution that is not necessarily optimal but good enough for the given task. In this work, we study the motion coordination of multiple drones under the above-mentioned paradigm, Bounded Rationality (BR), to achieve cooperative motion planning tasks. Specifically, we design a prior policy that provides useful goal-directed navigation heuristics in familiar environments and is adaptive in unfamiliar ones via Reinforcement Learning augmented with an environment-dependent exploration noise. Integrating this prior policy in the game-theoretic bounded rationality framework allows agents to quickly make decisions in a group considering other agents' computational constraints. Our investigation assures that agents with a well-informed prior policy increase the efficiency of the collective decision-making capability of the group. We have conducted rigorous experiments in simulation and in the real world to demonstrate that the ability of informed agents to navigate to the goal safely can guide the group to coordinate efficiently under the BR framework.

Thinker: Learning to Plan and Act

Authors:Stephen Chung, Ivan Anokhin, David Krueger
Date:2023-07-27 16:40:14

We propose the Thinker algorithm, a novel approach that enables reinforcement learning agents to autonomously interact with and utilize a learned world model. The Thinker algorithm wraps the environment with a world model and introduces new actions designed for interacting with the world model. These model-interaction actions enable agents to perform planning by proposing alternative plans to the world model before selecting a final action to execute in the environment. This approach eliminates the need for handcrafted planning algorithms by enabling the agent to learn how to plan autonomously and allows for easy interpretation of the agent's plan with visualization. We demonstrate the algorithm's effectiveness through experimental results in the game of Sokoban and the Atari 2600 benchmark, where the Thinker algorithm achieves state-of-the-art performance and competitive results, respectively. Visualizations of agents trained with the Thinker algorithm demonstrate that they have learned to plan effectively with the world model to select better actions. Thinker is the first work showing that an RL agent can learn to plan with a learned world model in complex environments.

Reinforced Potential Field for Multi-Robot Motion Planning in Cluttered Environments

Authors:Dengyu Zhang, Xinyu Zhang, Zheng Zhang, Bo Zhu, Qingrui Zhang
Date:2023-07-26 11:13:14

Motion planning is challenging for multiple robots in cluttered environments without communication, especially in view of real-time efficiency, motion safety, distributed computation, and trajectory optimality, etc. In this paper, a reinforced potential field method is developed for distributed multi-robot motion planning, which is a synthesized design of reinforcement learning and artificial potential fields. An observation embedding with a self-attention mechanism is presented to model the robot-robot and robot-environment interactions. A soft wall-following rule is developed to improve the trajectory smoothness. Our method belongs to reactive planning, but environment properties are implicitly encoded. The total amount of robots in our method can be scaled up to any number. The performance improvement over a vanilla APF and RL method has been demonstrated via numerical simulations. Experiments are also performed using quadrotors to further illustrate the competence of our method.

Settling the Sample Complexity of Online Reinforcement Learning

Authors:Zihan Zhang, Yuxin Chen, Jason D. Lee, Simon S. Du
Date:2023-07-25 15:42:11

A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by \cite{zhang2020reinforcement}, achieves a regret on the order of (modulo log factors) \begin{equation*} \min\big\{ \sqrt{SAH^3K}, \,HK \big\}, \end{equation*} where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, and $K$ is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size $K\geq 1$, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield $\varepsilon$-accuracy) of $\frac{SAH^3}{\varepsilon^2}$ up to log factor, which is minimax-optimal for the full $\varepsilon$-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency -- a long-standing challenge facing the analysis of online RL in the sample-hungry regime.

On Solving the Rubik's Cube with Domain-Independent Planners Using Standard Representations

Authors:Bharath Muppasani, Vishal Pallagani, Biplav Srivastava, Forest Agostinelli
Date:2023-07-25 14:52:23

Rubik's Cube (RC) is a well-known and computationally challenging puzzle that has motivated AI researchers to explore efficient alternative representations and problem-solving methods. The ideal situation for planning here is that a problem be solved optimally and efficiently represented in a standard notation using a general-purpose solver and heuristics. The fastest solver today for RC is DeepCubeA with a custom representation, and another approach is with Scorpion planner with State-Action-Space+ (SAS+) representation. In this paper, we present the first RC representation in the popular PDDL language so that the domain becomes more accessible to PDDL planners, competitions, and knowledge engineering tools, and is more human-readable. We then bridge across existing approaches and compare performance. We find that in one comparable experiment, DeepCubeA (trained with 12 RC actions) solves all problems with varying complexities, albeit only 78.5% are optimal plans. For the same problem set, Scorpion with SAS+ representation and pattern database heuristics solves 61.50% problems optimally, while FastDownward with PDDL representation and FF heuristic solves 56.50% problems, out of which 79.64% of the plans generated were optimal. Our study provides valuable insights into the trade-offs between representational choice and plan optimality that can help researchers design future strategies for challenging domains combining general-purpose solving methods (planning, reinforcement learning), heuristics, and representations (standard or custom).

Submodular Reinforcement Learning

Authors:Manish Prajapat, Mojmír Mutný, Melanie N. Zeilinger, Andreas Krause
Date:2023-07-25 09:46:02

In reinforcement learning (RL), rewards of states are typically considered additive, and following the Markov assumption, they are $\textit{independent}$ of states visited previously. In many important applications, such as coverage control, experiment design and informative path planning, rewards naturally have diminishing returns, i.e., their value decreases in light of similar states visited previously. To tackle this, we propose $\textit{submodular RL}$ (SubRL), a paradigm which seeks to optimize more general, non-additive (and history-dependent) rewards modelled via submodular set functions which capture diminishing returns. Unfortunately, in general, even in tabular settings, we show that the resulting optimization problem is hard to approximate. On the other hand, motivated by the success of greedy algorithms in classical submodular optimization, we propose SubPO, a simple policy gradient-based algorithm for SubRL that handles non-additive rewards by greedily maximizing marginal gains. Indeed, under some assumptions on the underlying Markov Decision Process (MDP), SubPO recovers optimal constant factor approximations of submodular bandits. Moreover, we derive a natural policy gradient approach for locally optimizing SubRL instances even in large state- and action- spaces. We showcase the versatility of our approach by applying SubPO to several applications, such as biodiversity monitoring, Bayesian experiment design, informative path planning, and coverage maximization. Our results demonstrate sample efficiency, as well as scalability to high-dimensional state-action spaces.

2-Level Reinforcement Learning for Ships on Inland Waterways: Path Planning and Following

Authors:Martin Waltz, Niklas Paulig, Ostap Okhrin
Date:2023-07-25 08:42:59

This paper proposes a realistic modularized framework for controlling autonomous surface vehicles (ASVs) on inland waterways (IWs) based on deep reinforcement learning (DRL). The framework improves operational safety and comprises two levels: a high-level local path planning (LPP) unit and a low-level path following (PF) unit, each consisting of a DRL agent. The LPP agent is responsible for planning a path under consideration of dynamic vessels, closing a gap in the current research landscape. In addition, the LPP agent adequately considers traffic rules and the geometry of the waterway. We thereby introduce a novel application of a spatial-temporal recurrent neural network architecture to continuous action spaces. The LPP agent outperforms a state-of-the-art artificial potential field (APF) method by increasing the minimum distance to other vessels by 65% on average. The PF agent performs low-level actuator control while accounting for shallow water influences and the environmental forces winds, waves, and currents. Compared with a proportional-integral-derivative (PID) controller, the PF agent yields only 61% of the mean cross-track error (MCTE) while significantly reducing control effort (CE) in terms of the required absolute rudder angle. Lastly, both agents are jointly validated in simulation, employing the lower Elbe in northern Germany as an example case and using real automatic identification system (AIS) trajectories to model the behavior of other ships.

Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

Authors:Chuming Li, Ruonan Jia, Jie Liu, Yinmin Zhang, Yazhe Niu, Yaodong Yang, Yu Liu, Wanli Ouyang
Date:2023-07-24 16:52:31

Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks due to its high sample efficiency. To save the computation cost of conducting planning online, recent practices tend to distill optimized action sequences into an RL policy during the training phase. Although the distillation can incorporate both the foresight of planning and the exploration ability of RL policies, the theoretical understanding of these methods is yet unclear. In this paper, we extend the policy improvement step of Soft Actor-Critic (SAC) by developing an approach to distill from model-based planning to the policy. We then demonstrate that such an approach of policy improvement has a theoretical guarantee of monotonic improvement and convergence to the maximum value defined in SAC. We discuss effective design choices and implement our theory as a practical algorithm -- Model-based Planning Distilled to Policy (MPDP) -- that updates the policy jointly over multiple future time steps. Extensive experiments show that MPDP achieves better sample efficiency and asymptotic performance than both model-free and model-based planning algorithms on six continuous control benchmark tasks in MuJoCo.

Emergence of Adaptive Circadian Rhythms in Deep Reinforcement Learning

Authors:Aqeel Labash, Florian Fletzer, Daniel Majoral, Raul Vicente
Date:2023-07-22 18:47:18

Adapting to regularities of the environment is critical for biological organisms to anticipate events and plan. A prominent example is the circadian rhythm corresponding to the internalization by organisms of the $24$-hour period of the Earth's rotation. In this work, we study the emergence of circadian-like rhythms in deep reinforcement learning agents. In particular, we deployed agents in an environment with a reliable periodic variation while solving a foraging task. We systematically characterize the agent's behavior during learning and demonstrate the emergence of a rhythm that is endogenous and entrainable. Interestingly, the internal rhythm adapts to shifts in the phase of the environmental signal without any re-training. Furthermore, we show via bifurcation and phase response curve analyses how artificial neurons develop dynamics to support the internalization of the environmental rhythm. From a dynamical systems view, we demonstrate that the adaptation proceeds by the emergence of a stable periodic orbit in the neuron dynamics with a phase response that allows an optimal phase synchronisation between the agent's dynamics and the environmental rhythm.

Using Reinforcement Learning for the Three-Dimensional Loading Capacitated Vehicle Routing Problem

Authors:Stefan Schoepf, Stephen Mak, Julian Senoner, Liming Xu, Netland Torbjörn, Alexandra Brintrup
Date:2023-07-22 18:05:28

Heavy goods vehicles are vital backbones of the supply chain delivery system but also contribute significantly to carbon emissions with only 60% loading efficiency in the United Kingdom. Collaborative vehicle routing has been proposed as a solution to increase efficiency, but challenges remain to make this a possibility. One key challenge is the efficient computation of viable solutions for co-loading and routing. Current operations research methods suffer from non-linear scaling with increasing problem size and are therefore bound to limited geographic areas to compute results in time for day-to-day operations. This only allows for local optima in routing and leaves global optimisation potential untouched. We develop a reinforcement learning model to solve the three-dimensional loading capacitated vehicle routing problem in approximately linear time. While this problem has been studied extensively in operations research, no publications on solving it with reinforcement learning exist. We demonstrate the favourable scaling of our reinforcement learning model and benchmark our routing performance against state-of-the-art methods. The model performs within an average gap of 3.83% to 8.10% compared to established methods. Our model not only represents a promising first step towards large-scale logistics optimisation with reinforcement learning but also lays the foundation for this research stream. GitHub: https://github.com/if-loops/3L-CVRP

Selective Perception: Optimizing State Descriptions with Reinforcement Learning for Language Model Actors

Authors:Kolby Nottingham, Yasaman Razeghi, Kyungmin Kim, JB Lanier, Pierre Baldi, Roy Fox, Sameer Singh
Date:2023-07-21 22:02:50

Large language models (LLMs) are being applied as actors for sequential decision making tasks in domains such as robotics and games, utilizing their general world knowledge and planning abilities. However, previous work does little to explore what environment state information is provided to LLM actors via language. Exhaustively describing high-dimensional states can impair performance and raise inference costs for LLM actors. Previous LLM actors avoid the issue by relying on hand-engineered, task-specific protocols to determine which features to communicate about a state and which to leave out. In this work, we propose Brief Language INputs for DEcision-making Responses (BLINDER), a method for automatically selecting concise state descriptions by learning a value function for task-conditioned state descriptions. We evaluate BLINDER on the challenging video game NetHack and a robotic manipulation task. Our method improves task success rate, reduces input size and compute costs, and generalizes between LLM actors.

JoinGym: An Efficient Query Optimization Environment for Reinforcement Learning

Authors:Kaiwen Wang, Junxiong Wang, Yueying Li, Nathan Kallus, Immanuel Trummer, Wen Sun
Date:2023-07-21 17:00:06

Join order selection (JOS) is the problem of ordering join operations to minimize total query execution cost and it is the core NP-hard combinatorial optimization problem of query optimization. In this paper, we present JoinGym, a lightweight and easy-to-use query optimization environment for reinforcement learning (RL) that captures both the left-deep and bushy variants of the JOS problem. Compared to existing query optimization environments, the key advantages of JoinGym are usability and significantly higher throughput which we accomplish by simulating query executions entirely offline. Under the hood, JoinGym simulates a query plan's cost by looking up intermediate result cardinalities from a pre-computed dataset. We release a novel cardinality dataset for $3300$ SQL queries based on real IMDb workloads which may be of independent interest, e.g., for cardinality estimation. Finally, we extensively benchmark four RL algorithms and find that their cost distributions are heavy-tailed, which motivates future work in risk-sensitive RL. In sum, JoinGym enables users to rapidly prototype RL algorithms on realistic database problems without needing to setup and run live systems.

An Analysis of Multi-Agent Reinforcement Learning for Decentralized Inventory Control Systems

Authors:Marwan Mousa, Damien van de Berg, Niki Kotecha, Ehecatl Antonio del Rio-Chanona, Max Mowbray
Date:2023-07-21 08:52:08

Most solutions to the inventory management problem assume a centralization of information that is incompatible with organisational constraints in real supply chain networks. The inventory management problem is a well-known planning problem in operations research, concerned with finding the optimal re-order policy for nodes in a supply chain. While many centralized solutions to the problem exist, they are not applicable to real-world supply chains made up of independent entities. The problem can however be naturally decomposed into sub-problems, each associated with an independent entity, turning it into a multi-agent system. Therefore, a decentralized data-driven solution to inventory management problems using multi-agent reinforcement learning is proposed where each entity is controlled by an agent. Three multi-agent variations of the proximal policy optimization algorithm are investigated through simulations of different supply chain networks and levels of uncertainty. The centralized training decentralized execution framework is deployed, which relies on offline centralization during simulation-based policy identification, but enables decentralization when the policies are deployed online to the real system. Results show that using multi-agent proximal policy optimization with a centralized critic leads to performance very close to that of a centralized data-driven solution and outperforms a distributed model-based solution in most cases while respecting the information constraints of the system.

Goal-Conditioned Reinforcement Learning with Disentanglement-based Reachability Planning

Authors:Zhifeng Qian, Mingyu You, Hongjun Zhou, Xuanhui Xu, Bin He
Date:2023-07-20 13:08:14

Goal-Conditioned Reinforcement Learning (GCRL) can enable agents to spontaneously set diverse goals to learn a set of skills. Despite the excellent works proposed in various fields, reaching distant goals in temporally extended tasks remains a challenge for GCRL. Current works tackled this problem by leveraging planning algorithms to plan intermediate subgoals to augment GCRL. Their methods need two crucial requirements: (i) a state representation space to search valid subgoals, and (ii) a distance function to measure the reachability of subgoals. However, they struggle to scale to high-dimensional state space due to their non-compact representations. Moreover, they cannot collect high-quality training data through standard GC policies, which results in an inaccurate distance function. Both affect the efficiency and performance of planning and policy learning. In the paper, we propose a goal-conditioned RL algorithm combined with Disentanglement-based Reachability Planning (REPlan) to solve temporally extended tasks. In REPlan, a Disentangled Representation Module (DRM) is proposed to learn compact representations which disentangle robot poses and object positions from high-dimensional observations in a self-supervised manner. A simple REachability discrimination Module (REM) is also designed to determine the temporal distance of subgoals. Moreover, REM computes intrinsic bonuses to encourage the collection of novel states for training. We evaluate our REPlan in three vision-based simulation tasks and one real-world task. The experiments demonstrate that our REPlan significantly outperforms the prior state-of-the-art methods in solving temporally extended tasks.

A Decision Making Framework for Recommended Maintenance of Road Segments

Authors:Haoyu Sun, Yan Yan
Date:2023-07-19 15:55:25

Due to limited budgets allocated for road maintenance projects in various countries, road management departments face difficulties in making scientific maintenance decisions. This paper aims to provide road management departments with more scientific decision tools and evidence. The framework proposed in this paper mainly has the following four innovative points: 1) Predicting pavement performance deterioration levels of road sections as decision basis rather than accurately predicting specific indicator values; 2) Determining maintenance route priorities based on multiple factors; 3) Making maintenance plan decisions by establishing deep reinforcement learning models to formulate predictive strategies based on past maintenance performance evaluations, while considering both technical and management indicators; 4) Determining repair section priorities according to actual and suggested repair effects. By resolving these four issues, the framework can make intelligent decisions regarding optimal maintenance plans and sections, taking into account limited funds and historical maintenance management experiences.

Towards A Unified Agent with Foundation Models

Authors:Norman Di Palo, Arunkumar Byravan, Leonard Hasenclever, Markus Wulfmeier, Nicolas Heess, Martin Riedmiller
Date:2023-07-18 22:37:30

Language Models and Vision Language Models have recently demonstrated unprecedented capabilities in terms of understanding human intentions, reasoning, scene understanding, and planning-like behaviour, in text form, among many others. In this work, we investigate how to embed and leverage such abilities in Reinforcement Learning (RL) agents. We design a framework that uses language as the core reasoning tool, exploring how this enables an agent to tackle a series of fundamental RL challenges, such as efficient exploration, reusing experience data, scheduling skills, and learning from observations, which traditionally require separate, vertically designed algorithms. We test our method on a sparse-reward simulated robotic manipulation environment, where a robot needs to stack a set of objects. We demonstrate substantial performance improvements over baselines in exploration efficiency and ability to reuse data from offline datasets, and illustrate how to reuse learned skills to solve novel tasks or imitate videos of human experts.

Machine-directed gravitational-wave counterpart discovery

Authors:Niharika Sravan, Matthew J. Graham, Michael W. Coughlin, Tomas Ahumada, Shreya Anand
Date:2023-07-18 12:48:13

Joint observations in electromagnetic and gravitational waves shed light on the physics of objects and surrounding environments with extreme gravity that are otherwise unreachable via siloed observations in each messenger. However, such detections remain challenging due to the rapid and faint nature of counterparts. Protocols for discovery and inference still rely on human experts manually inspecting survey alert streams and intuiting optimal usage of limited follow-up resources. Strategizing an optimal follow-up program requires adaptive sequential decision-making given evolving light curve data that (i) maximizes a global objective despite incomplete information and (ii) is robust to stochasticity introduced by detectors/observing conditions. Reinforcement learning (RL) approaches allow agents to implicitly learn the physics/detector dynamics and the behavior policy that maximize a designated objective through experience. To demonstrate the utility of such an approach for the kilonova follow-up problem, we train a toy RL agent for the goal of maximizing follow-up photometry for the true kilonova among several contaminant transient light curves. In a simulated environment where the agent learns online, it achieves 3x higher accuracy compared to a random strategy. However, it is surpassed by human agents by up to a factor of 2. This is likely because our hypothesis function (Q that is linear in state-action features) is an insufficient representation of the optimal behavior policy. More complex agents could perform at par or surpass human experts. Agents like these could pave the way for machine-directed software infrastructure to efficiently respond to next generation detectors, for conducting science inference and optimally planning expensive follow-up observations, scalably and with demonstrable performance guarantees.

REX: Rapid Exploration and eXploitation for AI Agents

Authors:Rithesh Murthy, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Le Xue, Weiran Yao, Yihao Feng, Zeyuan Chen, Akash Gokul, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, Silvio Savarese
Date:2023-07-18 04:26:33

In this paper, we propose an enhanced approach for Rapid Exploration and eXploitation for AI Agents called REX. Existing AutoGPT-style techniques have inherent limitations, such as a heavy reliance on precise descriptions for decision-making, and the lack of a systematic approach to leverage try-and-fail procedures akin to traditional Reinforcement Learning (RL). REX introduces an additional layer of rewards and integrates concepts similar to Upper Confidence Bound (UCB) scores, leading to more robust and efficient AI agent performance. This approach has the advantage of enabling the utilization of offline behaviors from logs and allowing seamless integration with existing foundation models while it does not require any model fine-tuning. Through comparative analysis with existing methods such as Chain-of-Thoughts(CoT) and Reasoning viA Planning(RAP), REX-based methods demonstrate comparable performance and, in certain cases, even surpass the results achieved by these existing techniques. Notably, REX-based methods exhibit remarkable reductions in execution time, enhancing their practical applicability across a diverse set of scenarios.

Can Euclidean Symmetry be Leveraged in Reinforcement Learning and Planning?

Authors:Linfeng Zhao, Owen Howell, Jung Yeon Park, Xupeng Zhu, Robin Walters, Lawson L. S. Wong
Date:2023-07-17 04:01:48

In robotic tasks, changes in reference frames typically do not influence the underlying physical properties of the system, which has been known as invariance of physical laws.These changes, which preserve distance, encompass isometric transformations such as translations, rotations, and reflections, collectively known as the Euclidean group. In this work, we delve into the design of improved learning algorithms for reinforcement learning and planning tasks that possess Euclidean group symmetry. We put forth a theory on that unify prior work on discrete and continuous symmetry in reinforcement learning, planning, and optimal control. Algorithm side, we further extend the 2D path planning with value-based planning to continuous MDPs and propose a pipeline for constructing equivariant sampling-based planning algorithms. Our work is substantiated with empirical evidence and illustrated through examples that explain the benefits of equivariance to Euclidean symmetry in tackling natural control problems.

POMDP inference and robust solution via deep reinforcement learning: An application to railway optimal maintenance

Authors:Giacomo Arcieri, Cyprien Hoelzl, Oliver Schwery, Daniel Straub, Konstantinos G. Papakonstantinou, Eleni Chatzi
Date:2023-07-16 15:44:58

Partially Observable Markov Decision Processes (POMDPs) can model complex sequential decision-making problems under stochastic and uncertain environments. A main reason hindering their broad adoption in real-world applications is the lack of availability of a suitable POMDP model or a simulator thereof. Available solution algorithms, such as Reinforcement Learning (RL), require the knowledge of the transition dynamics and the observation generating process, which are often unknown and non-trivial to infer. In this work, we propose a combined framework for inference and robust solution of POMDPs via deep RL. First, all transition and observation model parameters are jointly inferred via Markov Chain Monte Carlo sampling of a hidden Markov model, which is conditioned on actions, in order to recover full posterior distributions from the available data. The POMDP with uncertain parameters is then solved via deep RL techniques with the parameter distributions incorporated into the solution via domain randomization, in order to develop solutions that are robust to model uncertainty. As a further contribution, we compare the use of transformers and long short-term memory networks, which constitute model-free RL solutions, with a model-based/model-free hybrid approach. We apply these methods to the real-world problem of optimal maintenance planning for railway assets.

Bayesian inference for data-efficient, explainable, and safe robotic motion planning: A review

Authors:Chengmin Zhou, Chao Wang, Haseeb Hassan, Himat Shah, Bingding Huang, Pasi Fränti
Date:2023-07-16 12:29:27

Bayesian inference has many advantages in robotic motion planning over four perspectives: The uncertainty quantification of the policy, safety (risk-aware) and optimum guarantees of robot motions, data-efficiency in training of reinforcement learning, and reducing the sim2real gap when the robot is applied to real-world tasks. However, the application of Bayesian inference in robotic motion planning is lagging behind the comprehensive theory of Bayesian inference. Further, there are no comprehensive reviews to summarize the progress of Bayesian inference to give researchers a systematic understanding in robotic motion planning. This paper first provides the probabilistic theories of Bayesian inference which are the preliminary of Bayesian inference for complex cases. Second, the Bayesian estimation is given to estimate the posterior of policies or unknown functions which are used to compute the policy. Third, the classical model-based Bayesian RL and model-free Bayesian RL algorithms for robotic motion planning are summarized, while these algorithms in complex cases are also analyzed. Fourth, the analysis of Bayesian inference in inverse RL is given to infer the reward functions in a data-efficient manner. Fifth, we systematically present the hybridization of Bayesian inference and RL which is a promising direction to improve the convergence of RL for better motion planning. Sixth, given the Bayesian inference, we present the interpretable and safe robotic motion plannings which are the hot research topic recently. Finally, all algorithms reviewed in this paper are summarized analytically as the knowledge graphs, and the future of Bayesian inference for robotic motion planning is also discussed, to pave the way for data-efficient, explainable, and safe robotic motion planning strategies for practical applications.

SafeDreamer: Safe Reinforcement Learning with World Models

Authors:Weidong Huang, Jiaming Ji, Chunhe Xia, Borong Zhang, Yaodong Yang
Date:2023-07-14 06:00:08

The deployment of Reinforcement Learning (RL) in real-world applications is constrained by its failure to satisfy safety criteria. Existing Safe Reinforcement Learning (SafeRL) methods, which rely on cost functions to enforce safety, often fail to achieve zero-cost performance in complex scenarios, especially vision-only tasks. These limitations are primarily due to model inaccuracies and inadequate sample efficiency. The integration of the world model has proven effective in mitigating these shortcomings. In this work, we introduce SafeDreamer, a novel algorithm incorporating Lagrangian-based methods into world model planning processes within the superior Dreamer framework. Our method achieves nearly zero-cost performance on various tasks, spanning low-dimensional and vision-only input, within the Safety-Gymnasium benchmark, showcasing its efficacy in balancing performance and safety in RL tasks. Further details can be found in the code repository: \url{https://github.com/PKU-Alignment/SafeDreamer}.

Layered controller synthesis for dynamic multi-agent systems

Authors:Emily Clement, Nicolas Perrin-Gilbert, Philipp Schlehuber-Caissier
Date:2023-07-13 13:56:27

In this paper we present a layered approach for multi-agent control problem, decomposed into three stages, each building upon the results of the previous one. First, a high-level plan for a coarse abstraction of the system is computed, relying on parametric timed automata augmented with stopwatches as they allow to efficiently model simplified dynamics of such systems. In the second stage, the high-level plan, based on SMT-formulation, mainly handles the combinatorial aspects of the problem, provides a more dynamically accurate solution. These stages are collectively referred to as the SWA-SMT solver. They are correct by construction but lack a crucial feature: they cannot be executed in real time. To overcome this, we use SWA-SMT solutions as the initial training dataset for our last stage, which aims at obtaining a neural network control policy. We use reinforcement learning to train the policy, and show that the initial dataset is crucial for the overall success of the method.

On the Effective Horizon of Inverse Reinforcement Learning

Authors:Yiqing Xu, Finale Doshi-Velez, David Hsu
Date:2023-07-13 03:06:36

Inverse reinforcement learning (IRL) algorithms often rely on (forward) reinforcement learning or planning, over a given time horizon, to compute an approximately optimal policy for a hypothesized reward function; they then match this policy with expert demonstrations. The time horizon plays a critical role in determining both the accuracy of reward estimates and the computational efficiency of IRL algorithms. Interestingly, an *effective time horizon* shorter than the ground-truth value often produces better results faster. This work formally analyzes this phenomenon and provides an explanation: the time horizon controls the complexity of an induced policy class and mitigates overfitting with limited data. This analysis provides a guide for the principled choice of the effective horizon for IRL. It also prompts us to re-examine the classic IRL formulation: it is more natural to learn jointly the reward and the effective horizon rather than the reward alone with a given horizon. To validate our findings, we implement a cross-validation extension and the experimental results support the theoretical analysis. The project page and code are publicly available.

Bag of Views: An Appearance-based Approach to Next-Best-View Planning for 3D Reconstruction

Authors:Sara Hatami Gazani, Matthew Tucsok, Iraj Mantegh, Homayoun Najjaran
Date:2023-07-11 22:56:55

UAV-based intelligent data acquisition for 3D reconstruction and monitoring of infrastructure has experienced an increasing surge of interest due to recent advancements in image processing and deep learning-based techniques. View planning is an essential part of this task that dictates the information capture strategy and heavily impacts the quality of the 3D model generated from the captured data. Recent methods have used prior knowledge or partial reconstruction of the target to accomplish view planning for active reconstruction; the former approach poses a challenge for complex or newly identified targets while the latter is computationally expensive. In this work, we present Bag-of-Views (BoV), a fully appearance-based model used to assign utility to the captured views for both offline dataset refinement and online next-best-view (NBV) planning applications targeting the task of 3D reconstruction. With this contribution, we also developed the View Planning Toolbox (VPT), a lightweight package for training and testing machine learning-based view planning frameworks, custom view dataset generation of arbitrary 3D scenes, and 3D reconstruction. Through experiments which pair a BoV-based reinforcement learning model with VPT, we demonstrate the efficacy of our model in reducing the number of required views for high-quality reconstructions in dataset refinement and NBV planning.

Contextual Pre-planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning

Authors:Guy Azran, Mohamad H. Danesh, Stefano V. Albrecht, Sarah Keren
Date:2023-07-11 12:28:05

Recent studies show that deep reinforcement learning (DRL) agents tend to overfit to the task on which they were trained and fail to adapt to minor environment changes. To expedite learning when transferring to unseen tasks, we propose a novel approach to representing the current task using reward machines (RMs), state machine abstractions that induce subtasks based on the current task's rewards and dynamics. Our method provides agents with symbolic representations of optimal transitions from their current abstract state and rewards them for achieving these transitions. These representations are shared across tasks, allowing agents to exploit knowledge of previously encountered symbols and transitions, thus enhancing transfer. Empirical results show that our representations improve sample efficiency and few-shot transfer in a variety of domains.

Control as Probabilistic Inference as an Emergent Communication Mechanism in Multi-Agent Reinforcement Learning

Authors:Tomoaki Nakamura, Akira Taniguchi, Tadahiro Taniguchi
Date:2023-07-11 03:53:46

This paper proposes a generative probabilistic model integrating emergent communication and multi-agent reinforcement learning. The agents plan their actions by probabilistic inference, called control as inference, and communicate using messages that are latent variables and estimated based on the planned actions. Through these messages, each agent can send information about its actions and know information about the actions of another agent. Therefore, the agents change their actions according to the estimated messages to achieve cooperative tasks. This inference of messages can be considered as communication, and this procedure can be formulated by the Metropolis-Hasting naming game. Through experiments in the grid world environment, we show that the proposed PGM can infer meaningful messages to achieve the cooperative task.

Meta-Policy Learning over Plan Ensembles for Robust Articulated Object Manipulation

Authors:Constantinos Chamzas, Caelan Garrett, Balakumar Sundaralingam, Lydia E. Kavraki, Dieter Fox
Date:2023-07-08 20:06:18

Recent work has shown that complex manipulation skills, such as pushing or pouring, can be learned through state-of-the-art learning based techniques, such as Reinforcement Learning (RL). However, these methods often have high sample-complexity, are susceptible to domain changes, and produce unsafe motions that a robot should not perform. On the other hand, purely geometric model-based planning can produce complex behaviors that satisfy all the geometric constraints of the robot but might not be dynamically feasible for a given environment. In this work, we leverage a geometric model-based planner to build a mixture of path-policies on which a task-specific meta-policy can be learned to complete the task. In our results, we demonstrate that a successful meta-policy can be learned to push a door, while requiring little data and being robust to model uncertainty of the environment. We tested our method on a 7-DOF Franka-Emika Robot pushing a cabinet door in simulation.

Reinforcement and Deep Reinforcement Learning-based Solutions for Machine Maintenance Planning, Scheduling Policies, and Optimization

Authors:Oluwaseyi Ogunfowora, Homayoun Najjaran
Date:2023-07-07 22:47:29

Systems and machines undergo various failure modes that result in machine health degradation, so maintenance actions are required to restore them back to a state where they can perform their expected functions. Since maintenance tasks are inevitable, maintenance planning is essential to ensure the smooth operations of the production system and other industries at large. Maintenance planning is a decision-making problem that aims at developing optimum maintenance policies and plans that help reduces maintenance costs, extend asset life, maximize their availability, and ultimately ensure workplace safety. Reinforcement learning is a data-driven decision-making algorithm that has been increasingly applied to develop dynamic maintenance plans while leveraging the continuous information from condition monitoring of the system and machine states. By leveraging the condition monitoring data of systems and machines with reinforcement learning, smart maintenance planners can be developed, which is a precursor to achieving a smart factory. This paper presents a literature review on the applications of reinforcement and deep reinforcement learning for maintenance planning and optimization problems. To capture the common ideas without losing touch with the uniqueness of each publication, taxonomies used to categorize the systems were developed, and reviewed publications were highlighted, classified, and summarized based on these taxonomies. Adopted methodologies, findings, and well-defined interpretations of the reviewed studies were summarized in graphical and tabular representations to maximize the utility of the work for both researchers and practitioners. This work also highlights the research gaps, key insights from the literature, and areas for future work.

Discovering Hierarchical Achievements in Reinforcement Learning via Contrastive Learning

Authors:Seungyong Moon, Junyoung Yeom, Bumsoo Park, Hyun Oh Song
Date:2023-07-07 09:47:15

Discovering achievements with a hierarchical structure in procedurally generated environments presents a significant challenge. This requires an agent to possess a broad range of abilities, including generalization and long-term reasoning. Many prior methods have been built upon model-based or hierarchical approaches, with the belief that an explicit module for long-term planning would be advantageous for learning hierarchical dependencies. However, these methods demand an excessive number of environment interactions or large model sizes, limiting their practicality. In this work, we demonstrate that proximal policy optimization (PPO), a simple yet versatile model-free algorithm, outperforms previous methods when optimized with recent implementation practices. Moreover, we find that the PPO agent can predict the next achievement to be unlocked to some extent, albeit with limited confidence. Based on this observation, we introduce a novel contrastive learning method, called achievement distillation, which strengthens the agent's ability to predict the next achievement. Our method exhibits a strong capacity for discovering hierarchical achievements and shows state-of-the-art performance on the challenging Crafter environment in a sample-efficient manner while utilizing fewer model parameters.

Learning Constrained Corner Node Trajectories of a Tether Net System for Space Debris Capture

Authors:Feng Liu, Achira Boonrath, Prajit KrisshnaKumar, Elenora M. Botta, Souma Chowdhury
Date:2023-07-06 15:28:34

The earth's orbit is becoming increasingly crowded with debris that poses significant safety risks to the operation of existing and new spacecraft and satellites. The active tether-net system, which consists of a flexible net with maneuverable corner nodes launched from a small autonomous spacecraft, is a promising solution for capturing and disposing of such space debris. The requirement of autonomous operation and the need to generalize over scenarios with debris scenarios in different rotational rates makes the capture process significantly challenging. The space debris could rotate about multiple axes, which, along with sensing/estimation and actuation uncertainties, calls for a robust, generalizable approach to guiding the net launch and flight - one that can guarantee robust capture. This paper proposes a decentralized actuation system combined with reinforcement learning for planning and controlling this tether-net system. In this new system, four microsatellites with cold gas type thrusters act as the corner nodes of the net and can thus help control or correct the flight of the net after launch. The microsatellites pull the net to complete the task of approaching and capturing the space debris. The proposed method uses a RL framework that integrates a proximal policy optimization to find the optimal solution based on the dynamics simulation of the net and the microsatellites performed in Vortex Studio. The RL framework finds the optimal trajectory that is both fuel-efficient and ensures a desired level of capture quality.

Sequential Neural Barriers for Scalable Dynamic Obstacle Avoidance

Authors:Hongzhan Yu, Chiaki Hirayama, Chenning Yu, Sylvia Herbert, Sicun Gao
Date:2023-07-06 14:24:17

There are two major challenges for scaling up robot navigation around dynamic obstacles: the complex interaction dynamics of the obstacles can be hard to model analytically, and the complexity of planning and control grows exponentially in the number of obstacles. Data-driven and learning-based methods are thus particularly valuable in this context. However, data-driven methods are sensitive to distribution drift, making it hard to train and generalize learned models across different obstacle densities. We propose a novel method for compositional learning of Sequential Neural Control Barrier models (SNCBFs) to achieve scalability. Our approach exploits an important observation: the spatial interaction patterns of multiple dynamic obstacles can be decomposed and predicted through temporal sequences of states for each obstacle. Through decomposition, we can generalize control policies trained only with a small number of obstacles, to environments where the obstacle density can be 100x higher. We demonstrate the benefits of the proposed methods in improving dynamic collision avoidance in comparison with existing methods including potential fields, end-to-end reinforcement learning, and model-predictive control. We also perform hardware experiments and show the practical effectiveness of the approach in the supplementary video.

SACHA: Soft Actor-Critic with Heuristic-Based Attention for Partially Observable Multi-Agent Path Finding

Authors:Qiushi Lin, Hang Ma
Date:2023-07-05 23:36:33

Multi-Agent Path Finding (MAPF) is a crucial component for many large-scale robotic systems, where agents must plan their collision-free paths to their given goal positions. Recently, multi-agent reinforcement learning has been introduced to solve the partially observable variant of MAPF by learning a decentralized single-agent policy in a centralized fashion based on each agent's partial observation. However, existing learning-based methods are ineffective in achieving complex multi-agent cooperation, especially in congested environments, due to the non-stationarity of this setting. To tackle this challenge, we propose a multi-agent actor-critic method called Soft Actor-Critic with Heuristic-Based Attention (SACHA), which employs novel heuristic-based attention mechanisms for both the actors and critics to encourage cooperation among agents. SACHA learns a neural network for each agent to selectively pay attention to the shortest path heuristic guidance from multiple agents within its field of view, thereby allowing for more scalable learning of cooperation. SACHA also extends the existing multi-agent actor-critic framework by introducing a novel critic centered on each agent to approximate $Q$-values. Compared to existing methods that use a fully observable critic, our agent-centered multi-agent actor-critic method results in more impartial credit assignment and better generalizability of the learned policy to MAPF instances with varying numbers of agents and types of environments. We also implement SACHA(C), which embeds a communication module in the agent's policy network to enable information exchange among agents. We evaluate both SACHA and SACHA(C) on a variety of MAPF instances and demonstrate decent improvements over several state-of-the-art learning-based MAPF methods with respect to success rate and solution quality.

Distributional Model Equivalence for Risk-Sensitive Reinforcement Learning

Authors:Tyler Kastner, Murat A. Erdogdu, Amir-massoud Farahmand
Date:2023-07-04 13:23:21

We consider the problem of learning models for risk-sensitive reinforcement learning. We theoretically demonstrate that proper value equivalence, a method of learning models which can be used to plan optimally in the risk-neutral setting, is not sufficient to plan optimally in the risk-sensitive setting. We leverage distributional reinforcement learning to introduce two new notions of model equivalence, one which is general and can be used to plan for any risk measure, but is intractable; and a practical variation which allows one to choose which risk measures they may plan optimally for. We demonstrate how our framework can be used to augment any model-free risk-sensitive algorithm, and provide both tabular and large-scale experiments to demonstrate its ability.

Landmark Guided Active Exploration with State-specific Balance Coefficient

Authors:Fei Cui, Jiaojiao Fang, Mengke Yang, Guizhong Liu
Date:2023-06-30 08:54:47

Goal-conditioned hierarchical reinforcement learning (GCHRL) decomposes long-horizon tasks into sub-tasks through a hierarchical framework and it has demonstrated promising results across a variety of domains. However, the high-level policy's action space is often excessively large, presenting a significant challenge to effective exploration and resulting in potentially inefficient training. In this paper, we design a measure of prospect for sub-goals by planning in the goal space based on the goal-conditioned value function. Building upon the measure of prospect, we propose a landmark-guided exploration strategy by integrating the measures of prospect and novelty which aims to guide the agent to explore efficiently and improve sample efficiency. In order to dynamically consider the impact of prospect and novelty on exploration, we introduce a state-specific balance coefficient to balance the significance of prospect and novelty. The experimental results demonstrate that our proposed exploration strategy significantly outperforms the baseline methods across multiple tasks.

SkiROS2: A skill-based Robot Control Platform for ROS

Authors:Matthias Mayr, Francesco Rovida, Volker Krueger
Date:2023-06-29 15:25:51

The need for autonomous robot systems in both the service and the industrial domain is larger than ever. In the latter, the transition to small batches or even "batch size 1" in production created a need for robot control system architectures that can provide the required flexibility. Such architectures must not only have a sufficient knowledge integration framework. It must also support autonomous mission execution and allow for interchangeability and interoperability between different tasks and robot systems. We introduce SkiROS2, a skill-based robot control platform on top of ROS. SkiROS2 proposes a layered, hybrid control structure for automated task planning, and reactive execution, supported by a knowledge base for reasoning about the world state and entities. The scheduling formulation builds on the extended behavior tree model that merges task-level planning and execution. This allows for a high degree of modularity and a fast reaction to changes in the environment. The skill formulation based on pre-, hold- and post-conditions allows to organize robot programs and to compose diverse skills reaching from perception to low-level control and the incorporation of external tools. We relate SkiROS2 to the field and outline three example use cases that cover task planning, reasoning, multisensory input, integration in a manufacturing execution system and reinforcement learning.

Learning Coverage Paths in Unknown Environments with Deep Reinforcement Learning

Authors:Arvi Jonnarth, Jie Zhao, Michael Felsberg
Date:2023-06-29 14:32:06

Coverage path planning (CPP) is the problem of finding a path that covers the entire free space of a confined area, with applications ranging from robotic lawn mowing to search-and-rescue. When the environment is unknown, the path needs to be planned online while mapping the environment, which cannot be addressed by offline planning methods that do not allow for a flexible path space. We investigate how suitable reinforcement learning is for this challenging problem, and analyze the involved components required to efficiently learn coverage paths, such as action space, input feature representation, neural network architecture, and reward function. We propose a computationally feasible egocentric map representation based on frontiers, and a novel reward term based on total variation to promote complete coverage. Through extensive experiments, we show that our approach surpasses the performance of both previous RL-based approaches and highly specialized methods across multiple CPP variations.

Action and Trajectory Planning for Urban Autonomous Driving with Hierarchical Reinforcement Learning

Authors:Xinyang Lu, Flint Xiaofeng Fan, Tianying Wang
Date:2023-06-28 07:11:02

Reinforcement Learning (RL) has made promising progress in planning and decision-making for Autonomous Vehicles (AVs) in simple driving scenarios. However, existing RL algorithms for AVs fail to learn critical driving skills in complex urban scenarios. First, urban driving scenarios require AVs to handle multiple driving tasks of which conventional RL algorithms are incapable. Second, the presence of other vehicles in urban scenarios results in a dynamically changing environment, which challenges RL algorithms to plan the action and trajectory of the AV. In this work, we propose an action and trajectory planner using Hierarchical Reinforcement Learning (atHRL) method, which models the agent behavior in a hierarchical model by using the perception of the lidar and birdeye view. The proposed atHRL method learns to make decisions about the agent's future trajectory and computes target waypoints under continuous settings based on a hierarchical DDPG algorithm. The waypoints planned by the atHRL model are then sent to a low-level controller to generate the steering and throttle commands required for the vehicle maneuver. We empirically verify the efficacy of atHRL through extensive experiments in complex urban driving scenarios that compose multiple tasks with the presence of other vehicles in the CARLA simulator. The experimental results suggest a significant performance improvement compared to the state-of-the-art RL methods.

Trajectory Generation, Control, and Safety with Denoising Diffusion Probabilistic Models

Authors:Nicolò Botteghi, Federico Califano, Mannes Poel, Christoph Brune
Date:2023-06-27 14:36:40

We present a framework for safety-critical optimal control of physical systems based on denoising diffusion probabilistic models (DDPMs). The technology of control barrier functions (CBFs), encoding desired safety constraints, is used in combination with DDPMs to plan actions by iteratively denoising trajectories through a CBF-based guided sampling procedure. At the same time, the generated trajectories are also guided to maximize a future cumulative reward representing a specific task to be optimally executed. The proposed scheme can be seen as an offline and model-based reinforcement learning algorithm resembling in its functionalities a model-predictive control optimization scheme with receding horizon in which the selected actions lead to optimal and safe trajectories.

Improvise, Adapt, Overcome: Dynamic Resiliency Against Unknown Attack Vectors in Microgrid Cybersecurity Games

Authors:Suman Rath, Tapadhir Das, Shamik Sengupta
Date:2023-06-26 22:57:27

Cyber-physical microgrids are vulnerable to rootkit attacks that manipulate system dynamics to create instabilities in the network. Rootkits tend to hide their access level within microgrid system components to launch sudden attacks that prey on the slow response time of defenders to manipulate system trajectory. This problem can be formulated as a multi-stage, non-cooperative, zero-sum game with the attacker and the defender modeled as opposing players. To solve the game, this paper proposes a deep reinforcement learning-based strategy that dynamically identifies rootkit access levels and isolates incoming manipulations by incorporating changes in the defense plan. A major advantage of the proposed strategy is its ability to establish resiliency without altering the physical transmission/distribution network topology, thereby diminishing potential instability issues. The paper also presents several simulation results and case studies to demonstrate the operating mechanism and robustness of the proposed strategy.

InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback

Authors:John Yang, Akshara Prabhakar, Karthik Narasimhan, Shunyu Yao
Date:2023-06-26 17:59:50

Humans write code in a fundamentally interactive manner and rely on constant execution feedback to correct errors, resolve ambiguities, and decompose tasks. While LLMs have recently exhibited promising coding capabilities, current coding benchmarks mostly consider a static instruction-to-code sequence transduction process, which has the potential for error propagation and a disconnect between the generated code and its final execution environment. To address this gap, we introduce InterCode, a lightweight, flexible, and easy-to-use framework of interactive coding as a standard reinforcement learning (RL) environment, with code as actions and execution feedback as observations. Our framework is language and platform agnostic, uses self-contained Docker environments to provide safe and reproducible execution, and is compatible out-of-the-box with traditional seq2seq coding methods, while enabling the development of new methods for interactive code generation. We use InterCode to create three interactive code environments with Bash, SQL, and Python as action spaces, leveraging data from the static NL2Bash, Spider, and MBPP datasets. We demonstrate InterCode's viability as a testbed by evaluating multiple state-of-the-art LLMs configured with different prompting strategies such as ReAct and Plan & Solve. Our results showcase the benefits of interactive code generation and demonstrate that InterCode can serve as a challenging benchmark for advancing code understanding and generation capabilities. InterCode is designed to be easily extensible and can even be used to create new tasks such as Capture the Flag, a popular coding puzzle that is inherently multi-step and involves multiple programming languages. Project site with code and data: https://intercode-benchmark.github.io

Sim-to-Real Surgical Robot Learning and Autonomous Planning for Internal Tissue Points Manipulation using Reinforcement Learning

Authors:Yafei Ou, Mahdi Tavakoli
Date:2023-06-25 01:06:57

Indirect simultaneous positioning (ISP), where internal tissue points are placed at desired locations indirectly through the manipulation of boundary points, is a type of subtask frequently performed in robotic surgeries. Although challenging due to complex tissue dynamics, automating the task can potentially reduce the workload of surgeons. This paper presents a sim-to-real framework for learning to automate the task without interacting with a real environment, and for planning preoperatively to find the grasping points that minimize local tissue deformation. A control policy is learned using deep reinforcement learning (DRL) in the FEM-based simulation environment and transferred to real-world situation. Grasping points are planned in the simulator by utilizing the trained policy using Bayesian optimization (BO). Inconsistent simulation performance is overcome by formulating the problem as a state augmented Markov decision process (MDP). Experimental results show that the learned policy places the internal tissue points accurately, and that the planned grasping points yield small tissue deformation among the trials. The proposed learning and planning scheme is able to automate internal tissue point manipulation in surgeries and has the potential to be generalized to complex surgical scenarios.

Fighting Uncertainty with Gradients: Offline Reinforcement Learning via Diffusion Score Matching

Authors:H. J. Terry Suh, Glen Chou, Hongkai Dai, Lujie Yang, Abhishek Gupta, Russ Tedrake
Date:2023-06-24 23:40:58

Gradient-based methods enable efficient search capabilities in high dimensions. However, in order to apply them effectively in offline optimization paradigms such as offline Reinforcement Learning (RL) or Imitation Learning (IL), we require a more careful consideration of how uncertainty estimation interplays with first-order methods that attempt to minimize them. We study smoothed distance to data as an uncertainty metric, and claim that it has two beneficial properties: (i) it allows gradient-based methods that attempt to minimize uncertainty to drive iterates to data as smoothing is annealed, and (ii) it facilitates analysis of model bias with Lipschitz constants. As distance to data can be expensive to compute online, we consider settings where we need amortize this computation. Instead of learning the distance however, we propose to learn its gradients directly as an oracle for first-order optimizers. We show these gradients can be efficiently learned with score-matching techniques by leveraging the equivalence between distance to data and data likelihood. Using this insight, we propose Score-Guided Planning (SGP), a planning algorithm for offline RL that utilizes score-matching to enable first-order planning in high-dimensional problems, where zeroth-order methods were unable to scale, and ensembles were unable to overcome local minima. Website: https://sites.google.com/view/score-guided-planning/home

Offline Skill Graph (OSG): A Framework for Learning and Planning using Offline Reinforcement Learning Skills

Authors:Ben-ya Halevy, Yehudit Aperstein, Dotan Di Castro
Date:2023-06-23 17:35:02

Reinforcement Learning has received wide interest due to its success in competitive games. Yet, its adoption in everyday applications is limited (e.g. industrial, home, healthcare, etc.). In this paper, we address this limitation by presenting a framework for planning over offline skills and solving complex tasks in real-world environments. Our framework is comprised of three modules that together enable the agent to learn from previously collected data and generalize over it to solve long-horizon tasks. We demonstrate our approach by testing it on a robotic arm that is required to solve complex tasks.

Energy Optimization for HVAC Systems in Multi-VAV Open Offices: A Deep Reinforcement Learning Approach

Authors:Hao Wang, Xiwen Chen, Natan Vital, Edward. Duffy, Abolfazl Razi
Date:2023-06-23 07:27:31

With more than 32% of the global energy used by commercial and residential buildings, there is an urgent need to revisit traditional approaches to Building Energy Management (BEM). With HVAC systems accounting for about 40% of the total energy cost in the commercial sector, we propose a low-complexity DRL-based model with multi-input multi-output architecture for the HVAC energy optimization of open-plan offices, which uses only a handful of controllable and accessible factors. The efficacy of our solution is evaluated through extensive analysis of the overall energy consumption and thermal comfort levels compared to a baseline system based on the existing HVAC schedule in a real building. This comparison shows that our method achieves 37% savings in energy consumption with minimum violation (<1%) of the desired temperature range during work hours. It takes only a total of 40 minutes for 5 epochs (about 7.75 minutes per epoch) to train a network with superior performance and covering diverse conditions for its low-complexity architecture; therefore, it easily adapts to changes in the building setups, weather conditions, occupancy rate, etc. Moreover, by enforcing smoothness on the control strategy, we suppress the frequent and unpleasant on/off transitions on HVAC units to avoid occupant discomfort and potential damage to the system. The generalizability of our model is verified by applying it to different building models and under various weather conditions.

MP3: Movement Primitive-Based (Re-)Planning Policy

Authors:Fabian Otto, Hongyi Zhou, Onur Celik, Ge Li, Rudolf Lioutikov, Gerhard Neumann
Date:2023-06-22 08:11:32

We introduce a novel deep reinforcement learning (RL) approach called Movement Primitive-based Planning Policy (MP3). By integrating movement primitives (MPs) into the deep RL framework, MP3 enables the generation of smooth trajectories throughout the whole learning process while effectively learning from sparse and non-Markovian rewards. Additionally, MP3 maintains the capability to adapt to changes in the environment during execution. Although many early successes in robot RL have been achieved by combining RL with MPs, these approaches are often limited to learning single stroke-based motions, lacking the ability to adapt to task variations or adjust motions during execution. Building upon our previous work, which introduced an episode-based RL method for the non-linear adaptation of MP parameters to different task variations, this paper extends the approach to incorporating replanning strategies. This allows adaptation of the MP parameters throughout motion execution, addressing the lack of online motion adaptation in stochastic domains requiring feedback. We compared our approach against state-of-the-art deep RL and RL with MPs methods. The results demonstrated improved performance in sophisticated, sparse reward settings and in domains requiring replanning.

Optimistic Active Exploration of Dynamical Systems

Authors:Bhavya Sukhija, Lenart Treven, Cansu Sancaktar, Sebastian Blaes, Stelian Coros, Andreas Krause
Date:2023-06-21 16:26:59

Reinforcement learning algorithms commonly seek to optimize policies for solving one particular task. How should we explore an unknown dynamical system such that the estimated model globally approximates the dynamics and allows us to solve multiple downstream tasks in a zero-shot manner? In this paper, we address this challenge, by developing an algorithm -- OPAX -- for active exploration. OPAX uses well-calibrated probabilistic models to quantify the epistemic uncertainty about the unknown dynamics. It optimistically -- w.r.t. to plausible dynamics -- maximizes the information gain between the unknown dynamics and state observations. We show how the resulting optimization problem can be reduced to an optimal control problem that can be solved at each episode using standard approaches. We analyze our algorithm for general models, and, in the case of Gaussian process dynamics, we give a first-of-its-kind sample complexity bound and show that the epistemic uncertainty converges to zero. In our experiments, we compare OPAX with other heuristic active exploration approaches on several environments. Our experiments show that OPAX is not only theoretically sound but also performs well for zero-shot planning on novel downstream tasks.

Efficient Dynamics Modeling in Interactive Environments with Koopman Theory

Authors:Arnab Kumar Mondal, Siba Smarak Panigrahi, Sai Rajeswar, Kaleem Siddiqi, Siamak Ravanbakhsh
Date:2023-06-20 23:38:24

The accurate modeling of dynamics in interactive environments is critical for successful long-range prediction. Such a capability could advance Reinforcement Learning (RL) and Planning algorithms, but achieving it is challenging. Inaccuracies in model estimates can compound, resulting in increased errors over long horizons. We approach this problem from the lens of Koopman theory, where the nonlinear dynamics of the environment can be linearized in a high-dimensional latent space. This allows us to efficiently parallelize the sequential problem of long-range prediction using convolution while accounting for the agent's action at every time step. Our approach also enables stability analysis and better control over gradients through time. Taken together, these advantages result in significant improvement over the existing approaches, both in the efficiency and the accuracy of modeling dynamics over extended horizons. We also show that this model can be easily incorporated into dynamics modeling for model-based planning and model-free RL and report promising experimental results.

IMP-MARL: a Suite of Environments for Large-scale Infrastructure Management Planning via MARL

Authors:Pascal Leroy, Pablo G. Morato, Jonathan Pisane, Athanasios Kolios, Damien Ernst
Date:2023-06-20 14:12:29

We introduce IMP-MARL, an open-source suite of multi-agent reinforcement learning (MARL) environments for large-scale Infrastructure Management Planning (IMP), offering a platform for benchmarking the scalability of cooperative MARL methods in real-world engineering applications. In IMP, a multi-component engineering system is subject to a risk of failure due to its components' damage condition. Specifically, each agent plans inspections and repairs for a specific system component, aiming to minimise maintenance costs while cooperating to minimise system failure risk. With IMP-MARL, we release several environments including one related to offshore wind structural systems, in an effort to meet today's needs to improve management strategies to support sustainable and reliable energy systems. Supported by IMP practical engineering environments featuring up to 100 agents, we conduct a benchmark campaign, where the scalability and performance of state-of-the-art cooperative MARL methods are compared against expert-based heuristic policies. The results reveal that centralised training with decentralised execution methods scale better with the number of agents than fully centralised or decentralised RL approaches, while also outperforming expert-based heuristic policies in most IMP environments. Based on our findings, we additionally outline remaining cooperation and scalability challenges that future MARL methods should still address. Through IMP-MARL, we encourage the implementation of new environments and the further development of MARL methods.

The Unintended Consequences of Discount Regularization: Improving Regularization in Certainty Equivalence Reinforcement Learning

Authors:Sarah Rathnam, Sonali Parbhoo, Weiwei Pan, Susan A. Murphy, Finale Doshi-Velez
Date:2023-06-20 00:16:22

Discount regularization, using a shorter planning horizon when calculating the optimal policy, is a popular choice to restrict planning to a less complex set of policies when estimating an MDP from sparse or noisy data (Jiang et al., 2015). It is commonly understood that discount regularization functions by de-emphasizing or ignoring delayed effects. In this paper, we reveal an alternate view of discount regularization that exposes unintended consequences. We demonstrate that planning under a lower discount factor produces an identical optimal policy to planning using any prior on the transition matrix that has the same distribution for all states and actions. In fact, it functions like a prior with stronger regularization on state-action pairs with more transition data. This leads to poor performance when the transition matrix is estimated from data sets with uneven amounts of data across state-action pairs. Our equivalence theorem leads to an explicit formula to set regularization parameters locally for individual state-action pairs rather than globally. We demonstrate the failures of discount regularization and how we remedy them using our state-action-specific method across simple empirical examples as well as a medical cancer simulator.

Sim-to-real transfer of active suspension control using deep reinforcement learning

Authors:Viktor Wiberg, Erik Wallin, Arvid Fälldin, Tobias Semberg, Morgan Rossander, Eddie Wadbro, Martin Servin
Date:2023-06-19 21:31:05

We explore sim-to-real transfer of deep reinforcement learning controllers for a heavy vehicle with active suspensions designed for traversing rough terrain. While related research primarily focuses on lightweight robots with electric motors and fast actuation, this study uses a forestry vehicle with a complex hydraulic driveline and slow actuation. We simulate the vehicle using multibody dynamics and apply system identification to find an appropriate set of simulation parameters. We then train policies in simulation using various techniques to mitigate the sim-to-real gap, including domain randomization, action delays, and a reward penalty to encourage smooth control. In reality, the policies trained with action delays and a penalty for erratic actions perform nearly at the same level as in simulation. In experiments on level ground, the motion trajectories closely overlap when turning to either side, as well as in a route tracking scenario. When faced with a ramp that requires active use of the suspensions, the simulated and real motions are in close alignment. This shows that the actuator model together with system identification yields a sufficiently accurate model of the actuators. We observe that policies trained without the additional action penalty exhibit fast switching or bang-bang control. These present smooth motions and high performance in simulation but transfer poorly to reality. We find that policies make marginal use of the local height map for perception, showing no indications of predictive planning. However, the strong transfer capabilities entail that further development concerning perception and performance can be largely confined to simulation.

PTDRL: Parameter Tuning using Deep Reinforcement Learning

Authors:Elias Goldsztejn, Tal Feiner, Ronen Brafman
Date:2023-06-19 10:36:53

A variety of autonomous navigation algorithms exist that allow robots to move around in a safe and fast manner. However, many of these algorithms require parameter re-tuning when facing new environments. In this paper, we propose PTDRL, a parameter-tuning strategy that adaptively selects from a fixed set of parameters those that maximize the expected reward for a given navigation system. Our learning strategy can be used for different environments, different platforms, and different user preferences. Specifically, we attend to the problem of social navigation in indoor spaces, using a classical motion planning algorithm as our navigation system and training its parameters to optimize its behavior. Experimental results show that PTDRL can outperform other online parameter-tuning strategies.

Hierarchical Planning and Control for Box Loco-Manipulation

Authors:Zhaoming Xie, Jonathan Tseng, Sebastian Starke, Michiel van de Panne, C. Karen Liu
Date:2023-06-15 22:23:27

Humans perform everyday tasks using a combination of locomotion and manipulation skills. Building a system that can handle both skills is essential to creating virtual humans. We present a physically-simulated human capable of solving box rearrangement tasks, which requires a combination of both skills. We propose a hierarchical control architecture, where each level solves the task at a different level of abstraction, and the result is a physics-based simulated virtual human capable of rearranging boxes in a cluttered environment. The control architecture integrates a planner, diffusion models, and physics-based motion imitation of sparse motion clips using deep reinforcement learning. Boxes can vary in size, weight, shape, and placement height. Code and trained control policies are provided.

Joint Path planning and Power Allocation of a Cellular-Connected UAV using Apprenticeship Learning via Deep Inverse Reinforcement Learning

Authors:Alireza Shamsoshoara, Fatemeh Lotfi, Sajad Mousavi, Fatemeh Afghah, Ismail Guvenc
Date:2023-06-15 20:50:05

This paper investigates an interference-aware joint path planning and power allocation mechanism for a cellular-connected unmanned aerial vehicle (UAV) in a sparse suburban environment. The UAV's goal is to fly from an initial point and reach a destination point by moving along the cells to guarantee the required quality of service (QoS). In particular, the UAV aims to maximize its uplink throughput and minimize the level of interference to the ground user equipment (UEs) connected to the neighbor cellular BSs, considering the shortest path and flight resource limitation. Expert knowledge is used to experience the scenario and define the desired behavior for the sake of the agent (i.e., UAV) training. To solve the problem, an apprenticeship learning method is utilized via inverse reinforcement learning (IRL) based on both Q-learning and deep reinforcement learning (DRL). The performance of this method is compared to learning from a demonstration technique called behavioral cloning (BC) using a supervised learning approach. Simulation and numerical results show that the proposed approach can achieve expert-level performance. We also demonstrate that, unlike the BC technique, the performance of our proposed approach does not degrade in unseen situations.

Simplified Temporal Consistency Reinforcement Learning

Authors:Yi Zhao, Wenshuai Zhao, Rinu Boney, Juho Kannala, Joni Pajarinen
Date:2023-06-15 19:37:43

Reinforcement learning is able to solve complex sequential decision-making tasks but is currently limited by sample efficiency and required computation. To improve sample efficiency, recent work focuses on model-based RL which interleaves model learning with planning. Recent methods further utilize policy learning, value estimation, and, self-supervised learning as auxiliary objectives. In this paper we show that, surprisingly, a simple representation learning approach relying only on a latent dynamics model trained by latent temporal consistency is sufficient for high-performance RL. This applies when using pure planning with a dynamics model conditioned on the representation, but, also when utilizing the representation as policy and value function features in model-free RL. In experiments, our approach learns an accurate dynamics model to solve challenging high-dimensional locomotion tasks with online planners while being 4.1 times faster to train compared to ensemble-based methods. With model-free RL without planning, especially on high-dimensional tasks, such as the DeepMind Control Suite Humanoid and Dog tasks, our approach outperforms model-free methods by a large margin and matches model-based methods' sample efficiency while training 2.4 times faster.

Predictive Maneuver Planning with Deep Reinforcement Learning (PMP-DRL) for comfortable and safe autonomous driving

Authors:Jayabrata Chowdhury, Vishruth Veerendranath, Suresh Sundaram, Narasimhan Sundararajan
Date:2023-06-15 11:27:30

This paper presents a Predictive Maneuver Planning with Deep Reinforcement Learning (PMP-DRL) model for maneuver planning. Traditional rule-based maneuver planning approaches often have to improve their abilities to handle the variabilities of real-world driving scenarios. By learning from its experience, a Reinforcement Learning (RL)-based driving agent can adapt to changing driving conditions and improve its performance over time. Our proposed approach combines a predictive model and an RL agent to plan for comfortable and safe maneuvers. The predictive model is trained using historical driving data to predict the future positions of other surrounding vehicles. The surrounding vehicles' past and predicted future positions are embedded in context-aware grid maps. At the same time, the RL agent learns to make maneuvers based on this spatio-temporal context information. Performance evaluation of PMP-DRL has been carried out using simulated environments generated from publicly available NGSIM US101 and I80 datasets. The training sequence shows the continuous improvement in the driving experiences. It shows that proposed PMP-DRL can learn the trade-off between safety and comfortability. The decisions generated by the recent imitation learning-based model are compared with the proposed PMP-DRL for unseen scenarios. The results clearly show that PMP-DRL can handle complex real-world scenarios and make better comfortable and safe maneuver decisions than rule-based and imitative models.

Real-Time Network-Level Traffic Signal Control: An Explicit Multiagent Coordination Method

Authors:Wanyuan Wang, Tianchi Qiao, Jinming Ma, Jiahui Jin, Zhibin Li, Weiwei Wu, Yichuan Jian
Date:2023-06-15 04:08:09

Efficient traffic signal control (TSC) has been one of the most useful ways for reducing urban road congestion. Key to the challenge of TSC includes 1) the essential of real-time signal decision, 2) the complexity in traffic dynamics, and 3) the network-level coordination. Recent efforts that applied reinforcement learning (RL) methods can query policies by mapping the traffic state to the signal decision in real-time, however, is inadequate for unexpected traffic flows. By observing real traffic information, online planning methods can compute the signal decisions in a responsive manner. We propose an explicit multiagent coordination (EMC)-based online planning methods that can satisfy adaptive, real-time and network-level TSC. By multiagent, we model each intersection as an autonomous agent, and the coordination efficiency is modeled by a cost (i.e., congestion index) function between neighbor intersections. By network-level coordination, each agent exchanges messages with respect to cost function with its neighbors in a fully decentralized manner. By real-time, the message passing procedure can interrupt at any time when the real time limit is reached and agents select the optimal signal decisions according to the current message. Moreover, we prove our EMC method can guarantee network stability by borrowing ideas from transportation domain. Finally, we test our EMC method in both synthetic and real road network datasets. Experimental results are encouraging: compared to RL and conventional transportation baselines, our EMC method performs reasonably well in terms of adapting to real-time traffic dynamics, minimizing vehicle travel time and scalability to city-scale road networks.

Decentralized Social Navigation with Non-Cooperative Robots via Bi-Level Optimization

Authors:Rohan Chandra, Rahul Menon, Zayne Sprague, Arya Anantula, Joydeep Biswas
Date:2023-06-15 02:18:21

This paper presents a fully decentralized approach for realtime non-cooperative multi-robot navigation in social mini-games, such as navigating through a narrow doorway or negotiating right of way at a corridor intersection. Our contribution is a new realtime bi-level optimization algorithm, in which the top-level optimization consists of computing a fair and collision-free ordering followed by the bottom-level optimization which plans optimal trajectories conditioned on the ordering. We show that, given such a priority order, we can impose simple kinodynamic constraints on each robot that are sufficient for it to plan collision-free trajectories with minimal deviation from their preferred velocities, similar to how humans navigate in these scenarios. We successfully deploy the proposed algorithm in the real world using F$1/10$ robots, a Clearpath Jackal, and a Boston Dynamics Spot as well as in simulation using the SocialGym 2.0 multi-agent social navigation simulator, in the doorway and corridor intersection scenarios. We compare with state-of-the-art social navigation methods using multi-agent reinforcement learning, collision avoidance algorithms, and crowd simulation models. We show that $(i)$ classical navigation performs $44\%$ better than the state-of-the-art learning-based social navigation algorithms, $(ii)$ without a scheduling protocol, our approach results in collisions in social mini-games $(iii)$ our approach yields $2\times$ and $5\times$ fewer velocity changes than CADRL in doorways and intersections, and finally $(iv)$ bi-level navigation in doorways at a flow rate of $2.8 - 3.3$ (ms)$^{-1}$ is comparable to flow rate in human navigation at a flow rate of $4$ (ms)$^{-1}$.

Deep Generative Models for Decision-Making and Control

Authors:Michael Janner
Date:2023-06-15 01:54:30

Deep model-based reinforcement learning methods offer a conceptually simple approach to the decision-making and control problem: use learning for the purpose of estimating an approximate dynamics model, and offload the rest of the work to classical trajectory optimization. However, this combination has a number of empirical shortcomings, limiting the usefulness of model-based methods in practice. The dual purpose of this thesis is to study the reasons for these shortcomings and to propose solutions for the uncovered problems. Along the way, we highlight how inference techniques from the contemporary generative modeling toolbox, including beam search, classifier-guided sampling, and image inpainting, can be reinterpreted as viable planning strategies for reinforcement learning problems.

Simple Embodied Language Learning as a Byproduct of Meta-Reinforcement Learning

Authors:Evan Zheran Liu, Sahaana Suri, Tong Mu, Allan Zhou, Chelsea Finn
Date:2023-06-14 09:48:48

Whereas machine learning models typically learn language by directly training on language tasks (e.g., next-word prediction), language emerges in human children as a byproduct of solving non-language tasks (e.g., acquiring food). Motivated by this observation, we ask: can embodied reinforcement learning (RL) agents also indirectly learn language from non-language tasks? Learning to associate language with its meaning requires a dynamic environment with varied language. Therefore, we investigate this question in a multi-task environment with language that varies across the different tasks. Specifically, we design an office navigation environment, where the agent's goal is to find a particular office, and office locations differ in different buildings (i.e., tasks). Each building includes a floor plan with a simple language description of the goal office's location, which can be visually read as an RGB image when visited. We find RL agents indeed are able to indirectly learn language. Agents trained with current meta-RL algorithms successfully generalize to reading floor plans with held-out layouts and language phrases, and quickly navigate to the correct office, despite receiving no direct language supervision.

Hierarchical Task Network Planning for Facilitating Cooperative Multi-Agent Reinforcement Learning

Authors:Xuechen Mu, Hankz Hankui Zhuo, Chen Chen, Kai Zhang, Chao Yu, Jianye Hao
Date:2023-06-14 08:51:43

Exploring sparse reward multi-agent reinforcement learning (MARL) environments with traps in a collaborative manner is a complex task. Agents typically fail to reach the goal state and fall into traps, which affects the overall performance of the system. To overcome this issue, we present SOMARL, a framework that uses prior knowledge to reduce the exploration space and assist learning. In SOMARL, agents are treated as part of the MARL environment, and symbolic knowledge is embedded using a tree structure to build a knowledge hierarchy. The framework has a two-layer hierarchical structure, comprising a hybrid module with a Hierarchical Task Network (HTN) planning and meta-controller at the higher level, and a MARL-based interactive module at the lower level. The HTN module and meta-controller use Hierarchical Domain Definition Language (HDDL) and the option framework to formalize symbolic knowledge and obtain domain knowledge and a symbolic option set, respectively. Moreover, the HTN module leverages domain knowledge to guide low-level agent exploration by assisting the meta-controller in selecting symbolic options. The meta-controller further computes intrinsic rewards of symbolic options to limit exploration behavior and adjust HTN planning solutions as needed. We evaluate SOMARL on two benchmarks, FindTreasure and MoveBox, and report superior performance over state-of-the-art MARL and subgoal-based baselines for MARL environments significantly.

Pruning the Way to Reliable Policies: A Multi-Objective Deep Q-Learning Approach to Critical Care

Authors:Ali Shirali, Alexander Schubert, Ahmed Alaa
Date:2023-06-13 18:02:57

Medical treatments often involve a sequence of decisions, each informed by previous outcomes. This process closely aligns with reinforcement learning (RL), a framework for optimizing sequential decisions to maximize cumulative rewards under unknown dynamics. While RL shows promise for creating data-driven treatment plans, its application in medical contexts is challenging due to the frequent need to use sparse rewards, primarily defined based on mortality outcomes. This sparsity can reduce the stability of offline estimates, posing a significant hurdle in fully utilizing RL for medical decision-making. We introduce a deep Q-learning approach to obtain more reliable critical care policies by integrating relevant but noisy frequently measured biomarker signals into the reward specification without compromising the optimization of the main outcome. Our method prunes the action space based on all available rewards before training a final model on the sparse main reward. This approach minimizes potential distortions of the main objective while extracting valuable information from intermediate signals to guide learning. We evaluate our method in off-policy and offline settings using simulated environments and real health records from intensive care units. Our empirical results demonstrate that our method outperforms common offline RL methods such as conservative Q-learning and batch-constrained deep Q-learning. By disentangling sparse rewards and frequently measured reward proxies through action pruning, our work represents a step towards developing reliable policies that effectively harness the wealth of available information in data-intensive critical care environments.

Multi-Robot Motion Planning: A Learning-Based Artificial Potential Field Solution

Authors:Dengyu Zhang, Guobin Zhu, Qingrui Zhang
Date:2023-06-13 09:36:38

Motion planning is a crucial aspect of robot autonomy as it involves identifying a feasible motion path to a destination while taking into consideration various constraints, such as input, safety, and performance constraints, without violating either system or environment boundaries. This becomes particularly challenging when multiple robots run without communication, which compromises their real-time efficiency, safety, and performance. In this paper, we present a learning-based potential field algorithm that incorporates deep reinforcement learning into an artificial potential field (APF). Specifically, we introduce an observation embedding mechanism that pre-processes dynamic information about the environment and develop a soft wall-following rule to improve trajectory smoothness. Our method, while belonging to reactive planning, implicitly encodes environmental properties. Additionally, our approach can scale up to any number of robots and has demonstrated superior performance compared to APF and RL through numerical simulations. Finally, experiments are conducted to highlight the effectiveness of our proposed method.

Reinforcement Learning in Robotic Motion Planning by Combined Experience-based Planning and Self-Imitation Learning

Authors:Sha Luo, Lambert Schomaker
Date:2023-06-11 19:47:46

High-quality and representative data is essential for both Imitation Learning (IL)- and Reinforcement Learning (RL)-based motion planning tasks. For real robots, it is challenging to collect enough qualified data either as demonstrations for IL or experiences for RL due to safety considerations in environments with obstacles. We target this challenge by proposing the self-imitation learning by planning plus (SILP+) algorithm, which efficiently embeds experience-based planning into the learning architecture to mitigate the data-collection problem. The planner generates demonstrations based on successfully visited states from the current RL policy, and the policy improves by learning from these demonstrations. In this way, we relieve the demand for human expert operators to collect demonstrations required by IL and improve the RL performance as well. Various experimental results show that SILP+ achieves better training efficiency higher and more stable success rate in complex motion planning tasks compared to several other methods. Extensive tests on physical robots illustrate the effectiveness of SILP+ in a physical setting.

Decision Stacks: Flexible Reinforcement Learning via Modular Generative Models

Authors:Siyan Zhao, Aditya Grover
Date:2023-06-09 20:52:16

Reinforcement learning presents an attractive paradigm to reason about several distinct aspects of sequential decision making, such as specifying complex goals, planning future observations and actions, and critiquing their utilities. However, the combined integration of these capabilities poses competing algorithmic challenges in retaining maximal expressivity while allowing for flexibility in modeling choices for efficient learning and inference. We present Decision Stacks, a generative framework that decomposes goal-conditioned policy agents into 3 generative modules. These modules simulate the temporal evolution of observations, rewards, and actions via independent generative models that can be learned in parallel via teacher forcing. Our framework guarantees both expressivity and flexibility in designing individual modules to account for key factors such as architectural bias, optimization objective and dynamics, transferrability across domains, and inference speed. Our empirical results demonstrate the effectiveness of Decision Stacks for offline policy optimization for several MDP and POMDP environments, outperforming existing methods and enabling flexible generative decision making.

iPLAN: Intent-Aware Planning in Heterogeneous Traffic via Distributed Multi-Agent Reinforcement Learning

Authors:Xiyang Wu, Rohan Chandra, Tianrui Guan, Amrit Singh Bedi, Dinesh Manocha
Date:2023-06-09 20:12:02

Navigating safely and efficiently in dense and heterogeneous traffic scenarios is challenging for autonomous vehicles (AVs) due to their inability to infer the behaviors or intentions of nearby drivers. In this work, we introduce a distributed multi-agent reinforcement learning (MARL) algorithm that can predict trajectories and intents in dense and heterogeneous traffic scenarios. Our approach for intent-aware planning, iPLAN, allows agents to infer nearby drivers' intents solely from their local observations. We model two distinct incentives for agents' strategies: Behavioral Incentive for high-level decision-making based on their driving behavior or personality and Instant Incentive for motion planning for collision avoidance based on the current traffic state. Our approach enables agents to infer their opponents' behavior incentives and integrate this inferred information into their decision-making and motion-planning processes. We perform experiments on two simulation environments, Non-Cooperative Navigation and Heterogeneous Highway. In Heterogeneous Highway, results show that, compared with centralized training decentralized execution (CTDE) MARL baselines such as QMIX and MAPPO, our method yields a 4.3% and 38.4% higher episodic reward in mild and chaotic traffic, with 48.1% higher success rate and 80.6% longer survival time in chaotic traffic. We also compare with a decentralized training decentralized execution (DTDE) baseline IPPO and demonstrate a higher episodic reward of 12.7% and 6.3% in mild traffic and chaotic traffic, 25.3% higher success rate, and 13.7% longer survival time.

Embodied Executable Policy Learning with Language-based Scene Summarization

Authors:Jielin Qiu, Mengdi Xu, William Han, Seungwhan Moon, Ding Zhao
Date:2023-06-09 06:34:09

Large Language models (LLMs) have shown remarkable success in assisting robot learning tasks, i.e., complex household planning. However, the performance of pretrained LLMs heavily relies on domain-specific templated text data, which may be infeasible in real-world robot learning tasks with image-based observations. Moreover, existing LLMs with text inputs lack the capability to evolve with non-expert interactions with environments. In this work, we introduce a novel learning paradigm that generates robots' executable actions in the form of text, derived solely from visual observations, using language-based summarization of these observations as the connecting bridge between both domains. Our proposed paradigm stands apart from previous works, which utilized either language instructions or a combination of language and visual data as inputs. Moreover, our method does not require oracle text summarization of the scene, eliminating the need for human involvement in the learning loop, which makes it more practical for real-world robot learning tasks. Our proposed paradigm consists of two modules: the SUM module, which interprets the environment using visual observations and produces a text summary of the scene, and the APM module, which generates executable action policies based on the natural language descriptions provided by the SUM module. We demonstrate that our proposed method can employ two fine-tuning strategies, including imitation learning and reinforcement learning approaches, to adapt to the target test tasks effectively. We conduct extensive experiments involving various SUM/APM model selections, environments, and tasks across 7 house layouts in the VirtualHome environment. Our experimental results demonstrate that our method surpasses existing baselines, confirming the effectiveness of this novel learning paradigm.

Dual policy as self-model for planning

Authors:Jaesung Yoo, Fernanda de la Torre, Guangyu Robert Yang
Date:2023-06-07 13:58:45

Planning is a data efficient decision-making strategy where an agent selects candidate actions by exploring possible future states. To simulate future states when there is a high-dimensional action space, the knowledge of one's decision making strategy must be used to limit the number of actions to be explored. We refer to the model used to simulate one's decisions as the agent's self-model. While self-models are implicitly used widely in conjunction with world models to plan actions, it remains unclear how self-models should be designed. Inspired by current reinforcement learning approaches and neuroscience, we explore the benefits and limitations of using a distilled policy network as the self-model. In such dual-policy agents, a model-free policy and a distilled policy are used for model-free actions and planned actions, respectively. Our results on a ecologically relevant, parametric environment indicate that distilled policy network for self-model stabilizes training, has faster inference than using model-free policy, promotes better exploration, and could learn a comprehensive understanding of its own behaviors, at the cost of distilling a new network apart from the model-free policy.

Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach

Authors:Bin Hu, Chenyang Zhao, Pu Zhang, Zihao Zhou, Yuanhang Yang, Zenglin Xu, Bin Liu
Date:2023-06-06 11:49:09

Large language models (LLMs) encode a vast amount of world knowledge acquired from massive text datasets. Recent studies have demonstrated that LLMs can assist an embodied agent in solving complex sequential decision making tasks by providing high-level instructions. However, interactions with LLMs can be time-consuming. In many practical scenarios, it requires a significant amount of storage space that can only be deployed on remote cloud servers. Additionally, using commercial LLMs can be costly since they may charge based on usage frequency. In this paper, we explore how to enable intelligent cost-effective interactions between a down stream task oriented agent and an LLM. We find that this problem can be naturally formulated by a Markov decision process (MDP), and propose When2Ask, a reinforcement learning based approach that learns when it is necessary to query LLMs for high-level instructions to accomplish a target task. On one side, When2Ask discourages unnecessary redundant interactions, while on the other side, it enables the agent to identify and follow useful instructions from the LLM. This enables the agent to halt an ongoing plan and transition to a more suitable one based on new environmental observations. Experiments on MiniGrid and Habitat environments that entail planning sub-goals demonstrate that When2Ask learns to solve target tasks with only a few necessary interactions with the LLM, significantly reducing interaction costs in testing environments compared with baseline methods. Our code is available at: https://github.com/ZJLAB-AMMI/LLM4RL.

Risk-Aware Reward Shaping of Reinforcement Learning Agents for Autonomous Driving

Authors:Lin-Chi Wu, Zengjie Zhang, Sofie Haesaert, Zhiqiang Ma, Zhiyong Sun
Date:2023-06-05 20:10:36

Reinforcement learning (RL) is an effective approach to motion planning in autonomous driving, where an optimal driving policy can be automatically learned using the interaction data with the environment. Nevertheless, the reward function for an RL agent, which is significant to its performance, is challenging to be determined. The conventional work mainly focuses on rewarding safe driving states but does not incorporate the awareness of risky driving behaviors of the vehicles. In this paper, we investigate how to use risk-aware reward shaping to leverage the training and test performance of RL agents in autonomous driving. Based on the essential requirements that prescribe the safety specifications for general autonomous driving in practice, we propose additional reshaped reward terms that encourage exploration and penalize risky driving behaviors. A simulation study in OpenAI Gym indicates the advantage of risk-aware reward shaping for various RL agents. Also, we point out that proximal policy optimization (PPO) is likely to be the best RL method that works with risk-aware reward shaping.

Model-aided Federated Reinforcement Learning for Multi-UAV Trajectory Planning in IoT Networks

Authors:Jichao Chen, Omid Esrafilian, Harald Bayerlein, David Gesbert, Marco Caccamo
Date:2023-06-03 07:16:17

Deploying teams of unmanned aerial vehicles (UAVs) to harvest data from distributed Internet of Things (IoT) devices requires efficient trajectory planning and coordination algorithms. Multi-agent reinforcement learning (MARL) has emerged as a solution, but requires extensive and costly real-world training data. To tackle this challenge, we propose a novel model-aided federated MARL algorithm to coordinate multiple UAVs on a data harvesting mission with only limited knowledge about the environment. The proposed algorithm alternates between building an environment simulation model from real-world measurements, specifically learning the radio channel characteristics and estimating unknown IoT device positions, and federated QMIX training in the simulated environment. Each UAV agent trains a local QMIX model in its simulated environment and continuously consolidates it through federated learning with other agents, accelerating the learning process. A performance comparison with standard MARL algorithms demonstrates that our proposed model-aided FedQMIX algorithm reduces the need for real-world training experiences by around three magnitudes while attaining similar data collection performance.

Multi-Robot Path Planning Combining Heuristics and Multi-Agent Reinforcement Learning

Authors:Shaoming Peng
Date:2023-06-02 05:07:37

Multi-robot path finding in dynamic environments is a highly challenging classic problem. In the movement process, robots need to avoid collisions with other moving robots while minimizing their travel distance. Previous methods for this problem either continuously replan paths using heuristic search methods to avoid conflicts or choose appropriate collision avoidance strategies based on learning approaches. The former may result in long travel distances due to frequent replanning, while the latter may have low learning efficiency due to low sample exploration and utilization, and causing high training costs for the model. To address these issues, we propose a path planning method, MAPPOHR, which combines heuristic search, empirical rules, and multi-agent reinforcement learning. The method consists of two layers: a real-time planner based on the multi-agent reinforcement learning algorithm, MAPPO, which embeds empirical rules in the action output layer and reward functions, and a heuristic search planner used to create a global guiding path. During movement, the heuristic search planner replans new paths based on the instructions of the real-time planner. We tested our method in 10 different conflict scenarios. The experiments show that the planning performance of MAPPOHR is better than that of existing learning and heuristic methods. Due to the utilization of empirical knowledge and heuristic search, the learning efficiency of MAPPOHR is higher than that of existing learning methods.

Efficient Reinforcement Learning with Impaired Observability: Learning to Act with Delayed and Missing State Observations

Authors:Minshuo Chen, Jie Meng, Yu Bai, Yinyu Ye, H. Vincent Poor, Mengdi Wang
Date:2023-06-02 02:46:39

In real-world reinforcement learning (RL) systems, various forms of {\it impaired observability} can complicate matters. These situations arise when an agent is unable to observe the most recent state of the system due to latency or lossy channels, yet the agent must still make real-time decisions. This paper introduces a theoretical investigation into efficient RL in control systems where agents must act with delayed and missing state observations. We present algorithms and establish near-optimal regret upper and lower bounds, of the form $\tilde{\mathcal{O}}(\sqrt{{\rm poly}(H) SAK})$, for RL in the delayed and missing observation settings. Here $S$ and $A$ are the sizes of state and action spaces, $H$ is the time horizon and $K$ is the number of episodes. Despite impaired observability posing significant challenges to the policy class and planning, our results demonstrate that learning remains efficient, with the regret bound optimally depending on the state-action size of the original system. Additionally, we provide a characterization of the performance of the optimal policy under impaired observability, comparing it to the optimal value obtained with full observability. Numerical results are provided to support our theory.

IQL-TD-MPC: Implicit Q-Learning for Hierarchical Model Predictive Control

Authors:Rohan Chitnis, Yingchen Xu, Bobak Hashemi, Lucas Lehnert, Urun Dogan, Zheqing Zhu, Olivier Delalleau
Date:2023-06-01 16:24:40

Model-based reinforcement learning (RL) has shown great promise due to its sample efficiency, but still struggles with long-horizon sparse-reward tasks, especially in offline settings where the agent learns from a fixed dataset. We hypothesize that model-based RL agents struggle in these environments due to a lack of long-term planning capabilities, and that planning in a temporally abstract model of the environment can alleviate this issue. In this paper, we make two key contributions: 1) we introduce an offline model-based RL algorithm, IQL-TD-MPC, that extends the state-of-the-art Temporal Difference Learning for Model Predictive Control (TD-MPC) with Implicit Q-Learning (IQL); 2) we propose to use IQL-TD-MPC as a Manager in a hierarchical setting with any off-the-shelf offline RL algorithm as a Worker. More specifically, we pre-train a temporally abstract IQL-TD-MPC Manager to predict "intent embeddings", which roughly correspond to subgoals, via planning. We empirically show that augmenting state representations with intent embeddings generated by an IQL-TD-MPC manager significantly improves off-the-shelf offline RL agents' performance on some of the most challenging D4RL benchmark tasks. For instance, the offline RL algorithms AWAC, TD3-BC, DT, and CQL all get zero or near-zero normalized evaluation scores on the medium and large antmaze tasks, while our modification gives an average score over 40.

BitE : Accelerating Learned Query Optimization in a Mixed-Workload Environment

Authors:Yuri Kim, Yewon Choi, Yujung Gil, Sanghee Lee, Heesik Shin, Jaehyok Chong
Date:2023-06-01 16:05:33

Although the many efforts to apply deep reinforcement learning to query optimization in recent years, there remains room for improvement as query optimizers are complex entities that require hand-designed tuning of workloads and datasets. Recent research present learned query optimizations results mostly in bulks of single workloads which focus on picking up the unique traits of the specific workload. This proves to be problematic in scenarios where the different characteristics of multiple workloads and datasets are to be mixed and learned together. Henceforth, in this paper, we propose BitE, a novel ensemble learning model using database statistics and metadata to tune a learned query optimizer for enhancing performance. On the way, we introduce multiple revisions to solve several challenges: we extend the search space for the optimal Abstract SQL Plan(represented as a JSON object called ASP) by expanding hintsets, we steer the model away from the default plans that may be biased by configuring the experience with all unique plans of queries, and we deviate from the traditional loss functions and choose an alternative method to cope with underestimation and overestimation of reward. Our model achieves 19.6% more improved queries and 15.8% less regressed queries compared to the existing traditional methods whilst using a comparable level of resources.

Safe Offline Reinforcement Learning with Real-Time Budget Constraints

Authors:Qian Lin, Bo Tang, Zifan Wu, Chao Yu, Shangqin Mao, Qianlong Xie, Xingxing Wang, Dong Wang
Date:2023-06-01 12:19:32

Aiming at promoting the safe real-world deployment of Reinforcement Learning (RL), research on safe RL has made significant progress in recent years. However, most existing works in the literature still focus on the online setting where risky violations of the safety budget are likely to be incurred during training. Besides, in many real-world applications, the learned policy is required to respond to dynamically determined safety budgets (i.e., constraint threshold) in real time. In this paper, we target at the above real-time budget constraint problem under the offline setting, and propose Trajectory-based REal-time Budget Inference (TREBI) as a novel solution that models this problem from the perspective of trajectory distribution and solves it through diffusion model planning. Theoretically, we prove an error bound of the estimation on the episodic reward and cost under the offline setting and thus provide a performance guarantee for TREBI. Empirical results on a wide range of simulation tasks and a real-world large-scale advertising application demonstrate the capability of TREBI in solving real-time budget constraint problems under offline settings.

Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

Authors:Shengran Hu, Jeff Clune
Date:2023-06-01 03:43:41

Language is often considered a key aspect of human thinking, providing us with exceptional abilities to generalize, explore, plan, replan, and adapt to new situations. However, Reinforcement Learning (RL) agents are far from human-level performance in any of these abilities. We hypothesize one reason for such cognitive deficiencies is that they lack the benefits of thinking in language and that we can improve AI agents by training them to think like humans do. We introduce a novel Imitation Learning framework, Thought Cloning, where the idea is to not just clone the behaviors of human demonstrators, but also the thoughts humans have as they perform these behaviors. While we expect Thought Cloning to truly shine at scale on internet-sized datasets of humans thinking out loud while acting (e.g. online videos with transcripts), here we conduct experiments in a domain where the thinking and action data are synthetically generated. Results reveal that Thought Cloning learns much faster than Behavioral Cloning and its performance advantage grows the further out of distribution test tasks are, highlighting its ability to better handle novel situations. Thought Cloning also provides important benefits for AI Safety and Interpretability, and makes it easier to debug and improve AI. Because we can observe the agent's thoughts, we can (1) more easily diagnose why things are going wrong, making it easier to fix the problem, (2) steer the agent by correcting its thinking, or (3) prevent it from doing unsafe things it plans to do. Overall, by training agents how to think as well as behave, Thought Cloning creates safer, more powerful agents.

BetaZero: Belief-State Planning for Long-Horizon POMDPs using Learned Approximations

Authors:Robert J. Moss, Anthony Corso, Jef Caers, Mykel J. Kochenderfer
Date:2023-05-31 23:47:31

Real-world planning problems, including autonomous driving and sustainable energy applications like carbon storage and resource exploration, have recently been modeled as partially observable Markov decision processes (POMDPs) and solved using approximate methods. To solve high-dimensional POMDPs in practice, state-of-the-art methods use online planning with problem-specific heuristics to reduce planning horizons and make the problems tractable. Algorithms that learn approximations to replace heuristics have recently found success in large-scale fully observable domains. The key insight is the combination of online Monte Carlo tree search with offline neural network approximations of the optimal policy and value function. In this work, we bring this insight to partially observable domains and propose BetaZero, a belief-state planning algorithm for high-dimensional POMDPs. BetaZero learns offline approximations that replace heuristics to enable online decision making in long-horizon problems. We address several challenges inherent in large-scale partially observable domains; namely challenges of transitioning in stochastic environments, prioritizing action branching with a limited search budget, and representing beliefs as input to the network. To formalize the use of all limited search information, we train against a novel $Q$-weighted visit counts policy. We test BetaZero on various well-established POMDP benchmarks found in the literature and a real-world problem of critical mineral exploration. Experiments show that BetaZero outperforms state-of-the-art POMDP solvers on a variety of tasks.

MetaDiffuser: Diffusion Model as Conditional Planner for Offline Meta-RL

Authors:Fei Ni, Jianye Hao, Yao Mu, Yifu Yuan, Yan Zheng, Bin Wang, Zhixuan Liang
Date:2023-05-31 15:01:38

Recently, diffusion model shines as a promising backbone for the sequence modeling paradigm in offline reinforcement learning(RL). However, these works mostly lack the generalization ability across tasks with reward or dynamics change. To tackle this challenge, in this paper we propose a task-oriented conditioned diffusion planner for offline meta-RL(MetaDiffuser), which considers the generalization problem as conditional trajectory generation task with contextual representation. The key is to learn a context conditioned diffusion model which can generate task-oriented trajectories for planning across diverse tasks. To enhance the dynamics consistency of the generated trajectories while encouraging trajectories to achieve high returns, we further design a dual-guided module in the sampling process of the diffusion model. The proposed framework enjoys the robustness to the quality of collected warm-start data from the testing task and the flexibility to incorporate with different task representation method. The experiment results on MuJoCo benchmarks show that MetaDiffuser outperforms other strong offline meta-RL baselines, demonstrating the outstanding conditional generation ability of diffusion architecture.

Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration

Authors:Zhihan Liu, Miao Lu, Wei Xiong, Han Zhong, Hao Hu, Shenao Zhang, Sirui Zheng, Zhuoran Yang, Zhaoran Wang
Date:2023-05-29 17:25:26

In online reinforcement learning (online RL), balancing exploration and exploitation is crucial for finding an optimal policy in a sample-efficient way. To achieve this, existing sample-efficient online RL algorithms typically consist of three components: estimation, planning, and exploration. However, in order to cope with general function approximators, most of them involve impractical algorithmic components to incentivize exploration, such as optimization within data-dependent level-sets or complicated sampling procedures. To address this challenge, we propose an easy-to-implement RL framework called \textit{Maximize to Explore} (\texttt{MEX}), which only needs to optimize \emph{unconstrainedly} a single objective that integrates the estimation and planning components while balancing exploration and exploitation automatically. Theoretically, we prove that \texttt{MEX} achieves a sublinear regret with general function approximations for Markov decision processes (MDP) and is further extendable to two-player zero-sum Markov games (MG). Meanwhile, we adapt deep RL baselines to design practical versions of \texttt{MEX}, in both model-free and model-based manners, which can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards. Compared with existing sample-efficient online RL algorithms with general function approximations, \texttt{MEX} achieves similar sample efficiency while enjoying a lower computational cost and is more compatible with modern deep RL methods.

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Authors:Haque Ishfaq, Qingfeng Lan, Pan Xu, A. Rupam Mahmood, Doina Precup, Anima Anandkumar, Kamyar Azizzadenesheli
Date:2023-05-29 17:11:28

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O}(d^{3/2}H^{3/2}\sqrt{T})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.

Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning

Authors:Haoran He, Chenjia Bai, Kang Xu, Zhuoran Yang, Weinan Zhang, Dong Wang, Bin Zhao, Xuelong Li
Date:2023-05-29 05:20:38

Diffusion models have demonstrated highly-expressive generative capabilities in vision and NLP. Recent studies in reinforcement learning (RL) have shown that diffusion models are also powerful in modeling complex policies or trajectories in offline datasets. However, these works have been limited to single-task settings where a generalist agent capable of addressing multi-task predicaments is absent. In this paper, we aim to investigate the effectiveness of a single diffusion model in modeling large-scale multi-task offline data, which can be challenging due to diverse and multimodal data distribution. Specifically, we propose Multi-Task Diffusion Model (\textsc{MTDiff}), a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis in multi-task offline settings. \textsc{MTDiff} leverages vast amounts of knowledge available in multi-task data and performs implicit knowledge sharing among tasks. For generative planning, we find \textsc{MTDiff} outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D. For data synthesis, \textsc{MTDiff} generates high-quality data for testing tasks given a single demonstration as a prompt, which enhances the low-quality datasets for even unseen tasks.

Potential-based Credit Assignment for Cooperative RL-based Testing of Autonomous Vehicles

Authors:Utku Ayvaz, Chih-Hong Cheng, Hao Shen
Date:2023-05-28 06:41:06

While autonomous vehicles (AVs) may perform remarkably well in generic real-life cases, their irrational action in some unforeseen cases leads to critical safety concerns. This paper introduces the concept of collaborative reinforcement learning (RL) to generate challenging test cases for AV planning and decision-making module. One of the critical challenges for collaborative RL is the credit assignment problem, where a proper assignment of rewards to multiple agents interacting in the traffic scenario, considering all parameters and timing, turns out to be non-trivial. In order to address this challenge, we propose a novel potential-based reward-shaping approach inspired by counterfactual analysis for solving the credit-assignment problem. The evaluation in a simulated environment demonstrates the superiority of our proposed approach against other methods using local and global rewards.

Self-Supervised Reinforcement Learning that Transfers using Random Features

Authors:Boyuan Chen, Chuning Zhu, Pulkit Agrawal, Kaiqing Zhang, Abhishek Gupta
Date:2023-05-26 20:37:06

Model-free reinforcement learning algorithms have exhibited great potential in solving single-task sequential decision-making problems with high-dimensional observations and long horizons, but are known to be hard to generalize across tasks. Model-based RL, on the other hand, learns task-agnostic models of the world that naturally enables transfer across different reward functions, but struggles to scale to complex environments due to the compounding error. To get the best of both worlds, we propose a self-supervised reinforcement learning method that enables the transfer of behaviors across tasks with different rewards, while circumventing the challenges of model-based RL. In particular, we show self-supervised pre-training of model-free reinforcement learning with a number of random features as rewards allows implicit modeling of long-horizon environment dynamics. Then, planning techniques like model-predictive control using these implicit models enable fast adaptation to problems with new reward functions. Our method is self-supervised in that it can be trained on offline datasets without reward labels, but can then be quickly deployed on new tasks. We validate that our proposed method enables transfer across tasks on a variety of manipulation and locomotion domains in simulation, opening the door to generalist decision-making agents.

Formal Modelling for Multi-Robot Systems Under Uncertainty

Authors:Charlie Street, Masoumeh Mansouri, Bruno Lacerda
Date:2023-05-26 15:23:35

Purpose of Review: To effectively synthesise and analyse multi-robot behaviour, we require formal task-level models which accurately capture multi-robot execution. In this paper, we review modelling formalisms for multi-robot systems under uncertainty, and discuss how they can be used for planning, reinforcement learning, model checking, and simulation. Recent Findings: Recent work has investigated models which more accurately capture multi-robot execution by considering different forms of uncertainty, such as temporal uncertainty and partial observability, and modelling the effects of robot interactions on action execution. Other strands of work have presented approaches for reducing the size of multi-robot models to admit more efficient solution methods. This can be achieved by decoupling the robots under independence assumptions, or reasoning over higher level macro actions. Summary: Existing multi-robot models demonstrate a trade off between accurately capturing robot dependencies and uncertainty, and being small enough to tractably solve real world problems. Therefore, future research should exploit realistic assumptions over multi-robot behaviour to develop smaller models which retain accurate representations of uncertainty and robot interactions; and exploit the structure of multi-robot problems, such as factored state spaces, to develop scalable solution methods.

Reward-Machine-Guided, Self-Paced Reinforcement Learning

Authors:Cevahir Koprulu, Ufuk Topcu
Date:2023-05-25 22:13:37

Self-paced reinforcement learning (RL) aims to improve the data efficiency of learning by automatically creating sequences, namely curricula, of probability distributions over contexts. However, existing techniques for self-paced RL fail in long-horizon planning tasks that involve temporally extended behaviors. We hypothesize that taking advantage of prior knowledge about the underlying task structure can improve the effectiveness of self-paced RL. We develop a self-paced RL algorithm guided by reward machines, i.e., a type of finite-state machine that encodes the underlying task structure. The algorithm integrates reward machines in 1) the update of the policy and value functions obtained by any RL algorithm of choice, and 2) the update of the automated curriculum that generates context distributions. Our empirical results evidence that the proposed algorithm achieves optimal behavior reliably even in cases in which existing baselines cannot make any meaningful progress. It also decreases the curriculum length and reduces the variance in the curriculum generation process by up to one-fourth and four orders of magnitude, respectively.

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

Authors:Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, Jifeng Dai
Date:2023-05-25 17:59:49

The captivating realm of Minecraft has attracted substantial research interest in recent years, serving as a rich platform for developing intelligent agents capable of functioning in open-world environments. However, the current research landscape predominantly focuses on specific objectives, such as the popular "ObtainDiamond" task, and has not yet shown effective generalization to a broader spectrum of tasks. Furthermore, the current leading success rate for the "ObtainDiamond" task stands at around 20%, highlighting the limitations of Reinforcement Learning (RL) based controllers used in existing methods. To tackle these challenges, we introduce Ghost in the Minecraft (GITM), a novel framework integrates Large Language Models (LLMs) with text-based knowledge and memory, aiming to create Generally Capable Agents (GCAs) in Minecraft. These agents, equipped with the logic and common sense capabilities of LLMs, can skillfully navigate complex, sparse-reward environments with text-based interactions. We develop a set of structured actions and leverage LLMs to generate action plans for the agents to execute. The resulting LLM-based agent markedly surpasses previous methods, achieving a remarkable improvement of +47.5% in success rate on the "ObtainDiamond" task, demonstrating superior robustness compared to traditional RL-based controllers. Notably, our agent is the first to procure all items in the Minecraft Overworld technology tree, demonstrating its extensive capabilities. GITM does not need any GPU for training, but a single CPU node with 32 CPU cores is enough. This research shows the potential of LLMs in developing capable agents for handling long-horizon, complex tasks and adapting to uncertainties in open-world environments. See the project website at https://github.com/OpenGVLab/GITM.

C-MCTS: Safe Planning with Monte Carlo Tree Search

Authors:Dinesh Parthasarathy, Georgios Kontes, Axel Plinge, Christopher Mutschler
Date:2023-05-25 16:08:30

The Constrained Markov Decision Process (CMDP) formulation allows to solve safety-critical decision making tasks that are subject to constraints. While CMDPs have been extensively studied in the Reinforcement Learning literature, little attention has been given to sampling-based planning algorithms such as MCTS for solving them. Previous approaches perform conservatively with respect to costs as they avoid constraint violations by using Monte Carlo cost estimates that suffer from high variance. We propose Constrained MCTS (C-MCTS), which estimates cost using a safety critic that is trained with Temporal Difference learning in an offline phase prior to agent deployment. The critic limits exploration by pruning unsafe trajectories within MCTS during deployment. C-MCTS satisfies cost constraints but operates closer to the constraint boundary, achieving higher rewards than previous work. As a nice byproduct, the planner is more efficient w.r.t. planning steps. Most importantly, under model mismatch between the planner and the real world, C-MCTS is less susceptible to cost violations than previous work.

Provable Offline Preference-Based Reinforcement Learning

Authors:Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun
Date:2023-05-24 07:11:26

In this paper, we investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs.

KARNet: Kalman Filter Augmented Recurrent Neural Network for Learning World Models in Autonomous Driving Tasks

Authors:Hemanth Manjunatha, Andrey Pak, Dimitar Filev, Panagiotis Tsiotras
Date:2023-05-24 02:27:34

Autonomous driving has received a great deal of attention in the automotive industry and is often seen as the future of transportation. The development of autonomous driving technology has been greatly accelerated by the growth of end-to-end machine learning techniques that have been successfully used for perception, planning, and control tasks. An important aspect of autonomous driving planning is knowing how the environment evolves in the immediate future and taking appropriate actions. An autonomous driving system should effectively use the information collected from the various sensors to form an abstract representation of the world to maintain situational awareness. For this purpose, deep learning models can be used to learn compact latent representations from a stream of incoming data. However, most deep learning models are trained end-to-end and do not incorporate any prior knowledge (e.g., from physics) of the vehicle in the architecture. In this direction, many works have explored physics-infused neural network (PINN) architectures to infuse physics models during training. Inspired by this observation, we present a Kalman filter augmented recurrent neural network architecture to learn the latent representation of the traffic flow using front camera images only. We demonstrate the efficacy of the proposed model in both imitation and reinforcement learning settings using both simulated and real-world datasets. The results show that incorporating an explicit model of the vehicle (states estimated using Kalman filtering) in the end-to-end learning significantly increases performance.

Deep Reinforcement Learning-based Multi-objective Path Planning on the Off-road Terrain Environment for Ground Vehicles

Authors:Shuqiao Huang, Xiru Wu, Guoming Huang
Date:2023-05-23 07:53:35

Due to the vastly different energy consumption between up-slope and down-slope, a path with the shortest length on a complex off-road terrain environment (2.5D map) is not always the path with the least energy consumption. For any energy-sensitive vehicle, realizing a good trade-off between distance and energy consumption in 2.5D path planning is significantly meaningful. In this paper, we propose a deep reinforcement learning-based 2.5D multi-objective path planning method (DMOP). The DMOP can efficiently find the desired path in three steps: (1) Transform the high-resolution 2.5D map into a small-size map. (2) Use a trained deep Q network (DQN) to find the desired path on the small-size map. (3) Build the planned path to the original high-resolution map using a path-enhanced method. In addition, the hybrid exploration strategy and reward shaping theory are applied to train the DQN. The reward function is constructed with the information of terrain, distance, and border. Simulation results show that the proposed method can finish the multi-objective 2.5D path planning task with significantly high efficiency. With similar planned paths, the speed of the proposed method is more than 100 times faster than that of the A* method and 30 times faster than that of H3DM method. Also, simulation proves that the method has powerful reasoning capability that enables it to perform arbitrary untrained planning tasks.

M-EMBER: Tackling Long-Horizon Mobile Manipulation via Factorized Domain Transfer

Authors:Bohan Wu, Roberto Martin-Martin, Li Fei-Fei
Date:2023-05-23 00:53:30

In this paper, we propose a method to create visuomotor mobile manipulation solutions for long-horizon activities. We propose to leverage the recent advances in simulation to train visual solutions for mobile manipulation. While previous works have shown success applying this procedure to autonomous visual navigation and stationary manipulation, applying it to long-horizon visuomotor mobile manipulation is still an open challenge that demands both perceptual and compositional generalization of multiple skills. In this work, we develop Mobile-EMBER, or M-EMBER, a factorized method that decomposes a long-horizon mobile manipulation activity into a repertoire of primitive visual skills, reinforcement-learns each skill, and composes these skills to a long-horizon mobile manipulation activity. On a mobile manipulation robot, we find that M-EMBER completes a long-horizon mobile manipulation activity, cleaning_kitchen, achieving a 53% success rate. This requires successfully planning and executing five factorized, learned visual skills.

Know your Enemy: Investigating Monte-Carlo Tree Search with Opponent Models in Pommerman

Authors:Jannis Weil, Johannes Czech, Tobias Meuser, Kristian Kersting
Date:2023-05-22 16:39:20

In combination with Reinforcement Learning, Monte-Carlo Tree Search has shown to outperform human grandmasters in games such as Chess, Shogi and Go with little to no prior domain knowledge. However, most classical use cases only feature up to two players. Scaling the search to an arbitrary number of players presents a computational challenge, especially if decisions have to be planned over a longer time horizon. In this work, we investigate techniques that transform general-sum multiplayer games into single-player and two-player games that consider other agents to act according to given opponent models. For our evaluation, we focus on the challenging Pommerman environment which involves partial observability, a long time horizon and sparse rewards. In combination with our search methods, we investigate the phenomena of opponent modeling using heuristics and self-play. Overall, we demonstrate the effectiveness of our multiplayer search variants both in a supervised learning and reinforcement learning setting.

Road Planning for Slums via Deep Reinforcement Learning

Authors:Yu Zheng, Hongyuan Su, Jingtao Ding, Depeng Jin, Yong Li
Date:2023-05-22 14:18:28

Millions of slum dwellers suffer from poor accessibility to urban services due to inadequate road infrastructure within slums, and road planning for slums is critical to the sustainable development of cities. Existing re-blocking or heuristic methods are either time-consuming which cannot generalize to different slums, or yield sub-optimal road plans in terms of accessibility and construction costs. In this paper, we present a deep reinforcement learning based approach to automatically layout roads for slums. We propose a generic graph model to capture the topological structure of a slum, and devise a novel graph neural network to select locations for the planned roads. Through masked policy optimization, our model can generate road plans that connect places in a slum at minimal construction costs. Extensive experiments on real-world slums in different countries verify the effectiveness of our model, which can significantly improve accessibility by 14.3% against existing baseline methods. Further investigations on transferring across different tasks demonstrate that our model can master road planning skills in simple scenarios and adapt them to much more complicated ones, indicating the potential of applying our model in real-world slum upgrading. The code and data are available at https://github.com/tsinghua-fib-lab/road-planning-for-slums.

FurnitureBench: Reproducible Real-World Benchmark for Long-Horizon Complex Manipulation

Authors:Minho Heo, Youngwoon Lee, Doohyun Lee, Joseph J. Lim
Date:2023-05-22 08:29:00

Reinforcement learning (RL), imitation learning (IL), and task and motion planning (TAMP) have demonstrated impressive performance across various robotic manipulation tasks. However, these approaches have been limited to learning simple behaviors in current real-world manipulation benchmarks, such as pushing or pick-and-place. To enable more complex, long-horizon behaviors of an autonomous robot, we propose to focus on real-world furniture assembly, a complex, long-horizon robot manipulation task that requires addressing many current robotic manipulation challenges to solve. We present FurnitureBench, a reproducible real-world furniture assembly benchmark aimed at providing a low barrier for entry and being easily reproducible, so that researchers across the world can reliably test their algorithms and compare them against prior work. For ease of use, we provide 200+ hours of pre-collected data (5000+ demonstrations), 3D printable furniture models, a robotic environment setup guide, and systematic task initialization. Furthermore, we provide FurnitureSim, a fast and realistic simulator of FurnitureBench. We benchmark the performance of offline RL and IL algorithms on our assembly tasks and demonstrate the need to improve such algorithms to be able to solve our tasks in the real world, providing ample opportunities for future research.

Learn to Flap: Foil Non-parametric Path Planning via Deep Reinforcement Learning

Authors:Z. P. Wang, R. J. Lin, Z. Y. Zhao, P. M. Guo, N. Yang, D. X. Fan
Date:2023-05-22 03:50:16

To optimize flapping foil performance, the application of deep reinforcement learning (DRL) on controlling foil non-parametric motion is conducted in the present study. Traditional control techniques and simplified motions cannot fully model nonlinear, unsteady and high-dimensional foil-vortex interactions. A DRL-training framework based on Proximal Policy Optimization and Transformer architecture is proposed. The policy is initialized from the sinusoidal expert display. We first demonstrate the effectiveness of the proposed DRL-training framework which can optimize foil motion while enhancing foil generated thrust. By adjusting reward setting and action threshold, the DRL-optimized foil trajectories can gain further enhancement compared to sinusoidal motion. Via flow analysis of wake morphology and instantaneous pressure distributions, it is found that the DRL-optimized foil can adaptively adjust the phases between motion and shedding vortices to improve hydrodynamic performance. Our results give a hint for solving complex fluid manipulation problems through DRL method.

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

Authors:Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, Izzeddin Gur
Date:2023-05-19 17:44:34

The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned SoTA, humans, and GPT-4-based agent. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. Furthermore, WebGUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.

Massively Scalable Inverse Reinforcement Learning in Google Maps

Authors:Matt Barnes, Matthew Abueg, Oliver F. Lange, Matt Deeds, Jason Trader, Denali Molitor, Markus Wulfmeier, Shawn O'Banion
Date:2023-05-18 20:14:28

Inverse reinforcement learning (IRL) offers a powerful and general framework for learning humans' latent preferences in route recommendation, yet no approach has successfully addressed planetary-scale problems with hundreds of millions of states and demonstration trajectories. In this paper, we introduce scaling techniques based on graph compression, spatial parallelization, and improved initialization conditions inspired by a connection to eigenvector algorithms. We revisit classic IRL methods in the routing context, and make the key observation that there exists a trade-off between the use of cheap, deterministic planners and expensive yet robust stochastic policies. This insight is leveraged in Receding Horizon Inverse Planning (RHIP), a new generalization of classic IRL algorithms that provides fine-grained control over performance trade-offs via its planning horizon. Our contributions culminate in a policy that achieves a 16-24% improvement in route quality at a global scale, and to the best of our knowledge, represents the largest published study of IRL algorithms in a real-world setting to date. We conclude by conducting an ablation study of key components, presenting negative results from alternative eigenvalue solvers, and identifying opportunities to further improve scalability via IRL-specific batching strategies.

LIMA: Less Is More for Alignment

Authors:Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy
Date:2023-05-18 17:45:22

Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.

Semantically Aligned Task Decomposition in Multi-Agent Reinforcement Learning

Authors:Wenhao Li, Dan Qiao, Baoxiang Wang, Xiangfeng Wang, Bo Jin, Hongyuan Zha
Date:2023-05-18 10:37:54

The difficulty of appropriately assigning credit is particularly heightened in cooperative MARL with sparse reward, due to the concurrent time and structural scales involved. Automatic subgoal generation (ASG) has recently emerged as a viable MARL approach inspired by utilizing subgoals in intrinsically motivated reinforcement learning. However, end-to-end learning of complex task planning from sparse rewards without prior knowledge, undoubtedly requires massive training samples. Moreover, the diversity-promoting nature of existing ASG methods can lead to the "over-representation" of subgoals, generating numerous spurious subgoals of limited relevance to the actual task reward and thus decreasing the sample efficiency of the algorithm. To address this problem and inspired by the disentangled representation learning, we propose a novel "disentangled" decision-making method, Semantically Aligned task decomposition in MARL (SAMA), that prompts pretrained language models with chain-of-thought that can suggest potential goals, provide suitable goal decomposition and subgoal allocation as well as self-reflection-based replanning. Additionally, SAMA incorporates language-grounded RL to train each agent's subgoal-conditioned policy. SAMA demonstrates considerable advantages in sample efficiency compared to state-of-the-art ASG methods, as evidenced by its performance on two challenging sparse-reward tasks, Overcooked and MiniRTS.

Curriculum Learning in Job Shop Scheduling using Reinforcement Learning

Authors:Constantin Waubert de Puiseau, Hasan Tercan, Tobias Meisen
Date:2023-05-17 13:15:27

Solving job shop scheduling problems (JSSPs) with a fixed strategy, such as a priority dispatching rule, may yield satisfactory results for several problem instances but, nevertheless, insufficient results for others. From this single-strategy perspective finding a near optimal solution to a specific JSSP varies in difficulty even if the machine setup remains the same. A recent intensively researched and promising method to deal with difficulty variability is Deep Reinforcement Learning (DRL), which dynamically adjusts an agent's planning strategy in response to difficult instances not only during training, but also when applied to new situations. In this paper, we further improve DLR as an underlying method by actively incorporating the variability of difficulty within the same problem size into the design of the learning process. We base our approach on a state-of-the-art methodology that solves JSSP by means of DRL and graph neural network embeddings. Our work supplements the training routine of the agent by a curriculum learning strategy that ranks the problem instances shown during training by a new metric of problem instance difficulty. Our results show that certain curricula lead to significantly better performances of the DRL solutions. Agents trained on these curricula beat the top performance of those trained on randomly distributed training data, reaching 3.2% shorter average makespans.

On the Difficulty of Intersection Checking with Polynomial Zonotopes

Authors:Yushen Huang, Ertai Luo, Stanley Bak, Yifan Sun
Date:2023-05-17 02:21:24

Polynomial zonotopes, a non-convex set representation, have a wide range of applications from real-time motion planning and control in robotics, to reachability analysis of nonlinear systems and safety shielding in reinforcement learning. Despite this widespread use, a frequently overlooked difficulty associated with polynomial zonotopes is intersection checking. Determining whether the reachable set, represented as a polynomial zonotope, intersects an unsafe set is not straightforward. In fact, we show that this fundamental operation is NP-hard, even for a simple class of polynomial zonotopes. The standard method for intersection checking with polynomial zonotopes is a two-part algorithm that overapproximates a polynomial zonotope with a regular zonotope and then, if the overapproximation error is deemed too large, splits the set and recursively tries again. Beyond the possible need for a large number of splits, we identify two sources of concern related to this algorithm: (1) overapproximating a polynomial zonotope with a zonotope has unbounded error, and (2) after splitting a polynomial zonotope, the overapproximation error can actually increase. Taken together, this implies there may be a possibility that the algorithm does not always terminate.We perform a rigorous analysis of the method and detail necessary conditions for the union of overapproximations to provably converge to the original polynomial zonotope.

RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

Authors:Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, Ashwin Kalyan, Peter Clark, Derry Wijaya, Niket Tandon
Date:2023-05-15 17:57:16

Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited access models such as ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, we introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs. We study three datasets for action planning, summarization and alphabetization and show relative improvements up to 10% in multiple text similarity metrics over other learned, retrieval-augmented or prompting-based critique generators.

Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs

Authors:Kaixuan Ji, Qingyue Zhao, Jiafan He, Weitong Zhang, Quanquan Gu
Date:2023-05-15 05:37:32

Recent studies have shown that episodic reinforcement learning (RL) is no harder than bandits when the total reward is bounded by $1$, and proved regret bounds that have a polylogarithmic dependence on the planning horizon $H$. However, it remains an open question that if such results can be carried over to adversarial RL, where the reward is adversarially chosen at each episode. In this paper, we answer this question affirmatively by proposing the first horizon-free policy search algorithm. To tackle the challenges caused by exploration and adversarially chosen reward, our algorithm employs (1) a variance-uncertainty-aware weighted least square estimator for the transition kernel; and (2) an occupancy measure-based technique for the online search of a \emph{stochastic} policy. We show that our algorithm achieves an $\tilde{O}\big((d+\log (|\mathcal{S}|^2 |\mathcal{A}|))\sqrt{K}\big)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|\mathcal{S}|$ and $|\mathcal{A}|$ are the cardinalities of the state and action spaces. We also provide hardness results and regret lower bounds to justify the near optimality of our algorithm and the unavoidability of $\log|\mathcal{S}|$ and $\log|\mathcal{A}|$ in the regret bound.

Towards Theoretical Understanding of Data-Driven Policy Refinement

Authors:Ali Baheri
Date:2023-05-11 13:36:21

This paper presents an approach for data-driven policy refinement in reinforcement learning, specifically designed for safety-critical applications. Our methodology leverages the strengths of data-driven optimization and reinforcement learning to enhance policy safety and optimality through iterative refinement. Our principal contribution lies in the mathematical formulation of this data-driven policy refinement concept. This framework systematically improves reinforcement learning policies by learning from counterexamples identified during data-driven verification. Furthermore, we present a series of theorems elucidating key theoretical properties of our approach, including convergence, robustness bounds, generalization error, and resilience to model mismatch. These results not only validate the effectiveness of our methodology but also contribute to a deeper understanding of its behavior in different environments and scenarios.

Optimizing Memory Mapping Using Deep Reinforcement Learning

Authors:Pengming Wang, Mikita Sazanovich, Berkin Ilbeyi, Phitchaya Mangpo Phothilimthana, Manish Purohit, Han Yang Tay, Ngân Vũ, Miaosen Wang, Cosmin Paduraru, Edouard Leurent, Anton Zhernov, Po-Sen Huang, Julian Schrittwieser, Thomas Hubert, Robert Tung, Paula Kurylowicz, Kieran Milan, Oriol Vinyals, Daniel J. Mankowitz
Date:2023-05-11 11:55:16

Resource scheduling and allocation is a critical component of many high impact systems ranging from congestion control to cloud computing. Finding more optimal solutions to these problems often has significant impact on resource and time savings, reducing device wear-and-tear, and even potentially improving carbon emissions. In this paper, we focus on a specific instance of a scheduling problem, namely the memory mapping problem that occurs during compilation of machine learning programs: That is, mapping tensors to different memory layers to optimize execution time. We introduce an approach for solving the memory mapping problem using Reinforcement Learning. RL is a solution paradigm well-suited for sequential decision making problems that are amenable to planning, and combinatorial search spaces with high-dimensional data inputs. We formulate the problem as a single-player game, which we call the mallocGame, such that high-reward trajectories of the game correspond to efficient memory mappings on the target hardware. We also introduce a Reinforcement Learning agent, mallocMuZero, and show that it is capable of playing this game to discover new and improved memory mapping solutions that lead to faster execution times on real ML workloads on ML accelerators. We compare the performance of mallocMuZero to the default solver used by the Accelerated Linear Algebra (XLA) compiler on a benchmark of realistic ML workloads. In addition, we show that mallocMuZero is capable of improving the execution time of the recently published AlphaTensor matrix multiplication model.

An Option-Dependent Analysis of Regret Minimization Algorithms in Finite-Horizon Semi-Markov Decision Processes

Authors:Gianluca Drappo, Alberto Maria Metelli, Marcello Restelli
Date:2023-05-10 15:00:05

A large variety of real-world Reinforcement Learning (RL) tasks is characterized by a complex and heterogeneous structure that makes end-to-end (or flat) approaches hardly applicable or even infeasible. Hierarchical Reinforcement Learning (HRL) provides general solutions to address these problems thanks to a convenient multi-level decomposition of the tasks, making their solution accessible. Although often used in practice, few works provide theoretical guarantees to justify this outcome effectively. Thus, it is not yet clear when to prefer such approaches compared to standard flat ones. In this work, we provide an option-dependent upper bound to the regret suffered by regret minimization algorithms in finite-horizon problems. We illustrate that the performance improvement derives from the planning horizon reduction induced by the temporal abstraction enforced by the hierarchical structure. Then, focusing on a sub-setting of HRL approaches, the options framework, we highlight how the average duration of the available options affects the planning horizon and, consequently, the regret itself. Finally, we relax the assumption of having pre-trained options to show how in particular situations, learning hierarchically from scratch could be preferable to using a standard approach.

Active Semantic Localization with Graph Neural Embedding

Authors:Mitsuki Yoshida, Kanji Tanaka, Ryogo Yamamoto, Daiki Iwata
Date:2023-05-10 13:45:42

Semantic localization, i.e., robot self-localization with semantic image modality, is critical in recently emerging embodied AI applications (e.g., point-goal navigation, object-goal navigation, vision language navigation) and topological mapping applications (e.g., graph neural SLAM, ego-centric topological map). However, most existing works on semantic localization focus on passive vision tasks without viewpoint planning, or rely on additional rich modalities (e.g., depth measurements). Thus, the problem is largely unsolved. In this work, we explore a lightweight, entirely CPU-based, domain-adaptive semantic localization framework, called graph neural localizer. Our approach is inspired by two recently emerging technologies: (1) Scene graph, which combines the viewpoint- and appearance- invariance of local and global features; (2) Graph neural network, which enables direct learning/recognition of graph data (i.e., non-vector data). Specifically, a graph convolutional neural network is first trained as a scene graph classifier for passive vision, and then its knowledge is transferred to a reinforcement-learning planner for active vision. Experiments on two scenarios, self-supervised learning and unsupervised domain adaptation, using a photo-realistic Habitat simulator validate the effectiveness of the proposed method.

Safe Deep RL for Intraoperative Planning of Pedicle Screw Placement

Authors:Yunke Ao, Hooman Esfandiari, Fabio Carrillo, Yarden As, Mazda Farshad, Benjamin F. Grewe, Andreas Krause, Philipp Fuernstahl
Date:2023-05-09 11:42:53

Spinal fusion surgery requires highly accurate implantation of pedicle screw implants, which must be conducted in critical proximity to vital structures with a limited view of anatomy. Robotic surgery systems have been proposed to improve placement accuracy, however, state-of-the-art systems suffer from the limitations of open-loop approaches, as they follow traditional concepts of preoperative planning and intraoperative registration, without real-time recalculation of the surgical plan. In this paper, we propose an intraoperative planning approach for robotic spine surgery that leverages real-time observation for drill path planning based on Safe Deep Reinforcement Learning (DRL). The main contributions of our method are (1) the capability to guarantee safe actions by introducing an uncertainty-aware distance-based safety filter; and (2) the ability to compensate for incomplete intraoperative anatomical information, by encoding a-priori knowledge about anatomical structures with a network pre-trained on high-fidelity anatomical models. Planning quality was assessed by quantitative comparison with the gold standard (GS) drill planning. In experiments with 5 models derived from real magnetic resonance imaging (MRI) data, our approach was capable of achieving 90% bone penetration with respect to the GS while satisfying safety requirements, even under observation and motion uncertainty. To the best of our knowledge, our approach is the first safe DRL approach focusing on orthopedic surgeries.

Sense, Imagine, Act: Multimodal Perception Improves Model-Based Reinforcement Learning for Head-to-Head Autonomous Racing

Authors:Elena Shrestha, Chetan Reddy, Hanxi Wan, Yulun Zhuang, Ram Vasudevan
Date:2023-05-08 14:49:02

Model-based reinforcement learning (MBRL) techniques have recently yielded promising results for real-world autonomous racing using high-dimensional observations. MBRL agents, such as Dreamer, solve long-horizon tasks by building a world model and planning actions by latent imagination. This approach involves explicitly learning a model of the system dynamics and using it to learn the optimal policy for continuous control over multiple timesteps. As a result, MBRL agents may converge to sub-optimal policies if the world model is inaccurate. To improve state estimation for autonomous racing, this paper proposes a self-supervised sensor fusion technique that combines egocentric LiDAR and RGB camera observations collected from the F1TENTH Gym. The zero-shot performance of MBRL agents is empirically evaluated on unseen tracks and against a dynamic obstacle. This paper illustrates that multimodal perception improves robustness of the world model without requiring additional training data. The resulting multimodal Dreamer agent safely avoided collisions and won the most races compared to other tested baselines in zero-shot head-to-head autonomous racing.

Train a Real-world Local Path Planner in One Hour via Partially Decoupled Reinforcement Learning and Vectorized Diversity

Authors:Jinghao Xin, Jinwoo Kim, Zhi Li, Ning Li
Date:2023-05-07 03:39:31

Deep Reinforcement Learning (DRL) has exhibited efficacy in resolving the Local Path Planning (LPP) problem. However, such application in the real world is immensely limited due to the deficient training efficiency and generalization capability of DRL. To alleviate these two issues, a solution named Color is proposed, which consists of an Actor-Sharer-Learner (ASL) training framework and a mobile robot-oriented simulator Sparrow. Specifically, the ASL intends to improve the training efficiency of DRL algorithms. It employs a Vectorized Data Collection (VDC) mode to expedite data acquisition, decouples the data collection from model optimization by multithreading, and partially connects the two procedures by harnessing a Time Feedback Mechanism (TFM) to evade data underuse or overuse. Meanwhile, the Sparrow simulator utilizes a 2D grid-based world, simplified kinematics, and conversion-free data flow to achieve a lightweight design. The lightness facilitates vectorized diversity, allowing diversified simulation setups across extensive copies of the vectorized environments, resulting in a notable enhancement in the generalization capability of the DRL algorithm being trained. Comprehensive experiments, comprising 57 DRL benchmark environments, 32 simulated and 36 real-world LPP scenarios, have been conducted to corroborate the superiority of our method in terms of efficiency and generalization. The code and the video of this paper are accessible at https://github.com/XinJingHao/Color.

AI-based Radio and Computing Resource Allocation and Path Planning in NOMA NTNs: AoI Minimization under CSI Uncertainty

Authors:Maryam Ansarifard, Nader Mokari, Mohammadreza Javan, Hamid Saeedi, Eduard A. Jorswieck
Date:2023-05-01 11:52:15

In this paper, we develop a hierarchical aerial computing framework composed of high altitude platform (HAP) and unmanned aerial vehicles (UAVs) to compute the fully offloaded tasks of terrestrial mobile users which are connected through an uplink non-orthogonal multiple access (UL-NOMA). To better assess the freshness of information in computation-intensive applications the criterion of age of information (AoI) is considered. In particular, the problem is formulated to minimize the average AoI of users with elastic tasks, by adjusting UAVs trajectory and resource allocation on both UAVs and HAP, which is restricted by the channel state information (CSI) uncertainty and multiple resource constraints of UAVs and HAP. In order to solve this non-convex optimization problem, two methods of multi-agent deep deterministic policy gradient (MADDPG) and federated reinforcement learning (FRL) are proposed to design the UAVs trajectory, and obtain channel, power, and CPU allocations. It is shown that task scheduling significantly reduces the average AoI. This improvement is more pronounced for larger task sizes. On one hand, it is shown that power allocation has a marginal effect on the average AoI compared to using full transmission power for all users. Compared with traditional transmission schemes, the simulation results show our scheduling scheme results in a substantial improvement in average AoI.

Joint Learning of Policy with Unknown Temporal Constraints for Safe Reinforcement Learning

Authors:Lunet Yifru, Ali Baheri
Date:2023-04-30 21:15:07

In many real-world applications, safety constraints for reinforcement learning (RL) algorithms are either unknown or not explicitly defined. We propose a framework that concurrently learns safety constraints and optimal RL policies in such environments, supported by theoretical guarantees. Our approach merges a logically-constrained RL algorithm with an evolutionary algorithm to synthesize signal temporal logic (STL) specifications. The framework is underpinned by theorems that establish the convergence of our joint learning process and provide error bounds between the discovered policy and the true optimal policy. We showcased our framework in grid-world environments, successfully identifying both acceptable safety constraints and RL policies while demonstrating the effectiveness of our theorems in practice.

Model-free Motion Planning of Autonomous Agents for Complex Tasks in Partially Observable Environments

Authors:Junchao Li, Mingyu Cai, Zhen Kan, Shaoping Xiao
Date:2023-04-30 19:57:39

Motion planning of autonomous agents in partially known environments with incomplete information is a challenging problem, particularly for complex tasks. This paper proposes a model-free reinforcement learning approach to address this problem. We formulate motion planning as a probabilistic-labeled partially observable Markov decision process (PL-POMDP) problem and use linear temporal logic (LTL) to express the complex task. The LTL formula is then converted to a limit-deterministic generalized B\"uchi automaton (LDGBA). The problem is redefined as finding an optimal policy on the product of PL-POMDP with LDGBA based on model-checking techniques to satisfy the complex task. We implement deep Q learning with long short-term memory (LSTM) to process the observation history and task recognition. Our contributions include the proposed method, the utilization of LTL and LDGBA, and the LSTM-enhanced deep Q learning. We demonstrate the applicability of the proposed method by conducting simulations in various environments, including grid worlds, a virtual office, and a multi-agent warehouse. The simulation results demonstrate that our proposed method effectively addresses environment, action, and observation uncertainties. This indicates its potential for real-world applications, including the control of unmanned aerial vehicles (UAVs).

Posterior Sampling for Deep Reinforcement Learning

Authors:Remo Sasso, Michelangelo Conserva, Paulo Rauber
Date:2023-04-30 13:23:50

Despite remarkable successes, deep reinforcement learning algorithms remain sample inefficient: they require an enormous amount of trial and error to find good policies. Model-based algorithms promise sample efficiency by building an environment model that can be used for planning. Posterior Sampling for Reinforcement Learning is such a model-based algorithm that has attracted significant interest due to its performance in the tabular setting. This paper introduces Posterior Sampling for Deep Reinforcement Learning (PSDRL), the first truly scalable approximation of Posterior Sampling for Reinforcement Learning that retains its model-based essence. PSDRL combines efficient uncertainty quantification over latent state space models with a specially tailored continual planning algorithm based on value-function approximation. Extensive experiments on the Atari benchmark show that PSDRL significantly outperforms previous state-of-the-art attempts at scaling up posterior sampling while being competitive with a state-of-the-art (model-based) reinforcement learning method, both in sample efficiency and computational efficiency.

Model Extraction Attacks Against Reinforcement Learning Based Controllers

Authors:Momina Sajid, Yanning Shen, Yasser Shoukry
Date:2023-04-25 18:48:42

We introduce the problem of model-extraction attacks in cyber-physical systems in which an attacker attempts to estimate (or extract) the feedback controller of the system. Extracting (or estimating) the controller provides an unmatched edge to attackers since it allows them to predict the future control actions of the system and plan their attack accordingly. Hence, it is important to understand the ability of the attackers to perform such an attack. In this paper, we focus on the setting when a Deep Neural Network (DNN) controller is trained using Reinforcement Learning (RL) algorithms and is used to control a stochastic system. We play the role of the attacker that aims to estimate such an unknown DNN controller, and we propose a two-phase algorithm. In the first phase, also called the offline phase, the attacker uses side-channel information about the RL-reward function and the system dynamics to identify a set of candidate estimates of the unknown DNN. In the second phase, also called the online phase, the attacker observes the behavior of the unknown DNN and uses these observations to shortlist the set of final policy estimates. We provide theoretical analysis of the error between the unknown DNN and the estimated one. We also provide numerical results showing the effectiveness of the proposed algorithm.

A optimization framework for herbal prescription planning based on deep reinforcement learning

Authors:Kuo Yang, Zecong Yu, Xin Su, Xiong He, Ning Wang, Qiguang Zheng, Feidie Yu, Zhuang Liu, Tiancai Wen, Xuezhong Zhou
Date:2023-04-25 13:55:02

Treatment planning for chronic diseases is a critical task in medical artificial intelligence, particularly in traditional Chinese medicine (TCM). However, generating optimized sequential treatment strategies for patients with chronic diseases in different clinical encounters remains a challenging issue that requires further exploration. In this study, we proposed a TCM herbal prescription planning framework based on deep reinforcement learning for chronic disease treatment (PrescDRL). PrescDRL is a sequential herbal prescription optimization model that focuses on long-term effectiveness rather than achieving maximum reward at every step, thereby ensuring better patient outcomes. We constructed a high-quality benchmark dataset for sequential diagnosis and treatment of diabetes and evaluated PrescDRL against this benchmark. Our results showed that PrescDRL achieved a higher curative effect, with the single-step reward improving by 117% and 153% compared to doctors. Furthermore, PrescDRL outperformed the benchmark in prescription prediction, with precision improving by 40.5% and recall improving by 63%. Overall, our study demonstrates the potential of using artificial intelligence to improve clinical intelligent diagnosis and treatment in TCM.

When to Replan? An Adaptive Replanning Strategy for Autonomous Navigation using Deep Reinforcement Learning

Authors:Kohei Honda, Ryo Yonetani, Mai Nishimura, Tadashi Kozuno
Date:2023-04-24 12:39:36

The hierarchy of global and local planners is one of the most commonly utilized system designs in autonomous robot navigation. While the global planner generates a reference path from the current to goal locations based on the pre-built map, the local planner produces a kinodynamic trajectory to follow the reference path while avoiding perceived obstacles. To account for unforeseen or dynamic obstacles not present on the pre-built map, ``when to replan'' the reference path is critical for the success of safe and efficient navigation. However, determining the ideal timing to execute replanning in such partially unknown environments still remains an open question. In this work, we first conduct an extensive simulation experiment to compare several common replanning strategies and confirm that effective strategies are highly dependent on the environment as well as the global and local planners. Based on this insight, we then derive a new adaptive replanning strategy based on deep reinforcement learning, which can learn from experience to decide appropriate replanning timings in the given environment and planning setups. Our experimental results show that the proposed replanner can perform on par or even better than the current best-performing strategies in multiple situations regarding navigation robustness and efficiency.

A Review of Symbolic, Subsymbolic and Hybrid Methods for Sequential Decision Making

Authors:Carlos Núñez-Molina, Pablo Mesejo, Juan Fernández-Olivares
Date:2023-04-20 18:22:30

In the field of Sequential Decision Making (SDM), two paradigms have historically vied for supremacy: Automated Planning (AP) and Reinforcement Learning (RL). In the spirit of reconciliation, this article reviews AP, RL and hybrid methods (e.g., novel learn to plan techniques) for solving Sequential Decision Processes (SDPs), focusing on their knowledge representation: symbolic, subsymbolic, or a combination. Additionally, it also covers methods for learning the SDP structure. Finally, we compare the advantages and drawbacks of the existing methods and conclude that neurosymbolic AI poses a promising approach for SDM, since it combines AP and RL with a hybrid knowledge representation.

Topological Guided Actor-Critic Modular Learning of Continuous Systems with Temporal Objectives

Authors:Lening Li, Zhentian Qian
Date:2023-04-20 01:36:05

This work investigates the formal policy synthesis of continuous-state stochastic dynamic systems given high-level specifications in linear temporal logic. To learn an optimal policy that maximizes the satisfaction probability, we take a product between a dynamic system and the translated automaton to construct a product system on which we solve an optimal planning problem. Since this product system has a hybrid product state space that results in reward sparsity, we introduce a generalized optimal backup order, in reverse to the topological order, to guide the value backups and accelerate the learning process. We provide the optimality proof for using the generalized optimal backup order in this optimal planning problem. Further, this paper presents an actor-critic reinforcement learning algorithm when topological order applies. This algorithm leverages advanced mathematical techniques and enjoys the property of hyperparameter self-tuning. We provide proof of the optimality and convergence of our proposed reinforcement learning algorithm. We use neural networks to approximate the value function and policy function for hybrid product state space. Furthermore, we observe that assigning integer numbers to automaton states can rank the value or policy function approximated by neural networks. To break the ordinal relationship, we use an individual neural network for each automaton state's value (policy) function, termed modular learning. We conduct two experiments. First, to show the efficacy of our reinforcement learning algorithm, we compare it with baselines on a classic control task, CartPole. Second, we demonstrate the empirical performance of our formal policy synthesis framework on motion planning of a Dubins car with a temporal specification.

Robust Route Planning with Distributional Reinforcement Learning in a Stochastic Road Network Environment

Authors:Xi Lin, Paul Szenher, John D. Martin, Brendan Englot
Date:2023-04-19 22:12:12

Route planning is essential to mobile robot navigation problems. In recent years, deep reinforcement learning (DRL) has been applied to learning optimal planning policies in stochastic environments without prior knowledge. However, existing works focus on learning policies that maximize the expected return, the performance of which can vary greatly when the level of stochasticity in the environment is high. In this work, we propose a distributional reinforcement learning based framework that learns return distributions which explicitly reflect environmental stochasticity. Policies based on the second-order stochastic dominance (SSD) relation can be used to make adjustable route decisions according to user preference on performance robustness. Our proposed method is evaluated in a simulated road network environment, and experimental results show that our method is able to plan the shortest routes that minimize stochasticity in travel time when robustness is preferred, while other state-of-the-art DRL methods are agnostic to environmental stochasticity.

Integrated Ray-Tracing and Coverage Planning Control using Reinforcement Learning

Authors:Savvas Papaioannou, Panayiotis Kolios, Theocharis Theocharides, Christos G. Panayiotou, Marios M. Polycarpou
Date:2023-04-19 13:06:55

In this work we propose a coverage planning control approach which allows a mobile agent, equipped with a controllable sensor (i.e., a camera) with limited sensing domain (i.e., finite sensing range and angle of view), to cover the surface area of an object of interest. The proposed approach integrates ray-tracing into the coverage planning process, thus allowing the agent to identify which parts of the scene are visible at any point in time. The problem of integrated ray-tracing and coverage planning control is first formulated as a constrained optimal control problem (OCP), which aims at determining the agent's optimal control inputs over a finite planning horizon, that minimize the coverage time. Efficiently solving the resulting OCP is however very challenging due to non-convex and non-linear visibility constraints. To overcome this limitation, the problem is converted into a Markov decision process (MDP) which is then solved using reinforcement learning. In particular, we show that a controller which follows an optimal control law can be learned using off-policy temporal-difference control (i.e., Q-learning). Extensive numerical experiments demonstrate the effectiveness of the proposed approach for various configurations of the agent and the object of interest.

Torque-based Deep Reinforcement Learning for Task-and-Robot Agnostic Learning on Bipedal Robots Using Sim-to-Real Transfer

Authors:Donghyeon Kim, Glen Berseth, Mathew Schwartz, Jaeheung Park
Date:2023-04-19 06:00:40

In this paper, we review the question of which action space is best suited for controlling a real biped robot in combination with Sim2Real training. Position control has been popular as it has been shown to be more sample efficient and intuitive to combine with other planning algorithms. However, for position control gain tuning is required to achieve the best possible policy performance. We show that instead, using a torque-based action space enables task-and-robot agnostic learning with less parameter tuning and mitigates the sim-to-reality gap by taking advantage of torque control's inherent compliance. Also, we accelerate the torque-based-policy training process by pre-training the policy to remain upright by compensating for gravity. The paper showcases the first successful sim-to-real transfer of a torque-based deep reinforcement learning policy on a real human-sized biped robot. The video is available at https://youtu.be/CR6pTS39VRE.

Deep Explainable Relational Reinforcement Learning: A Neuro-Symbolic Approach

Authors:Rishi Hazra, Luc De Raedt
Date:2023-04-17 15:11:40

Despite numerous successes in Deep Reinforcement Learning (DRL), the learned policies are not interpretable. Moreover, since DRL does not exploit symbolic relational representations, it has difficulties in coping with structural changes in its environment (such as increasing the number of objects). Relational Reinforcement Learning, on the other hand, inherits the relational representations from symbolic planning to learn reusable policies. However, it has so far been unable to scale up and exploit the power of deep neural networks. We propose Deep Explainable Relational Reinforcement Learning (DERRL), a framework that exploits the best of both -- neural and symbolic worlds. By resorting to a neuro-symbolic approach, DERRL combines relational representations and constraints from symbolic planning with deep learning to extract interpretable policies. These policies are in the form of logical rules that explain how each decision (or action) is arrived at. Through several experiments, in setups like the Countdown Game, Blocks World, Gridworld, and Traffic, we show that the policies learned by DERRL can be applied to different configurations and contexts, hence generalizing to environmental modifications.

Integration of Reinforcement Learning Based Behavior Planning With Sampling Based Motion Planning for Automated Driving

Authors:Marvin Klimke, Benjamin Völz, Michael Buchholz
Date:2023-04-17 13:49:55

Reinforcement learning has received high research interest for developing planning approaches in automated driving. Most prior works consider the end-to-end planning task that yields direct control commands and rarely deploy their algorithm to real vehicles. In this work, we propose a method to employ a trained deep reinforcement learning policy for dedicated high-level behavior planning. By populating an abstract objective interface, established motion planning algorithms can be leveraged, which derive smooth and drivable trajectories. Given the current environment model, we propose to use a built-in simulator to predict the traffic scene for a given horizon into the future. The behavior of automated vehicles in mixed traffic is determined by querying the learned policy. To the best of our knowledge, this work is the first to apply deep reinforcement learning in this manner, and as such lacks a state-of-the-art benchmark. Thus, we validate the proposed approach by comparing an idealistic single-shot plan with cyclic replanning through the learned policy. Experiments with a real testing vehicle on proving grounds demonstrate the potential of our approach to shrink the simulation to real world gap of deep reinforcement learning based planning approaches. Additional simulative analyses reveal that more complex multi-agent maneuvers can be managed by employing the cycling replanning approach.

Deep reinforcement learning applied to an assembly sequence planning problem with user preferences

Authors:Miguel Neves, Pedro Neto
Date:2023-04-13 14:25:15

Deep reinforcement learning (DRL) has demonstrated its potential in solving complex manufacturing decision-making problems, especially in a context where the system learns over time with actual operation in the absence of training data. One interesting and challenging application for such methods is the assembly sequence planning (ASP) problem. In this paper, we propose an approach to the implementation of DRL methods in ASP. The proposed approach introduces in the RL environment parametric actions to improve training time and sample efficiency and uses two different reward signals: (1) user's preferences and (2) total assembly time duration. The user's preferences signal addresses the difficulties and non-ergonomic properties of the assembly faced by the human and the total assembly time signal enforces the optimization of the assembly. Three of the most powerful deep RL methods were studied, Advantage Actor-Critic (A2C), Deep Q-Learning (DQN), and Rainbow, in two different scenarios: a stochastic and a deterministic one. Finally, the performance of the DRL algorithms was compared to tabular Q-Learnings performance. After 10,000 episodes, the system achieved near optimal behaviour for the algorithms tabular Q-Learning, A2C, and Rainbow. Though, for more complex scenarios, the algorithm tabular Q-Learning is expected to underperform in comparison to the other 2 algorithms. The results support the potential for the application of deep reinforcement learning in assembly sequence planning problems with human interaction.

Human-Robot Skill Transfer with Enhanced Compliance via Dynamic Movement Primitives

Authors:Jayden Hong, Zengjie Zhang, Amir M. Soufi Enayati, Homayoun Najjaran
Date:2023-04-12 08:48:28

Finding an efficient way to adapt robot trajectory is a priority to improve overall performance of robots. One approach for trajectory planning is through transferring human-like skills to robots by Learning from Demonstrations (LfD). The human demonstration is considered the target motion to mimic. However, human motion is typically optimal for human embodiment but not for robots because of the differences between human biomechanics and robot dynamics. The Dynamic Movement Primitives (DMP) framework is a viable solution for this limitation of LfD, but it requires tuning the second-order dynamics in the formulation. Our contribution is introducing a systematic method to extract the dynamic features from human demonstration to auto-tune the parameters in the DMP framework. In addition to its use with LfD, another utility of the proposed method is that it can readily be used in conjunction with Reinforcement Learning (RL) for robot training. In this way, the extracted features facilitate the transfer of human skills by allowing the robot to explore the possible trajectories more efficiently and increasing robot compliance significantly. We introduced a methodology to extract the dynamic features from multiple trajectories based on the optimization of human-likeness and similarity in the parametric space. Our method was implemented into an actual human-robot setup to extract human dynamic features and used to regenerate the robot trajectories following both LfD and RL with DMP. It resulted in a stable performance of the robot, maintaining a high degree of human-likeness based on accumulated distance error as good as the best heuristic tuning.

Emergent autonomous scientific research capabilities of large language models

Authors:Daniil A. Boiko, Robert MacKnight, Gabe Gomes
Date:2023-04-11 16:50:17

Transformer-based large language models are rapidly advancing in the field of machine learning research, with applications spanning natural language, biology, chemistry, and computer programming. Extreme scaling and reinforcement learning from human feedback have significantly improved the quality of generated text, enabling these models to perform various tasks and reason about their choices. In this paper, we present an Intelligent Agent system that combines multiple large language models for autonomous design, planning, and execution of scientific experiments. We showcase the Agent's scientific research capabilities with three distinct examples, with the most complex being the successful performance of catalyzed cross-coupling reactions. Finally, we discuss the safety implications of such systems and propose measures to prevent their misuse.

Automaton-Guided Curriculum Generation for Reinforcement Learning Agents

Authors:Yash Shukla, Abhishek Kulkarni, Robert Wright, Alvaro Velasquez, Jivko Sinapov
Date:2023-04-11 15:14:31

Despite advances in Reinforcement Learning, many sequential decision making tasks remain prohibitively expensive and impractical to learn. Recently, approaches that automatically generate reward functions from logical task specifications have been proposed to mitigate this issue; however, they scale poorly on long-horizon tasks (i.e., tasks where the agent needs to perform a series of correct actions to reach the goal state, considering future transitions while choosing an action). Employing a curriculum (a sequence of increasingly complex tasks) further improves the learning speed of the agent by sequencing intermediate tasks suited to the learning capacity of the agent. However, generating curricula from the logical specification still remains an unsolved problem. To this end, we propose AGCL, Automaton-guided Curriculum Learning, a novel method for automatically generating curricula for the target task in the form of Directed Acyclic Graphs (DAGs). AGCL encodes the specification in the form of a deterministic finite automaton (DFA), and then uses the DFA along with the Object-Oriented MDP (OOMDP) representation to generate a curriculum as a DAG, where the vertices correspond to tasks, and edges correspond to the direction of knowledge transfer. Experiments in gridworld and physics-based simulated robotics domains show that the curricula produced by AGCL achieve improved time-to-threshold performance on a complex sequential decision-making problem relative to state-of-the-art curriculum learning (e.g, teacher-student, self-play) and automaton-guided reinforcement learning baselines (e.g, Q-Learning for Reward Machines). Further, we demonstrate that AGCL performs well even in the presence of noise in the task's OOMDP description, and also when distractor objects are present that are not modeled in the logical specification of the tasks' objectives.

Feudal Graph Reinforcement Learning

Authors:Tommaso Marzi, Arshjot Khehra, Andrea Cini, Cesare Alippi
Date:2023-04-11 09:51:13

Graph-based representations and message-passing modular policies constitute prominent approaches to tackling composable control problems in reinforcement learning (RL). However, as shown by recent graph deep learning literature, such local message-passing operators can create information bottlenecks and hinder global coordination. The issue becomes more serious in tasks requiring high-level planning. In this work, we propose a novel methodology, named Feudal Graph Reinforcement Learning (FGRL), that addresses such challenges by relying on hierarchical RL and a pyramidal message-passing architecture. In particular, FGRL defines a hierarchy of policies where high-level commands are propagated from the top of the hierarchy down through a layered graph structure. The bottom layers mimic the morphology of the physical system, while the upper layers correspond to higher-order sub-modules. The resulting agents are then characterized by a committee of policies where actions at a certain level set goals for the level below, thus implementing a hierarchical decision-making structure that can naturally implement task decomposition. We evaluate the proposed framework on a graph clustering problem and MuJoCo locomotion tasks; simulation results show that FGRL compares favorably against relevant baselines. Furthermore, an in-depth analysis of the command propagation mechanism provides evidence that the introduced message-passing scheme favors learning hierarchical decision-making policies.

Optimal Interpretability-Performance Trade-off of Classification Trees with Black-Box Reinforcement Learning

Authors:Hector Kohler, Riad Akrour, Philippe Preux
Date:2023-04-11 09:43:23

Interpretability of AI models allows for user safety checks to build trust in these models. In particular, decision trees (DTs) provide a global view on the learned model and clearly outlines the role of the features that are critical to classify a given data. However, interpretability is hindered if the DT is too large. To learn compact trees, a Reinforcement Learning (RL) framework has been recently proposed to explore the space of DTs. A given supervised classification task is modeled as a Markov decision problem (MDP) and then augmented with additional actions that gather information about the features, equivalent to building a DT. By appropriately penalizing these actions, the RL agent learns to optimally trade-off size and performance of a DT. However, to do so, this RL agent has to solve a partially observable MDP. The main contribution of this paper is to prove that it is sufficient to solve a fully observable problem to learn a DT optimizing the interpretability-performance trade-off. As such any planning or RL algorithm can be used. We demonstrate the effectiveness of this approach on a set of classical supervised classification datasets and compare our approach with other interpretability-performance optimizing methods.

RoboPianist: Dexterous Piano Playing with Deep Reinforcement Learning

Authors:Kevin Zakka, Philipp Wu, Laura Smith, Nimrod Gileadi, Taylor Howell, Xue Bin Peng, Sumeet Singh, Yuval Tassa, Pete Florence, Andy Zeng, Pieter Abbeel
Date:2023-04-09 03:53:05

Replicating human-like dexterity in robot hands represents one of the largest open problems in robotics. Reinforcement learning is a promising approach that has achieved impressive progress in the last few years; however, the class of problems it has typically addressed corresponds to a rather narrow definition of dexterity as compared to human capabilities. To address this gap, we investigate piano-playing, a skill that challenges even the human limits of dexterity, as a means to test high-dimensional control, and which requires high spatial and temporal precision, and complex finger coordination and planning. We introduce RoboPianist, a system that enables simulated anthropomorphic hands to learn an extensive repertoire of 150 piano pieces where traditional model-based optimization struggles. We additionally introduce an open-sourced environment, benchmark of tasks, interpretable evaluation metrics, and open challenges for future study. Our website featuring videos, code, and datasets is available at https://kzakka.com/robopianist/

A Reinforcement Learning-assisted Genetic Programming Algorithm for Team Formation Problem Considering Person-Job Matching

Authors:Yangyang Guo, Hao Wang, Lei He, Witold Pedrycz, P. N. Suganthan, Yanjie Song
Date:2023-04-08 14:32:12

An efficient team is essential for the company to successfully complete new projects. To solve the team formation problem considering person-job matching (TFP-PJM), a 0-1 integer programming model is constructed, which considers both person-job matching and team members' willingness to communicate on team efficiency, with the person-job matching score calculated using intuitionistic fuzzy numbers. Then, a reinforcement learning-assisted genetic programming algorithm (RL-GP) is proposed to enhance the quality of solutions. The RL-GP adopts the ensemble population strategies. Before the population evolution at each generation, the agent selects one from four population search modes according to the information obtained, thus realizing a sound balance of exploration and exploitation. In addition, surrogate models are used in the algorithm to evaluate the formation plans generated by individuals, which speeds up the algorithm learning process. Afterward, a series of comparison experiments are conducted to verify the overall performance of RL-GP and the effectiveness of the improved strategies within the algorithm. The hyper-heuristic rules obtained through efficient learning can be utilized as decision-making aids when forming project teams. This study reveals the advantages of reinforcement learning methods, ensemble strategies, and the surrogate model applied to the GP framework. The diversity and intelligent selection of search patterns along with fast adaptation evaluation, are distinct features that enable RL-GP to be deployed in real-world enterprise environments.

Evolving Reinforcement Learning Environment to Minimize Learner's Achievable Reward: An Application on Hardening Active Directory Systems

Authors:Diksha Goel, Aneta Neumann, Frank Neumann, Hung Nguyen, Mingyu Guo
Date:2023-04-08 12:39:40

We study a Stackelberg game between one attacker and one defender in a configurable environment. The defender picks a specific environment configuration. The attacker observes the configuration and attacks via Reinforcement Learning (RL trained against the observed environment). The defender's goal is to find the environment with minimum achievable reward for the attacker. We apply Evolutionary Diversity Optimization (EDO) to generate diverse population of environments for training. Environments with clearly high rewards are killed off and replaced by new offsprings to avoid wasting training time. Diversity not only improves training quality but also fits well with our RL scenario: RL agents tend to improve gradually, so a slightly worse environment earlier on may become better later. We demonstrate the effectiveness of our approach by focusing on a specific application, Active Directory (AD). AD is the default security management system for Windows domain networks. AD environment describes an attack graph, where nodes represent computers/accounts/etc., and edges represent accesses. The attacker aims to find the best attack path to reach the highest-privilege node. The defender can change the graph by removing a limited number of edges (revoke accesses). Our approach generates better defensive plans than the existing approach and scales better.

Decision-Focused Model-based Reinforcement Learning for Reward Transfer

Authors:Abhishek Sharma, Sonali Parbhoo, Omer Gottesman, Finale Doshi-Velez
Date:2023-04-06 20:47:09

Model-based reinforcement learning (MBRL) provides a way to learn a transition model of the environment, which can then be used to plan personalized policies for different patient cohorts and to understand the dynamics involved in the decision-making process. However, standard MBRL algorithms are either sensitive to changes in the reward function or achieve suboptimal performance on the task when the transition model is restricted. Motivated by the need to use simple and interpretable models in critical domains such as healthcare, we propose a novel robust decision-focused (RDF) algorithm that learns a transition model that achieves high returns while being robust to changes in the reward function. We demonstrate our RDF algorithm can be used with several model classes and planning algorithms. We also provide theoretical and empirical evidence, on a variety of simulators and real patient data, that RDF can learn simple yet effective models that can be used to plan personalized policies.

Finite Time Lyapunov Exponent Analysis of Model Predictive Control and Reinforcement Learning

Authors:Kartik Krishna, Steven L. Brunton, Zhuoyuan Song
Date:2023-04-06 18:43:48

Finite-time Lyapunov exponents (FTLEs) provide a powerful approach to compute time-varying analogs of invariant manifolds in unsteady fluid flow fields. These manifolds are useful to visualize the transport mechanisms of passive tracers advecting with the flow. However, many vehicles and mobile sensors are not passive, but are instead actuated according to some intelligent trajectory planning or control law; for example, model predictive control and reinforcement learning are often used to design energy-efficient trajectories in a dynamically changing background flow. In this work, we investigate the use of FTLE on such controlled agents to gain insight into optimal transport routes for navigation in known unsteady flows. We find that these controlled FTLE (cFTLE) coherent structures separate the flow field into different regions with similar costs of transport to the goal location. These separatrices are functions of the planning algorithm's hyper-parameters, such as the optimization time horizon and the cost of actuation. Computing the invariant sets and manifolds of active agent dynamics in dynamic flow fields is useful in the context of robust motion control, hyperparameter tuning, and determining safe and collision-free trajectories for autonomous systems. Moreover, these cFTLE structures provide insight into effective deployment locations for mobile agents with actuation and energy constraints to traverse the ocean or atmosphere.

PyFlyt -- UAV Simulation Environments for Reinforcement Learning Research

Authors:Jun Jet Tai, Jim Wong, Mauro Innocente, Nadjim Horri, James Brusey, Swee King Phang
Date:2023-04-03 19:12:20

Unmanned aerial vehicles (UAVs) have numerous applications, but their efficient and optimal flight can be a challenge. Reinforcement Learning (RL) has emerged as a promising approach to address this challenge, yet there is no standardized library for testing and benchmarking RL algorithms on UAVs. In this paper, we introduce PyFlyt, a platform built on the Bullet physics engine with native Gymnasium API support. PyFlyt provides modular implementations of simple components, such as motors and lifting surfaces, allowing for the implementation of UAVs of arbitrary configurations. Additionally, PyFlyt includes various task definitions and multiple reward function settings for each vehicle type. We demonstrate the effectiveness of PyFlyt by training various RL agents for two UAV models: quadrotor and fixed-wing. Our findings highlight the effectiveness of RL in UAV control and planning, and further show that it is possible to train agents in sparse reward settings for UAVs. PyFlyt fills a gap in existing literature by providing a flexible and standardised platform for testing RL algorithms on UAVs. We believe that this will inspire more standardised research in this direction.

Combinatorial Optimization enriched Machine Learning to solve the Dynamic Vehicle Routing Problem with Time Windows

Authors:Léo Baty, Kai Jungel, Patrick S. Klein, Axel Parmentier, Maximilian Schiffer
Date:2023-04-03 08:23:09

With the rise of e-commerce and increasing customer requirements, logistics service providers face a new complexity in their daily planning, mainly due to efficiently handling same day deliveries. Existing multi-stage stochastic optimization approaches that allow to solve the underlying dynamic vehicle routing problem are either computationally too expensive for an application in online settings, or -- in the case of reinforcement learning -- struggle to perform well on high-dimensional combinatorial problems. To mitigate these drawbacks, we propose a novel machine learning pipeline that incorporates a combinatorial optimization layer. We apply this general pipeline to a dynamic vehicle routing problem with dispatching waves, which was recently promoted in the EURO Meets NeurIPS Vehicle Routing Competition at NeurIPS 2022. Our methodology ranked first in this competition, outperforming all other approaches in solving the proposed dynamic vehicle routing problem. With this work, we provide a comprehensive numerical study that further highlights the efficacy and benefits of the proposed pipeline beyond the results achieved in the competition, e.g., by showcasing the robustness of the encoded policy against unseen instances and scenarios.

Leveraging Predictive Models for Adaptive Sampling of Spatiotemporal Fluid Processes

Authors:Sandeep Manjanna, Tom Z. Jiahao, M. Ani Hsieh
Date:2023-04-03 05:55:27

Persistent monitoring of a spatiotemporal fluid process requires data sampling and predictive modeling of the process being monitored. In this paper we present PASST algorithm: Predictive-model based Adaptive Sampling of a Spatio-Temporal process. PASST is an adaptive robotic sampling algorithm that leverages predictive models to efficiently and persistently monitor a fluid process in a given region of interest. Our algorithm makes use of the predictions from a learned prediction model to plan a path for an autonomous vehicle to adaptively and efficiently survey the region of interest. In turn, the sampled data is used to obtain better predictions by giving an updated initial state to the predictive model. For predictive model, we use Knowledged-based Neural Ordinary Differential Equations to train models of fluid processes. These models are orders of magnitude smaller in size and run much faster than fluid data obtained from direct numerical simulations of the partial differential equations that describe the fluid processes or other comparable computational fluids models. For path planning, we use reinforcement learning based planning algorithms that use the field predictions as reward functions. We evaluate our adaptive sampling path planning algorithm on both numerically simulated fluid data and real-world nowcast ocean flow data to show that we can sample the spatiotemporal field in the given region of interest for long time horizons. We also evaluate PASST algorithm's generalization ability to sample from fluid processes that are not in the training repertoire of the learned models.

Risk-Sensitive and Robust Model-Based Reinforcement Learning and Planning

Authors:Marc Rigter
Date:2023-04-02 16:44:14

Many sequential decision-making problems that are currently automated, such as those in manufacturing or recommender systems, operate in an environment where there is either little uncertainty, or zero risk of catastrophe. As companies and researchers attempt to deploy autonomous systems in less constrained environments, it is increasingly important that we endow sequential decision-making algorithms with the ability to reason about uncertainty and risk. In this thesis, we will address both planning and reinforcement learning (RL) approaches to sequential decision-making. In the planning setting, it is assumed that a model of the environment is provided, and a policy is optimised within that model. Reinforcement learning relies upon extensive random exploration, and therefore usually requires a simulator in which to perform training. In many real-world domains, it is impossible to construct a perfectly accurate model or simulator. Therefore, the performance of any policy is inevitably uncertain due to the incomplete knowledge about the environment. Furthermore, in stochastic domains, the outcome of any given run is also uncertain due to the inherent randomness of the environment. These two sources of uncertainty are usually classified as epistemic, and aleatoric uncertainty, respectively. The over-arching goal of this thesis is to contribute to developing algorithms that mitigate both sources of uncertainty in sequential decision-making problems. We make a number of contributions towards this goal, with a focus on model-based algorithms...

Adaptive formation motion planning and control of autonomous underwater vehicles using deep reinforcement learning

Authors:Behnaz Hadi, Alireza Khosravi, Pouria Sarhadi
Date:2023-04-01 04:58:55

Creating safe paths in unknown and uncertain environments is a challenging aspect of leader-follower formation control. In this architecture, the leader moves toward the target by taking optimal actions, and followers should also avoid obstacles while maintaining their desired formation shape. Most of the studies in this field have inspected formation control and obstacle avoidance separately. The present study proposes a new approach based on deep reinforcement learning (DRL) for end-to-end motion planning and control of under-actuated autonomous underwater vehicles (AUVs). The aim is to design optimal adaptive distributed controllers based on actor-critic structure for AUVs formation motion planning. This is accomplished by controlling the speed and heading of AUVs. In obstacle avoidance, two approaches have been deployed. In the first approach, the goal is to design control policies for the leader and followers such that each learns its own collision-free path. Moreover, the followers adhere to an overall formation maintenance policy. In the second approach, the leader solely learns the control policy, and safely leads the whole group towards the target. Here, the control policy of the followers is to maintain the predetermined distance and angle. In the presence of ocean currents, communication delays, and sensing errors, the robustness of the proposed method under realistically perturbed circumstances is shown. The efficiency of the algorithms has been evaluated and approved using a number of computer-based simulations.

Q-Learning based system for path planning with unmanned aerial vehicles swarms in obstacle environments

Authors:Alejandro Puente-Castro, Daniel Rivero, Eurico Pedrosa, Artur Pereira, Nuno Lau, Enrique Fernandez-Blanco
Date:2023-03-30 18:37:34

Path Planning methods for autonomous control of Unmanned Aerial Vehicle (UAV) swarms are on the rise because of all the advantages they bring. There are more and more scenarios where autonomous control of multiple UAVs is required. Most of these scenarios present a large number of obstacles, such as power lines or trees. If all UAVs can be operated autonomously, personnel expenses can be decreased. In addition, if their flight paths are optimal, energy consumption is reduced. This ensures that more battery time is left for other operations. In this paper, a Reinforcement Learning based system is proposed for solving this problem in environments with obstacles by making use of Q-Learning. This method allows a model, in this particular case an Artificial Neural Network, to self-adjust by learning from its mistakes and achievements. Regardless of the size of the map or the number of UAVs in the swarm, the goal of these paths is to ensure complete coverage of an area with fixed obstacles for tasks, like field prospecting. Setting goals or having any prior information aside from the provided map is not required. For experimentation, five maps of different sizes with different obstacles were used. The experiments were performed with different number of UAVs. For the calculation of the results, the number of actions taken by all UAVs to complete the task in each experiment is taken into account. The lower the number of actions, the shorter the path and the lower the energy consumption. The results are satisfactory, showing that the system obtains solutions in fewer movements the more UAVs there are. For a better presentation, these results have been compared to another state-of-the-art approach.

Learning Human-to-Robot Handovers from Point Clouds

Authors:Sammy Christen, Wei Yang, Claudia Pérez-D'Arpino, Otmar Hilliges, Dieter Fox, Yu-Wei Chao
Date:2023-03-30 17:58:36

We propose the first framework to learn control policies for vision-based human-to-robot handovers, a critical task for human-robot interaction. While research in Embodied AI has made significant progress in training robot agents in simulated environments, interacting with humans remains challenging due to the difficulties of simulating humans. Fortunately, recent research has developed realistic simulated environments for human-to-robot handovers. Leveraging this result, we introduce a method that is trained with a human-in-the-loop via a two-stage teacher-student framework that uses motion and grasp planning, reinforcement learning, and self-supervision. We show significant performance gains over baselines on a simulation benchmark, sim-to-sim transfer and sim-to-real transfer.

Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks

Authors:Haoqi Yuan, Chi Zhang, Hongcheng Wang, Feiyang Xie, Penglin Cai, Hao Dong, Zongqing Lu
Date:2023-03-29 09:45:50

We study building multi-task agents in open-world environments. Without human demonstrations, learning to accomplish long-horizon tasks in a large open-world environment with reinforcement learning (RL) is extremely inefficient. To tackle this challenge, we convert the multi-task learning problem into learning basic skills and planning over the skills. Using the popular open-world game Minecraft as the testbed, we propose three types of fine-grained basic skills, and use RL with intrinsic rewards to acquire skills. A novel Finding-skill that performs exploration to find diverse items provides better initialization for other skills, improving the sample efficiency for skill learning. In skill planning, we leverage the prior knowledge in Large Language Models to find the relationships between skills and build a skill graph. When the agent is solving a task, our skill search algorithm walks on the skill graph and generates the proper skill plans for the agent. In experiments, our method accomplishes 40 diverse Minecraft tasks, where many tasks require sequentially executing for more than 10 skills. Our method outperforms baselines by a large margin and is the most sample-efficient demonstration-free RL method to solve Minecraft Tech Tree tasks. The project's website and code can be found at https://sites.google.com/view/plan4mc.

Planning with Sequence Models through Iterative Energy Minimization

Authors:Hongyi Chen, Yilun Du, Yiye Chen, Joshua Tenenbaum, Patricio A. Vela
Date:2023-03-28 17:53:22

Recent works have shown that sequence modeling can be effectively used to train reinforcement learning (RL) policies. However, the success of applying existing sequence models to planning, in which we wish to obtain a trajectory of actions to reach some goal, is less straightforward. The typical autoregressive generation procedures of sequence models preclude sequential refinement of earlier steps, which limits the effectiveness of a predicted plan. In this paper, we suggest an approach towards integrating planning with sequence models based on the idea of iterative energy minimization, and illustrate how such a procedure leads to improved RL performance across different tasks. We train a masked language model to capture an implicit energy function over trajectories of actions, and formulate planning as finding a trajectory of actions with minimum energy. We illustrate how this procedure enables improved performance over recent approaches across BabyAI and Atari environments. We further demonstrate unique benefits of our iterative optimization procedure, involving new task generalization, test-time constraints adaptation, and the ability to compose plans together. Project website: https://hychen-naza.github.io/projects/LEAP

DexDeform: Dexterous Deformable Object Manipulation with Human Demonstrations and Differentiable Physics

Authors:Sizhe Li, Zhiao Huang, Tao Chen, Tao Du, Hao Su, Joshua B. Tenenbaum, Chuang Gan
Date:2023-03-27 17:59:49

In this work, we aim to learn dexterous manipulation of deformable objects using multi-fingered hands. Reinforcement learning approaches for dexterous rigid object manipulation would struggle in this setting due to the complexity of physics interaction with deformable objects. At the same time, previous trajectory optimization approaches with differentiable physics for deformable manipulation would suffer from local optima caused by the explosion of contact modes from hand-object interactions. To address these challenges, we propose DexDeform, a principled framework that abstracts dexterous manipulation skills from human demonstration and refines the learned skills with differentiable physics. Concretely, we first collect a small set of human demonstrations using teleoperation. And we then train a skill model using demonstrations for planning over action abstractions in imagination. To explore the goal space, we further apply augmentations to the existing deformable shapes in demonstrations and use a gradient optimizer to refine the actions planned by the skill model. Finally, we adopt the refined trajectories as new demonstrations for finetuning the skill model. To evaluate the effectiveness of our approach, we introduce a suite of six challenging dexterous deformable object manipulation tasks. Compared with baselines, DexDeform is able to better explore and generalize across novel goals unseen in the initial human demonstrations.

Bi-Manual Block Assembly via Sim-to-Real Reinforcement Learning

Authors:Satoshi Kataoka, Youngseog Chung, Seyed Kamyar Seyed Ghasemipour, Pannag Sanketi, Shixiang Shane Gu, Igor Mordatch
Date:2023-03-27 01:25:24

Most successes in robotic manipulation have been restricted to single-arm gripper robots, whose low dexterity limits the range of solvable tasks to pick-and-place, inser-tion, and object rearrangement. More complex tasks such as assembly require dual and multi-arm platforms, but entail a suite of unique challenges such as bi-arm coordination and collision avoidance, robust grasping, and long-horizon planning. In this work we investigate the feasibility of training deep reinforcement learning (RL) policies in simulation and transferring them to the real world (Sim2Real) as a generic methodology for obtaining performant controllers for real-world bi-manual robotic manipulation tasks. As a testbed for bi-manual manipulation, we develop the U-Shape Magnetic BlockAssembly Task, wherein two robots with parallel grippers must connect 3 magnetic blocks to form a U-shape. Without manually-designed controller nor human demonstrations, we demonstrate that with careful Sim2Real considerations, our policies trained with RL in simulation enable two xArm6 robots to solve the U-shape assembly task with a success rate of above90% in simulation, and 50% on real hardware without any additional real-world fine-tuning. Through careful ablations,we highlight how each component of the system is critical for such simple and successful policy learning and transfer,including task specification, learning algorithm, direct joint-space control, behavior constraints, perception and actuation noises, action delays and action interpolation. Our results present a significant step forward for bi-arm capability on real hardware, and we hope our system can inspire future research on deep RL and Sim2Real transfer of bi-manualpolicies, drastically scaling up the capability of real-world robot manipulators.

Learning to Operate in Open Worlds by Adapting Planning Models

Authors:Wiktor Piotrowski, Roni Stern, Yoni Sher, Jacob Le, Matthew Klenk, Johan deKleer, Shiwali Mohan
Date:2023-03-24 21:04:16

Planning agents are ill-equipped to act in novel situations in which their domain model no longer accurately represents the world. We introduce an approach for such agents operating in open worlds that detects the presence of novelties and effectively adapts their domain models and consequent action selection. It uses observations of action execution and measures their divergence from what is expected, according to the environment model, to infer existence of a novelty. Then, it revises the model through a heuristics-guided search over model changes. We report empirical evaluations on the CartPole problem, a standard Reinforcement Learning (RL) benchmark. The results show that our approach can deal with a class of novelties very quickly and in an interpretable fashion.

Boosting Reinforcement Learning and Planning with Demonstrations: A Survey

Authors:Tongzhou Mu, Hao Su
Date:2023-03-23 17:53:44

Although reinforcement learning has seen tremendous success recently, this kind of trial-and-error learning can be impractical or inefficient in complex environments. The use of demonstrations, on the other hand, enables agents to benefit from expert knowledge rather than having to discover the best action to take through exploration. In this survey, we discuss the advantages of using demonstrations in sequential decision making, various ways to apply demonstrations in learning-based decision making paradigms (for example, reinforcement learning and planning in the learned models), and how to collect the demonstrations in various scenarios. Additionally, we exemplify a practical pipeline for generating and utilizing demonstrations in the recently proposed ManiSkill robot learning benchmark.

Planning Goals for Exploration

Authors:Edward S. Hu, Richard Chang, Oleh Rybkin, Dinesh Jayaraman
Date:2023-03-23 02:51:50

Dropped into an unknown environment, what should an agent do to quickly learn about the environment and how to accomplish diverse tasks within it? We address this question within the goal-conditioned reinforcement learning paradigm, by identifying how the agent should set its goals at training time to maximize exploration. We propose "Planning Exploratory Goals" (PEG), a method that sets goals for each training episode to directly optimize an intrinsic exploration reward. PEG first chooses goal commands such that the agent's goal-conditioned policy, at its current level of training, will end up in states with high exploration potential. It then launches an exploration policy starting at those promising states. To enable this direct optimization, PEG learns world models and adapts sampling-based planning algorithms to "plan goal commands". In challenging simulated robotics environments including a multi-legged ant robot in a maze, and a robot arm on a cluttered tabletop, PEG exploration enables more efficient and effective training of goal-conditioned policies relative to baselines and ablations. Our ant successfully navigates a long maze, and the robot arm successfully builds a stack of three blocks upon command. Website: https://penn-pal-lab.github.io/peg/

EDGI: Equivariant Diffusion for Planning with Embodied Agents

Authors:Johann Brehmer, Joey Bose, Pim de Haan, Taco Cohen
Date:2023-03-22 09:19:39

Embodied agents operate in a structured world, often solving tasks with spatial, temporal, and permutation symmetries. Most algorithms for planning and model-based reinforcement learning (MBRL) do not take this rich geometric structure into account, leading to sample inefficiency and poor generalization. We introduce the Equivariant Diffuser for Generating Interactions (EDGI), an algorithm for MBRL and planning that is equivariant with respect to the product of the spatial symmetry group SE(3), the discrete-time translation group Z, and the object permutation group Sn. EDGI follows the Diffuser framework (Janner et al., 2022) in treating both learning a world model and planning in it as a conditional generative modeling problem, training a diffusion model on an offline trajectory dataset. We introduce a new SE(3)xZxSn-equivariant diffusion model that supports multiple representations. We integrate this model in a planning loop, where conditioning and classifier guidance let us softly break the symmetry for specific tasks as needed. On object manipulation and navigation tasks, EDGI is substantially more sample efficient and generalizes better across the symmetry group than non-equivariant models.

Deep Reinforcement Learning for Localizability-Enhanced Navigation in Dynamic Human Environments

Authors:Yuan Chen, Quecheng Qiu, Xiangyu Liu, Guangda Chen, Shunyi Yao, Jie Peng, Jianmin Ji, Yanyong Zhang
Date:2023-03-22 07:44:35

Reliable localization is crucial for autonomous robots to navigate efficiently and safely. Some navigation methods can plan paths with high localizability (which describes the capability of acquiring reliable localization). By following these paths, the robot can access the sensor streams that facilitate more accurate location estimation results by the localization algorithms. However, most of these methods require prior knowledge and struggle to adapt to unseen scenarios or dynamic changes. To overcome these limitations, we propose a novel approach for localizability-enhanced navigation via deep reinforcement learning in dynamic human environments. Our proposed planner automatically extracts geometric features from 2D laser data that are helpful for localization. The planner learns to assign different importance to the geometric features and encourages the robot to navigate through areas that are helpful for laser localization. To facilitate the learning of the planner, we suggest two techniques: (1) an augmented state representation that considers the dynamic changes and the confidence of the localization results, which provides more information and allows the robot to make better decisions, (2) a reward metric that is capable to offer both sparse and dense feedback on behaviors that affect localization accuracy. Our method exhibits significant improvements in lost rate and arrival rate when tested in previously unseen environments.

Adaptive Road Configurations for Improved Autonomous Vehicle-Pedestrian Interactions using Reinforcement Learning

Authors:Qiming Ye, Yuxiang Feng, Jose Javier Escribano Macias, Marc Stettler, Panagiotis Angeloudis
Date:2023-03-22 03:42:39

The deployment of Autonomous Vehicles (AVs) poses considerable challenges and unique opportunities for the design and management of future urban road infrastructure. In light of this disruptive transformation, the Right-Of-Way (ROW) composition of road space has the potential to be renewed. Design approaches and intelligent control models have been proposed to address this problem, but we lack an operational framework that can dynamically generate ROW plans for AVs and pedestrians in response to real-time demand. Based on microscopic traffic simulation, this study explores Reinforcement Learning (RL) methods for evolving ROW compositions. We implement a centralised paradigm and a distributive learning paradigm to separately perform the dynamic control on several road network configurations. Experimental results indicate that the algorithms have the potential to improve traffic flow efficiency and allocate more space for pedestrians. Furthermore, the distributive learning algorithm outperforms its centralised counterpart regarding computational cost (49.55\%), benchmark rewards (25.35\%), best cumulative rewards (24.58\%), optimal actions (13.49\%) and rate of convergence. This novel road management technique could potentially contribute to the flow-adaptive and active mobility-friendly streets in the AVs era.

A Hierarchical Hybrid Learning Framework for Multi-agent Trajectory Prediction

Authors:Yujun Jiao, Mingze Miao, Zhishuai Yin, Chunyuan Lei, Xu Zhu, Linzhen Nie, Bo Tao
Date:2023-03-22 02:47:42

Accurate and robust trajectory prediction of neighboring agents is critical for autonomous vehicles traversing in complex scenes. Most methods proposed in recent years are deep learning-based due to their strength in encoding complex interactions. However, unplausible predictions are often generated since they rely heavily on past observations and cannot effectively capture the transient and contingency interactions from sparse samples. In this paper, we propose a hierarchical hybrid framework of deep learning (DL) and reinforcement learning (RL) for multi-agent trajectory prediction, to cope with the challenge of predicting motions shaped by multi-scale interactions. In the DL stage, the traffic scene is divided into multiple intermediate-scale heterogenous graphs based on which Transformer-style GNNs are adopted to encode heterogenous interactions at intermediate and global levels. In the RL stage, we divide the traffic scene into local sub-scenes utilizing the key future points predicted in the DL stage. To emulate the motion planning procedure so as to produce trajectory predictions, a Transformer-based Proximal Policy Optimization (PPO) incorporated with a vehicle kinematics model is devised to plan motions under the dominant influence of microscopic interactions. A multi-objective reward is designed to balance between agent-centric accuracy and scene-wise compatibility. Experimental results show that our proposal matches the state-of-the-arts on the Argoverse forecasting benchmark. It's also revealed by the visualized results that the hierarchical learning framework captures the multi-scale interactions and improves the feasibility and compliance of the predicted trajectories.

Imitating Graph-Based Planning with Goal-Conditioned Policies

Authors:Junsu Kim, Younggyo Seo, Sungsoo Ahn, Kyunghwan Son, Jinwoo Shin
Date:2023-03-20 14:51:10

Recently, graph-based planning algorithms have gained much attention to solve goal-conditioned reinforcement learning (RL) tasks: they provide a sequence of subgoals to reach the target-goal, and the agents learn to execute subgoal-conditioned policies. However, the sample-efficiency of such RL schemes still remains a challenge, particularly for long-horizon tasks. To address this issue, we present a simple yet effective self-imitation scheme which distills a subgoal-conditioned policy into the target-goal-conditioned policy. Our intuition here is that to reach a target-goal, an agent should pass through a subgoal, so target-goal- and subgoal- conditioned policies should be similar to each other. We also propose a novel scheme of stochastically skipping executed subgoals in a planned path, which further improves performance. Unlike prior methods that only utilize graph-based planning in an execution phase, our method transfers knowledge from a planner along with a graph into policy learning. We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods under various long-horizon control tasks.

Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs

Authors:Yuan Cheng, Ruiquan Huang, Jing Yang, Yingbin Liang
Date:2023-03-20 04:39:39

In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.

Deceptive Reinforcement Learning in Model-Free Domains

Authors:Alan Lewis, Tim Miller
Date:2023-03-20 02:47:40

This paper investigates deceptive reinforcement learning for privacy preservation in model-free and continuous action space domains. In reinforcement learning, the reward function defines the agent's objective. In adversarial scenarios, an agent may need to both maximise rewards and keep its reward function private from observers. Recent research presented the ambiguity model (AM), which selects actions that are ambiguous over a set of possible reward functions, via pre-trained $Q$-functions. Despite promising results in model-based domains, our investigation shows that AM is ineffective in model-free domains due to misdirected state space exploration. It is also inefficient to train and inapplicable in continuous action space domains. We propose the deceptive exploration ambiguity model (DEAM), which learns using the deceptive policy during training, leading to targeted exploration of the state space. DEAM is also applicable in continuous action spaces. We evaluate DEAM in discrete and continuous action space path planning environments. DEAM achieves similar performance to an optimal model-based version of AM and outperforms a model-free version of AM in terms of path cost, deceptiveness and training efficiency. These results extend to the continuous domain.

Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs

Authors:Junkai Zhang, Weitong Zhang, Quanquan Gu
Date:2023-03-17 17:53:28

We study reward-free reinforcement learning (RL) with linear function approximation, where the agent works in two phases: (1) in the exploration phase, the agent interacts with the environment but cannot access the reward; and (2) in the planning phase, the agent is given a reward function and is expected to find a near-optimal policy based on samples collected in the exploration phase. The sample complexities of existing reward-free algorithms have a polynomial dependence on the planning horizon, which makes them intractable for long planning horizon RL problems. In this paper, we propose a new reward-free algorithm for learning linear mixture Markov decision processes (MDPs), where the transition probability can be parameterized as a linear combination of known feature mappings. At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward and a high-order moment estimator for the aleatoric and epistemic uncertainties. When the total reward is bounded by $1$, we show that our algorithm only needs to explore $\tilde O( d^2\varepsilon^{-2})$ episodes to find an $\varepsilon$-optimal policy, where $d$ is the dimension of the feature mapping. The sample complexity of our algorithm only has a polylogarithmic dependence on the planning horizon and therefore is "horizon-free". In addition, we provide an $\Omega(d^2\varepsilon^{-2})$ sample complexity lower bound, which matches the sample complexity of our algorithm up to logarithmic factors, suggesting that our algorithm is optimal.

Efficient Learning of High Level Plans from Play

Authors:Núria Armengol Urpí, Marco Bagatella, Otmar Hilliges, Georg Martius, Stelian Coros
Date:2023-03-16 20:09:47

Real-world robotic manipulation tasks remain an elusive challenge, since they involve both fine-grained environment interaction, as well as the ability to plan for long-horizon goals. Although deep reinforcement learning (RL) methods have shown encouraging results when planning end-to-end in high-dimensional environments, they remain fundamentally limited by poor sample efficiency due to inefficient exploration, and by the complexity of credit assignment over long horizons. In this work, we present Efficient Learning of High-Level Plans from Play (ELF-P), a framework for robotic learning that bridges motion planning and deep RL to achieve long-horizon complex manipulation tasks. We leverage task-agnostic play data to learn a discrete behavioral prior over object-centric primitives, modeling their feasibility given the current context. We then design a high-level goal-conditioned policy which (1) uses primitives as building blocks to scaffold complex long-horizon tasks and (2) leverages the behavioral prior to accelerate learning. We demonstrate that ELF-P has significantly better sample efficiency than relevant baselines over multiple realistic manipulation tasks and learns policies that can be easily transferred to physical hardware.

Self-Inspection Method of Unmanned Aerial Vehicles in Power Plants Using Deep Q-Network Reinforcement Learning

Authors:Haoran Guan
Date:2023-03-16 00:58:50

For the purpose of inspecting power plants, autonomous robots can be built using reinforcement learning techniques. The method replicates the environment and employs a simple reinforcement learning (RL) algorithm. This strategy might be applied in several sectors, including the electricity generation sector. A pre-trained model with perception, planning, and action is suggested by the research. To address optimization problems, such as the Unmanned Aerial Vehicle (UAV) navigation problem, Deep Q-network (DQN), a reinforcement learning-based framework that Deepmind launched in 2015, incorporates both deep learning and Q-learning. To overcome problems with current procedures, the research proposes a power plant inspection system incorporating UAV autonomous navigation and DQN reinforcement learning. These training processes set reward functions with reference to states and consider both internal and external effect factors, which distinguishes them from other reinforcement learning training techniques now in use. The key components of the reinforcement learning segment of the technique, for instance, introduce states such as the simulation of a wind field, the battery charge level of an unmanned aerial vehicle, the height the UAV reached, etc. The trained model makes it more likely that the inspection strategy will be applied in practice by enabling the UAV to move around on its own in difficult environments. The average score of the model converges to 9,000. The trained model allowed the UAV to make the fewest number of rotations necessary to go to the target point.

Efficient Planning of Multi-Robot Collective Transport using Graph Reinforcement Learning with Higher Order Topological Abstraction

Authors:Steve Paul, Wenyuan Li, Brian Smyth, Yuzhou Chen, Yulia Gel, Souma Chowdhury
Date:2023-03-15 20:58:20

Efficient multi-robot task allocation (MRTA) is fundamental to various time-sensitive applications such as disaster response, warehouse operations, and construction. This paper tackles a particular class of these problems that we call MRTA-collective transport or MRTA-CT -- here tasks present varying workloads and deadlines, and robots are subject to flight range, communication range, and payload constraints. For large instances of these problems involving 100s-1000's of tasks and 10s-100s of robots, traditional non-learning solvers are often time-inefficient, and emerging learning-based policies do not scale well to larger-sized problems without costly retraining. To address this gap, we use a recently proposed encoder-decoder graph neural network involving Capsule networks and multi-head attention mechanism, and innovatively add topological descriptors (TD) as new features to improve transferability to unseen problems of similar and larger size. Persistent homology is used to derive the TD, and proximal policy optimization is used to train our TD-augmented graph neural network. The resulting policy model compares favorably to state-of-the-art non-learning baselines while being much faster. The benefit of using TD is readily evident when scaling to test problems of size larger than those used in training.

On the Benefits of Leveraging Structural Information in Planning Over the Learned Model

Authors:Jiajun Shen, Kananart Kuwaranancharoen, Raid Ayoub, Pietro Mercati, Shreyas Sundaram
Date:2023-03-15 18:18:01

Model-based Reinforcement Learning (RL) integrates learning and planning and has received increasing attention in recent years. However, learning the model can incur a significant cost (in terms of sample complexity), due to the need to obtain a sufficient number of samples for each state-action pair. In this paper, we investigate the benefits of leveraging structural information about the system in terms of reducing sample complexity. Specifically, we consider the setting where the transition probability matrix is a known function of a number of structural parameters, whose values are initially unknown. We then consider the problem of estimating those parameters based on the interactions with the environment. We characterize the difference between the Q estimates and the optimal Q value as a function of the number of samples. Our analysis shows that there can be a significant saving in sample complexity by leveraging structural information about the model. We illustrate the findings by considering several problems including controlling a queuing system with heterogeneous servers, and seeking an optimal path in a stochastic windy gridworld.

Replay Buffer with Local Forgetting for Adapting to Local Environment Changes in Deep Model-Based Reinforcement Learning

Authors:Ali Rahimi-Kalahroudi, Janarthanan Rajendran, Ida Momennejad, Harm van Seijen, Sarath Chandar
Date:2023-03-15 15:21:26

One of the key behavioral characteristics used in neuroscience to determine whether the subject of study -- be it a rodent or a human -- exhibits model-based learning is effective adaptation to local changes in the environment, a particular form of adaptivity that is the focus of this work. In reinforcement learning, however, recent work has shown that modern deep model-based reinforcement-learning (MBRL) methods adapt poorly to local environment changes. An explanation for this mismatch is that MBRL methods are typically designed with sample-efficiency on a single task in mind and the requirements for effective adaptation are substantially higher, both in terms of the learned world model and the planning routine. One particularly challenging requirement is that the learned world model has to be sufficiently accurate throughout relevant parts of the state-space. This is challenging for deep-learning-based world models due to catastrophic forgetting. And while a replay buffer can mitigate the effects of catastrophic forgetting, the traditional first-in-first-out replay buffer precludes effective adaptation due to maintaining stale data. In this work, we show that a conceptually simple variation of this traditional replay buffer is able to overcome this limitation. By removing only samples from the buffer from the local neighbourhood of the newly observed samples, deep world models can be built that maintain their accuracy across the state-space, while also being able to effectively adapt to local changes in the reward function. We demonstrate this by applying our replay-buffer variation to a deep version of the classical Dyna method, as well as to recent methods such as PlaNet and DreamerV2, demonstrating that deep model-based methods can adapt effectively as well to local changes in the environment.

Path Planning using Reinforcement Learning: A Policy Iteration Approach

Authors:Saumil Shivdikar, Jagannath Nirmal
Date:2023-03-13 23:44:40

With the impact of real-time processing being realized in the recent past, the need for efficient implementations of reinforcement learning algorithms has been on the rise. Albeit the numerous advantages of Bellman equations utilized in RL algorithms, they are not without the large search space of design parameters. This research aims to shed light on the design space exploration associated with reinforcement learning parameters, specifically that of Policy Iteration. Given the large computational expenses of fine-tuning the parameters of reinforcement learning algorithms, we propose an auto-tuner-based ordinal regression approach to accelerate the process of exploring these parameters and, in return, accelerate convergence towards an optimal policy. Our approach provides 1.82x peak speedup with an average of 1.48x speedup over the previous state-of-the-art.

Towards Practical Multi-Robot Hybrid Tasks Allocation for Autonomous Cleaning

Authors:Yabin Wang, Xiaopeng Hong, Zhiheng Ma, Tiedong Ma, Baoxing Qin, Zhou Su
Date:2023-03-12 01:15:08

Task allocation plays a vital role in multi-robot autonomous cleaning systems, where multiple robots work together to clean a large area. However, most current studies mainly focus on deterministic, single-task allocation for cleaning robots, without considering hybrid tasks in uncertain working environments. Moreover, there is a lack of datasets and benchmarks for relevant research. In this paper, to address these problems, we formulate multi-robot hybrid-task allocation under the uncertain cleaning environment as a robust optimization problem. Firstly, we propose a novel robust mixed-integer linear programming model with practical constraints including the task order constraint for different tasks and the ability constraints of hybrid robots. Secondly, we establish a dataset of \emph{100} instances made from floor plans, each of which has 2D manually-labeled images and a 3D model. Thirdly, we provide comprehensive results on the collected dataset using three traditional optimization approaches and a deep reinforcement learning-based solver. The evaluation results show that our solution meets the needs of multi-robot cleaning task allocation and the robust solver can protect the system from worst-case scenarios with little additional cost. The benchmark will be available at {https://github.com/iamwangyabin/Multi-robot-Cleaning-Task-Allocation}.

Spatio-Temporal Attention Network for Persistent Monitoring of Multiple Mobile Targets

Authors:Yizhuo Wang, Yutong Wang, Yuhong Cao, Guillaume Sartoretti
Date:2023-03-11 08:53:37

This work focuses on the persistent monitoring problem, where a set of targets moving based on an unknown model must be monitored by an autonomous mobile robot with a limited sensing range. To keep each target's position estimate as accurate as possible, the robot needs to adaptively plan its path to (re-)visit all the targets and update its belief from measurements collected along the way. In doing so, the main challenge is to strike a balance between exploitation, i.e., re-visiting previously-located targets, and exploration, i.e., finding new targets or re-acquiring lost ones. Encouraged by recent advances in deep reinforcement learning, we introduce an attention-based neural solution to the persistent monitoring problem, where the agent can learn the inter-dependencies between targets, i.e., their spatial and temporal correlations, conditioned on past measurements. This endows the agent with the ability to determine which target, time, and location to attend to across multiple scales, which we show also helps relax the usual limitations of a finite target set. We experimentally demonstrate that our method outperforms other baselines in terms of number of targets visits and average estimation error in complex environments. Finally, we implement and validate our model in a drone-based simulation experiment to monitor mobile ground targets in a high-fidelity simulator.

Intent-based Deep Reinforcement Learning for Multi-agent Informative Path Planning

Authors:Tianze Yang, Yuhong Cao, Guillaume Sartoretti
Date:2023-03-09 15:50:36

In multi-agent informative path planning (MAIPP), agents must collectively construct a global belief map of an underlying distribution of interest (e.g., gas concentration, light intensity, or pollution levels) over a given domain, based on measurements taken along their trajectory. They must frequently replan their path to balance the exploration of new areas with the exploitation of known high-interest areas, to maximize information gain within a predefined budget. Traditional approaches rely on reactive path planning conditioned on other agents' predicted future actions. However, as the belief is continuously updated, the predicted actions may not match the executed actions, introducing noise and reducing performance. We propose a decentralized, deep reinforcement learning (DRL) approach using an attention-based neural network, where agents optimize long-term individual and cooperative objectives by sharing their intent, represented as a distribution of medium-/long-term future positions obtained from their own policy. Intent sharing enables agents to learn to claim or avoid broader areas, while the use of attention mechanisms allows them to identify useful portions of imperfect predictions, maximizing cooperation even based on imperfect information. Our experiments compare the performance of our approach, its variants, and high-quality baselines across various MAIPP scenarios. We finally demonstrate the effectiveness of our approach under limited communication ranges, towards deployments under realistic communication constraints.

Real-time scheduling of renewable power systems through planning-based reinforcement learning

Authors:Shaohuai Liu, Jinbo Liu, Weirui Ye, Nan Yang, Guanglun Zhang, Haiwang Zhong, Chongqing Kang, Qirong Jiang, Xuri Song, Fangchun Di, Yang Gao
Date:2023-03-09 12:19:20

The growing renewable energy sources have posed significant challenges to traditional power scheduling. It is difficult for operators to obtain accurate day-ahead forecasts of renewable generation, thereby requiring the future scheduling system to make real-time scheduling decisions aligning with ultra-short-term forecasts. Restricted by the computation speed, traditional optimization-based methods can not solve this problem. Recent developments in reinforcement learning (RL) have demonstrated the potential to solve this challenge. However, the existing RL methods are inadequate in terms of constraint complexity, algorithm performance, and environment fidelity. We are the first to propose a systematic solution based on the state-of-the-art reinforcement learning algorithm and the real power grid environment. The proposed approach enables planning and finer time resolution adjustments of power generators, including unit commitment and economic dispatch, thus increasing the grid's ability to admit more renewable energy. The well-trained scheduling agent significantly reduces renewable curtailment and load shedding, which are issues arising from traditional scheduling's reliance on inaccurate day-ahead forecasts. High-frequency control decisions exploit the existing units' flexibility, reducing the power grid's dependence on hardware transformations and saving investment and operating costs, as demonstrated in experimental results. This research exhibits the potential of reinforcement learning in promoting low-carbon and intelligent power systems and represents a solid step toward sustainable electricity generation.

Learned Parameter Selection for Robotic Information Gathering

Authors:Christopher E. Denniston, Gautam Salhotra, Akseli Kangaslahti, David A. Caron, Gaurav S. Sukhatme
Date:2023-03-09 03:53:44

When robots are deployed in the field for environmental monitoring they typically execute pre-programmed motions, such as lawnmower paths, instead of adaptive methods, such as informative path planning. One reason for this is that adaptive methods are dependent on parameter choices that are both critical to set correctly and difficult for the non-specialist to choose. Here, we show how to automatically configure a planner for informative path planning by training a reinforcement learning agent to select planner parameters at each iteration of informative path planning. We demonstrate our method with 37 instances of 3 distinct environments, and compare it against pure (end-to-end) reinforcement learning techniques, as well as approaches that do not use a learned model to change the planner parameters. Our method shows a 9.53% mean improvement in the cumulative reward across diverse environments when compared to end-to-end learning based methods; we also demonstrate via a field experiment how it can be readily used to facilitate high performance deployment of an information gathering robot.

MCTS-GEB: Monte Carlo Tree Search is a Good E-graph Builder

Authors:Guoliang He, Zak Singh, Eiko Yoneki
Date:2023-03-08 15:19:27

Rewrite systems [6, 10, 12] have been widely employing equality saturation [9], which is an optimisation methodology that uses a saturated e-graph to represent all possible sequences of rewrite simultaneously, and then extracts the optimal one. As such, optimal results can be achieved by avoiding the phase-ordering problem. However, we observe that when the e-graph is not saturated, it cannot represent all possible rewrite opportunities and therefore the phase-ordering problem is re-introduced during the construction phase of the e-graph. To address this problem, we propose MCTS-GEB, a domain-general rewrite system that applies reinforcement learning (RL) to e-graph construction. At its core, MCTS-GEB uses a Monte Carlo Tree Search (MCTS) [3] to efficiently plan for the optimal e-graph construction, and therefore it can effectively eliminate the phase-ordering problem at the construction phase and achieve better performance within a reasonable time. Evaluation in two different domains shows MCTS-GEB can outperform the state-of-the-art rewrite systems by up to 49x, while the optimisation can generally take less than an hour, indicating MCTS-GEB is a promising building block for the future generation of rewrite systems.

Deep Occupancy-Predictive Representations for Autonomous Driving

Authors:Eivind Meyer, Lars Frederik Peiss, Matthias Althoff
Date:2023-03-07 20:21:49

Manually specifying features that capture the diversity in traffic environments is impractical. Consequently, learning-based agents cannot realize their full potential as neural motion planners for autonomous vehicles. Instead, this work proposes to learn which features are task-relevant. Given its immediate relevance to motion planning, our proposed architecture encodes the probabilistic occupancy map as a proxy for obtaining pre-trained state representations. By leveraging a map-aware graph formulation of the environment, our agent-centric encoder generalizes to arbitrary road networks and traffic situations. We show that our approach significantly improves the downstream performance of a reinforcement learning agent operating in urban traffic environments.

Foundation Models for Decision Making: Problems, Methods, and Opportunities

Authors:Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, Dale Schuurmans
Date:2023-03-07 18:44:07

Foundation models pretrained on diverse data at scale have demonstrated extraordinary capabilities in a wide range of vision and language tasks. When such models are deployed in real world environments, they inevitably interface with other entities and agents. For example, language models are often used to interact with human beings through dialogue, and visual perception models are used to autonomously navigate neighborhood streets. In response to these developments, new paradigms are emerging for training foundation models to interact with other agents and perform long-term reasoning. These paradigms leverage the existence of ever-larger datasets curated for multimodal, multitask, and generalist interaction. Research at the intersection of foundation models and decision making holds tremendous promise for creating powerful new systems that can interact effectively across a diverse range of applications such as dialogue, autonomous driving, healthcare, education, and robotics. In this manuscript, we examine the scope of foundation models for decision making, and provide conceptual tools and technical background for understanding the problem space and exploring new research directions. We review recent approaches that ground foundation models in practical decision making applications through a variety of methods such as prompting, conditional generative modeling, planning, optimal control, and reinforcement learning, and discuss common challenges and open problems in the field.

TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction

Authors:Zhejun Zhang, Alexander Liniger, Dengxin Dai, Fisher Yu, Luc Van Gool
Date:2023-03-07 18:28:41

Data-driven simulation has become a favorable way to train and test autonomous driving algorithms. The idea of replacing the actual environment with a learned simulator has also been explored in model-based reinforcement learning in the context of world models. In this work, we show data-driven traffic simulation can be formulated as a world model. We present TrafficBots, a multi-agent policy built upon motion prediction and end-to-end driving, and based on TrafficBots we obtain a world model tailored for the planning module of autonomous vehicles. Existing data-driven traffic simulators are lacking configurability and scalability. To generate configurable behaviors, for each agent we introduce a destination as navigational information, and a time-invariant latent personality that specifies the behavioral style. To improve the scalability, we present a new scheme of positional encoding for angles, allowing all agents to share the same vectorized context and the use of an architecture based on dot-product attention. As a result, we can simulate all traffic participants seen in dense urban scenarios. Experiments on the Waymo open motion dataset show TrafficBots can simulate realistic multi-agent behaviors and achieve good performance on the motion prediction task.

Deep Reinforcement Learning for Beam Angle Optimization of Intensity-Modulated Radiation Therapy

Authors:Peng Bao, Gong Wang, Ruijie Yang, Bin Dong
Date:2023-03-07 11:27:09

Objective: Intensity-modulated radiation therapy (IMRT) beam angle optimization (BAO) is a challenging combinatorial optimization problem that is NP-hard. In this study, we aim to develop a personalized BAO algorithm for IMRT that improves the quality of the final treatment. Methods: To improve the quality of IMRT treatment planning, we propose a deep reinforcement learning (DRL)-based approach for IMRT BAO. We consider the task as a sequential decision-making problem and formulate it as a Markov Decision Process. To facilitate the training process, a 3D-Unet is designed to predict the dose distribution for the different number of beam angles, ranging from 1 to 9, to simulate the IMRT environment. By leveraging the simulation model, double deep-Q network (DDQN) and proximal policy optimization (PPO) are used to train agents to select the personalized beam angle sequentially within a few seconds. Results: The treatment plans with beam angles selected by DRL outperform those with clinically used evenly distributed beam angles. For DDQN, the overall average improvement of the CIs is 0.027, 0.032, and 0.03 for 5, 7, and 9 beam angles respectively. For PPO, the overall average improvement of CIs is 0.045, 0.051, and 0.025 for 5, 7, and 9 beam angles respectively. Conclusion: The proposed DRL-based beam angle selection strategy can generate personalized beam angles within a few seconds, and the resulting treatment plan is superior to that obtained using evenly distributed angles. Significance: A fast and automated personalized beam angle selection approach is been proposed for IMRT BAO.

Environment Transformer and Policy Optimization for Model-Based Offline Reinforcement Learning

Authors:Pengqin Wang, Meixin Zhu, Shaojie Shen
Date:2023-03-07 11:26:09

Interacting with the actual environment to acquire data is often costly and time-consuming in robotic tasks. Model-based offline reinforcement learning (RL) provides a feasible solution. On the one hand, it eliminates the requirements of interaction with the actual environment. On the other hand, it learns the transition dynamics and reward function from the offline datasets and generates simulated rollouts to accelerate training. Previous model-based offline RL methods adopt probabilistic ensemble neural networks (NN) to model aleatoric uncertainty and epistemic uncertainty. However, this results in an exponential increase in training time and computing resource requirements. Furthermore, these methods are easily disturbed by the accumulative errors of the environment dynamics models when simulating long-term rollouts. To solve the above problems, we propose an uncertainty-aware sequence modeling architecture called Environment Transformer. It models the probability distribution of the environment dynamics and reward function to capture aleatoric uncertainty and treats epistemic uncertainty as a learnable noise parameter. Benefiting from the accurate modeling of the transition dynamics and reward function, Environment Transformer can be combined with arbitrary planning, dynamics programming, or policy optimization algorithms for offline RL. In this case, we perform Conservative Q-Learning (CQL) to learn a conservative Q-function. Through simulation experiments, we demonstrate that our method achieves or exceeds state-of-the-art performance in widely studied offline RL benchmarks. Moreover, we show that Environment Transformer's simulated rollout quality, sample efficiency, and long-term rollout simulation capability are superior to those of previous model-based offline RL methods.

Sample-efficient Real-time Planning with Curiosity Cross-Entropy Method and Contrastive Learning

Authors:Mostafa Kotb, Cornelius Weber, Stefan Wermter
Date:2023-03-07 10:48:20

Model-based reinforcement learning (MBRL) with real-time planning has shown great potential in locomotion and manipulation control tasks. However, the existing planning methods, such as the Cross-Entropy Method (CEM), do not scale well to complex high-dimensional environments. One of the key reasons for underperformance is the lack of exploration, as these planning methods only aim to maximize the cumulative extrinsic reward over the planning horizon. Furthermore, planning inside the compact latent space in the absence of observations makes it challenging to use curiosity-based intrinsic motivation. We propose Curiosity CEM (CCEM), an improved version of the CEM algorithm for encouraging exploration via curiosity. Our proposed method maximizes the sum of state-action Q values over the planning horizon, in which these Q values estimate the future extrinsic and intrinsic reward, hence encouraging reaching novel observations. In addition, our model uses contrastive representation learning to efficiently learn latent representations. Experiments on image-based continuous control tasks from the DeepMind Control suite show that CCEM is by a large margin more sample-efficient than previous MBRL algorithms and compares favorably with the best model-free RL methods.

Efficient Skill Acquisition for Complex Manipulation Tasks in Obstructed Environments

Authors:Jun Yamada, Jack Collins, Ingmar Posner
Date:2023-03-06 18:49:59

Data efficiency in robotic skill acquisition is crucial for operating robots in varied small-batch assembly settings. To operate in such environments, robots must have robust obstacle avoidance and versatile goal conditioning acquired from only a few simple demonstrations. Existing approaches, however, fall short of these requirements. Deep reinforcement learning (RL) enables a robot to learn complex manipulation tasks but is often limited to small task spaces in the real world due to sample inefficiency and safety concerns. Motion planning (MP) can generate collision-free paths in obstructed environments, but cannot solve complex manipulation tasks and requires goal states often specified by a user or object-specific pose estimator. In this work, we propose a system for efficient skill acquisition that leverages an object-centric generative model (OCGM) for versatile goal identification to specify a goal for MP combined with RL to solve complex manipulation tasks in obstructed environments. Specifically, OCGM enables one-shot target object identification and re-identification in new scenes, allowing MP to guide the robot to the target object while avoiding obstacles. This is combined with a skill transition network, which bridges the gap between terminal states of MP and feasible start states of a sample-efficient RL policy. The experiments demonstrate that our OCGM-based one-shot goal identification provides competitive accuracy to other baseline approaches and that our modular framework outperforms competitive baselines, including a state-of-the-art RL algorithm, by a significant margin for complex manipulation tasks in obstructed environments.

Viewpoint Push Planning for Mapping of Unknown Confined Spaces

Authors:Nils Dengler, Sicong Pan, Vamsi Kalagaturu, Rohit Menon, Murad Dawood, Maren Bennewitz
Date:2023-03-06 13:38:25

Viewpoint planning is an important task in any application where objects or scenes need to be viewed from different angles to achieve sufficient coverage. The mapping of confined spaces such as shelves is an especially challenging task since objects occlude each other and the scene can only be observed from the front, posing limitations on the possible viewpoints. In this paper, we propose a deep reinforcement learning framework that generates promising views aiming at reducing the map entropy. Additionally, the pipeline extends standard viewpoint planning by predicting adequate minimally invasive push actions to uncover occluded objects and increase the visible space. Using a 2.5D occupancy height map as state representation that can be efficiently updated, our system decides whether to plan a new viewpoint or perform a push. To learn feasible pushes, we use a neural network to sample push candidates on the map based on training data provided by human experts. As simulated and real-world experimental results with a robotic arm show, our system is able to significantly increase the mapped space compared to different baselines, while the executed push actions highly benefit the viewpoint planner with only minor changes to the object configuration.

Local Path Planning among Pushable Objects based on Reinforcement Learning

Authors:Linghong Yao, Valerio Modugno, Andromachi Maria Delfaki, Yuanchang Liu, Danail Stoyanov, Dimitrios Kanoulas
Date:2023-03-04 12:56:15

In this paper, we introduce a method to deal with the problem of robot local path planning among pushable objects -- an open problem in robotics. In particular, we achieve that by training multiple agents simultaneously in a physics-based simulation environment, utilizing an Advantage Actor-Critic algorithm coupled with a deep neural network. The developed online policy enables these agents to push obstacles in ways that are not limited to axial alignments, adapt to unforeseen changes in obstacle dynamics instantaneously, and effectively tackle local path planning in confined areas. We tested the method in various simulated environments to prove the adaptation effectiveness to various unseen scenarios in unfamiliar settings. Moreover, we have successfully applied this policy on an actual quadruped robot, confirming its capability to handle the unpredictability and noise associated with real-world sensors and the inherent uncertainties present in unexplored object pushing tasks.

Look-Ahead AC Optimal Power Flow: A Model-Informed Reinforcement Learning Approach

Authors:Xinyue Wang, Haiwang Zhong, Guanglun Zhang, Guangchun Ruan, Yiliu He, Zekuan Yu
Date:2023-03-04 03:15:51

With the increasing proportion of renewable energy in the generation side, it becomes more difficult to accurately predict the power generation and adapt to the large deviations between the optimal dispatch scheme and the day-ahead scheduling in the process of real-time dispatch. Therefore, it is necessary to conduct look-ahead dispatches to revise the operation plan according to the real-time status of the power grid and reliable ultra-short-term prediction. Application of traditional model-driven methods is often limited by the scale of the power system and cannot meet the computational time requirements of real-time dispatch. Data-driven methods can provide strong online decision-making support abilities when facing large-scale systems, while it is limited by the quantity and quality of the training dataset. This paper proposes a model-informed reinforcement learning approach for look-ahead AC optimal power flow. The reinforcement learning model is first formulated based on the domain knowledge of economic dispatch, and then the physics-informed neural network is constructed to enhance the reliability and efficiency. At last, the case study based on the SG 126-bus system validates the accuracy and efficiency of the proposed approach.

Towards Safety Assured End-to-End Vision-Based Control for Autonomous Racing

Authors:Dvij Kalaria, Qin Lin, John M. Dolan
Date:2023-03-03 23:49:39

Autonomous car racing is a challenging task, as it requires precise applications of control while the vehicle is operating at cornering speeds. Traditional autonomous pipelines require accurate pre-mapping, localization, and planning which make the task computationally expensive and environment-dependent. Recent works propose use of imitation and reinforcement learning to train end-to-end deep neural networks and have shown promising results for high-speed racing. However, the end-to-end models may be dangerous to be deployed on real systems, as the neural networks are treated as black-box models devoid of any provable safety guarantees. In this work we propose a decoupled approach where an optimal end-to-end controller and a state prediction end-to-end model are learned together, and the predicted state of the vehicle is used to formulate a control barrier function for safeguarding the vehicle to stay within lane boundaries. We validate our algorithm both on a high-fidelity Carla driving simulator and a 1/10-scale RC car on a real track. The evaluation results suggest that using an explicit safety controller helps to learn the task safely with fewer iterations and makes it possible to safely navigate the vehicle on the track along the more challenging racing line.

Co-learning Planning and Control Policies Constrained by Differentiable Logic Specifications

Authors:Zikang Xiong, Daniel Lawson, Joe Eappen, Ahmed H. Qureshi, Suresh Jagannathan
Date:2023-03-02 15:24:24

Synthesizing planning and control policies in robotics is a fundamental task, further complicated by factors such as complex logic specifications and high-dimensional robot dynamics. This paper presents a novel reinforcement learning approach to solving high-dimensional robot navigation tasks with complex logic specifications by co-learning planning and control policies. Notably, this approach significantly reduces the sample complexity in training, allowing us to train high-quality policies with much fewer samples compared to existing reinforcement learning algorithms. In addition, our methodology streamlines complex specification extraction from map images and enables the efficient generation of long-horizon robot motion paths across different map layouts. Moreover, our approach also demonstrates capabilities for high-dimensional control and avoiding suboptimal policies via policy alignment. The efficacy of our approach is demonstrated through experiments involving simulated high-dimensional quadruped robot dynamics and a real-world differential drive robot (TurtleBot3) under different types of task specifications.

Multi-Start Team Orienteering Problem for UAS Mission Re-Planning with Data-Efficient Deep Reinforcement Learning

Authors:Dong Ho Lee, Jaemyung Ahn
Date:2023-03-02 15:15:56

In this paper, we study the Multi-Start Team Orienteering Problem (MSTOP), a mission re-planning problem where vehicles are initially located away from the depot and have different amounts of fuel. We consider/assume the goal of multiple vehicles is to travel to maximize the sum of collected profits under resource (e.g., time, fuel) consumption constraints. Such re-planning problems occur in a wide range of intelligent UAS applications where changes in the mission environment force the operation of multiple vehicles to change from the original plan. To solve this problem with deep reinforcement learning (RL), we develop a policy network with self-attention on each partial tour and encoder-decoder attention between the partial tour and the remaining nodes. We propose a modified REINFORCE algorithm where the greedy rollout baseline is replaced by a local mini-batch baseline based on multiple, possibly non-duplicate sample rollouts. By drawing multiple samples per training instance, we can learn faster and obtain a stable policy gradient estimator with significantly fewer instances. The proposed training algorithm outperforms the conventional greedy rollout baseline, even when combined with the maximum entropy objective.

Population-based Evaluation in Repeated Rock-Paper-Scissors as a Benchmark for Multiagent Reinforcement Learning

Authors:Marc Lanctot, John Schultz, Neil Burch, Max Olan Smith, Daniel Hennes, Thomas Anthony, Julien Perolat
Date:2023-03-02 15:06:52

Progress in fields of machine learning and adversarial planning has benefited significantly from benchmark domains, from checkers and the classic UCI data sets to Go and Diplomacy. In sequential decision-making, agent evaluation has largely been restricted to few interactions against experts, with the aim to reach some desired level of performance (e.g. beating a human professional player). We propose a benchmark for multiagent learning based on repeated play of the simple game Rock, Paper, Scissors along with a population of forty-three tournament entries, some of which are intentionally sub-optimal. We describe metrics to measure the quality of agents based both on average returns and exploitability. We then show that several RL, online learning, and language model approaches can learn good counter-strategies and generalize well, but ultimately lose to the top-performing bots, creating an opportunity for research in multiagent learning.

Multi-UAV Adaptive Path Planning Using Deep Reinforcement Learning

Authors:Jonas Westheider, Julius Rückin, Marija Popović
Date:2023-03-02 10:54:07

Efficient aerial data collection is important in many remote sensing applications. In large-scale monitoring scenarios, deploying a team of unmanned aerial vehicles (UAVs) offers improved spatial coverage and robustness against individual failures. However, a key challenge is cooperative path planning for the UAVs to efficiently achieve a joint mission goal. We propose a novel multi-agent informative path planning approach based on deep reinforcement learning for adaptive terrain monitoring scenarios using UAV teams. We introduce new network feature representations to effectively learn path planning in a 3D workspace. By leveraging a counterfactual baseline, our approach explicitly addresses credit assignment to learn cooperative behaviour. Our experimental evaluation shows improved planning performance, i.e. maps regions of interest more quickly, with respect to non-counterfactual variants. Results on synthetic and real-world data show that our approach has superior performance compared to state-of-the-art non-learning-based methods, while being transferable to varying team sizes and communication constraints.

The Virtues of Laziness in Model-based RL: A Unified Objective and Algorithms

Authors:Anirudh Vemula, Yuda Song, Aarti Singh, J. Andrew Bagnell, Sanjiban Choudhury
Date:2023-03-01 17:42:26

We propose a novel approach to addressing two fundamental challenges in Model-based Reinforcement Learning (MBRL): the computational expense of repeatedly finding a good policy in the learned model, and the objective mismatch between model fitting and policy computation. Our "lazy" method leverages a novel unified objective, Performance Difference via Advantage in Model, to capture the performance difference between the learned policy and expert policy under the true dynamics. This objective demonstrates that optimizing the expected policy advantage in the learned model under an exploration distribution is sufficient for policy computation, resulting in a significant boost in computational efficiency compared to traditional planning methods. Additionally, the unified objective uses a value moment matching term for model fitting, which is aligned with the model's usage during policy computation. We present two no-regret algorithms to optimize the proposed objective, and demonstrate their statistical and computational gains compared to existing MBRL methods through simulated benchmarks.

Multi-Arm Robot Task Planning for Fruit Harvesting Using Multi-Agent Reinforcement Learning

Authors:Tao Li, Feng Xie, Ya Xiong, Qingchun Feng
Date:2023-03-01 12:39:30

The emergence of harvesting robotics offers a promising solution to the issue of limited agricultural labor resources and the increasing demand for fruits. Despite notable advancements in the field of harvesting robotics, the utilization of such technology in orchards is still limited. The key challenge is to improve operational efficiency. Taking into account inner-arm conflicts, couplings of DoFs, and dynamic tasks, we propose a task planning strategy for a harvesting robot with four arms in this paper. The proposed method employs a Markov game framework to formulate the four-arm robotic harvesting task, which avoids the computational complexity of solving an NP-hard scheduling problem. Furthermore, a multi-agent reinforcement learning (MARL) structure with a fully centralized collaboration protocol is used to train a MARL-based task planning network. Several simulations and orchard experiments are conducted to validate the effectiveness of the proposed method for a multi-arm harvesting robot in comparison with the existing method.

Hierarchical Reinforcement Learning in Complex 3D Environments

Authors:Bernardo Avila Pires, Feryal Behbahani, Hubert Soyer, Kyriacos Nikiforou, Thomas Keck, Satinder Singh
Date:2023-02-28 09:56:36

Hierarchical Reinforcement Learning (HRL) agents have the potential to demonstrate appealing capabilities such as planning and exploration with abstraction, transfer, and skill reuse. Recent successes with HRL across different domains provide evidence that practical, effective HRL agents are possible, even if existing agents do not yet fully realize the potential of HRL. Despite these successes, visually complex partially observable 3D environments remained a challenge for HRL agents. We address this issue with Hierarchical Hybrid Offline-Online (H2O2), a hierarchical deep reinforcement learning agent that discovers and learns to use options from scratch using its own experience. We show that H2O2 is competitive with a strong non-hierarchical Muesli baseline in the DeepMind Hard Eight tasks and we shed new light on the problem of learning hierarchical agents in complex environments. Our empirical study of H2O2 reveals previously unnoticed practical challenges and brings new perspective to the current understanding of hierarchical agents in complex domains.

Exposure-Based Multi-Agent Inspection of a Tumbling Target Using Deep Reinforcement Learning

Authors:Joshua Aurand, Steven Cutlip, Henry Lei, Kendra Lang, Sean Phillips
Date:2023-02-27 22:54:01

As space becomes more congested, on orbit inspection is an increasingly relevant activity whether to observe a defunct satellite for planning repairs or to de-orbit it. However, the task of on orbit inspection itself is challenging, typically requiring the careful coordination of multiple observer satellites. This is complicated by a highly nonlinear environment where the target may be unknown or moving unpredictably without time for continuous command and control from the ground. There is a need for autonomous, robust, decentralized solutions to the inspection task. To achieve this, we consider a hierarchical, learned approach for the decentralized planning of multi-agent inspection of a tumbling target. Our solution consists of two components: a viewpoint or high-level planner trained using deep reinforcement learning and a navigation planner handling point-to-point navigation between pre-specified viewpoints. We present a novel problem formulation and methodology that is suitable not only to reinforcement learning-derived robust policies, but extendable to unknown target geometries and higher fidelity information theoretic objectives received directly from sensor inputs. Operating under limited information, our trained multi-agent high-level policies successfully contextualize information within the global hierarchical environment and are correspondingly able to inspect over 90% of non-convex tumbling targets, even in the absence of additional agent attitude control.

Puppeteer and Marionette: Learning Anticipatory Quadrupedal Locomotion Based on Interactions of a Central Pattern Generator and Supraspinal Drive

Authors:Milad Shafiee, Guillaume Bellegarda, Auke Ijspeert
Date:2023-02-26 18:32:44

Quadruped animal locomotion emerges from the interactions between the spinal central pattern generator (CPG), sensory feedback, and supraspinal drive signals from the brain. Computational models of CPGs have been widely used for investigating the spinal cord contribution to animal locomotion control in computational neuroscience and in bio-inspired robotics. However, the contribution of supraspinal drive to anticipatory behavior, i.e. motor behavior that involves planning ahead of time (e.g. of footstep placements), is not yet properly understood. In particular, it is not clear whether the brain modulates CPG activity and/or directly modulates muscle activity (hence bypassing the CPG) for accurate foot placements. In this paper, we investigate the interaction of supraspinal drive and a CPG in an anticipatory locomotion scenario that involves stepping over gaps. By employing deep reinforcement learning (DRL), we train a neural network policy that replicates the supraspinal drive behavior. This policy can either modulate the CPG dynamics, or directly change actuation signals to bypass the CPG dynamics. Our results indicate that the direct supraspinal contribution to the actuation signal is a key component for a high gap crossing success rate. However, the CPG dynamics in the spinal cord are beneficial for gait smoothness and energy efficiency. Moreover, our investigation shows that sensing the front feet distances to the gap is the most important and sufficient sensory information for learning gap crossing. Our results support the biological hypothesis that cats and horses mainly control the front legs for obstacle avoidance, and that hind limbs follow an internal memory based on the front limbs' information. Our method enables the quadruped robot to cross gaps of up to 20 cm (50% of body-length) without any explicit dynamics modeling or Model Predictive Control (MPC).

Leveraging Jumpy Models for Planning and Fast Learning in Robotic Domains

Authors:Jingwei Zhang, Jost Tobias Springenberg, Arunkumar Byravan, Leonard Hasenclever, Abbas Abdolmaleki, Dushyant Rao, Nicolas Heess, Martin Riedmiller
Date:2023-02-24 13:26:03

In this paper we study the problem of learning multi-step dynamics prediction models (jumpy models) from unlabeled experience and their utility for fast inference of (high-level) plans in downstream tasks. In particular we propose to learn a jumpy model alongside a skill embedding space offline, from previously collected experience for which no labels or reward annotations are required. We then investigate several options of harnessing those learned components in combination with model-based planning or model-free reinforcement learning (RL) to speed up learning on downstream tasks. We conduct a set of experiments in the RGB-stacking environment, showing that planning with the learned skills and the associated model can enable zero-shot generalization to new tasks, and can further speed up training of policies via reinforcement learning. These experiments demonstrate that jumpy models which incorporate temporal abstraction can facilitate planning in long-horizon tasks in which standard dynamics models fail.

Provably Efficient Reinforcement Learning via Surprise Bound

Authors:Hanlin Zhu, Ruosong Wang, Jason D. Lee
Date:2023-02-22 20:21:25

Value function approximation is important in modern reinforcement learning (RL) problems especially when the state space is (infinitely) large. Despite the importance and wide applicability of value function approximation, its theoretical understanding is still not as sophisticated as its empirical success, especially in the context of general function approximation. In this paper, we propose a provably efficient RL algorithm (both computationally and statistically) with general value function approximations. We show that if the value functions can be approximated by a function class that satisfies the Bellman-completeness assumption, our algorithm achieves an $\widetilde{O}(\text{poly}(\iota H)\sqrt{T})$ regret bound where $\iota$ is the product of the surprise bound and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes and $T = HK$ is the total number of steps the agent interacts with the environment. Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting. Moreover, our algorithm only needs to solve $O(H\log K)$ empirical risk minimization (ERM) problems, which is far more efficient than previous algorithms that need to solve ERM problems for $\Omega(HK)$ times.

Modular Deep Learning

Authors:Jonas Pfeiffer, Sebastian Ruder, Ivan Vulić, Edoardo Maria Ponti
Date:2023-02-22 18:11:25

Transfer learning has recently become the dominant paradigm of machine learning. Pre-trained models fine-tuned for downstream tasks achieve better performance with fewer labelled examples. Nonetheless, it remains unclear how to develop models that specialise towards multiple tasks without incurring negative interference and that generalise systematically to non-identically distributed tasks. Modular deep learning has emerged as a promising solution to these challenges. In this framework, units of computation are often implemented as autonomous parameter-efficient modules. Information is conditionally routed to a subset of modules and subsequently aggregated. These properties enable positive transfer and systematic generalisation by separating computation from routing and updating modules locally. We offer a survey of modular architectures, providing a unified view over several threads of research that evolved independently in the scientific literature. Moreover, we explore various additional purposes of modularity, including scaling language models, causal inference, programme induction, and planning in reinforcement learning. Finally, we report various concrete applications where modularity has been successfully deployed such as cross-lingual and cross-modal knowledge transfer. Related talks and projects to this survey, are available at https://www.modulardeeplearning.com/.

A Supervisory Learning Control Framework for Autonomous & Real-time Task Planning for an Underactuated Cooperative Robotic task

Authors:Sander De Witte, Tom Lefebvre, Thijs Van Hauwermeiren, Guillaume Crevecoeur
Date:2023-02-22 16:59:06

We introduce a framework for cooperative manipulation, applied on an underactuated manipulation problem. Two stationary robotic manipulators are required to cooperate in order to reposition an object within their shared work space. Control of multi-agent systems for manipulation tasks cannot rely on individual control strategies with little to no communication between the agents that serve the common objective through swarming. Instead a coordination strategy is required that queries subtasks to the individual agents. We formulate the problem in a Task And Motion Planning (TAMP) setting, while considering a decomposition strategy that allows us to treat the task and motion planning problems separately. We solve the supervisory planning problem offline using deep Reinforcement Learning techniques resulting into a supervisory policy capable of coordinating the two manipulators into a successful execution of the pick-and-place task. Additionally, a benefit of solving the task planning problem offline is the possibility of real-time (re)planning, demonstrating robustness in the event of subtask execution failure or on-the-fly task changes. The framework achieved zero-shot deployment on the real setup with a success rate that is higher than 90%.

Learning Agile Flights through Narrow Gaps with Varying Angles using Onboard Sensing

Authors:Yuhan Xie, Minghao Lu, Rui Peng, Peng Lu
Date:2023-02-22 09:25:53

This paper addresses the problem of traversing through unknown, tilted, and narrow gaps for quadrotors using Deep Reinforcement Learning (DRL). Previous learning-based methods relied on accurate knowledge of the environment, including the gap's pose and size. In contrast, we integrate onboard sensing and detect the gap from a single onboard camera. The training problem is challenging for two reasons: a precise and robust whole-body planning and control policy is required for variable-tilted and narrow gaps, and an effective Sim2Real method is needed to successfully conduct real-world experiments. To this end, we propose a learning framework for agile gap traversal flight, which successfully trains the vehicle to traverse through the center of the gap at an approximate attitude to the gap with aggressive tilted angles. The policy trained only in a simulation environment can be transferred into different domains with fine-tuning while maintaining the success rate. Our proposed framework, which integrates onboard sensing and a neural network controller, achieves a success rate of 84.51% in real-world experiments, with gap orientations up to 60deg. To the best of our knowledge, this is the first paper that performs the learning-based variable-tilted narrow gap traversal flight in the real world, without prior knowledge of the environment.

Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management

Authors:Dhawal Gupta, Yinlam Chow, Aza Tulepbergenov, Mohammad Ghavamzadeh, Craig Boutilier
Date:2023-02-21 18:02:20

Reinforcement learning (RL) has shown great promise for developing dialogue management (DM) agents that are non-myopic, conduct rich conversations, and maximize overall user satisfaction. Despite recent developments in RL and language models (LMs), using RL to power conversational chatbots remains challenging, in part because RL requires online exploration to learn effectively, whereas collecting novel human-bot interactions can be expensive and unsafe. This issue is exacerbated by the combinatorial action spaces facing these algorithms, as most LM agents generate responses at the word level. We develop a variety of RL algorithms, specialized to dialogue planning, that leverage recent Mixture-of-Expert Language Models (MoE-LMs) -- models that capture diverse semantics, generate utterances reflecting different intents, and are amenable for multi-turn DM. By exploiting MoE-LM structure, our methods significantly reduce the size of the action space and improve the efficacy of RL-based DM. We evaluate our methods in open-domain dialogue to demonstrate their effectiveness w.r.t.\ the diversity of intent in generated utterances and overall DM performance.

UAV Path Planning Employing MPC- Reinforcement Learning Method Considering Collision Avoidance

Authors:Mahya Ramezani, Hamed Habibi, Jose luis Sanchez Lopez, Holger Voos
Date:2023-02-21 13:39:40

In this paper, we tackle the problem of Unmanned Aerial (UA V) path planning in complex and uncertain environments by designing a Model Predictive Control (MPC), based on a Long-Short-Term Memory (LSTM) network integrated into the Deep Deterministic Policy Gradient algorithm. In the proposed solution, LSTM-MPC operates as a deterministic policy within the DDPG network, and it leverages a predicting pool to store predicted future states and actions for improved robustness and efficiency. The use of the predicting pool also enables the initialization of the critic network, leading to improved convergence speed and reduced failure rate compared to traditional reinforcement learning and deep reinforcement learning methods. The effectiveness of the proposed solution is evaluated by numerical simulations.

Understanding the effect of varying amounts of replay per step

Authors:Animesh Kumar Paul, Videh Raj Nema
Date:2023-02-20 20:54:11

Model-based reinforcement learning uses models to plan, where the predictions and policies of an agent can be improved by using more computation without additional data from the environment, thereby improving sample efficiency. However, learning accurate estimates of the model is hard. Subsequently, the natural question is whether we can get similar benefits as planning with model-free methods. Experience replay is an essential component of many model-free algorithms enabling sample-efficient learning and stability by providing a mechanism to store past experiences for further reuse in the gradient computational process. Prior works have established connections between models and experience replay by planning with the latter. This involves increasing the number of times a mini-batch is sampled and used for updates at each step (amount of replay per step). We attempt to exploit this connection by doing a systematic study on the effect of varying amounts of replay per step in a well-known model-free algorithm: Deep Q-Network (DQN) in the Mountain Car environment. We empirically show that increasing replay improves DQN's sample efficiency, reduces the variation in its performance, and makes it more robust to change in hyperparameters. Altogether, this takes a step toward a better algorithm for deployment.

Interactive Video Corpus Moment Retrieval using Reinforcement Learning

Authors:Zhixin Ma, Chong-Wah Ngo
Date:2023-02-19 09:48:23

Known-item video search is effective with human-in-the-loop to interactively investigate the search result and refine the initial query. Nevertheless, when the first few pages of results are swamped with visually similar items, or the search target is hidden deep in the ranked list, finding the know-item target usually requires a long duration of browsing and result inspection. This paper tackles the problem by reinforcement learning, aiming to reach a search target within a few rounds of interaction by long-term learning from user feedbacks. Specifically, the system interactively plans for navigation path based on feedback and recommends a potential target that maximizes the long-term reward for user comment. We conduct experiments for the challenging task of video corpus moment retrieval (VCMR) to localize moments from a large video corpus. The experimental results on TVR and DiDeMo datasets verify that our proposed work is effective in retrieving the moments that are hidden deep inside the ranked lists of CONQUER and HERO, which are the state-of-the-art auto-search engines for VCMR.

Reinforcement Learning in the Wild with Maximum Likelihood-based Model Transfer

Authors:Hannes Eriksson, Debabrota Basu, Tommy Tram, Mina Alibeigi, Christos Dimitrakakis
Date:2023-02-18 09:47:34

In this paper, we study the problem of transferring the available Markov Decision Process (MDP) models to learn and plan efficiently in an unknown but similar MDP. We refer to it as \textit{Model Transfer Reinforcement Learning (MTRL)} problem. First, we formulate MTRL for discrete MDPs and Linear Quadratic Regulators (LQRs) with continuous state actions. Then, we propose a generic two-stage algorithm, MLEMTRL, to address the MTRL problem in discrete and continuous settings. In the first stage, MLEMTRL uses a \textit{constrained Maximum Likelihood Estimation (MLE)}-based approach to estimate the target MDP model using a set of known MDP models. In the second stage, using the estimated target MDP model, MLEMTRL deploys a model-based planning algorithm appropriate for the MDP class. Theoretically, we prove worst-case regret bounds for MLEMTRL both in realisable and non-realisable settings. We empirically demonstrate that MLEMTRL allows faster learning in new MDPs than learning from scratch and achieves near-optimal performance depending on the similarity of the available MDPs and the target MDP.

Robot path planning using deep reinforcement learning

Authors:Miguel Quinones-Ramirez, Jorge Rios-Martinez, Victor Uc-Cetina
Date:2023-02-17 20:08:59

Autonomous navigation is challenging for mobile robots, especially in an unknown environment. Commonly, the robot requires multiple sensors to map the environment, locate itself, and make a plan to reach the target. However, reinforcement learning methods offer an alternative to map-free navigation tasks by learning the optimal actions to take. In this article, deep reinforcement learning agents are implemented using variants of the deep Q networks method, the D3QN and rainbow algorithms, for both the obstacle avoidance and the goal-oriented navigation task. The agents are trained and evaluated in a simulated environment. Furthermore, an analysis of the changes in the behaviour and performance of the agents caused by modifications in the reward function is conducted.

Dynamic Grasping with a Learned Meta-Controller

Authors:Yinsen Jia, Jingxi Xu, Dinesh Jayaraman, Shuran Song
Date:2023-02-16 18:14:13

Grasping moving objects is a challenging task that requires multiple submodules such as object pose predictor, arm motion planner, etc. Each submodule operates under its own set of meta-parameters. For example, how far the pose predictor should look into the future (i.e., look-ahead time) and the maximum amount of time the motion planner can spend planning a motion (i.e., time budget). Many previous works assign fixed values to these parameters; however, at different moments within a single episode of dynamic grasping, the optimal values should vary depending on the current scene. In this work, we propose a dynamic grasping pipeline with a meta-controller that controls the look-ahead time and time budget dynamically. We learn the meta-controller through reinforcement learning with a sparse reward. Our experiments show the meta-controller improves the grasping success rate (up to 28% in the most cluttered environment) and reduces grasping time, compared to the strongest baseline. Our meta-controller learns to reason about the reachable workspace and maintain the predicted pose within the reachable region. In addition, it assigns a small but sufficient time budget for the motion planner. Our method can handle different objects, trajectories, and obstacles. Despite being trained only with 3-6 random cuboidal obstacles, our meta-controller generalizes well to 7-9 obstacles and more realistic out-of-domain household setups with unseen obstacle shapes.

Reinforcement Learning Based Power Grid Day-Ahead Planning and AI-Assisted Control

Authors:Anton R. Fuxjäger, Kristian Kozak, Matthias Dorfer, Patrick M. Blies, Marcel Wasserer
Date:2023-02-15 13:38:40

The ongoing transition to renewable energy is increasing the share of fluctuating power sources like wind and solar, raising power grid volatility and making grid operation increasingly complex and costly. In our prior work, we have introduced a congestion management approach consisting of a redispatching optimizer combined with a machine learning-based topology optimization agent. Compared to a typical redispatching-only agent, it was able to keep a simulated grid in operation longer while at the same time reducing operational cost. Our approach also ranked 1st in the L2RPN 2022 competition initiated by RTE, Europe's largest grid operator. The aim of this paper is to bring this promising technology closer to the real world of power grid operation. We deploy RL-based agents in two settings resembling established workflows, AI-assisted day-ahead planning and realtime control, in an attempt to show the benefits and caveats of this new technology. We then analyse congestion, redispatching and switching profiles, and elementary sensitivity analysis providing a glimpse of operation robustness. While there is still a long way to a real control room, we believe that this paper and the associated prototypes help to narrow the gap and pave the way for a safe deployment of RL agents in tomorrow's power grids.

TiZero: Mastering Multi-Agent Football with Curriculum Learning and Self-Play

Authors:Fanqi Lin, Shiyu Huang, Tim Pearce, Wenze Chen, Wei-Wei Tu
Date:2023-02-15 08:19:18

Multi-agent football poses an unsolved challenge in AI research. Existing work has focused on tackling simplified scenarios of the game, or else leveraging expert demonstrations. In this paper, we develop a multi-agent system to play the full 11 vs. 11 game mode, without demonstrations. This game mode contains aspects that present major challenges to modern reinforcement learning algorithms; multi-agent coordination, long-term planning, and non-transitivity. To address these challenges, we present TiZero; a self-evolving, multi-agent system that learns from scratch. TiZero introduces several innovations, including adaptive curriculum learning, a novel self-play strategy, and an objective that optimizes the policies of multiple agents jointly. Experimentally, it outperforms previous systems by a large margin on the Google Research Football environment, increasing win rates by over 30%. To demonstrate the generality of TiZero's innovations, they are assessed on several environments beyond football; Overcooked, Multi-agent Particle-Environment, Tic-Tac-Toe and Connect-Four.

Quantum algorithms applied to satellite mission planning for Earth observation

Authors:Serge Rainjonneau, Igor Tokarev, Sergei Iudin, Saaketh Rayaprolu, Karan Pinto, Daria Lemtiuzhnikova, Miras Koblan, Egor Barashov, Mo Kordzanganeh, Markus Pflitsch, Alexey Melnikov
Date:2023-02-14 16:49:25

Earth imaging satellites are a crucial part of our everyday lives that enable global tracking of industrial activities. Use cases span many applications, from weather forecasting to digital maps, carbon footprint tracking, and vegetation monitoring. However, there are limitations; satellites are difficult to manufacture, expensive to maintain, and tricky to launch into orbit. Therefore, satellites must be employed efficiently. This poses a challenge known as the satellite mission planning problem, which could be computationally prohibitive to solve on large scales. However, close-to-optimal algorithms, such as greedy reinforcement learning and optimization algorithms, can often provide satisfactory resolutions. This paper introduces a set of quantum algorithms to solve the mission planning problem and demonstrate an advantage over the classical algorithms implemented thus far. The problem is formulated as maximizing the number of high-priority tasks completed on real datasets containing thousands of tasks and multiple satellites. This work demonstrates that through solution-chaining and clustering, optimization and machine learning algorithms offer the greatest potential for optimal solutions. This paper notably illustrates that a hybridized quantum-enhanced reinforcement learning agent can achieve a completion percentage of 98.5% over high-priority tasks, significantly improving over the baseline greedy methods with a completion rate of 75.8%. The results presented in this work pave the way to quantum-enabled solutions in the space industry and, more generally, future mission planning problems across industries.

Towards Minimax Optimality of Model-based Robust Reinforcement Learning

Authors:Pierre Clavier, Erwan Le Pennec, Matthieu Geist
Date:2023-02-10 16:50:40

We study the sample complexity of obtaining an $\epsilon$-optimal policy in \emph{Robust} discounted Markov Decision Processes (RMDPs), given only access to a generative model of the nominal kernel. This problem is widely studied in the non-robust case, and it is known that any planning approach applied to an empirical MDP estimated with $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid}{\epsilon^2})$ samples provides an $\epsilon$-optimal policy, which is minimax optimal. Results in the robust case are much more scarce. For $sa$- (resp $s$-)rectangular uncertainty sets, the best known sample complexity is $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid}{\epsilon^2})$ (resp. $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid^2}{\epsilon^2})$), for specific algorithms and when the uncertainty set is based on the total variation (TV), the KL or the Chi-square divergences. In this paper, we consider uncertainty sets defined with an $L_p$-ball (recovering the TV case), and study the sample complexity of \emph{any} planning algorithm (with high accuracy guarantee on the solution) applied to an empirical RMDP estimated using the generative model. In the general case, we prove a sample complexity of $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid\mid A \mid}{\epsilon^2})$ for both the $sa$- and $s$-rectangular cases (improvements of $\mid S \mid$ and $\mid S \mid\mid A \mid$ respectively). When the size of the uncertainty is small enough, we improve the sample complexity to $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid }{\epsilon^2})$, recovering the lower-bound for the non-robust case for the first time and a robust lower-bound when the size of the uncertainty is small enough.

Improving Zero-Shot Coordination Performance Based on Policy Similarity

Authors:Lebin Yu, Yunbo Qiu, Quanming Yao, Xudong Zhang, Jian Wang
Date:2023-02-10 05:42:52

Over these years, multi-agent reinforcement learning has achieved remarkable performance in multi-agent planning and scheduling tasks. It typically follows the self-play setting, where agents are trained by playing with a fixed group of agents. However, in the face of zero-shot coordination, where an agent must coordinate with unseen partners, self-play agents may fail. Several methods have been proposed to handle this problem, but they either take a lot of time or lack generalizability. In this paper, we firstly reveal an important phenomenon: the zero-shot coordination performance is strongly linearly correlated with the similarity between an agent's training partner and testing partner. Inspired by it, we put forward a Similarity-Based Robust Training (SBRT) scheme that improves agents' zero-shot coordination performance by disturbing their partners' actions during training according to a pre-defined policy similarity value. To validate its effectiveness, we apply our scheme to three multi-agent reinforcement learning frameworks and achieve better performance compared with previous methods.

RIS-Assisted Jamming Rejection and Path Planning for UAV-Borne IoT Platform: A New Deep Reinforcement Learning Framework

Authors:Shuyan Hu, Xin Yuan, Wei Ni, Xin Wang, Abbas Jamalipour
Date:2023-02-10 00:36:56

This paper presents a new deep reinforcement learning (DRL)-based approach to the trajectory planning and jamming rejection of an unmanned aerial vehicle (UAV) for the Internet-of-Things (IoT) applications. Jamming can prevent timely delivery of sensing data and reception of operation instructions. With the assistance of a reconfigurable intelligent surface (RIS), we propose to augment the radio environment, suppress jamming signals, and enhance the desired signals. The UAV is designed to learn its trajectory and the RIS configuration based solely on changes in its received data rate, using the latest deep deterministic policy gradient (DDPG) and twin delayed DDPG (TD3) models. Simulations show that the proposed DRL algorithms give the UAV with strong resistance against jamming and that the TD3 algorithm exhibits faster and smoother convergence than the DDPG algorithm, and suits better for larger RISs. This DRL-based approach eliminates the need for knowledge of the channels involving the RIS and jammer, thereby offering significant practical value.

What are the mechanisms underlying metacognitive learning?

Authors:Ruiqi He, Falk Lieder
Date:2023-02-09 18:49:10

How is it that humans can solve complex planning tasks so efficiently despite limited cognitive resources? One reason is its ability to know how to use its limited computational resources to make clever choices. We postulate that people learn this ability from trial and error (metacognitive reinforcement learning). Here, we systematize models of the underlying learning mechanisms and enhance them with more sophisticated additional mechanisms. We fit the resulting 86 models to human data collected in previous experiments where different phenomena of metacognitive learning were demonstrated and performed Bayesian model selection. Our results suggest that a gradient ascent through the space of cognitive strategies can explain most of the observed qualitative phenomena, and is therefore a promising candidate for explaining the mechanism underlying metacognitive learning.

Catch Planner: Catching High-Speed Targets in the Flight

Authors:Huan Yu, Pengqin Wang, Jin Wang, Jialin Ji, Zhi Zheng, Jie Tu, Guodong Lu, Jun Meng, Meixin Zhu, Shaojie Shen, Fei Gao
Date:2023-02-09 00:57:52

Catching high-speed targets in the flight is a complex and typical highly dynamic task. In this paper, we propose Catch Planner, a planning-with-decision scheme for catching. For sequential decision making, we propose a policy search method based on deep reinforcement learning. In order to make catching adaptive and flexible, we propose a trajectory optimization method to jointly optimize the highly coupled catching time and terminal state while considering the dynamic feasibility and safety. We also propose a flexible constraint transcription method to catch targets at any reasonable attitude and terminal position bias. The proposed Catch Planner provides a new paradigm for the combination of learning and planning and is integrated on the quadrotor designed by ourselves, which runs at 100hz on the onboard computer. Extensive experiments are carried out in real and simulated scenes to verify the robustness of the proposed method and its expansibility when facing a variety of high-speed flying targets.

Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning

Authors:Volodymyr Tkachuk, Seyed Alireza Bakhtiari, Johannes Kirschner, Matej Jusup, Ilija Bogunovic, Csaba Szepesvári
Date:2023-02-08 23:42:49

A practical challenge in reinforcement learning are combinatorial action spaces that make planning computationally demanding. For example, in cooperative multi-agent reinforcement learning, a potentially large number of agents jointly optimize a global reward function, which leads to a combinatorial blow-up in the action space by the number of agents. As a minimal requirement, we assume access to an argmax oracle that allows to efficiently compute the greedy policy for any Q-function in the model class. Building on recent work in planning with local access to a simulator and linear function approximation, we propose efficient algorithms for this setting that lead to polynomial compute and query complexity in all relevant problem parameters. For the special case where the feature decomposition is additive, we further improve the bounds and extend the results to the kernelized setting with an efficient algorithm.

Shared Information-Based Safe And Efficient Behavior Planning For Connected Autonomous Vehicles

Authors:Songyang Han, Shanglin Zhou, Lynn Pepin, Jiangwei Wang, Caiwen Ding, Fei Miao
Date:2023-02-08 20:31:41

The recent advancements in wireless technology enable connected autonomous vehicles (CAVs) to gather data via vehicle-to-vehicle (V2V) communication, such as processed LIDAR and camera data from other vehicles. In this work, we design an integrated information sharing and safe multi-agent reinforcement learning (MARL) framework for CAVs, to take advantage of the extra information when making decisions to improve traffic efficiency and safety. We first use weight pruned convolutional neural networks (CNN) to process the raw image and point cloud LIDAR data locally at each autonomous vehicle, and share CNN-output data with neighboring CAVs. We then design a safe actor-critic algorithm that utilizes both a vehicle's local observation and the information received via V2V communication to explore an efficient behavior planning policy with safety guarantees. Using the CARLA simulator for experiments, we show that our approach improves the CAV system's efficiency in terms of average velocity and comfort under different CAV ratios and different traffic densities. We also show that our approach avoids the execution of unsafe actions and always maintains a safe distance from other vehicles. We construct an obstacle-at-corner scenario to show that the shared vision can help CAVs to observe obstacles earlier and take action to avoid traffic jams.

Learning Interaction-aware Motion Prediction Model for Decision-making in Autonomous Driving

Authors:Zhiyu Huang, Haochen Liu, Jingda Wu, Wenhui Huang, Chen Lv
Date:2023-02-08 08:32:16

Predicting the behaviors of other road users is crucial to safe and intelligent decision-making for autonomous vehicles (AVs). However, most motion prediction models ignore the influence of the AV's actions and the planning module has to treat other agents as unalterable moving obstacles. To address this problem, this paper proposes an interaction-aware motion prediction model that is able to predict other agents' future trajectories according to the ego agent's future plan, i.e., their reactions to the ego's actions. Specifically, we employ Transformers to effectively encode the driving scene and incorporate the AV's plan in decoding the predicted trajectories. To train the model to accurately predict the reactions of other agents, we develop an online learning framework, where the ego agent explores the environment and collects other agents' reactions to itself. We validate the decision-making and learning framework in three highly interactive simulated driving scenarios. The results reveal that our decision-making method significantly outperforms the reinforcement learning methods in terms of data efficiency and performance. We also find that using the interaction-aware model can bring better performance than the non-interaction-aware model and the exploration process helps improve the success rate in testing.

Predictable MDP Abstraction for Unsupervised Model-Based RL

Authors:Seohong Park, Sergey Levine
Date:2023-02-08 07:37:51

A key component of model-based reinforcement learning (RL) is a dynamics model that predicts the outcomes of actions. Errors in this predictive model can degrade the performance of model-based controllers, and complex Markov decision processes (MDPs) can present exceptionally difficult prediction problems. To mitigate this issue, we propose predictable MDP abstraction (PMA): instead of training a predictive model on the original MDP, we train a model on a transformed MDP with a learned action space that only permits predictable, easy-to-model actions, while covering the original state-action space as much as possible. As a result, model learning becomes easier and more accurate, which allows robust, stable model-based planning or model-based RL. This transformation is learned in an unsupervised manner, before any task is specified by the user. Downstream tasks can then be solved with model-based control in a zero-shot fashion, without additional environment interactions. We theoretically analyze PMA and empirically demonstrate that PMA leads to significant improvements over prior unsupervised model-based RL approaches in a range of benchmark environments. Our code and videos are available at https://seohong.me/projects/pma/

ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines

Authors:Siyuan Chen, Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry
Date:2023-02-08 02:56:36

Batching has a fundamental influence on the efficiency of deep neural network (DNN) execution. However, for dynamic DNNs, efficient batching is particularly challenging as the dataflow graph varies per input instance. As a result, state-of-the-art frameworks use heuristics that result in suboptimal batching decisions. Further, batching puts strict restrictions on memory adjacency and can lead to high data movement costs. In this paper, we provide an approach for batching dynamic DNNs based on finite state machines, which enables the automatic discovery of batching policies specialized for each DNN via reinforcement learning. Moreover, we find that memory planning that is aware of the batching policy can save significant data movement overheads, which is automated by a PQ tree-based algorithm we introduce. Experimental results show that our framework speeds up state-of-the-art frameworks by on average 1.15x, 1.39x, and 2.45x for chain-based, tree-based, and lattice-based DNNs across CPU and GPU.

Holistic Deep-Reinforcement-Learning-based Training of Autonomous Navigation Systems

Authors:Linh Kästner, Marvin Meusel, Teham Bhuiyan, Jens Lambrecht
Date:2023-02-06 16:52:15

In recent years, Deep Reinforcement Learning emerged as a promising approach for autonomous navigation of ground vehicles and has been utilized in various areas of navigation such as cruise control, lane changing, or obstacle avoidance. However, most research works either focus on providing an end-to-end solution training the whole system using Deep Reinforcement Learning or focus on one specific aspect such as local motion planning. This however, comes along with a number of problems such as catastrophic forgetfulness, inefficient navigation behavior, and non-optimal synchronization between different entities of the navigation stack. In this paper, we propose a holistic Deep Reinforcement Learning training approach in which the training procedure is involving all entities of the navigation stack. This should enhance the synchronization between- and understanding of all entities of the navigation stack and as a result, improve navigational performance. We trained several agents with a number of different observation spaces to study the impact of different input on the navigation behavior of the agent. In profound evaluations against multiple learning-based and classic model-based navigation approaches, our proposed agent could outperform the baselines in terms of efficiency and safety attaining shorter path lengths, less roundabout paths, and less collisions.

Generating Dispatching Rules for the Interrupting Swap-Allowed Blocking Job Shop Problem Using Graph Neural Network and Reinforcement Learning

Authors:Vivian W. H. Wong, Sang Hun Kim, Junyoung Park, Jinkyoo Park, Kincho H. Law
Date:2023-02-05 23:35:21

The interrupting swap-allowed blocking job shop problem (ISBJSSP) is a complex scheduling problem that is able to model many manufacturing planning and logistics applications realistically by addressing both the lack of storage capacity and unforeseen production interruptions. Subjected to random disruptions due to machine malfunction or maintenance, industry production settings often choose to adopt dispatching rules to enable adaptive, real-time re-scheduling, rather than traditional methods that require costly re-computation on the new configuration every time the problem condition changes dynamically. To generate dispatching rules for the ISBJSSP problem, we introduce a dynamic disjunctive graph formulation characterized by nodes and edges subjected to continuous deletions and additions. This formulation enables the training of an adaptive scheduler utilizing graph neural networks and reinforcement learning. Furthermore, a simulator is developed to simulate interruption, swapping, and blocking in the ISBJSSP setting. Employing a set of reported benchmark instances, we conduct a detailed experimental study on ISBJSSP instances with a range of machine shutdown probabilities to show that the scheduling policies generated can outperform or are at least as competitive as existing dispatching rules with predetermined priority. This study shows that the ISBJSSP, which requires real-time adaptive solutions, can be scheduled efficiently with the proposed method when production interruptions occur with random machine shutdowns.

Developing Driving Strategies Efficiently: A Skill-Based Hierarchical Reinforcement Learning Approach

Authors:Yigit Gurses, Kaan Buyukdemirci, Yildiray Yildiz
Date:2023-02-04 15:09:51

Driving in dense traffic with human and autonomous drivers is a challenging task that requires high-level planning and reasoning. Human drivers can achieve this task comfortably, and there has been many efforts to model human driver strategies. These strategies can be used as inspirations for developing autonomous driving algorithms or to create high-fidelity simulators. Reinforcement learning is a common tool to model driver policies, but conventional training of these models can be computationally expensive and time-consuming. To address this issue, in this paper, we propose ``skill-based" hierarchical driving strategies, where motion primitives, i.e. skills, are designed and used as high-level actions. This reduces the training time for applications that require multiple models with varying behavior. Simulation results in a merging scenario demonstrate that the proposed approach yields driver models that achieve higher performance with less training compared to baseline reinforcement learning methods.

Reinforcement Learning with History-Dependent Dynamic Contexts

Authors:Guy Tennenholtz, Nadav Merlis, Lior Shani, Martin Mladenov, Craig Boutilier
Date:2023-02-04 01:58:21

We introduce Dynamic Contextual Markov Decision Processes (DCMDPs), a novel reinforcement learning framework for history-dependent environments that generalizes the contextual MDP framework to handle non-Markov environments, where contexts change over time. We consider special cases of the model, with a focus on logistic DCMDPs, which break the exponential dependence on history length by leveraging aggregation functions to determine context transitions. This special structure allows us to derive an upper-confidence-bound style algorithm for which we establish regret bounds. Motivated by our theoretical results, we introduce a practical model-based algorithm for logistic DCMDPs that plans in a latent space and uses optimism over history-dependent features. We demonstrate the efficacy of our approach on a recommendation task (using MovieLens data) where user behavior dynamics evolve in response to recommendations.

AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners

Authors:Zhixuan Liang, Yao Mu, Mingyu Ding, Fei Ni, Masayoshi Tomizuka, Ping Luo
Date:2023-02-03 17:28:59

Diffusion models have demonstrated their powerful generative capability in many tasks, with great potential to serve as a paradigm for offline reinforcement learning. However, the quality of the diffusion model is limited by the insufficient diversity of training data, which hinders the performance of planning and the generalizability to new tasks. This paper introduces AdaptDiffuser, an evolutionary planning method with diffusion that can self-evolve to improve the diffusion model hence a better planner, not only for seen tasks but can also adapt to unseen tasks. AdaptDiffuser enables the generation of rich synthetic expert data for goal-conditioned tasks using guidance from reward gradients. It then selects high-quality data via a discriminator to finetune the diffusion model, which improves the generalization ability to unseen tasks. Empirical experiments on two benchmark environments and two carefully designed unseen tasks in KUKA industrial robot arm and Maze2D environments demonstrate the effectiveness of AdaptDiffuser. For example, AdaptDiffuser not only outperforms the previous art Diffuser by 20.8% on Maze2D and 7.5% on MuJoCo locomotion, but also adapts better to new tasks, e.g., KUKA pick-and-place, by 27.9% without requiring additional expert data. More visualization results and demo videos could be found on our project page.

Learning, Fast and Slow: A Goal-Directed Memory-Based Approach for Dynamic Environments

Authors:John Chong Min Tan, Mehul Motani
Date:2023-01-31 16:47:09

Model-based next state prediction and state value prediction are slow to converge. To address these challenges, we do the following: i) Instead of a neural network, we do model-based planning using a parallel memory retrieval system (which we term the slow mechanism); ii) Instead of learning state values, we guide the agent's actions using goal-directed exploration, by using a neural network to choose the next action given the current state and the goal state (which we term the fast mechanism). The goal-directed exploration is trained online using hippocampal replay of visited states and future imagined states every single time step, leading to fast and efficient training. Empirical studies show that our proposed method has a 92% solve rate across 100 episodes in a dynamically changing grid world, significantly outperforming state-of-the-art actor critic mechanisms such as PPO (54%), TRPO (50%) and A2C (24%). Ablation studies demonstrate that both mechanisms are crucial. We posit that the future of Reinforcement Learning (RL) will be to model goals and sub-goals for various tasks, and plan it out in a goal-directed memory-based approach.

Retrosynthetic Planning with Dual Value Networks

Authors:Guoqing Liu, Di Xue, Shufang Xie, Yingce Xia, Austin Tripp, Krzysztof Maziarz, Marwin Segler, Tao Qin, Zongzhang Zhang, Tie-Yan Liu
Date:2023-01-31 16:43:53

Retrosynthesis, which aims to find a route to synthesize a target molecule from commercially available starting materials, is a critical task in drug discovery and materials design. Recently, the combination of ML-based single-step reaction predictors with multi-step planners has led to promising results. However, the single-step predictors are mostly trained offline to optimize the single-step accuracy, without considering complete routes. Here, we leverage reinforcement learning (RL) to improve the single-step predictor, by using a tree-shaped MDP to optimize complete routes. Specifically, we propose a novel online training algorithm, called Planning with Dual Value Networks (PDVN), which alternates between the planning phase and updating phase. In PDVN, we construct two separate value networks to predict the synthesizability and cost of molecules, respectively. To maintain the single-step accuracy, we design a two-branch network structure for the single-step predictor. On the widely-used USPTO dataset, our PDVN algorithm improves the search success rate of existing multi-step planners (e.g., increasing the success rate from 85.79% to 98.95% for Retro*, and reducing the number of model calls by half while solving 99.47% molecules for RetroGraph). Additionally, PDVN helps find shorter synthesis routes (e.g., reducing the average route length from 5.76 to 4.83 for Retro*, and from 5.63 to 4.78 for RetroGraph). Our code is available at \url{https://github.com/DiXue98/PDVN}.

Hierarchical Imitation Learning with Vector Quantized Models

Authors:Kalle Kujanpää, Joni Pajarinen, Alexander Ilin
Date:2023-01-30 15:04:39

The ability to plan actions on multiple levels of abstraction enables intelligent agents to solve complex tasks effectively. However, learning the models for both low and high-level planning from demonstrations has proven challenging, especially with higher-dimensional inputs. To address this issue, we propose to use reinforcement learning to identify subgoals in expert trajectories by associating the magnitude of the rewards with the predictability of low-level actions given the state and the chosen subgoal. We build a vector-quantized generative model for the identified subgoals to perform subgoal-level planning. In experiments, the algorithm excels at solving complex, long-horizon decision-making problems outperforming state-of-the-art. Because of its ability to plan, our algorithm can find better trajectories than the ones in the training set

Guiding Online Reinforcement Learning with Action-Free Offline Pretraining

Authors:Deyao Zhu, Yuhui Wang, Jürgen Schmidhuber, Mohamed Elhoseiny
Date:2023-01-30 13:30:56

Offline RL methods have been shown to reduce the need for environment interaction by training agents using offline collected episodes. However, these methods typically require action information to be logged during data collection, which can be difficult or even impossible in some practical cases. In this paper, we investigate the potential of using action-free offline datasets to improve online reinforcement learning, name this problem Reinforcement Learning with Action-Free Offline Pretraining (AFP-RL). We introduce Action-Free Guide (AF-Guide), a method that guides online training by extracting knowledge from action-free offline datasets. AF-Guide consists of an Action-Free Decision Transformer (AFDT) implementing a variant of Upside-Down Reinforcement Learning. It learns to plan the next states from the offline dataset, and a Guided Soft Actor-Critic (Guided SAC) that learns online with guidance from AFDT. Experimental results show that AF-Guide can improve sample efficiency and performance in online training thanks to the knowledge from the action-free offline dataset. Code is available at https://github.com/Vision-CAIR/AF-Guide.

Planning Multiple Epidemic Interventions with Reinforcement Learning

Authors:Anh Mai, Nikunj Gupta, Azza Abouzied, Dennis Shasha
Date:2023-01-30 11:51:24

Combating an epidemic entails finding a plan that describes when and how to apply different interventions, such as mask-wearing mandates, vaccinations, school or workplace closures. An optimal plan will curb an epidemic with minimal loss of life, disease burden, and economic cost. Finding an optimal plan is an intractable computational problem in realistic settings. Policy-makers, however, would greatly benefit from tools that can efficiently search for plans that minimize disease and economic costs especially when considering multiple possible interventions over a continuous and complex action space given a continuous and equally complex state space. We formulate this problem as a Markov decision process. Our formulation is unique in its ability to represent multiple continuous interventions over any disease model defined by ordinary differential equations. We illustrate how to effectively apply state-of-the-art actor-critic reinforcement learning algorithms (PPO and SAC) to search for plans that minimize overall costs. We empirically evaluate the learning performance of these algorithms and compare their performance to hand-crafted baselines that mimic plans constructed by policy-makers. Our method outperforms baselines. Our work confirms the viability of a computational approach to support policy-makers

Automatic Intersection Management in Mixed Traffic Using Reinforcement Learning and Graph Neural Networks

Authors:Marvin Klimke, Benjamin Völz, Michael Buchholz
Date:2023-01-30 08:21:18

Connected automated driving has the potential to significantly improve urban traffic efficiency, e.g., by alleviating issues due to occlusion. Cooperative behavior planning can be employed to jointly optimize the motion of multiple vehicles. Most existing approaches to automatic intersection management, however, only consider fully automated traffic. In practice, mixed traffic, i.e., the simultaneous road usage by automated and human-driven vehicles, will be prevalent. The present work proposes to leverage reinforcement learning and a graph-based scene representation for cooperative multi-agent planning. We build upon our previous works that showed the applicability of such machine learning methods to fully automated traffic. The scene representation is extended for mixed traffic and considers uncertainty in the human drivers' intentions. In the simulation-based evaluation, we model measurement uncertainties through noise processes that are tuned using real-world data. The paper evaluates the proposed method against an enhanced first in - first out scheme, our baseline for mixed traffic management. With increasing share of automated vehicles, the learned planner significantly increases the vehicle throughput and reduces the delay due to interaction. Non-automated vehicles benefit virtually alike.

Sample Efficient Deep Reinforcement Learning via Local Planning

Authors:Dong Yin, Sridhar Thiagarajan, Nevena Lazic, Nived Rajaraman, Botao Hao, Csaba Szepesvari
Date:2023-01-29 23:17:26

The focus of this work is sample-efficient deep reinforcement learning (RL) with a simulator. One useful property of simulators is that it is typically easy to reset the environment to a previously observed state. We propose an algorithmic framework, named uncertainty-first local planning (UFLP), that takes advantage of this property. Concretely, in each data collection iteration, with some probability, our meta-algorithm resets the environment to an observed state which has high uncertainty, instead of sampling according to the initial-state distribution. The agent-environment interaction then proceeds as in the standard online RL setting. We demonstrate that this simple procedure can dramatically improve the sample cost of several baseline RL algorithms on difficult exploration tasks. Notably, with our framework, we can achieve super-human performance on the notoriously hard Atari game, Montezuma's Revenge, with a simple (distributional) double DQN. Our work can be seen as an efficient approximate implementation of an existing algorithm with theoretical guarantees, which offers an interpretation of the positive empirical results.

Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling

Authors:Kolby Nottingham, Prithviraj Ammanabrolu, Alane Suhr, Yejin Choi, Hannaneh Hajishirzi, Sameer Singh, Roy Fox
Date:2023-01-28 02:04:07

Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world. However, if initialized with knowledge of high-level subgoals and transitions between subgoals, RL agents could utilize this Abstract World Model (AWM) for planning and exploration. We propose using few-shot large language models (LLMs) to hypothesize an AWM, that will be verified through world experience, to improve sample efficiency of RL agents. Our DECKARD agent applies LLM-guided exploration to item crafting in Minecraft in two phases: (1) the Dream phase where the agent uses an LLM to decompose a task into a sequence of subgoals, the hypothesized AWM; and (2) the Wake phase where the agent learns a modular policy for each subgoal and verifies or corrects the hypothesized AWM. Our method of hypothesizing an AWM with LLMs and then verifying the AWM based on agent experience not only increases sample efficiency over contemporary methods by an order of magnitude but is also robust to and corrects errors in the LLM, successfully blending noisy internet-scale information from LLMs with knowledge grounded in environment dynamics.

ARiADNE: A Reinforcement learning approach using Attention-based Deep Networks for Exploration

Authors:Yuhong Cao, Tianxiang Hou, Yizhuo Wang, Xian Yi, Guillaume Sartoretti
Date:2023-01-27 07:49:59

In autonomous robot exploration tasks, a mobile robot needs to actively explore and map an unknown environment as fast as possible. Since the environment is being revealed during exploration, the robot needs to frequently re-plan its path online, as new information is acquired by onboard sensors and used to update its partial map. While state-of-the-art exploration planners are frontier- and sampling-based, encouraged by the recent development in deep reinforcement learning (DRL), we propose ARiADNE, an attention-based neural approach to obtain real-time, non-myopic path planning for autonomous exploration. ARiADNE is able to learn dependencies at multiple spatial scales between areas of the agent's partial map, and implicitly predict potential gains associated with exploring those areas. This allows the agent to sequence movement actions that balance the natural trade-off between exploitation/refinement of the map in known areas and exploration of new areas. We experimentally demonstrate that our method outperforms both learning and non-learning state-of-the-art baselines in terms of average trajectory length to complete exploration in hundreds of simplified 2D indoor scenarios. We further validate our approach in high-fidelity Robot Operating System (ROS) simulations, where we consider a real sensor model and a realistic low-level motion controller, toward deployment on real robots.

Learning to Generate All Feasible Actions

Authors:Mirco Theile, Daniele Bernardini, Raphael Trumpp, Cristina Piazza, Marco Caccamo, Alberto L. Sangiovanni-Vincentelli
Date:2023-01-26 23:15:51

Modern cyber-physical systems are becoming increasingly complex to model, thus motivating data-driven techniques such as reinforcement learning (RL) to find appropriate control agents. However, most systems are subject to hard constraints such as safety or operational bounds. Typically, to learn to satisfy these constraints, the agent must violate them systematically, which is computationally prohibitive in most systems. Recent efforts aim to utilize feasibility models that assess whether a proposed action is feasible to avoid applying the agent's infeasible action proposals to the system. However, these efforts focus on guaranteeing constraint satisfaction rather than the agent's learning efficiency. To improve the learning process, we introduce action mapping, a novel approach that divides the learning process into two steps: first learn feasibility and subsequently, the objective by mapping actions into the sets of feasible actions. This paper focuses on the feasibility part by learning to generate all feasible actions through self-supervised querying of the feasibility model. We train the agent by formulating the problem as a distribution matching problem and deriving gradient estimators for different divergences. Through an illustrative example, a robotic path planning scenario, and a robotic grasping simulation, we demonstrate the agent's proficiency in generating actions across disconnected feasible action sets. By addressing the feasibility step, this paper makes it possible to focus future work on the objective part of action mapping, paving the way for an RL framework that is both safe and efficient.

An Incremental Inverse Reinforcement Learning Approach for Motion Planning with Separated Path and Velocity Preferences

Authors:Armin Avaei, Linda van der Spaa, Luka Peternel, Jens Kober
Date:2023-01-25 11:26:10

Humans often demonstrate diverse behaviors due to their personal preferences, for instance, related to their individual execution style or personal margin for safety. In this paper, we consider the problem of integrating both path and velocity preferences into trajectory planning for robotic manipulators. We first learn reward functions that represent the user path and velocity preferences from kinesthetic demonstration. We then optimize the trajectory in two steps: first the path and then the velocity, to produce trajectories that adhere to both task requirements and user preferences. We design a set of parameterized features that capture the fundamental preferences in a pick-and-place type of object-transportation task, both in shape and timing of the motion. We demonstrate that our method is capable of generalizing such preferences to new scenarios. We implement our algorithm on a Franka Emika 7-DoF robot arm, and validate the functionality and flexibility of our approach in a user study. The results show that non-expert users are able to teach the robot their preferences with just a few iterations of feedback.

PushWorld: A benchmark for manipulation planning with tools and movable obstacles

Authors:Ken Kansky, Skanda Vaidyanath, Scott Swingle, Xinghua Lou, Miguel Lazaro-Gredilla, Dileep George
Date:2023-01-24 20:20:17

While recent advances in artificial intelligence have achieved human-level performance in environments like Starcraft and Go, many physical reasoning tasks remain challenging for modern algorithms. To date, few algorithms have been evaluated on physical tasks that involve manipulating objects when movable obstacles are present and when tools must be used to perform the manipulation. To promote research on such tasks, we introduce PushWorld, an environment with simplistic physics that requires manipulation planning with both movable obstacles and tools. We provide a benchmark of more than 200 PushWorld puzzles in PDDL and in an OpenAI Gym environment. We evaluate state-of-the-art classical planning and reinforcement learning algorithms on this benchmark, and we find that these baseline results are below human-level performance. We then provide a new classical planning heuristic that solves the most puzzles among the baselines, and although it is 40 times faster than the best baseline planner, it remains below human-level performance.

NeSIG: A Neuro-Symbolic Method for Learning to Generate Planning Problems

Authors:Carlos Núñez-Molina, Pablo Mesejo, Juan Fernández-Olivares
Date:2023-01-24 19:37:59

In the field of Automated Planning there is often the need for a set of planning problems from a particular domain, e.g., to be used as training data for Machine Learning or as benchmarks in planning competitions. In most cases, these problems are created either by hand or by a domain-specific generator, putting a burden on the human designers. In this paper we propose NeSIG, to the best of our knowledge the first domain-independent method for automatically generating planning problems that are valid, diverse and difficult to solve. We formulate problem generation as a Markov Decision Process and train two generative policies with Deep Reinforcement Learning to generate problems with the desired properties. We conduct experiments on three classical domains, comparing our approach against handcrafted, domain-specific instance generators and various ablations. Results show NeSIG is able to automatically generate valid and diverse problems of much greater difficulty (15.5 times more on geometric average) than domain-specific generators, while simultaneously reducing human effort when compared to them. Additionally, it can generalize to larger problems than those seen during training.

Minimal Value-Equivalent Partial Models for Scalable and Robust Planning in Lifelong Reinforcement Learning

Authors:Safa Alver, Doina Precup
Date:2023-01-24 16:40:01

Learning models of the environment from pure interaction is often considered an essential component of building lifelong reinforcement learning agents. However, the common practice in model-based reinforcement learning is to learn models that model every aspect of the agent's environment, regardless of whether they are important in coming up with optimal decisions or not. In this paper, we argue that such models are not particularly well-suited for performing scalable and robust planning in lifelong reinforcement learning scenarios and we propose new kinds of models that only model the relevant aspects of the environment, which we call "minimal value-equivalent partial models". After providing a formal definition for these models, we provide theoretical results demonstrating the scalability advantages of performing planning with such models and then perform experiments to empirically illustrate our theoretical results. Then, we provide some useful heuristics on how to learn these kinds of models with deep learning architectures and empirically demonstrate that models learned in such a way can allow for performing planning that is robust to distribution shifts and compounding model errors. Overall, both our theoretical and empirical results suggest that minimal value-equivalent partial models can provide significant benefits to performing scalable and robust planning in lifelong reinforcement learning scenarios.

Effective Baselines for Multiple Object Rearrangement Planning in Partially Observable Mapped Environments

Authors:Engin Tekin, Elaheh Barati, Nitin Kamra, Ruta Desai
Date:2023-01-24 08:03:34

Many real-world tasks, from house-cleaning to cooking, can be formulated as multi-object rearrangement problems -- where an agent needs to get specific objects into appropriate goal states. For such problems, we focus on the setting that assumes a pre-specified goal state, availability of perfect manipulation and object recognition capabilities, and a static map of the environment but unknown initial location of objects to be rearranged. Our goal is to enable home-assistive intelligent agents to efficiently plan for rearrangement under such partial observability. This requires efficient trade-offs between exploration of the environment and planning for rearrangement, which is challenging because of long-horizon nature of the problem. To make progress on this problem, we first analyze the effects of various factors such as number of objects and receptacles, agent carrying capacity, environment layouts etc. on exploration and planning for rearrangement using classical methods. We then investigate both monolithic and modular deep reinforcement learning (DRL) methods for planning in our setting. We find that monolithic DRL methods do not succeed at long-horizon planning needed for multi-object rearrangement. Instead, modular greedy approaches surprisingly perform reasonably well and emerge as competitive baselines for planning with partial observability in multi-object rearrangement problems. We also show that our greedy modular agents are empirically optimal when the objects that need to be rearranged are uniformly distributed in the environment -- thereby contributing baselines with strong performance for future work on multi-object rearrangement planning in partially observable settings.

A deep reinforcement learning approach to assess the low-altitude airspace capacity for urban air mobility

Authors:Asal Mehditabrizi, Mahdi Samadzad, Sina Sabzekar
Date:2023-01-23 23:38:05

Urban air mobility is the new mode of transportation aiming to provide a fast and secure way of travel by utilizing the low-altitude airspace. This goal cannot be achieved without the implementation of new flight regulations which can assure safe and efficient allocation of flight paths to a large number of vertical takeoff/landing aerial vehicles. Such rules should also allow estimating the effective capacity of the low-altitude airspace for planning purposes. Path planning is a vital subject in urban air mobility which could enable a large number of UAVs to fly simultaneously in the airspace without facing the risk of collision. Since urban air mobility is a novel concept, authorities are still working on the redaction of new flight rules applicable to urban air mobility. In this study, an autonomous UAV path planning framework is proposed using a deep reinforcement learning approach and a deep deterministic policy gradient algorithm. The objective is to employ a self-trained UAV to reach its destination in the shortest possible time in any arbitrary environment by adjusting its acceleration. It should avoid collisions with any dynamic or static obstacles and avoid entering prior permission zones existing on its path. The reward function is the determinant factor in the training process. Thus, two different reward function compositions are compared and the chosen composition is deployed to train the UAV by coding the RL algorithm in python. Finally, numerical simulations investigated the success rate of UAVs in different scenarios providing an estimate of the effective airspace capacity.

Learning to View: Decision Transformers for Active Object Detection

Authors:Wenhao Ding, Nathalie Majcherczyk, Mohit Deshpande, Xuewei Qi, Ding Zhao, Rajasimman Madhivanan, Arnie Sen
Date:2023-01-23 17:00:48

Active perception describes a broad class of techniques that couple planning and perception systems to move the robot in a way to give the robot more information about the environment. In most robotic systems, perception is typically independent of motion planning. For example, traditional object detection is passive: it operates only on the images it receives. However, we have a chance to improve the results if we allow planning to consume detection signals and move the robot to collect views that maximize the quality of the results. In this paper, we use reinforcement learning (RL) methods to control the robot in order to obtain images that maximize the detection quality. Specifically, we propose using a Decision Transformer with online fine-tuning, which first optimizes the policy with a pre-collected expert dataset and then improves the learned policy by exploring better solutions in the environment. We evaluate the performance of proposed method on an interactive dataset collected from an indoor scenario simulator. Experimental results demonstrate that our method outperforms all baselines, including expert policy and pure offline RL methods. We also provide exhaustive analyses of the reward distribution and observation space.

Plan To Predict: Learning an Uncertainty-Foreseeing Model for Model-Based Reinforcement Learning

Authors:Zifan Wu, Chao Yu, Chen Chen, Jianye Hao, Hankz Hankui Zhuo
Date:2023-01-20 10:17:22

In Model-based Reinforcement Learning (MBRL), model learning is critical since an inaccurate model can bias policy learning via generating misleading samples. However, learning an accurate model can be difficult since the policy is continually updated and the induced distribution over visited states used for model learning shifts accordingly. Prior methods alleviate this issue by quantifying the uncertainty of model-generated samples. However, these methods only quantify the uncertainty passively after the samples were generated, rather than foreseeing the uncertainty before model trajectories fall into those highly uncertain regions. The resulting low-quality samples can induce unstable learning targets and hinder the optimization of the policy. Moreover, while being learned to minimize one-step prediction errors, the model is generally used to predict for multiple steps, leading to a mismatch between the objectives of model learning and model usage. To this end, we propose \emph{Plan To Predict} (P2P), an MBRL framework that treats the model rollout process as a sequential decision making problem by reversely considering the model as a decision maker and the current policy as the dynamics. In this way, the model can quickly adapt to the current policy and foresee the multi-step future uncertainty when generating trajectories. Theoretically, we show that the performance of P2P can be guaranteed by approximately optimizing a lower bound of the true environment return. Empirical results demonstrate that P2P achieves state-of-the-art performance on several challenging benchmark tasks.

A reinforcement learning path planning approach for range-only underwater target localization with autonomous vehicles

Authors:Ivan Masmitja, Mario Martin, Kakani Katija, Spartacus Gomariz, Joan Navarro
Date:2023-01-17 13:16:16

Underwater target localization using range-only and single-beacon (ROSB) techniques with autonomous vehicles has been used recently to improve the limitations of more complex methods, such as long baseline and ultra-short baseline systems. Nonetheless, in ROSB target localization methods, the trajectory of the tracking vehicle near the localized target plays an important role in obtaining the best accuracy of the predicted target position. Here, we investigate a Reinforcement Learning (RL) approach to find the optimal path that an autonomous vehicle should follow in order to increase and optimize the overall accuracy of the predicted target localization, while reducing time and power consumption. To accomplish this objective, different experimental tests have been designed using state-of-the-art deep RL algorithms. Our study also compares the results obtained with the analytical Fisher information matrix approach used in previous studies. The results revealed that the policy learned by the RL agent outperforms trajectories based on these analytical solutions, e.g. the median predicted error at the beginning of the target's localisation is 17% less. These findings suggest that using deep RL for localizing acoustic targets could be successfully applied to in-water applications that include tracking of acoustically tagged marine animals by autonomous underwater vehicles. This is envisioned as a first necessary step to validate the use of RL to tackle such problems, which could be used later on in a more complex scenarios

Deep-Reinforcement-Learning-based Path Planning for Industrial Robots using Distance Sensors as Observation

Authors:Teham Bhuiyan, Linh Kästner, Yifan Hu, Benno Kutschank, Jens Lambrecht
Date:2023-01-14 21:42:17

Industrial robots are widely used in various manufacturing environments due to their efficiency in doing repetitive tasks such as assembly or welding. A common problem for these applications is to reach a destination without colliding with obstacles or other robot arms. Commonly used sampling-based path planning approaches such as RRT require long computation times, especially in complex environments. Furthermore, the environment in which they are employed needs to be known beforehand. When utilizing the approaches in new environments, a tedious engineering effort in setting hyperparameters needs to be conducted, which is time- and cost-intensive. On the other hand, Deep Reinforcement Learning has shown remarkable results in dealing with unknown environments, generalizing new problem instances, and solving motion planning problems efficiently. On that account, this paper proposes a Deep-Reinforcement-Learning-based motion planner for robotic manipulators. We evaluated our model against state-of-the-art sampling-based planners in several experiments. The results show the superiority of our planner in terms of path length and execution time.

Long-distance migration with minimal energy consumption in a thermal turbulent environment

Authors:Ao Xu, Hua-Lin Wu, Heng-Dong Xi
Date:2023-01-12 04:46:41

We adopt the reinforcement learning algorithm to train the self-propelling agent migrating long-distance in a thermal turbulent environment. We choose the Rayleigh-B\'enard turbulent convection cell with an aspect ratio ($\Gamma$, which is defined as the ratio between cell length and cell height) of 2 as the training environment. Our results showed that, compared to a naive agent that moves straight from the origin to the destination, the smart agent can learn to utilize the carrier flow currents to save propelling energy. We then apply the optimal policy obtained from the $\Gamma=2$ cell and test the smart agent migrating in convection cells with $\Gamma$ up to 32. In a larger $\Gamma$ cell, the dominant flow modes of horizontally stacked rolls are less stable, and the energy contained in higher-order flow modes increases. We found that the optimized policy can be successfully extended to convection cells with a larger $\Gamma$. In addition, the ratio of propelling energy consumed by the smart agent to that of the naive agent decreases with the increase of $\Gamma$, indicating more propelling energy can be saved by the smart agent in a larger $\Gamma$ cell. We also evaluate the optimized policy when the agents are being released from the randomly chosen origin, which aims to test the robustness of the learning framework, and possible solutions to improve the success rate are suggested. This work has implications for long-distance migration problems, such as unmanned aerial vehicles patrolling in a turbulent convective environment, where planning energy-efficient trajectories can be beneficial to increase their endurance.

MotorFactory: A Blender Add-on for Large Dataset Generation of Small Electric Motors

Authors:Chengzhi Wu, Kanran Zhou, Jan-Philipp Kaiser, Norbert Mitschke, Jan-Felix Klein, Julius Pfrommer, Jürgen Beyerer, Gisela Lanza, Michael Heizmann, Kai Furmans
Date:2023-01-11 18:03:24

To enable automatic disassembly of different product types with uncertain conditions and degrees of wear in remanufacturing, agile production systems that can adapt dynamically to changing requirements are needed. Machine learning algorithms can be employed due to their generalization capabilities of learning from various types and variants of products. However, in reality, datasets with a diversity of samples that can be used to train models are difficult to obtain in the initial period. This may cause bad performances when the system tries to adapt to new unseen input data in the future. In order to generate large datasets for different learning purposes, in our project, we present a Blender add-on named MotorFactory to generate customized mesh models of various motor instances. MotorFactory allows to create mesh models which, complemented with additional add-ons, can be further used to create synthetic RGB images, depth images, normal images, segmentation ground truth masks, and 3D point cloud datasets with point-wise semantic labels. The created synthetic datasets may be used for various tasks including motor type classification, object detection for decentralized material transfer tasks, part segmentation for disassembly and handling tasks, or even reinforcement learning-based robotics control or view-planning.

Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments

Authors:Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, Animesh Garg
Date:2023-01-10 20:19:17

We present Orbit, a unified and modular framework for robot learning powered by NVIDIA Isaac Sim. It offers a modular design to easily and efficiently create robotic environments with photo-realistic scenes and high-fidelity rigid and deformable body simulation. With Orbit, we provide a suite of benchmark tasks of varying difficulty -- from single-stage cabinet opening and cloth folding to multi-stage tasks such as room reorganization. To support working with diverse observations and action spaces, we include fixed-arm and mobile manipulators with different physically-based sensors and motion generators. Orbit allows training reinforcement learning policies and collecting large demonstration datasets from hand-crafted or expert solutions in a matter of minutes by leveraging GPU-based parallelization. In summary, we offer an open-sourced framework that readily comes with 16 robotic platforms, 4 sensor modalities, 10 motion generators, more than 20 benchmark tasks, and wrappers to 4 learning libraries. With this framework, we aim to support various research areas, including representation learning, reinforcement learning, imitation learning, and task and motion planning. We hope it helps establish interdisciplinary collaborations in these communities, and its modularity makes it easily extensible for more tasks and applications in the future.

Deep Reinforcement Learning for Autonomous Ground Vehicle Exploration Without A-Priori Maps

Authors:Shathushan Sivashangaran, Azim Eskandarian
Date:2023-01-10 15:38:59

Autonomous Ground Vehicles (AGVs) are essential tools for a wide range of applications stemming from their ability to operate in hazardous environments with minimal human operator input. Effective motion planning is paramount for successful operation of AGVs. Conventional motion planning algorithms are dependent on prior knowledge of environment characteristics and offer limited utility in information poor, dynamically altering environments such as areas where emergency hazards like fire and earthquake occur, and unexplored subterranean environments such as tunnels and lava tubes on Mars. We propose a Deep Reinforcement Learning (DRL) framework for intelligent AGV exploration without a-priori maps utilizing Actor-Critic DRL algorithms to learn policies in continuous and high-dimensional action spaces directly from raw sensor data. The DRL architecture comprises feedforward neural networks for the critic and actor representations in which the actor network strategizes linear and angular velocity control actions given current state inputs, that are evaluated by the critic network which learns and estimates Q-values to maximize an accumulated reward. Three off-policy DRL algorithms, DDPG, TD3 and SAC, are trained and compared in two environments of varying complexity, and further evaluated in a third with no prior training or knowledge of map characteristics. The agent is shown to learn optimal policies at the end of each training period to chart quick, collision-free exploration trajectories, and is extensible, capable of adapting to an unknown environment without changes to network architecture or hyperparameters. The best algorithm is further evaluated in a realistic 3D environment.

Multi-UAV Path Learning for Age and Power Optimization in IoT with UAV Battery Recharge

Authors:Eslam Eldeeb, Jean Michel de Souza Sant'Ana, Dian Echevarría Pérez, Mohammad Shehab, Nurul Huda Mahmood, Hirley Alves
Date:2023-01-09 15:21:41

In many emerging Internet of Things (IoT) applications, the freshness of the is an important design criterion. Age of Information (AoI) quantifies the freshness of the received information or status update. This work considers a setup of deployed IoT devices in an IoT network; multiple unmanned aerial vehicles (UAVs) serve as mobile relay nodes between the sensors and the base station. We formulate an optimization problem to jointly plan the UAVs' trajectory, while minimizing the AoI of the received messages and the devices' energy consumption. The solution accounts for the UAVs' battery lifetime and flight time to recharging depots to ensure the UAVs' green operation. The complex optimization problem is efficiently solved using a deep reinforcement learning algorithm. In particular, we propose a deep Q-network, which works as a function approximation to estimate the state-action value function. The proposed scheme is quick to converge and results in a lower ergodic age and ergodic energy consumption when compared with benchmark algorithms such as greedy algorithm (GA), nearest neighbour (NN), and random-walk (RW).

Exploration in Model-based Reinforcement Learning with Randomized Reward

Authors:Lingxiao Wang, Ping Li
Date:2023-01-09 01:50:55

Model-based Reinforcement Learning (MBRL) has been widely adapted due to its sample efficiency. However, existing worst-case regret analysis typically requires optimistic planning, which is not realistic in general. In contrast, motivated by the theory, empirical study utilizes ensemble of models, which achieve state-of-the-art performance on various testing environments. Such deviation between theory and empirical study leads us to question whether randomized model ensemble guarantee optimism, and hence the optimal worst-case regret? This paper partially answers such question from the perspective of reward randomization, a scarcely explored direction of exploration with MBRL. We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism, which further yields a near-optimal worst-case regret in terms of the number of interactions. We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration. Correspondingly, we propose concrete examples of efficient reward randomization. To the best of our knowledge, our analysis establishes the first worst-case regret analysis on randomized MBRL with function approximation.

Mathematical Models and Reinforcement Learning based Evolutionary Algorithm Framework for Satellite Scheduling Problem

Authors:Yanjie Song
Date:2023-01-07 01:57:10

For complex combinatorial optimization problems, models and algorithms are at the heart of the solution. The complexity of many types of satellite mission planning problems is NP-hard and places high demands on the solution. In this paper, two types of satellite scheduling problem models are introduced and a reinforcement learning based evolutionary algorithm framework based is proposed.

Cost-Effective Two-Stage Network Slicing for Edge-Cloud Orchestrated Vehicular Networks

Authors:Wen Wu, Kaige Qu, Peng Yang, Ning Zhang, Xuemin, Shen, Weihua Zhuang
Date:2022-12-31 06:03:14

In this paper, we study a network slicing problem for edge-cloud orchestrated vehicular networks, in which the edge and cloud servers are orchestrated to process computation tasks for reducing network slicing cost while satisfying the quality of service requirements. We propose a two-stage network slicing framework, which consists of 1) network planning stage in a large timescale to perform slice deployment, edge resource provisioning, and cloud resource provisioning, and 2) network operation stage in a small timescale to perform resource allocation and task dispatching. Particularly, we formulate the network slicing problem as a two-timescale stochastic optimization problem to minimize the network slicing cost. Since the problem is NP-hard due to coupled network planning and network operation stages, we develop a Two timescAle netWork Slicing (TAWS) algorithm by collaboratively integrating reinforcement learning (RL) and optimization methods, which can jointly make network planning and operation decisions. Specifically, by leveraging the timescale separation property of decisions, we decouple the problem into a large-timescale network planning subproblem and a small-timescale network operation subproblem. The former is solved by an RL method, and the latter is solved by an optimization method. Simulation results based on real-world vehicle traffic traces show that the TAWS can effectively reduce the network slicing cost as compared to the benchmark scheme.

Symbolic Visual Reinforcement Learning: A Scalable Framework with Object-Level Abstraction and Differentiable Expression Search

Authors:Wenqing Zheng, S P Sharan, Zhiwen Fan, Kevin Wang, Yihan Xi, Zhangyang Wang
Date:2022-12-30 17:50:54

Learning efficient and interpretable policies has been a challenging task in reinforcement learning (RL), particularly in the visual RL setting with complex scenes. While neural networks have achieved competitive performance, the resulting policies are often over-parameterized black boxes that are difficult to interpret and deploy efficiently. More recent symbolic RL frameworks have shown that high-level domain-specific programming logic can be designed to handle both policy learning and symbolic planning. However, these approaches rely on coded primitives with little feature learning, and when applied to high-dimensional visual scenes, they can suffer from scalability issues and perform poorly when images have complex object interactions. To address these challenges, we propose \textit{Differentiable Symbolic Expression Search} (DiffSES), a novel symbolic learning approach that discovers discrete symbolic policies using partially differentiable optimization. By using object-level abstractions instead of raw pixel-level inputs, DiffSES is able to leverage the simplicity and scalability advantages of symbolic expressions, while also incorporating the strengths of neural networks for feature learning and optimization. Our experiments demonstrate that DiffSES is able to generate symbolic policies that are simpler and more and scalable than state-of-the-art symbolic RL methods, with a reduced amount of symbolic prior knowledge.

Hybrid Deep Reinforcement Learning and Planning for Safe and Comfortable Automated Driving

Authors:Dikshant Gupta, Mathias Klusch
Date:2022-12-30 15:19:01

We present a novel hybrid learning method, HyLEAR, for solving the collision-free navigation problem for self-driving cars in POMDPs. HyLEAR leverages interposed learning to embed knowledge of a hybrid planner into a deep reinforcement learner to faster determine safe and comfortable driving policies. In particular, the hybrid planner combines pedestrian path prediction and risk-aware path planning with driving-behavior rule-based reasoning such that the driving policies also take into account, whenever possible, the ride comfort and a given set of driving-behavior rules. Our experimental performance analysis over the CARLA-CTS1 benchmark of critical traffic scenarios revealed that HyLEAR can significantly outperform the selected baselines in terms of safety and ride comfort.

POMRL: No-Regret Learning-to-Plan with Increasing Horizons

Authors:Khimya Khetarpal, Claire Vernade, Brendan O'Donoghue, Satinder Singh, Tom Zahavy
Date:2022-12-30 03:09:45

We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented with a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying structure across tasks, utilize it to plan in each task, and upper-bound the regret of the planning loss. Our bound suggests that the average regret over tasks decreases as the number of tasks increases and as the tasks are more similar. In the classical single-task setting, it is known that the planning horizon should depend on the estimated model's accuracy, that is, on the number of samples within task. We generalize this finding to meta-RL and study this dependence of planning horizons on the number of tasks. Based on our theoretical findings, we derive heuristics for selecting slowly increasing discount factors, and we validate its significance empirically.

Bayesian Optimization Enhanced Deep Reinforcement Learning for Trajectory Planning and Network Formation in Multi-UAV Networks

Authors:Shimin Gong, Meng Wang, Bo Gu, Wenjie Zhang, Dinh Thai Hoang, Dusit Niyato
Date:2022-12-27 07:46:40

In this paper, we employ multiple UAVs coordinated by a base station (BS) to help the ground users (GUs) to offload their sensing data. Different UAVs can adapt their trajectories and network formation to expedite data transmissions via multi-hop relaying. The trajectory planning aims to collect all GUs' data, while the UAVs' network formation optimizes the multi-hop UAV network topology to minimize the energy consumption and transmission delay. The joint network formation and trajectory optimization is solved by a two-step iterative approach. Firstly, we devise the adaptive network formation scheme by using a heuristic algorithm to balance the UAVs' energy consumption and data queue size. Then, with the fixed network formation, the UAVs' trajectories are further optimized by using multi-agent deep reinforcement learning without knowing the GUs' traffic demands and spatial distribution. To improve the learning efficiency, we further employ Bayesian optimization to estimate the UAVs' flying decisions based on historical trajectory points. This helps avoid inefficient action explorations and improves the convergence rate in the model training. The simulation results reveal close spatial-temporal couplings between the UAVs' trajectory planning and network formation. Compared with several baselines, our solution can better exploit the UAVs' cooperation in data offloading, thus improving energy efficiency and delay performance.

Streaming Traffic Flow Prediction Based on Continuous Reinforcement Learning

Authors:Yanan Xiao, Minyu Liu, Zichen Zhang, Lu Jiang, Minghao Yin, Jianan Wang
Date:2022-12-24 16:34:10

Traffic flow prediction is an important part of smart transportation. The goal is to predict future traffic conditions based on historical data recorded by sensors and the traffic network. As the city continues to build, parts of the transportation network will be added or modified. How to accurately predict expanding and evolving long-term streaming networks is of great significance. To this end, we propose a new simulation-based criterion that considers teaching autonomous agents to mimic sensor patterns, planning their next visit based on the sensor's profile (e.g., traffic, speed, occupancy). The data recorded by the sensor is most accurate when the agent can perfectly simulate the sensor's activity pattern. We propose to formulate the problem as a continuous reinforcement learning task, where the agent is the next flow value predictor, the action is the next time-series flow value in the sensor, and the environment state is a dynamically fused representation of the sensor and transportation network. Actions taken by the agent change the environment, which in turn forces the agent's mode to update, while the agent further explores changes in the dynamic traffic network, which helps the agent predict its next visit more accurately. Therefore, we develop a strategy in which sensors and traffic networks update each other and incorporate temporal context to quantify state representations evolving over time.

Deep Reinforcement Learning for Trajectory Path Planning and Distributed Inference in Resource-Constrained UAV Swarms

Authors:Marwan Dhuheir, Emna Baccour, Aiman Erbad, Sinan Sabeeh Al-Obaidi, Mounir Hamdi
Date:2022-12-21 17:16:42

The deployment flexibility and maneuverability of Unmanned Aerial Vehicles (UAVs) increased their adoption in various applications, such as wildfire tracking, border monitoring, etc. In many critical applications, UAVs capture images and other sensory data and then send the captured data to remote servers for inference and data processing tasks. However, this approach is not always practical in real-time applications due to the connection instability, limited bandwidth, and end-to-end latency. One promising solution is to divide the inference requests into multiple parts (layers or segments), with each part being executed in a different UAV based on the available resources. Furthermore, some applications require the UAVs to traverse certain areas and capture incidents; thus, planning their paths becomes critical particularly, to reduce the latency of making the collaborative inference process. Specifically, planning the UAVs trajectory can reduce the data transmission latency by communicating with devices in the same proximity while mitigating the transmission interference. This work aims to design a model for distributed collaborative inference requests and path planning in a UAV swarm while respecting the resource constraints due to the computational load and memory usage of the inference requests. The model is formulated as an optimization problem and aims to minimize latency. The formulated problem is NP-hard so finding the optimal solution is quite complex; thus, this paper introduces a real-time and dynamic solution for online applications using deep reinforcement learning. We conduct extensive simulations and compare our results to the-state-of-the-art studies demonstrating that our model outperforms the competing models.

AdverSAR: Adversarial Search and Rescue via Multi-Agent Reinforcement Learning

Authors:Aowabin Rahman, Arnab Bhattacharya, Thiagarajan Ramachandran, Sayak Mukherjee, Himanshu Sharma, Ted Fujimoto, Samrat Chatterjee
Date:2022-12-20 08:13:29

Search and Rescue (SAR) missions in remote environments often employ autonomous multi-robot systems that learn, plan, and execute a combination of local single-robot control actions, group primitives, and global mission-oriented coordination and collaboration. Often, SAR coordination strategies are manually designed by human experts who can remotely control the multi-robot system and enable semi-autonomous operations. However, in remote environments where connectivity is limited and human intervention is often not possible, decentralized collaboration strategies are needed for fully-autonomous operations. Nevertheless, decentralized coordination may be ineffective in adversarial environments due to sensor noise, actuation faults, or manipulation of inter-agent communication data. In this paper, we propose an algorithmic approach based on adversarial multi-agent reinforcement learning (MARL) that allows robots to efficiently coordinate their strategies in the presence of adversarial inter-agent communications. In our setup, the objective of the multi-robot team is to discover targets strategically in an obstacle-strewn geographical area by minimizing the average time needed to find the targets. It is assumed that the robots have no prior knowledge of the target locations, and they can interact with only a subset of neighboring robots at any time. Based on the centralized training with decentralized execution (CTDE) paradigm in MARL, we utilize a hierarchical meta-learning framework to learn dynamic team-coordination modalities and discover emergent team behavior under complex cooperative-competitive scenarios. The effectiveness of our approach is demonstrated on a collection of prototype grid-world environments with different specifications of benign and adversarial agents, target locations, and agent rewards.

Near-optimal Policy Identification in Active Reinforcement Learning

Authors:Xiang Li, Viraj Mehta, Johannes Kirschner, Ian Char, Willie Neiswanger, Jeff Schneider, Andreas Krause, Ilija Bogunovic
Date:2022-12-19 14:46:57

Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a \emph{generative model}. We propose the AE-LSVI algorithm for best-policy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy \emph{uniformly} over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required.

Planning Immediate Landmarks of Targets for Model-Free Skill Transfer across Agents

Authors:Minghuan Liu, Zhengbang Zhu, Menghui Zhu, Yuzheng Zhuang, Weinan Zhang, Jianye Hao
Date:2022-12-18 08:03:21

In reinforcement learning applications like robotics, agents usually need to deal with various input/output features when specified with different state/action spaces by their developers or physical restrictions. This indicates unnecessary re-training from scratch and considerable sample inefficiency, especially when agents follow similar solution steps to achieve tasks. In this paper, we aim to transfer similar high-level goal-transition knowledge to alleviate the challenge. Specifically, we propose PILoT, i.e., Planning Immediate Landmarks of Targets. PILoT utilizes the universal decoupled policy optimization to learn a goal-conditioned state planner; then, distills a goal-planner to plan immediate landmarks in a model-free style that can be shared among different agents. In our experiments, we show the power of PILoT on various transferring challenges, including few-shot transferring across action spaces and dynamics, from low-dimensional vector states to image inputs, from simple robot to complicated morphology; and we also illustrate a zero-shot transfer solution from a simple 2D navigation task to the harder Ant-Maze task.

Comparison of Model-Free and Model-Based Learning-Informed Planning for PointGoal Navigation

Authors:Yimeng Li, Arnab Debnath, Gregory J. Stein, Jana Kosecka
Date:2022-12-17 05:23:54

In recent years several learning approaches to point goal navigation in previously unseen environments have been proposed. They vary in the representations of the environments, problem decomposition, and experimental evaluation. In this work, we compare the state-of-the-art Deep Reinforcement Learning based approaches with Partially Observable Markov Decision Process (POMDP) formulation of the point goal navigation problem. We adapt the (POMDP) sub-goal framework proposed by [1] and modify the component that estimates frontier properties by using partial semantic maps of indoor scenes built from images' semantic segmentation. In addition to the well-known completeness of the model-based approach, we demonstrate that it is robust and efficient in that it leverages informative, learned properties of the frontiers compared to an optimistic frontier-based planner. We also demonstrate its data efficiency compared to the end-to-end deep reinforcement learning approaches. We compare our results against an optimistic planner, ANS and DD-PPO on Matterport3D dataset using the Habitat Simulator. We show comparable, though slightly worse performance than the SOTA DD-PPO approach, yet with far fewer data.

Conditional Predictive Behavior Planning with Inverse Reinforcement Learning for Human-like Autonomous Driving

Authors:Zhiyu Huang, Haochen Liu, Jingda Wu, Chen Lv
Date:2022-12-17 03:16:32

Making safe and human-like decisions is an essential capability of autonomous driving systems, and learning-based behavior planning presents a promising pathway toward achieving this objective. Distinguished from existing learning-based methods that directly output decisions, this work introduces a predictive behavior planning framework that learns to predict and evaluate from human driving data. This framework consists of three components: a behavior generation module that produces a diverse set of candidate behaviors in the form of trajectory proposals, a conditional motion prediction network that predicts future trajectories of other agents based on each proposal, and a scoring module that evaluates the candidate plans using maximum entropy inverse reinforcement learning (IRL). We validate the proposed framework on a large-scale real-world urban driving dataset through comprehensive experiments. The results show that the conditional prediction model can predict distinct and reasonable future trajectories given different trajectory proposals and the IRL-based scoring module can select plans that are close to human driving. The proposed framework outperforms other baseline methods in terms of similarity to human driving trajectories. Additionally, we find that the conditional prediction model improves both prediction and planning performance compared to the non-conditional model. Lastly, we note that learning the scoring module is crucial for aligning the evaluations with human drivers.

Latent Variable Representation for Reinforcement Learning

Authors:Tongzheng Ren, Chenjun Xiao, Tianjun Zhang, Na Li, Zhaoran Wang, Sujay Sanghavi, Dale Schuurmans, Bo Dai
Date:2022-12-17 00:26:31

Deep latent variable models have achieved significant empirical successes in model-based reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks.

A Simple Decentralized Cross-Entropy Method

Authors:Zichen Zhang, Jun Jin, Martin Jagersand, Jun Luo, Dale Schuurmans
Date:2022-12-16 02:00:55

Cross-Entropy Method (CEM) is commonly used for planning in model-based reinforcement learning (MBRL) where a centralized approach is typically utilized to update the sampling distribution based on only the top-$k$ operation's results on samples. In this paper, we show that such a centralized approach makes CEM vulnerable to local optima, thus impairing its sample efficiency. To tackle this issue, we propose Decentralized CEM (DecentCEM), a simple but effective improvement over classical CEM, by using an ensemble of CEM instances running independently from one another, and each performing a local improvement of its own sampling distribution. We provide both theoretical and empirical analysis to demonstrate the effectiveness of this simple decentralized approach. We empirically show that, compared to the classical centralized approach using either a single or even a mixture of Gaussian distributions, our DecentCEM finds the global optimum much more consistently thus improves the sample efficiency. Furthermore, we plug in our DecentCEM in the planning problem of MBRL, and evaluate our approach in several continuous control environments, with comparison to the state-of-art CEM based MBRL approaches (PETS and POPLIN). Results show sample efficiency improvement by simply replacing the classical CEM module with our DecentCEM module, while only sacrificing a reasonable amount of computational cost. Lastly, we conduct ablation studies for more in-depth analysis. Code is available at https://github.com/vincentzhang/decentCEM

Reinforcement Learning for Agile Active Target Sensing with a UAV

Authors:Harsh Goel, Laura Jarin Lipschitz, Saurav Agarwal, Sandeep Manjanna, Vijay Kumar
Date:2022-12-16 01:01:17

Active target sensing is the task of discovering and classifying an unknown number of targets in an environment and is critical in search-and-rescue missions. This paper develops a deep reinforcement learning approach to plan informative trajectories that increase the likelihood for an uncrewed aerial vehicle (UAV) to discover missing targets. Our approach efficiently (1) explores the environment to discover new targets, (2) exploits its current belief of the target states and incorporates inaccurate sensor models for high-fidelity classification, and (3) generates dynamically feasible trajectories for an agile UAV by employing a motion primitive library. Extensive simulations on randomly generated environments show that our approach is more efficient in discovering and classifying targets than several other baselines. A unique characteristic of our approach, in contrast to heuristic informative path planning approaches, is that it is robust to varying amounts of deviations of the prior belief from the true target distribution, thereby alleviating the challenge of designing heuristics specific to the application conditions.

Hierarchical Strategies for Cooperative Multi-Agent Reinforcement Learning

Authors:Majd Ibrahim, Ammar Fayad
Date:2022-12-14 18:27:58

Adequate strategizing of agents behaviors is essential to solving cooperative MARL problems. One intuitively beneficial yet uncommon method in this domain is predicting agents future behaviors and planning accordingly. Leveraging this point, we propose a two-level hierarchical architecture that combines a novel information-theoretic objective with a trajectory prediction model to learn a strategy. To this end, we introduce a latent policy that learns two types of latent strategies: individual $z_A$, and relational $z_R$ using a modified Graph Attention Network module to extract interaction features. We encourage each agent to behave according to the strategy by conditioning its local $Q$ functions on $z_A$, and we further equip agents with a shared $Q$ function that conditions on $z_R$. Additionally, we introduce two regularizers to allow predicted trajectories to be accurate and rewarding. Empirical results on Google Research Football (GRF) and StarCraft (SC) II micromanagement tasks show that our method establishes a new state of the art being, to the best of our knowledge, the first MARL algorithm to solve all super hard SC II scenarios as well as the GRF full game with a win rate higher than $95\%$, thus outperforming all existing methods. Videos and brief overview of the methods and results are available at: https://sites.google.com/view/hier-strats-marl/home.

Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes

Authors:Jiafan He, Heyang Zhao, Dongruo Zhou, Quanquan Gu
Date:2022-12-12 18:58:59

We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition probability can be parameterized as a linear function of a given feature mapping, we propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $K$ is the number of episodes. Our algorithm is based on a weighted linear regression scheme with a carefully designed weight, which depends on a new variance estimator that (1) directly estimates the variance of the optimal value function, (2) monotonically decreases with respect to the number of episodes to ensure a better estimation accuracy, and (3) uses a rare-switching policy to update the value function estimator to control the complexity of the estimated value function class. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.

Reinforcement Learning and Tree Search Methods for the Unit Commitment Problem

Authors:Patrick de Mars
Date:2022-12-12 16:03:31

The unit commitment (UC) problem, which determines operating schedules of generation units to meet demand, is a fundamental task in power systems operation. Existing UC methods using mixed-integer programming are not well-suited to highly stochastic systems. Approaches which more rigorously account for uncertainty could yield large reductions in operating costs by reducing spinning reserve requirements; operating power stations at higher efficiencies; and integrating greater volumes of variable renewables. A promising approach to solving the UC problem is reinforcement learning (RL), a methodology for optimal decision-making which has been used to conquer long-standing grand challenges in artificial intelligence. This thesis explores the application of RL to the UC problem and addresses challenges including robustness under uncertainty; generalisability across multiple problem instances; and scaling to larger power systems than previously studied. To tackle these issues, we develop guided tree search, a novel methodology combining model-free RL and model-based planning. The UC problem is formalised as a Markov decision process and we develop an open-source environment based on real data from Great Britain's power system to train RL agents. In problems of up to 100 generators, guided tree search is shown to be competitive with deterministic UC methods, reducing operating costs by up to 1.4\%. An advantage of RL is that the framework can be easily extended to incorporate considerations important to power systems operators such as robustness to generator failure, wind curtailment or carbon prices. When generator outages are considered, guided tree search saves over 2\% in operating costs as compared with methods using conventional $N-x$ reserve criteria.

MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations

Authors:Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang, Vikash Kumar, Aravind Rajeswaran
Date:2022-12-12 04:28:50

Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications, and in particular for visuo-motor control. Model-based RL has the potential to be highly sample efficient by concurrently learning a world model and using synthetic rollouts for planning and policy improvement. However, in practice, sample-efficient learning with model-based RL is bottlenecked by the exploration challenge. In this work, we find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL. Simply appending demonstrations to the interaction dataset, however, does not suffice. We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework. We empirically study three complex visuo-motor control domains and find that our method is 150%-250% more successful in completing sparse reward tasks compared to prior approaches in the low data regime (100K interaction steps, 5 demonstrations). Code and videos are available at: https://nicklashansen.github.io/modemrl

Optimal Planning of Hybrid Energy Storage Systems using Curtailed Renewable Energy through Deep Reinforcement Learning

Authors:Dongju Kang, Doeun Kang, Sumin Hwangbo, Haider Niaz, Won Bo Lee, J. Jay Liu, Jonggeol Na
Date:2022-12-12 02:24:50

Energy management systems (EMS) are becoming increasingly important in order to utilize the continuously growing curtailed renewable energy. Promising energy storage systems (ESS), such as batteries and green hydrogen should be employed to maximize the efficiency of energy stakeholders. However, optimal decision-making, i.e., planning the leveraging between different strategies, is confronted with the complexity and uncertainties of large-scale problems. Here, we propose a sophisticated deep reinforcement learning (DRL) methodology with a policy-based algorithm to realize the real-time optimal ESS planning under the curtailed renewable energy uncertainty. A quantitative performance comparison proved that the DRL agent outperforms the scenario-based stochastic optimization (SO) algorithm, even with a wide action and observation space. Owing to the uncertainty rejection capability of the DRL, we could confirm a robust performance, under a large uncertainty of the curtailed renewable energy, with a maximizing net profit and stable system. Action-mapping was performed for visually assessing the action taken by the DRL agent according to the state. The corresponding results confirmed that the DRL agent learns the way like what a human expert would do, suggesting reliable application of the proposed methodology.

AutoDRIVE: A Comprehensive, Flexible and Integrated Digital Twin Ecosystem for Enhancing Autonomous Driving Research and Education

Authors:Tanmay Vilas Samak, Chinmay Vilas Samak, Sivanathan Kandhasamy, Venkat Krovi, Ming Xie
Date:2022-12-10 08:16:05

Prototyping and validating hardware-software components, sub-systems and systems within the intelligent transportation system-of-systems framework requires a modular yet flexible and open-access ecosystem. This work presents our attempt towards developing such a comprehensive research and education ecosystem, called AutoDRIVE, for synergistically prototyping, simulating and deploying cyber-physical solutions pertaining to autonomous driving as well as smart city management. AutoDRIVE features both software as well as hardware-in-the-loop testing interfaces with openly accessible scaled vehicle and infrastructure components. The ecosystem is compatible with a variety of development frameworks, and supports both single and multi-agent paradigms through local as well as distributed computing. Most critically, AutoDRIVE is intended to be modularly expandable to explore emergent technologies, and this work highlights various complementary features and capabilities of the proposed ecosystem by demonstrating four such deployment use-cases: (i) autonomous parking using probabilistic robotics approach for mapping, localization, path planning and control; (ii) behavioral cloning using computer vision and deep imitation learning; (iii) intersection traversal using vehicle-to-vehicle communication and deep reinforcement learning; and (iv) smart city management using vehicle-to-infrastructure communication and internet-of-things.

Near-Optimal Differentially Private Reinforcement Learning

Authors:Dan Qiao, Yu-Xiang Wang
Date:2022-12-09 06:03:02

Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an $\epsilon$-JDP algorithm with a regret of $\widetilde{O}(\sqrt{SAH^2T}+S^2AH^3/\epsilon)$ which matches the information-theoretic lower bound of non-private learning for all choices of $\epsilon> S^{1.5}A^{0.5} H^2/\sqrt{T}$. In the above, $S$, $A$ denote the number of states and actions, $H$ denotes the planning horizon, and $T$ is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves \emph{privacy for free} asymptotically as $T\rightarrow \infty$. Our techniques -- which could be of independent interest -- include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case.

PALMER: Perception-Action Loop with Memory for Long-Horizon Planning

Authors:Onur Beker, Mohammad Mohammadi, Amir Zamir
Date:2022-12-08 22:11:49

To achieve autonomy in a priori unknown real-world scenarios, agents should be able to: i) act from high-dimensional sensory observations (e.g., images), ii) learn from past experience to adapt and improve, and iii) be capable of long horizon planning. Classical planning algorithms (e.g. PRM, RRT) are proficient at handling long-horizon planning. Deep learning based methods in turn can provide the necessary representations to address the others, by modeling statistical contingencies between observations. In this direction, we introduce a general-purpose planning algorithm called PALMER that combines classical sampling-based planning algorithms with learning-based perceptual representations. For training these perceptual representations, we combine Q-learning with contrastive representation learning to create a latent space where the distance between the embeddings of two states captures how easily an optimal policy can traverse between them. For planning with these perceptual representations, we re-purpose classical sampling-based planning algorithms to retrieve previously observed trajectory segments from a replay buffer and restitch them into approximately optimal paths that connect any given pair of start and goal states. This creates a tight feedback loop between representation learning, memory, reinforcement learning, and sampling-based planning. The end result is an experiential framework for long-horizon planning that is significantly more robust and sample efficient compared to existing methods.

Model-based trajectory stitching for improved behavioural cloning and its applications

Authors:Charles A. Hepburn, Giovanni Montana
Date:2022-12-08 14:18:04

Behavioural cloning (BC) is a commonly used imitation learning method to infer a sequential decision-making policy from expert demonstrations. However, when the quality of the data is not optimal, the resulting behavioural policy also performs sub-optimally once deployed. Recently, there has been a surge in offline reinforcement learning methods that hold the promise to extract high-quality policies from sub-optimal historical data. A common approach is to perform regularisation during training, encouraging updates during policy evaluation and/or policy improvement to stay close to the underlying data. In this work, we investigate whether an offline approach to improving the quality of the existing data can lead to improved behavioural policies without any changes in the BC algorithm. The proposed data improvement approach - Trajectory Stitching (TS) - generates new trajectories (sequences of states and actions) by `stitching' pairs of states that were disconnected in the original data and generating their connecting new action. By construction, these new transitions are guaranteed to be highly plausible according to probabilistic models of the environment, and to improve a state-value function. We demonstrate that the iterative process of replacing old trajectories with new ones incrementally improves the underlying behavioural policy. Extensive experimental results show that significant performance gains can be achieved using TS over BC policies extracted from the original data. Furthermore, using the D4RL benchmarking suite, we demonstrate that state-of-the-art results are obtained by combining TS with two existing offline learning methodologies reliant on BC, model-based offline planning (MBOP) and policy constraint (TD3+BC).

Design and Planning of Flexible Mobile Micro-Grids Using Deep Reinforcement Learning

Authors:Cesare Caputo, Michel-Alexandre Cardin, Pudong Ge, Fei Teng, Anna Korre, Ehecatl Antonio del Rio Chanona
Date:2022-12-08 08:30:50

Ongoing risks from climate change have impacted the livelihood of global nomadic communities, and are likely to lead to increased migratory movements in coming years. As a result, mobility considerations are becoming increasingly important in energy systems planning, particularly to achieve energy access in developing countries. Advanced Plug and Play control strategies have been recently developed with such a decentralized framework in mind, more easily allowing for the interconnection of nomadic communities, both to each other and to the main grid. In light of the above, the design and planning strategy of a mobile multi-energy supply system for a nomadic community is investigated in this work. Motivated by the scale and dimensionality of the associated uncertainties, impacting all major design and decision variables over the 30-year planning horizon, Deep Reinforcement Learning (DRL) is implemented for the design and planning problem tackled. DRL based solutions are benchmarked against several rigid baseline design options to compare expected performance under uncertainty. The results on a case study for ger communities in Mongolia suggest that mobile nomadic energy systems can be both technically and economically feasible, particularly when considering flexibility, although the degree of spatial dispersion among households is an important limiting factor. Key economic, sustainability and resilience indicators such as Cost, Equivalent Emissions and Total Unmet Load are measured, suggesting potential improvements compared to available baselines of up to 25%, 67% and 76%, respectively. Finally, the decomposition of values of flexibility and plug and play operation is presented using a variation of real options theory, with important implications for both nomadic communities and policymakers focused on enabling their energy access.

Combining Planning, Reasoning and Reinforcement Learning to solve Industrial Robot Tasks

Authors:Matthias Mayr, Faseeh Ahmad, Konstantinos Chatzilygeroudis, Luigi Nardi, Volker Krueger
Date:2022-12-07 10:55:26

One of today's goals for industrial robot systems is to allow fast and easy provisioning for new tasks. Skill-based systems that use planning and knowledge representation have long been one possible answer to this. However, especially with contact-rich robot tasks that need careful parameter settings, such reasoning techniques can fall short if the required knowledge not adequately modeled. We show an approach that provides a combination of task-level planning and reasoning with targeted learning of skill parameters for a task at hand. Starting from a task goal formulated in PDDL, the learnable parameters in the plan are identified and an operator can choose reward functions and parameters for the learning process. A tight integration with a knowledge framework allows to form a prior for learning and the usage of multi-objective Bayesian optimization eases to balance aspects such as safety and task performance that can often affect each other. We demonstrate the efficacy and versatility of our approach by learning skill parameters for two different contact-rich tasks and show their successful execution on a real 7-DOF KUKA-iiwa.

Scalable Planning and Learning Framework Development for Swarm-to-Swarm Engagement Problems

Authors:Umut Demir, A. Sadik Satir, Gulay Goktas Sever, Cansu Yikilmaz, Nazim Kemal Ure
Date:2022-12-06 12:08:09

Development of guidance, navigation and control frameworks/algorithms for swarms attracted significant attention in recent years. That being said, algorithms for planning swarm allocations/trajectories for engaging with enemy swarms is largely an understudied problem. Although small-scale scenarios can be addressed with tools from differential game theory, existing approaches fail to scale for large-scale multi-agent pursuit evasion (PE) scenarios. In this work, we propose a reinforcement learning (RL) based framework to decompose to large-scale swarm engagement problems into a number of independent multi-agent pursuit-evasion games. We simulate a variety of multi-agent PE scenarios, where finite time capture is guaranteed under certain conditions. The calculated PE statistics are provided as a reward signal to the high level allocation layer, which uses an RL algorithm to allocate controlled swarm units to eliminate enemy swarm units with maximum efficiency. We verify our approach in large-scale swarm-to-swarm engagement simulations.

Bi-Level Optimization Augmented with Conditional Variational Autoencoder for Autonomous Driving in Dense Traffic

Authors:Arun Kumar Singh, Jatan Shrestha, Nicola Albarella
Date:2022-12-05 12:56:42

Autonomous driving has a natural bi-level structure. The goal of the upper behavioural layer is to provide appropriate lane change, speeding up, and braking decisions to optimize a given driving task. However, this layer can only indirectly influence the driving efficiency through the lower-level trajectory planner, which takes in the behavioural inputs to produce motion commands. Existing sampling-based approaches do not fully exploit the strong coupling between the behavioural and planning layer. On the other hand, end-to-end Reinforcement Learning (RL) can learn a behavioural layer while incorporating feedback from the lower-level planner. However, purely data-driven approaches often fail in safety metrics in unseen environments. This paper presents a novel alternative; a parameterized bi-level optimization that jointly computes the optimal behavioural decisions and the resulting downstream trajectory. Our approach runs in real-time using a custom GPU-accelerated batch optimizer, and a Conditional Variational Autoencoder learnt warm-start strategy. Extensive simulations show that our approach outperforms state-of-the-art model predictive control and RL approaches in terms of collision rate while being competitive in driving efficiency.

E-MAPP: Efficient Multi-Agent Reinforcement Learning with Parallel Program Guidance

Authors:Can Chang, Ni Mu, Jiajun Wu, Ling Pan, Huazhe Xu
Date:2022-12-05 07:02:05

A critical challenge in multi-agent reinforcement learning(MARL) is for multiple agents to efficiently accomplish complex, long-horizon tasks. The agents often have difficulties in cooperating on common goals, dividing complex tasks, and planning through several stages to make progress. We propose to address these challenges by guiding agents with programs designed for parallelization, since programs as a representation contain rich structural and semantic information, and are widely used as abstractions for long-horizon tasks. Specifically, we introduce Efficient Multi-Agent Reinforcement Learning with Parallel Program Guidance(E-MAPP), a novel framework that leverages parallel programs to guide multiple agents to efficiently accomplish goals that require planning over $10+$ stages. E-MAPP integrates the structural information from a parallel program, promotes the cooperative behaviors grounded in program semantics, and improves the time efficiency via a task allocator. We conduct extensive experiments on a series of challenging, long-horizon cooperative tasks in the Overcooked environment. Results show that E-MAPP outperforms strong baselines in terms of the completion rate, time efficiency, and zero-shot generalization ability by a large margin.

Online Shielding for Reinforcement Learning

Authors:Bettina Könighofer, Julian Rudolf, Alexander Palmisano, Martin Tappler, Roderick Bloem
Date:2022-12-04 16:00:29

Besides the recent impressive results on reinforcement learning (RL), safety is still one of the major research challenges in RL. RL is a machine-learning approach to determine near-optimal policies in Markov decision processes (MDPs). In this paper, we consider the setting where the safety-relevant fragment of the MDP together with a temporal logic safety specification is given and many safety violations can be avoided by planning ahead a short time into the future. We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. For any action, the shield computes the maximal probability to not violate the safety specification within the next $k$ steps when executing this action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent. Existing offline shielding approaches compute exhaustively the safety of all state-action combinations ahead of time, resulting in huge computation times and large memory consumption. The intuition behind online shielding is to compute at runtime the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our approach is well suited for high-level planning problems where the time between decisions can be used for safety computations and it is sustainable for the agent to wait until these computations are finished. For our evaluation, we selected a 2-player version of the classical computer game SNAKE. The game represents a high-level planning problem that requires fast decisions and the multiplayer setting induces a large state space, which is computationally expensive to analyse exhaustively.

XTENTH-CAR: A Proportionally Scaled Experimental Vehicle Platform for Connected Autonomy and All-Terrain Research

Authors:Shathushan Sivashangaran, Azim Eskandarian
Date:2022-12-03 21:00:41

Connected Autonomous Vehicles (CAVs) are key components of the Intelligent Transportation System (ITS), and all-terrain Autonomous Ground Vehicles (AGVs) are indispensable tools for a wide range of applications such as disaster response, automated mining, agriculture, military operations, search and rescue missions, and planetary exploration. Experimental validation is a requisite for CAV and AGV research, but requires a large, safe experimental environment when using full-size vehicles which is time-consuming and expensive. To address these challenges, we developed XTENTH-CAR (eXperimental one-TENTH scaled vehicle platform for Connected autonomy and All-terrain Research), an open-source, cost-effective proportionally one-tenth scaled experimental vehicle platform governed by the same physics as a full-size on-road vehicle. XTENTH-CAR is equipped with the best-in-class NVIDIA Jetson AGX Orin System on Module (SOM), stereo camera, 2D LiDAR and open-source Electronic Speed Controller (ESC) with drivers written for both versions of the Robot Operating System (ROS 1 & ROS 2) to facilitate experimental CAV and AGV perception, motion planning and control research, that incorporate state-of-the-art computationally expensive algorithms such as Deep Reinforcement Learning (DRL). XTENTH-CAR is designed for compact experimental environments, and aims to increase the accessibility of experimental CAV and AGV research with low upfront costs, and complete Autonomous Vehicle (AV) hardware and software architectures similar to the full-sized X-CAR experimental vehicle platform, enabling efficient cross-platform development between small-scale and full-scale vehicles.

A Hierarchical Approach for Strategic Motion Planning in Autonomous Racing

Authors:Rudolf Reiter, Jasper Hoffmann, Joschka Boedecker, Moritz Diehl
Date:2022-12-03 12:38:45

We present an approach for safe trajectory planning, where a strategic task related to autonomous racing is learned sample-efficient within a simulation environment. A high-level policy, represented as a neural network, outputs a reward specification that is used within the cost function of a parametric nonlinear model predictive controller (NMPC). By including constraints and vehicle kinematics in the NLP, we are able to guarantee safe and feasible trajectories related to the used model. Compared to classical reinforcement learning (RL), our approach restricts the exploration to safe trajectories, starts with a good prior performance and yields full trajectories that can be passed to a tracking lowest-level controller. We do not address the lowest-level controller in this work and assume perfect tracking of feasible trajectories. We show the superior performance of our algorithm on simulated racing tasks that include high-level decision making. The vehicle learns to efficiently overtake slower vehicles and to avoid getting overtaken by blocking faster vehicles.

Multi-Agent Reinforcement Learning with Reward Delays

Authors:Yuyang Zhang, Runyu Zhang, Yuantao Gu, Na Li
Date:2022-12-02 20:50:48

This paper considers multi-agent reinforcement learning (MARL) where the rewards are received after delays and the delay time varies across agents and across time steps. Based on the V-learning framework, this paper proposes MARL algorithms that efficiently deal with reward delays. When the delays are finite, our algorithm reaches a coarse correlated equilibrium (CCE) with rate $\tilde{\mathcal{O}}(\frac{H^3\sqrt{S\mathcal{T}_K}}{K}+\frac{H^3\sqrt{SA}}{\sqrt{K}})$ where $K$ is the number of episodes, $H$ is the planning horizon, $S$ is the size of the state space, $A$ is the size of the largest action space, and $\mathcal{T}_K$ is the measure of total delay formally defined in the paper. Moreover, our algorithm is extended to cases with infinite delays through a reward skipping scheme. It achieves convergence rate similar to the finite delay case.

Discrete Control in Real-World Driving Environments using Deep Reinforcement Learning

Authors:Avinash Amballa, Advaith P., Pradip Sasmal, Sumohana Channappayya
Date:2022-11-29 04:24:03

Training self-driving cars is often challenging since they require a vast amount of labeled data in multiple real-world contexts, which is computationally and memory intensive. Researchers often resort to driving simulators to train the agent and transfer the knowledge to a real-world setting. Since simulators lack realistic behavior, these methods are quite inefficient. To address this issue, we introduce a framework (perception, planning, and control) in a real-world driving environment that transfers the real-world environments into gaming environments by setting up a reliable Markov Decision Process (MDP). We propose variations of existing Reinforcement Learning (RL) algorithms in a multi-agent setting to learn and execute the discrete control in real-world environments. Experiments show that the multi-agent setting outperforms the single-agent setting in all the scenarios. We also propose reliable initialization, data augmentation, and training techniques that enable the agents to learn and generalize to navigate in a real-world environment with minimal input video data, and with minimal training. Additionally, to show the efficacy of our proposed algorithm, we deploy our method in the virtual driving environment TORCS.

Multi-robot Social-aware Cooperative Planning in Pedestrian Environments Using Multi-agent Reinforcement Learning

Authors:Zichen He, Chunwei Song, Lu Dong
Date:2022-11-29 03:38:47

Safe and efficient co-planning of multiple robots in pedestrian participation environments is promising for applications. In this work, a novel multi-robot social-aware efficient cooperative planner that on the basis of off-policy multi-agent reinforcement learning (MARL) under partial dimension-varying observation and imperfect perception conditions is proposed. We adopt temporal-spatial graph (TSG)-based social encoder to better extract the importance of social relation between each robot and the pedestrians in its field of view (FOV). Also, we introduce K-step lookahead reward setting in multi-robot RL framework to avoid aggressive, intrusive, short-sighted, and unnatural motion decisions generated by robots. Moreover, we improve the traditional centralized critic network with multi-head global attention module to better aggregates local observation information among different robots to guide the process of individual policy update. Finally, multi-group experimental results verify the effectiveness of the proposed cooperative motion planner.

Continuous Neural Algorithmic Planners

Authors:Yu He, Petar Veličković, Pietro Liò, Andreea Deac
Date:2022-11-29 00:19:35

Neural algorithmic reasoning studies the problem of learning algorithms with neural networks, especially with graph architectures. A recent proposal, XLVIN, reaps the benefits of using a graph neural network that simulates the value iteration algorithm in deep reinforcement learning agents. It allows model-free planning without access to privileged information about the environment, which is usually unavailable. However, XLVIN only supports discrete action spaces, and is hence nontrivially applicable to most tasks of real-world interest. We expand XLVIN to continuous action spaces by discretization, and evaluate several selective expansion policies to deal with the large planning graphs. Our proposal, CNAP, demonstrates how neural algorithmic reasoning can make a measurable impact in higher-dimensional continuous control settings, such as MuJoCo, bringing gains in low-data settings and outperforming model-free baselines.

Inapplicable Actions Learning for Knowledge Transfer in Reinforcement Learning

Authors:Leo Ardon, Alberto Pozanco, Daniel Borrajo, Sumitra Ganesh
Date:2022-11-28 17:45:39

Reinforcement Learning (RL) algorithms are known to scale poorly to environments with many available actions, requiring numerous samples to learn an optimal policy. The traditional approach of considering the same fixed action space in every possible state implies that the agent must understand, while also learning to maximize its reward, to ignore irrelevant actions such as $\textit{inapplicable actions}$ (i.e. actions that have no effect on the environment when performed in a given state). Knowing this information can help reduce the sample complexity of RL algorithms by masking the inapplicable actions from the policy distribution to only explore actions relevant to finding an optimal policy. While this technique has been formalized for quite some time within the Automated Planning community with the concept of precondition in the STRIPS language, RL algorithms have never formally taken advantage of this information to prune the search space to explore. This is typically done in an ad-hoc manner with hand-crafted domain logic added to the RL algorithm. In this paper, we propose a more systematic approach to introduce this knowledge into the algorithm. We (i) standardize the way knowledge can be manually specified to the agent; and (ii) present a new framework to autonomously learn the partial action model encapsulating the precondition of an action jointly with the policy. We show experimentally that learning inapplicable actions greatly improves the sample efficiency of the algorithm by providing a reliable signal to mask out irrelevant actions. Moreover, we demonstrate that thanks to the transferability of the knowledge acquired, it can be reused in other tasks and domains to make the learning process more efficient.

Evaluation Beyond Task Performance: Analyzing Concepts in AlphaZero in Hex

Authors:Charles Lovering, Jessica Zosa Forde, George Konidaris, Ellie Pavlick, Michael L. Littman
Date:2022-11-26 21:59:11

AlphaZero, an approach to reinforcement learning that couples neural networks and Monte Carlo tree search (MCTS), has produced state-of-the-art strategies for traditional board games like chess, Go, shogi, and Hex. While researchers and game commentators have suggested that AlphaZero uses concepts that humans consider important, it is unclear how these concepts are captured in the network. We investigate AlphaZero's internal representations in the game of Hex using two evaluation techniques from natural language processing (NLP): model probing and behavioral tests. In doing so, we introduce new evaluation tools to the RL community and illustrate how evaluations other than task performance can be used to provide a more complete picture of a model's strengths and weaknesses. Our analyses in the game of Hex reveal interesting patterns and generate some testable hypotheses about how such models learn in general. For example, we find that MCTS discovers concepts before the neural network learns to encode them. We also find that concepts related to short-term end-game planning are best encoded in the final layers of the model, whereas concepts related to long-term planning are encoded in the middle layers of the model.

Operator Splitting Value Iteration

Authors:Amin Rakhsha, Andrew Wang, Mohammad Ghavamzadeh, Amir-massoud Farahmand
Date:2022-11-25 07:34:26

We introduce new planning and reinforcement learning algorithms for discounted MDPs that utilize an approximate model of the environment to accelerate the convergence of the value function. Inspired by the splitting approach in numerical linear algebra, we introduce Operator Splitting Value Iteration (OS-VI) for both Policy Evaluation and Control problems. OS-VI achieves a much faster convergence rate when the model is accurate enough. We also introduce a sample-based version of the algorithm called OS-Dyna. Unlike the traditional Dyna architecture, OS-Dyna still converges to the correct value function in presence of model approximation error.

Monte Carlo Tree Search Algorithms for Risk-Aware and Multi-Objective Reinforcement Learning

Authors:Conor F. Hayes, Mathieu Reymond, Diederik M. Roijers, Enda Howley, Patrick Mannion
Date:2022-11-23 15:33:19

In many risk-aware and multi-objective reinforcement learning settings, the utility of the user is derived from a single execution of a policy. In these settings, making decisions based on the average future returns is not suitable. For example, in a medical setting a patient may only have one opportunity to treat their illness. Making decisions using just the expected future returns -- known in reinforcement learning as the value -- cannot account for the potential range of adverse or positive outcomes a decision may have. Therefore, we should use the distribution over expected future returns differently to represent the critical information that the agent requires at decision time by taking both the future and accrued returns into consideration. In this paper, we propose two novel Monte Carlo tree search algorithms. Firstly, we present a Monte Carlo tree search algorithm that can compute policies for nonlinear utility functions (NLU-MCTS) by optimising the utility of the different possible returns attainable from individual policy executions, resulting in good policies for both risk-aware and multi-objective settings. Secondly, we propose a distributional Monte Carlo tree search algorithm (DMCTS) which extends NLU-MCTS. DMCTS computes an approximate posterior distribution over the utility of the returns, and utilises Thompson sampling during planning to compute policies in risk-aware and multi-objective settings. Both algorithms outperform the state-of-the-art in multi-objective reinforcement learning for the expected utility of the returns.

Model-based Trajectory Stitching for Improved Offline Reinforcement Learning

Authors:Charles A. Hepburn, Giovanni Montana
Date:2022-11-21 16:00:39

In many real-world applications, collecting large and high-quality datasets may be too costly or impractical. Offline reinforcement learning (RL) aims to infer an optimal decision-making policy from a fixed set of data. Getting the most information from historical data is then vital for good performance once the policy is deployed. We propose a model-based data augmentation strategy, Trajectory Stitching (TS), to improve the quality of sub-optimal historical trajectories. TS introduces unseen actions joining previously disconnected states: using a probabilistic notion of state reachability, it effectively `stitches' together parts of the historical demonstrations to generate new, higher quality ones. A stitching event consists of a transition between a pair of observed states through a synthetic and highly probable action. New actions are introduced only when they are expected to be beneficial, according to an estimated state-value function. We show that using this data augmentation strategy jointly with behavioural cloning (BC) leads to improvements over the behaviour-cloned policy from the original dataset. Improving over the BC policy could then be used as a launchpad for online RL through planning and demonstration-guided RL.

Reward is not Necessary: How to Create a Modular & Compositional Self-Preserving Agent for Life-Long Learning

Authors:Thomas J. Ringstrom
Date:2022-11-20 02:48:01

Reinforcement Learning views the maximization of rewards and avoidance of punishments as central to explaining goal-directed behavior. However, over a life, organisms will need to learn about many different aspects of the world's structure: the states of the world and state-vector transition dynamics. The number of combinations of states grows exponentially as an agent incorporates new knowledge, and there is no obvious weighted combination of pre-existing rewards or costs defined for a given combination of states, as such a weighting would need to encode information about good and bad combinations prior to an agent's experience in the world. Therefore, we must develop more naturalistic accounts of behavior and motivation in large state-spaces. We show that it is possible to use only the intrinsic motivation metric of empowerment, which measures the agent's capacity to realize many possible futures under a transition operator. We propose to scale empowerment to hierarchical state-spaces by using Operator Bellman Equations. These equations produce state-time feasibility functions, which are compositional hierarchical state-time transition operators that map an initial state and time when an agent begins a policy to the final states and times of completing a goal. Because these functions are hierarchical operators we can define hierarchical empowerment measures on them. An agent can then optimize plans to distant states and times to maximize its hierarchical empowerment-gain, allowing it to discover goals that bring about a more favorable coupling of its internal structure (physiological states) to its external environment (world structure & spatial state). Life-long agents could therefore be primarily animated by principles of compositionality and empowerment, exhibiting self-concern for the growth & maintenance of their own structural integrity without recourse to reward-maximization.

Evaluating the Perceived Safety of Urban City via Maximum Entropy Deep Inverse Reinforcement Learning

Authors:Yaxuan Wang, Zhixin Zeng, Qijun Zhao
Date:2022-11-19 11:01:08

Inspired by expert evaluation policy for urban perception, we proposed a novel inverse reinforcement learning (IRL) based framework for predicting urban safety and recovering the corresponding reward function. We also presented a scalable state representation method to model the prediction problem as a Markov decision process (MDP) and use reinforcement learning (RL) to solve the problem. Additionally, we built a dataset called SmallCity based on the crowdsourcing method to conduct the research. As far as we know, this is the first time the IRL approach has been introduced to the urban safety perception and planning field to help experts quantitatively analyze perceptual features. Our results showed that IRL has promising prospects in this field. We will later open-source the crowdsourcing data collection site and the model proposed in this paper.

Prediction-aware and Reinforcement Learning based Altruistic Cooperative Driving

Authors:Rodolfo Valiente, Mahdi Razzaghpour, Behrad Toghi, Ghayoor Shah, Yaser P. Fallah
Date:2022-11-19 04:32:34

Autonomous vehicle (AV) navigation in the presence of Human-driven vehicles (HVs) is challenging, as HVs continuously update their policies in response to AVs. In order to navigate safely in the presence of complex AV-HV social interactions, the AVs must learn to predict these changes. Humans are capable of navigating such challenging social interaction settings because of their intrinsic knowledge about other agents behaviors and use that to forecast what might happen in the future. Inspired by humans, we provide our AVs the capability of anticipating future states and leveraging prediction in a cooperative reinforcement learning (RL) decision-making framework, to improve safety and robustness. In this paper, we propose an integration of two essential and earlier-presented components of AVs: social navigation and prediction. We formulate the AV decision-making process as a RL problem and seek to obtain optimal policies that produce socially beneficial results utilizing a prediction-aware planning and social-aware optimization RL framework. We also propose a Hybrid Predictive Network (HPN) that anticipates future observations. The HPN is used in a multi-step prediction chain to compute a window of predicted future observations to be used by the value function network (VFN). Finally, a safe VFN is trained to optimize a social utility using a sequence of previous and predicted observations, and a safety prioritizer is used to leverage the interpretable kinematic predictions to mask the unsafe actions, constraining the RL policy. We compare our prediction-aware AV to state-of-the-art solutions and demonstrate performance improvements in terms of efficiency and safety in multiple simulated scenarios.

Planning Irregular Object Packing via Hierarchical Reinforcement Learning

Authors:Sichao Huang, Ziwei Wang, Jie Zhou, Jiwen Lu
Date:2022-11-17 07:16:37

Object packing by autonomous robots is an im-portant challenge in warehouses and logistics industry. Most conventional data-driven packing planning approaches focus on regular cuboid packing, which are usually heuristic and limit the practical use in realistic applications with everyday objects. In this paper, we propose a deep hierarchical reinforcement learning approach to simultaneously plan packing sequence and placement for irregular object packing. Specifically, the top manager network infers packing sequence from six principal view heightmaps of all objects, and then the bottom worker network receives heightmaps of the next object to predict the placement position and orientation. The two networks are trained hierarchically in a self-supervised Q-Learning framework, where the rewards are provided by the packing results based on the top height , object volume and placement stability in the box. The framework repeats sequence and placement planning iteratively until all objects have been packed into the box or no space is remained for unpacked items. We compare our approach with existing robotic packing methods for irregular objects in a physics simulator. Experiments show that our approach can pack more objects with less time cost than the state-of-the-art packing methods of irregular objects. We also implement our packing plan with a robotic manipulator to show the generalization ability in the real world.

Data-pooling Reinforcement Learning for Personalized Healthcare Intervention

Authors:Xinyun Chen, Pengyi Shi, Shanwen Pu
Date:2022-11-16 15:52:49

Motivated by the emerging needs of personalized preventative intervention in many healthcare applications, we consider a multi-stage, dynamic decision-making problem in the online setting with unknown model parameters. To deal with the pervasive issue of small sample size in personalized planning, we develop a novel data-pooling reinforcement learning (RL) algorithm based on a general perturbed value iteration framework. Our algorithm adaptively pools historical data, with three main innovations: (i) the weight of pooling ties directly to the performance of decision (measured by regret) as opposed to estimation accuracy in conventional methods; (ii) no parametric assumptions are needed between historical and current data; and (iii) requiring data-sharing only via aggregate statistics, as opposed to patient-level data. Our data-pooling algorithm framework applies to a variety of popular RL algorithms, and we establish a theoretical performance guarantee showing that our pooling version achieves a regret bound strictly smaller than that of the no-pooling counterpart. We substantiate the theoretical development with empirically better performance of our algorithm via a case study in the context of post-discharge intervention to prevent unplanned readmissions, generating practical insights for healthcare management. In particular, our algorithm alleviates privacy concerns about sharing health data, which (i) opens the door for individual organizations to levering public datasets or published studies to better manage their own patients; and (ii) provides the basis for public policy makers to encourage organizations to share aggregate data to improve population health outcomes for the broader community.

Simulated Mental Imagery for Robotic Task Planning

Authors:Shijia Li, Tomas Kulvicius, Minija Tamosiunaite, Florentin Wörgötter
Date:2022-11-15 17:24:15

Traditional AI-planning methods for task planning in robotics require a symbolically encoded domain description. While powerful in well-defined scenarios, as well as human-interpretable, setting this up requires substantial effort. Different from this, most everyday planning tasks are solved by humans intuitively, using mental imagery of the different planning steps. Here we suggest that the same approach can be used for robots, too, in cases which require only limited execution accuracy. In the current study, we propose a novel sub-symbolic method called Simulated Mental Imagery for Planning (SiMIP), which consists of perception, simulated action, success-checking and re-planning performed on 'imagined' images. We show that it is possible to implement mental imagery-based planning in an algorithmically sound way by combining regular convolutional neural networks and generative adversarial networks. With this method, the robot acquires the capability to use the initially existing scene to generate action plans without symbolic domain descriptions, while at the same time plans remain human-interpretable, different from deep reinforcement learning, which is an alternative sub-symbolic approach. We create a dataset from real scenes for a packing problem of having to correctly place different objects into different target slots. This way efficiency and success rate of this algorithm could be quantified.

Legged Locomotion in Challenging Terrains using Egocentric Vision

Authors:Ananye Agarwal, Ashish Kumar, Jitendra Malik, Deepak Pathak
Date:2022-11-14 18:59:58

Animals are capable of precise and agile locomotion using vision. Replicating this ability has been a long-standing goal in robotics. The traditional approach has been to decompose this problem into elevation mapping and foothold planning phases. The elevation mapping, however, is susceptible to failure and large noise artifacts, requires specialized hardware, and is biologically implausible. In this paper, we present the first end-to-end locomotion system capable of traversing stairs, curbs, stepping stones, and gaps. We show this result on a medium-sized quadruped robot using a single front-facing depth camera. The small size of the robot necessitates discovering specialized gait patterns not seen elsewhere. The egocentric camera requires the policy to remember past information to estimate the terrain under its hind feet. We train our policy in simulation. Training has two phases - first, we train a policy using reinforcement learning with a cheap-to-compute variant of depth image and then in phase 2 distill it into the final policy that uses depth using supervised learning. The resulting policy transfers to the real world and is able to run in real-time on the limited compute of the robot. It can traverse a large variety of terrain while being robust to perturbations like pushes, slippery surfaces, and rocky terrain. Videos are at https://vision-locomotion.github.io

Control Transformer: Robot Navigation in Unknown Environments through PRM-Guided Return-Conditioned Sequence Modeling

Authors:Daniel Lawson, Ahmed H. Qureshi
Date:2022-11-11 18:44:41

Learning long-horizon tasks such as navigation has presented difficult challenges for successfully applying reinforcement learning to robotics. From another perspective, under known environments, sampling-based planning can robustly find collision-free paths in environments without learning. In this work, we propose Control Transformer that models return-conditioned sequences from low-level policies guided by a sampling-based Probabilistic Roadmap (PRM) planner. We demonstrate that our framework can solve long-horizon navigation tasks using only local information. We evaluate our approach on partially-observed maze navigation with MuJoCo robots, including Ant, Point, and Humanoid. We show that Control Transformer can successfully navigate through mazes and transfer to unknown environments. Additionally, we apply our method to a differential drive robot (Turtlebot3) and show zero-shot sim2real transfer under noisy observations.

Robust N-1 secure HV Grid Flexibility Estimation for TSO-DSO coordinated Congestion Management with Deep Reinforcement Learning

Authors:Zhenqi Wang, Sebastian Wende-von Berg, Martin Braun
Date:2022-11-10 20:22:34

Nowadays, the PQ flexibility from the distributed energy resources (DERs) in the high voltage (HV) grids plays a more critical and significant role in grid congestion management in TSO grids. This work proposed a multi-stage deep reinforcement learning approach to estimate the PQ flexibility (PQ area) at the TSO-DSO interfaces and identifies the DER PQ setpoints for each operating point in a way, that DERs in the meshed HV grid can be coordinated to offer flexibility for the transmission grid. In the estimation process, we consider the steady-state grid limits and the robustness in the resulting voltage profile against uncertainties and the N-1 security criterion regarding thermal line loading, essential for real-life grid operational planning applications. Using deep reinforcement learning (DRL) for PQ flexibility estimation is the first of its kind. Furthermore, our approach of considering N-1 security criterion for meshed grids and robustness against uncertainty directly in the optimization tasks offers a new perspective besides the common relaxation schema in finding a solution with mathematical optimal power flow (OPF). Finally, significant improvements in the computational efficiency in estimation PQ area are the highlights of the proposed method.

RARE: Renewable Energy Aware Resource Management in Datacenters

Authors:Vanamala Venkataswamy, Jake Grigsby, Andrew Grimshaw, Yanjun Qi
Date:2022-11-10 05:17:14

The exponential growth in demand for digital services drives massive datacenter energy consumption and negative environmental impacts. Promoting sustainable solutions to pressing energy and digital infrastructure challenges is crucial. Several hyperscale cloud providers have announced plans to power their datacenters using renewable energy. However, integrating renewables to power the datacenters is challenging because the power generation is intermittent, necessitating approaches to tackle power supply variability. Hand engineering domain-specific heuristics-based schedulers to meet specific objective functions in such complex dynamic green datacenter environments is time-consuming, expensive, and requires extensive tuning by domain experts. The green datacenters need smart systems and system software to employ multiple renewable energy sources (wind and solar) by intelligently adapting computing to renewable energy generation. We present RARE (Renewable energy Aware REsource management), a Deep Reinforcement Learning (DRL) job scheduler that automatically learns effective job scheduling policies while continually adapting to datacenters' complex dynamic environment. The resulting DRL scheduler performs better than heuristic scheduling policies with different workloads and adapts to the intermittent power supply from renewables. We demonstrate DRL scheduler system design parameters that, when tuned correctly, produce better performance. Finally, we demonstrate that the DRL scheduler can learn from and improve upon existing heuristic policies using Offline Learning.

Vision-based navigation and obstacle avoidance via deep reinforcement learning

Authors:Paul Blum, Peter Crowley, George Lykotrafitis
Date:2022-11-09 22:36:41

Development of navigation algorithms is essential for the successful deployment of robots in rapidly changing hazardous environments for which prior knowledge of configuration is often limited or unavailable. Use of traditional path-planning algorithms, which are based on localization and require detailed obstacle maps with goal locations, is not possible. In this regard, vision-based algorithms hold great promise, as visual information can be readily acquired by a robot's onboard sensors and provides a much richer source of information from which deep neural networks can extract complex patterns. Deep reinforcement learning has been used to achieve vision-based robot navigation. However, the efficacy of these algorithms in environments with dynamic obstacles and high variation in the configuration space has not been thoroughly investigated. In this paper, we employ a deep Dyna-Q learning algorithm for room evacuation and obstacle avoidance in partially observable environments based on low-resolution raw image data from an onboard camera. We explore the performance of a robotic agent in environments containing no obstacles, convex obstacles, and concave obstacles, both static and dynamic. Obstacles and the exit are initialized in random positions at the start of each episode of reinforcement learning. Overall, we show that our algorithm and training approach can generalize learning for collision-free evacuation of environments with complex obstacle configurations. It is evident that the agent can navigate to a goal location while avoiding multiple static and dynamic obstacles, and can escape from a concave obstacle while searching for and navigating to the exit.

RL-DWA Omnidirectional Motion Planning for Person Following in Domestic Assistance and Monitoring

Authors:Andrea Eirale, Mauro Martini, Marcello Chiaberge
Date:2022-11-09 16:11:41

Robot assistants are emerging as high-tech solutions to support people in everyday life. Following and assisting the user in the domestic environment requires flexible mobility to safely move in cluttered spaces. We introduce a new approach to person following for assistance and monitoring. Our methodology exploits an omnidirectional robotic platform to detach the computation of linear and angular velocities and navigate within the domestic environment without losing track of the assisted person. While linear velocities are managed by a conventional Dynamic Window Approach (DWA) local planner, we trained a Deep Reinforcement Learning (DRL) agent to predict optimized angular velocities commands and maintain the orientation of the robot towards the user. We evaluate our navigation system on a real omnidirectional platform in various indoor scenarios, demonstrating the competitive advantage of our solution compared to a standard differential steering following.

Design Process is a Reinforcement Learning Problem

Authors:Reza kakooee, Benjamin Dillunberger
Date:2022-11-06 14:37:22

While reinforcement learning has been used widely in research during the past few years, it found fewer real-world applications than supervised learning due to some weaknesses that the RL algorithms suffer from, such as performance degradation in transitioning from the simulator to the real world. Here, we argue the design process is a reinforcement learning problem and can potentially be a proper application for RL algorithms as it is an offline process and conventionally is done in CAD software - a sort of simulator. This creates opportunities for using RL methods and, at the same time, raises challenges. While the design processes are so diverse, here we focus on the space layout planning (SLP), frame it as an RL problem under the Markov Decision Process, and use PPO to address the layout design problem. To do so, we developed an environment named RLDesigner, to simulate the SLP. The RLDesigner is an OpenAI Gym compatible environment that can be easily customized to define a diverse range of design scenarios. We publicly share the environment to encourage both RL and architecture communities to use it for testing different RL algorithms or in their design practice. The codes are available in the following GitHub repository https://github.com/ RezaKakooee/rldesigner/tree/Second_Paper

Wall Street Tree Search: Risk-Aware Planning for Offline Reinforcement Learning

Authors:Dan Elbaz, Gal Novik, Oren Salzman
Date:2022-11-06 07:42:24

Offline reinforcement-learning (RL) algorithms learn to make decisions using a given, fixed training dataset without online data collection. This problem setting is captivating because it holds the promise of utilizing previously collected datasets without any costly or risky interaction with the environment. However, this promise also bears the drawback of this setting as the restricted dataset induces uncertainty because the agent can encounter unfamiliar sequences of states and actions that the training data did not cover. To mitigate the destructive uncertainty effects, we need to balance the aspiration to take reward-maximizing actions with the incurred risk due to incorrect ones. In financial economics, modern portfolio theory (MPT) is a method that risk-averse investors can use to construct diversified portfolios that maximize their returns without unacceptable levels of risk. We propose integrating MPT into the agent's decision-making process, presenting a new simple-yet-highly-effective risk-aware planning algorithm for offline RL. Our algorithm allows us to systematically account for the \emph{estimated quality} of specific actions and their \emph{estimated risk} due to the uncertainty. We show that our approach can be coupled with the Transformer architecture to yield a state-of-the-art planner, which maximizes the return for offline RL tasks. Moreover, our algorithm reduces the variance of the results significantly compared to conventional Transformer decoding, which results in a much more stable algorithm -- a property that is essential for the offline RL setting, where real-world exploration and failures can be costly or dangerous.

Path Planning Using Wassertein Distributionally Robust Deep Q-learning

Authors:Cem Alpturk, Venkatraman Renganathan
Date:2022-11-04 11:00:59

We investigate the problem of risk averse robot path planning using the deep reinforcement learning and distributionally robust optimization perspectives. Our problem formulation involves modelling the robot as a stochastic linear dynamical system, assuming that a collection of process noise samples is available. We cast the risk averse motion planning problem as a Markov decision process and propose a continuous reward function design that explicitly takes into account the risk of collision with obstacles while encouraging the robot's motion towards the goal. We learn the risk-averse robot control actions through Lipschitz approximated Wasserstein distributionally robust deep Q-learning to hedge against the noise uncertainty. The learned control actions result in a safe and risk averse trajectory from the source to the goal, avoiding all the obstacles. Various supporting numerical simulations are presented to demonstrate our proposed approach.

A Survey on Reinforcement Learning in Aviation Applications

Authors:Pouria Razzaghi, Amin Tabrizian, Wei Guo, Shulu Chen, Abenezer Taye, Ellis Thompson, Alexis Bregeon, Ali Baheri, Peng Wei
Date:2022-11-03 21:10:25

Compared with model-based control and optimization methods, reinforcement learning (RL) provides a data-driven, learning-based framework to formulate and solve sequential decision-making problems. The RL framework has become promising due to largely improved data availability and computing power in the aviation industry. Many aviation-based applications can be formulated or treated as sequential decision-making problems. Some of them are offline planning problems, while others need to be solved online and are safety-critical. In this survey paper, we first describe standard RL formulations and solutions. Then we survey the landscape of existing RL-based applications in aviation. Finally, we summarize the paper, identify the technical gaps, and suggest future directions of RL research in aviation.

Spatial-temporal recurrent reinforcement learning for autonomous ships

Authors:Martin Waltz, Ostap Okhrin
Date:2022-11-02 10:08:31

This paper proposes a spatial-temporal recurrent neural network architecture for deep $Q$-networks that can be used to steer an autonomous ship. The network design makes it possible to handle an arbitrary number of surrounding target ships while offering robustness to partial observability. Furthermore, a state-of-the-art collision risk metric is proposed to enable an easier assessment of different situations by the agent. The COLREG rules of maritime traffic are explicitly considered in the design of the reward function. The final policy is validated on a custom set of newly created single-ship encounters called `Around the Clock' problems and the commonly used Imazu (1987) problems, which include 18 multi-ship scenarios. Performance comparisons with artificial potential field and velocity obstacle methods demonstrate the potential of the proposed approach for maritime path planning. Furthermore, the new architecture exhibits robustness when it is deployed in multi-agent scenarios and it is compatible with other deep reinforcement learning algorithms, including actor-critic frameworks.

Wind Power Forecasting Considering Data Privacy Protection: A Federated Deep Reinforcement Learning Approach

Authors:Yang Li, Ruinong Wang, Yuanzheng Li, Meng Zhang, Chao Long
Date:2022-11-02 08:36:32

In a modern power system with an increasing proportion of renewable energy, wind power prediction is crucial to the arrangement of power grid dispatching plans due to the volatility of wind power. However, traditional centralized forecasting methods raise concerns regarding data privacy-preserving and data islands problem. To handle the data privacy and openness, we propose a forecasting scheme that combines federated learning and deep reinforcement learning (DRL) for ultra-short-term wind power forecasting, called federated deep reinforcement learning (FedDRL). Firstly, this paper uses the deep deterministic policy gradient (DDPG) algorithm as the basic forecasting model to improve prediction accuracy. Secondly, we integrate the DDPG forecasting model into the framework of federated learning. The designed FedDRL can obtain an accurate prediction model in a decentralized way by sharing model parameters instead of sharing private data which can avoid sensitive privacy issues. The simulation results show that the proposed FedDRL outperforms the traditional prediction methods in terms of forecasting accuracy. More importantly, while ensuring the forecasting performance, FedDRL can effectively protect the data privacy and relieve the communication pressure compared with the traditional centralized forecasting method. In addition, a simulation with different federated learning parameters is conducted to confirm the robustness of the proposed scheme.

Learning to Solve Voxel Building Embodied Tasks from Pixels and Natural Language Instructions

Authors:Alexey Skrynnik, Zoya Volovikova, Marc-Alexandre Côté, Anton Voronov, Artem Zholus, Negar Arabzadeh, Shrestha Mohanty, Milagro Teruel, Ahmed Awadallah, Aleksandr Panov, Mikhail Burtsev, Julia Kiseleva
Date:2022-11-01 18:30:42

The adoption of pre-trained language models to generate action plans for embodied agents is a promising research strategy. However, execution of instructions in real or simulated environments requires verification of the feasibility of actions as well as their relevance to the completion of a goal. We propose a new method that combines a language model and reinforcement learning for the task of building objects in a Minecraft-like environment according to the natural language instructions. Our method first generates a set of consistently achievable sub-goals from the instructions and then completes associated sub-tasks with a pre-trained RL policy. The proposed method formed the RL baseline at the IGLU 2022 competition.

Planning to the Information Horizon of BAMDPs via Epistemic State Abstraction

Authors:Dilip Arumugam, Satinder Singh
Date:2022-10-30 16:30:23

The Bayes-Adaptive Markov Decision Process (BAMDP) formalism pursues the Bayes-optimal solution to the exploration-exploitation trade-off in reinforcement learning. As the computation of exact solutions to Bayesian reinforcement-learning problems is intractable, much of the literature has focused on developing suitable approximation algorithms. In this work, before diving into algorithm design, we first define, under mild structural assumptions, a complexity measure for BAMDP planning. As efficient exploration in BAMDPs hinges upon the judicious acquisition of information, our complexity measure highlights the worst-case difficulty of gathering information and exhausting epistemic uncertainty. To illustrate its significance, we establish a computationally-intractable, exact planning algorithm that takes advantage of this measure to show more efficient planning. We then conclude by introducing a specific form of state abstraction with the potential to reduce BAMDP complexity and gives rise to a computationally-tractable, approximate planning algorithm.

A Bibliometric Analysis and Review on Reinforcement Learning for Transportation Applications

Authors:Can Li, Lei Bai, Lina Yao, S. Travis Waller, Wei Liu
Date:2022-10-26 07:34:51

Transportation is the backbone of the economy and urban development. Improving the efficiency, sustainability, resilience, and intelligence of transportation systems is critical and also challenging. The constantly changing traffic conditions, the uncertain influence of external factors (e.g., weather, accidents), and the interactions among multiple travel modes and multi-type flows result in the dynamic and stochastic natures of transportation systems. The planning, operation, and control of transportation systems require flexible and adaptable strategies in order to deal with uncertainty, non-linearity, variability, and high complexity. In this context, Reinforcement Learning (RL) that enables autonomous decision-makers to interact with the complex environment, learn from the experiences, and select optimal actions has been rapidly emerging as one of the most useful approaches for smart transportation. This paper conducts a bibliometric analysis to identify the development of RL-based methods for transportation applications, typical journals/conferences, and leading topics in the field of intelligent transportation in recent ten years. Then, this paper presents a comprehensive literature review on applications of RL in transportation by categorizing different methods with respect to the specific application domains. The potential future research directions of RL applications and developments are also discussed.

Evaluating Long-Term Memory in 3D Mazes

Authors:Jurgis Pasukonis, Timothy Lillicrap, Danijar Hafner
Date:2022-10-24 16:32:28

Intelligent agents need to remember salient information to reason in partially-observed environments. For example, agents with a first-person view should remember the positions of relevant objects even if they go out of view. Similarly, to effectively navigate through rooms agents need to remember the floor plan of how rooms are connected. However, most benchmark tasks in reinforcement learning do not test long-term memory in agents, slowing down progress in this important research direction. In this paper, we introduce the Memory Maze, a 3D domain of randomized mazes specifically designed for evaluating long-term memory in agents. Unlike existing benchmarks, Memory Maze measures long-term memory separate from confounding agent abilities and requires the agent to localize itself by integrating information over time. With Memory Maze, we propose an online reinforcement learning benchmark, a diverse offline dataset, and an offline probing evaluation. Recording a human player establishes a strong baseline and verifies the need to build up and retain memories, which is reflected in their gradually increasing rewards within each episode. We find that current algorithms benefit from training with truncated backpropagation through time and succeed on small mazes, but fall short of human performance on the large mazes, leaving room for future algorithmic designs to be evaluated on the Memory Maze.

DaXBench: Benchmarking Deformable Object Manipulation with Differentiable Physics

Authors:Siwei Chen, Yiqing Xu, Cunjun Yu, Linfeng Li, Xiao Ma, Zhongwen Xu, David Hsu
Date:2022-10-24 09:33:59

Deformable Object Manipulation (DOM) is of significant importance to both daily and industrial applications. Recent successes in differentiable physics simulators allow learning algorithms to train a policy with analytic gradients through environment dynamics, which significantly facilitates the development of DOM algorithms. However, existing DOM benchmarks are either single-object-based or non-differentiable. This leaves the questions of 1) how a task-specific algorithm performs on other tasks and 2) how a differentiable-physics-based algorithm compares with the non-differentiable ones in general. In this work, we present DaXBench, a differentiable DOM benchmark with a wide object and task coverage. DaXBench includes 9 challenging high-fidelity simulated tasks, covering rope, cloth, and liquid manipulation with various difficulty levels. To better understand the performance of general algorithms on different DOM tasks, we conduct comprehensive experiments over representative DOM methods, ranging from planning to imitation learning and reinforcement learning. In addition, we provide careful empirical studies of existing decision-making algorithms based on differentiable physics, and discuss their limitations, as well as potential future directions.

The Design and Realization of Multi-agent Obstacle Avoidance based on Reinforcement Learning

Authors:Enyu Zhao, Chanjuan Liu, Houfu Su, Yang Liu
Date:2022-10-24 02:46:32

Intelligence agents and multi-agent systems play important roles in scenes like the control system of grouped drones, and multi-agent navigation and obstacle avoidance which is the foundational function of advanced application has great importance. In multi-agent navigation and obstacle avoidance tasks, the decision-making interactions and dynamic changes of agents are difficult for traditional route planning algorithms or reinforcement learning algorithms with the increased complexity of the environment. The classical multi-agent reinforcement learning algorithm, Multi-agent deep deterministic policy gradient(MADDPG), solved precedent algorithms' problems of having unstationary training process and unable to deal with environment randomness. However, MADDPG ignored the temporal message hidden beneath agents' interaction with the environment. Besides, due to its CTDE technique which let each agent's critic network to calculate over all agents' action and the whole environment information, it lacks ability to scale to larger amount of agents. To deal with MADDPG's ignorance of the temporal information of the data, this article proposes a new algorithm called MADDPG-LSTMactor, which combines MADDPG with Long short term memory (LSTM). By using agent's observations of continuous timesteps as the input of its policy network, it allows the LSTM layer to process the hidden temporal message. Experimental result demonstrated that this algorithm had better performance in scenarios where the amount of agents is small. Besides, to solve MADDPG's drawback of not being efficient in scenarios where agents are too many, this article puts forward a light-weight MADDPG (MADDPG-L) algorithm, which simplifies the input of critic network. The result of experiments showed that this algorithm had better performance than MADDPG when the amount of agents was large.

Active Exploration for Robotic Manipulation

Authors:Tim Schneider, Boris Belousov, Georgia Chalvatzaki, Diego Romeres, Devesh K. Jha, Jan Peters
Date:2022-10-23 18:07:51

Robotic manipulation stands as a largely unsolved problem despite significant advances in robotics and machine learning in recent years. One of the key challenges in manipulation is the exploration of the dynamics of the environment when there is continuous contact between the objects being manipulated. This paper proposes a model-based active exploration approach that enables efficient learning in sparse-reward robotic manipulation tasks. The proposed method estimates an information gain objective using an ensemble of probabilistic models and deploys model predictive control (MPC) to plan actions online that maximize the expected reward while also performing directed exploration. We evaluate our proposed algorithm in simulation and on a real robot, trained from scratch with our method, on a challenging ball pushing task on tilted tables, where the target ball position is not known to the agent a-priori. Our real-world robot experiment serves as a fundamental application of active exploration in model-based reinforcement learning of complex robotic manipulation tasks.

LEAGUE: Guided Skill Learning and Abstraction for Long-Horizon Manipulation

Authors:Shuo Cheng, Danfei Xu
Date:2022-10-23 06:57:05

To assist with everyday human activities, robots must solve complex long-horizon tasks and generalize to new settings. Recent deep reinforcement learning (RL) methods show promise in fully autonomous learning, but they struggle to reach long-term goals in large environments. On the other hand, Task and Motion Planning (TAMP) approaches excel at solving and generalizing across long-horizon tasks, thanks to their powerful state and action abstractions. But they assume predefined skill sets, which limits their real-world applications. In this work, we combine the benefits of these two paradigms and propose an integrated task planning and skill learning framework named LEAGUE (Learning and Abstraction with Guidance). LEAGUE leverages the symbolic interface of a task planner to guide RL-based skill learning and creates abstract state space to enable skill reuse. More importantly, LEAGUE learns manipulation skills in-situ of the task planning system, continuously growing its capability and the set of tasks that it can solve. We evaluate LEAGUE on four challenging simulated task domains and show that LEAGUE outperforms baselines by large margins. We also show that the learned skills can be reused to accelerate learning in new tasks domains and transfer to a physical robot platform.

Active Predictive Coding: A Unified Neural Framework for Learning Hierarchical World Models for Perception and Planning

Authors:Rajesh P. N. Rao, Dimitrios C. Gklezakos, Vishwas Sathish
Date:2022-10-23 05:44:22

Predictive coding has emerged as a prominent model of how the brain learns through predictions, anticipating the importance accorded to predictive learning in recent AI architectures such as transformers. Here we propose a new framework for predictive coding called active predictive coding which can learn hierarchical world models and solve two radically different open problems in AI: (1) how do we learn compositional representations, e.g., part-whole hierarchies, for equivariant vision? and (2) how do we solve large-scale planning problems, which are hard for traditional reinforcement learning, by composing complex action sequences from primitive policies? Our approach exploits hypernetworks, self-supervised learning and reinforcement learning to learn hierarchical world models that combine task-invariant state transition networks and task-dependent policy networks at multiple abstraction levels. We demonstrate the viability of our approach on a variety of vision datasets (MNIST, FashionMNIST, Omniglot) as well as on a scalable hierarchical planning problem. Our results represent, to our knowledge, the first demonstration of a unified solution to the part-whole learning problem posed by Hinton, the nested reference frames problem posed by Hawkins, and the integrated state-action hierarchy learning problem in reinforcement learning.

Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes

Authors:Runlong Zhou, Ruosong Wang, Simon S. Du
Date:2022-10-20 21:32:01

We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver. We prove an $\tilde{O}(\sqrt{\mathsf{Var}^\star M \Gamma S A K})$ regret bound where $\tilde{O}$ hides logarithm factors, $M$ is the number of contexts, $S$ is the number of states, $A$ is the number of actions, $K$ is the number of episodes, $\Gamma \le S$ is the maximum transition degree of any state-action pair, and $\mathsf{Var}^\star$ is a variance quantity describing the determinism of the LMDP. The regret bound only scales logarithmically with the planning horizon, thus yielding the first (nearly) horizon-free regret bound for LMDP. This is also the first problem-dependent regret bound for LMDP. Key in our proof is an analysis of the total variance of alpha vectors (a generalization of value functions), which is handled with a truncation method. We complement our positive result with a novel $\Omega(\sqrt{\mathsf{Var}^\star M S A K})$ regret lower bound with $\Gamma = 2$, which shows our upper bound minimax optimal when $\Gamma$ is a constant for the class of variance-bounded LMDPs. Our lower bound relies on new constructions of hard instances and an argument inspired by the symmetrization technique from theoretical computer science, both of which are technically different from existing lower bound proof for MDPs, and thus can be of independent interest.

Robotic Table Wiping via Reinforcement Learning and Whole-body Trajectory Optimization

Authors:Thomas Lew, Sumeet Singh, Mario Prats, Jeffrey Bingham, Jonathan Weisz, Benjie Holson, Xiaohan Zhang, Vikas Sindhwani, Yao Lu, Fei Xia, Peng Xu, Tingnan Zhang, Jie Tan, Montserrat Gonzalez
Date:2022-10-19 20:12:43

We propose a framework to enable multipurpose assistive mobile robots to autonomously wipe tables to clean spills and crumbs. This problem is challenging, as it requires planning wiping actions while reasoning over uncertain latent dynamics of crumbs and spills captured via high-dimensional visual observations. Simultaneously, we must guarantee constraints satisfaction to enable safe deployment in unstructured cluttered environments. To tackle this problem, we first propose a stochastic differential equation to model crumbs and spill dynamics and absorption with a robot wiper. Using this model, we train a vision-based policy for planning wiping actions in simulation using reinforcement learning (RL). To enable zero-shot sim-to-real deployment, we dovetail the RL policy with a whole-body trajectory optimization framework to compute base and arm joint trajectories that execute the desired wiping motions while guaranteeing constraints satisfaction. We extensively validate our approach in simulation and on hardware. Video: https://youtu.be/inORKP4F3EI

Robot Navigation with Reinforcement Learned Path Generation and Fine-Tuned Motion Control

Authors:Longyuan Zhang, Ziyue Hou, Ji Wang, Ziang Liu, Wei Li
Date:2022-10-19 15:10:52

In this paper, we propose a novel reinforcement learning (RL) based path generation (RL-PG) approach for mobile robot navigation without a prior exploration of an unknown environment. Multiple predictive path points are dynamically generated by a deep Markov model optimized using RL approach for robot to track. To ensure the safety when tracking the predictive points, the robot's motion is fine-tuned by a motion fine-tuning module. Such an approach, using the deep Markov model with RL algorithm for planning, focuses on the relationship between adjacent path points. We analyze the benefits that our proposed approach are more effective and are with higher success rate than RL-Based approach DWA-RL and a traditional navigation approach APF. We deploy our model on both simulation and physical platforms and demonstrate our model performs robot navigation effectively and safely.

Planning for Sample Efficient Imitation Learning

Authors:Zhao-Heng Yin, Weirui Ye, Qifeng Chen, Yang Gao
Date:2022-10-18 05:19:26

Imitation learning is a class of promising policy learning algorithms that is free from many practical issues with reinforcement learning, such as the reward design issue and the exploration hardness. However, the current imitation algorithm struggles to achieve both high performance and high in-environment sample efficiency simultaneously. Behavioral Cloning (BC) does not need in-environment interactions, but it suffers from the covariate shift problem which harms its performance. Adversarial Imitation Learning (AIL) turns imitation learning into a distribution matching problem. It can achieve better performance on some tasks but it requires a large number of in-environment interactions. Inspired by the recent success of EfficientZero in RL, we propose EfficientImitate (EI), a planning-based imitation learning method that can achieve high in-environment sample efficiency and performance simultaneously. Our algorithmic contribution in this paper is two-fold. First, we extend AIL into the MCTS-based RL. Second, we show the seemingly incompatible two classes of imitation algorithms (BC and AIL) can be naturally unified under our framework, enjoying the benefits of both. We benchmark our method not only on the state-based DeepMind Control Suite, but also on the image version which many previous works find highly challenging. Experimental results show that EI achieves state-of-the-art results in performance and sample efficiency. EI shows over 4x gain in performance in the limited sample setting on state-based and image-based tasks and can solve challenging problems like Humanoid, where previous methods fail with small amount of interactions. Our code is available at https://github.com/zhaohengyin/EfficientImitate.

Simple Emergent Action Representations from Multi-Task Policy Training

Authors:Pu Hua, Yubei Chen, Huazhe Xu
Date:2022-10-18 03:49:13

The low-level sensory and motor signals in deep reinforcement learning, which exist in high-dimensional spaces such as image observations or motor torques, are inherently challenging to understand or utilize directly for downstream tasks. While sensory representations have been extensively studied, the representations of motor actions are still an area of active exploration. Our work reveals that a space containing meaningful action representations emerges when a multi-task policy network takes as inputs both states and task embeddings. Moderate constraints are added to improve its representation ability. Therefore, interpolated or composed embeddings can function as a high-level interface within this space, providing instructions to the agent for executing meaningful action sequences. Empirical results demonstrate that the proposed action representations are effective for intra-action interpolation and inter-action composition with limited or no additional learning. Furthermore, our approach exhibits superior task adaptation ability compared to strong baselines in Mujoco locomotion tasks. Our work sheds light on the promising direction of learning action representations for efficient, adaptable, and composable RL, forming the basis of abstract action planning and the understanding of motor signal space. Project page: https://sites.google.com/view/emergent-action-representation/

Towards an Interpretable Hierarchical Agent Framework using Semantic Goals

Authors:Bharat Prakash, Nicholas Waytowich, Tim Oates, Tinoosh Mohsenin
Date:2022-10-16 02:04:13

Learning to solve long horizon temporally extended tasks with reinforcement learning has been a challenge for several years now. We believe that it is important to leverage both the hierarchical structure of complex tasks and to use expert supervision whenever possible to solve such tasks. This work introduces an interpretable hierarchical agent framework by combining planning and semantic goal directed reinforcement learning. We assume access to certain spatial and haptic predicates and construct a simple and powerful semantic goal space. These semantic goal representations are more interpretable, making expert supervision and intervention easier. They also eliminate the need to write complex, dense reward functions thereby reducing human engineering effort. We evaluate our framework on a robotic block manipulation task and show that it performs better than other methods, including both sparse and dense reward functions. We also suggest some next steps and discuss how this framework makes interaction and collaboration with humans easier.

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Authors:Zihan Zhang, Yuhang Jiang, Yuan Zhou, Xiangyang Ji
Date:2022-10-15 09:22:22

In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with $S$ states, $A$ actions and planning horizon $H$, we design a computational efficient algorithm to achieve near-optimal regret of $\tilde{O}(\sqrt{SAH^3K\ln(1/\delta)})$\footnote{$\tilde{O}(\cdot)$ hides logarithmic terms of $(S,A,H,K)$} in $K$ episodes using $O\left(H+\log_2\log_2(K) \right)$ batches with confidence parameter $\delta$. To our best of knowledge, it is the first $\tilde{O}(\sqrt{SAH^3K})$ regret bound with $O(H+\log_2\log_2(K))$ batch complexity. Meanwhile, we show that to achieve $\tilde{O}(\mathrm{poly}(S,A,H)\sqrt{K})$ regret, the number of batches is at least $\Omega\left(H/\log_A(K)+ \log_2\log_2(K) \right)$, which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.

WILD-SCAV: Benchmarking FPS Gaming AI on Unity3D-based Environments

Authors:Xi Chen, Tianyu Shi, Qingpeng Zhao, Yuchen Sun, Yunfei Gao, Xiangjun Wang
Date:2022-10-14 13:39:41

Recent advances in deep reinforcement learning (RL) have demonstrated complex decision-making capabilities in simulation environments such as Arcade Learning Environment, MuJoCo, and ViZDoom. However, they are hardly extensible to more complicated problems, mainly due to the lack of complexity and variations in the environments they are trained and tested on. Furthermore, they are not extensible to an open-world environment to facilitate long-term exploration research. To learn realistic task-solving capabilities, we need to develop an environment with greater diversity and complexity. We developed WILD-SCAV, a powerful and extensible environment based on a 3D open-world FPS (First-Person Shooter) game to bridge the gap. It provides realistic 3D environments of variable complexity, various tasks, and multiple modes of interaction, where agents can learn to perceive 3D environments, navigate and plan, compete and cooperate in a human-like manner. WILD-SCAV also supports different complexities, such as configurable maps with different terrains, building structures and distributions, and multi-agent settings with cooperative and competitive tasks. The experimental results on configurable complexity, multi-tasking, and multi-agent scenarios demonstrate the effectiveness of WILD-SCAV in benchmarking various RL algorithms, as well as it is potential to give rise to intelligent agents with generalized task-solving abilities. The link to our open-sourced code can be found here https://github.com/inspirai/wilderness-scavenger.

Decentralized Coverage Path Planning with Reinforcement Learning and Dual Guidance

Authors:Yongkai Liu, Jiawei Hu, Wei Dong
Date:2022-10-14 04:39:20

Planning coverage path for multiple robots in a decentralized way enhances robustness to coverage tasks handling uncertain malfunctions. To achieve high efficiency in a distributed manner for each single robot, a comprehensive understanding of both the complicated environments and cooperative agents intent is crucial. Unfortunately, existing works commonly consider only part of these factors, resulting in imbalanced subareas or unnecessary overlaps. To tackle this issue, we introduce a Decentralized reinforcement learning framework with dual guidance to train each agent to solve the decentralized multiple coverage path planning problem straightly through the environment states. As distributed robots require others intentions to perform better coverage efficiency, we utilize two guidance methods, artificial potential fields and heuristic guidance, to include and integrate others intentions into observations for each robot. With our constructed framework, results have shown our agents successfully learn to determine their own subareas while achieving full coverage, balanced subareas and low overlap rates. We then implement spanning tree cover within those subareas to construct actual routes for each robot and complete given coverage tasks. Our performance is also compared with the state of the art decentralized method showing at most 10 percent lower overlap rates while performing high efficiency in similar environments.

A Direct Approximation of AIXI Using Logical State Abstractions

Authors:Samuel Yang-Zhao, Tianyu Wang, Kee Siong Ng
Date:2022-10-13 11:30:56

We propose a practical integration of logical state abstraction with AIXI, a Bayesian optimality notion for reinforcement learning agents, to significantly expand the model class that AIXI agents can be approximated over to complex history-dependent and structured environments. The state representation and reasoning framework is based on higher-order logic, which can be used to define and enumerate complex features on non-Markovian and structured environments. We address the problem of selecting the right subset of features to form state abstractions by adapting the $\Phi$-MDP optimisation criterion from state abstraction theory. Exact Bayesian model learning is then achieved using a suitable generalisation of Context Tree Weighting over abstract state sequences. The resultant architecture can be integrated with different planning algorithms. Experimental results on controlling epidemics on large-scale contact networks validates the agent's performance.

Generalization with Lossy Affordances: Leveraging Broad Offline Data for Learning Visuomotor Tasks

Authors:Kuan Fang, Patrick Yin, Ashvin Nair, Homer Walke, Gengchen Yan, Sergey Levine
Date:2022-10-12 21:46:38

The utilization of broad datasets has proven to be crucial for generalization for a wide range of fields. However, how to effectively make use of diverse multi-task data for novel downstream tasks still remains a grand challenge in robotics. To tackle this challenge, we introduce a framework that acquires goal-conditioned policies for unseen temporally extended tasks via offline reinforcement learning on broad data, in combination with online fine-tuning guided by subgoals in learned lossy representation space. When faced with a novel task goal, the framework uses an affordance model to plan a sequence of lossy representations as subgoals that decomposes the original task into easier problems. Learned from the broad data, the lossy representation emphasizes task-relevant information about states and goals while abstracting away redundant contexts that hinder generalization. It thus enables subgoal planning for unseen tasks, provides a compact input to the policy, and facilitates reward shaping during fine-tuning. We show that our framework can be pre-trained on large-scale datasets of robot experiences from prior work and efficiently fine-tuned for novel tasks, entirely from visual inputs without any manual reward engineering.

Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning

Authors:Anton Bakhtin, David J Wu, Adam Lerer, Jonathan Gray, Athul Paul Jacob, Gabriele Farina, Alexander H Miller, Noam Brown
Date:2022-10-11 14:47:35

No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark for multi-agent AI research. While self-play reinforcement learning has resulted in numerous successes in purely adversarial games like chess, Go, and poker, self-play alone is insufficient for achieving optimal performance in domains involving cooperation with humans. We address this shortcoming by first introducing a planning algorithm we call DiL-piKL that regularizes a reward-maximizing policy toward a human imitation-learned policy. We prove that this is a no-regret learning algorithm under a modified utility function. We then show that DiL-piKL can be extended into a self-play reinforcement learning algorithm we call RL-DiL-piKL that provides a model of human play while simultaneously training an agent that responds well to this human model. We used RL-DiL-piKL to train an agent we name Diplodocus. In a 200-game no-press Diplomacy tournament involving 62 human participants spanning skill levels from beginner to expert, two Diplodocus agents both achieved a higher average score than all other participants who played more than two games, and ranked first and third according to an Elo ratings model.

Neurosymbolic Motion and Task Planning for Linear Temporal Logic Tasks

Authors:Xiaowu Sun, Yasser Shoukry
Date:2022-10-11 06:33:58

This paper presents a neurosymbolic framework to solve motion planning problems for mobile robots involving temporal goals. The temporal goals are described using temporal logic formulas such as Linear Temporal Logic (LTL) to capture complex tasks. The proposed framework trains Neural Network (NN)-based planners that enjoy strong correctness guarantees when applying to unseen tasks, i.e., the exact task (including workspace, LTL formula, and dynamic constraints of a robot) is unknown during the training of NNs. Our approach to achieving theoretical guarantees and computational efficiency is based on two insights. First, we incorporate a symbolic model into the training of NNs such that the resulting NN-based planner inherits the interpretability and correctness guarantees of the symbolic model. Moreover, the symbolic model serves as a discrete "memory", which is necessary for satisfying temporal logic formulas. Second, we train a library of neural networks offline and combine a subset of the trained NNs into a single NN-based planner at runtime when a task is revealed. In particular, we develop a novel constrained NN training procedure, named formal NN training, to enforce that each neural network in the library represents a "symbol" in the symbolic model. As a result, our neurosymbolic framework enjoys the scalability and flexibility benefits of machine learning and inherits the provable guarantees from control-theoretic and formal-methods techniques. We demonstrate the effectiveness of our framework in both simulations and on an actual robotic vehicle, and show that our framework can generalize to unknown tasks where state-of-the-art meta-reinforcement learning techniques fail.

Simulating Coverage Path Planning with Roomba

Authors:Robert Chuchro
Date:2022-10-10 19:50:44

Coverage Path Planning involves visiting every unoccupied state in an environment with obstacles. In this paper, we explore this problem in environments which are initially unknown to the agent, for purposes of simulating the task of a vacuum cleaning robot. A survey of prior work reveals sparse effort in applying learning to solve this problem. In this paper, we explore modeling a Cover Path Planning problem using Deep Reinforcement Learning, and compare it with the performance of the built-in algorithm of the Roomba, a popular vacuum cleaning robot.

Exploration Policies for On-the-Fly Controller Synthesis: A Reinforcement Learning Approach

Authors:Tomás Delgado, Marco Sánchez Sorondo, Víctor Braberman, Sebastián Uchitel
Date:2022-10-07 20:28:25

Controller synthesis is in essence a case of model-based planning for non-deterministic environments in which plans (actually ''strategies'') are meant to preserve system goals indefinitely. In the case of supervisory control environments are specified as the parallel composition of state machines and valid strategies are required to be ''non-blocking'' (i.e., always enabling the environment to reach certain marked states) in addition to safe (i.e., keep the system within a safe zone). Recently, On-the-fly Directed Controller Synthesis techniques were proposed to avoid the exploration of the entire -and exponentially large-environment space, at the cost of non-maximal permissiveness, to either find a strategy or conclude that there is none. The incremental exploration of the plant is currently guided by a domain-independent human-designed heuristic. In this work, we propose a new method for obtaining heuristics based on Reinforcement Learning (RL). The synthesis algorithm is thus framed as an RL task with an unbounded action space and a modified version of DQN is used. With a simple and general set of features that abstracts both states and actions, we show that it is possible to learn heuristics on small versions of a problem that generalize to the larger instances, effectively doing zero-shot policy transfer. Our agents learn from scratch in a highly partially observable RL task and outperform the existing heuristic overall, in instances unseen during training.

Inferring Smooth Control: Monte Carlo Posterior Policy Iteration with Gaussian Processes

Authors:Joe Watson, Jan Peters
Date:2022-10-07 12:56:31

Monte Carlo methods have become increasingly relevant for control of non-differentiable systems, approximate dynamics models and learning from data. These methods scale to high-dimensional spaces and are effective at the non-convex optimizations often seen in robot learning. We look at sample-based methods from the perspective of inference-based control, specifically posterior policy iteration. From this perspective, we highlight how Gaussian noise priors produce rough control actions that are unsuitable for physical robot deployment. Considering smoother Gaussian process priors, as used in episodic reinforcement learning and motion planning, we demonstrate how smoother model predictive control can be achieved using online sequential inference. This inference is realized through an efficient factorization of the action distribution and a novel means of optimizing the likelihood temperature to improve importance sampling accuracy. We evaluate this approach on several high-dimensional robot control tasks, matching the sample efficiency of prior heuristic methods while also ensuring smoothness. Simulation results can be seen at https://monte-carlo-ppi.github.io/.

Exploration via Planning for Information about the Optimal Trajectory

Authors:Viraj Mehta, Ian Char, Joseph Abbate, Rory Conlin, Mark D. Boyer, Stefano Ermon, Jeff Schneider, Willie Neiswanger
Date:2022-10-06 20:28:55

Many potential applications of reinforcement learning (RL) are stymied by the large numbers of samples required to learn an effective policy. This is especially true when applying RL to real-world control tasks, e.g. in the sciences or robotics, where executing a policy in the environment is costly. In popular RL algorithms, agents typically explore either by adding stochasticity to a reward-maximizing policy or by attempting to gather maximal information about environment dynamics without taking the given task into account. In this work, we develop a method that allows us to plan for exploration while taking both the task and the current knowledge about the dynamics into account. The key insight to our approach is to plan an action sequence that maximizes the expected information gain about the optimal trajectory for the task at hand. We demonstrate that our method learns strong policies with 2x fewer samples than strong exploration baselines and 200x fewer samples than model free methods on a diverse set of low-to-medium dimensional control tasks in both the open-loop and closed-loop control settings.

Neuroevolution is a Competitive Alternative to Reinforcement Learning for Skill Discovery

Authors:Felix Chalumeau, Raphael Boige, Bryan Lim, Valentin Macé, Maxime Allard, Arthur Flajolet, Antoine Cully, Thomas Pierrot
Date:2022-10-06 11:06:39

Deep Reinforcement Learning (RL) has emerged as a powerful paradigm for training neural policies to solve complex control tasks. However, these policies tend to be overfit to the exact specifications of the task and environment they were trained on, and thus do not perform well when conditions deviate slightly or when composed hierarchically to solve even more complex tasks. Recent work has shown that training a mixture of policies, as opposed to a single one, that are driven to explore different regions of the state-action space can address this shortcoming by generating a diverse set of behaviors, referred to as skills, that can be collectively used to great effect in adaptation tasks or for hierarchical planning. This is typically realized by including a diversity term - often derived from information theory - in the objective function optimized by RL. However these approaches often require careful hyperparameter tuning to be effective. In this work, we demonstrate that less widely-used neuroevolution methods, specifically Quality Diversity (QD), are a competitive alternative to information-theory-augmented RL for skill discovery. Through an extensive empirical evaluation comparing eight state-of-the-art algorithms (four flagship algorithms from each line of work) on the basis of (i) metrics directly evaluating the skills' diversity, (ii) the skills' performance on adaptation tasks, and (iii) the skills' performance when used as primitives for hierarchical planning; QD methods are found to provide equal, and sometimes improved, performance whilst being less sensitive to hyperparameters and more scalable. As no single method is found to provide near-optimal performance across all environments, there is a rich scope for further research which we support by proposing future directions and providing optimized open-source implementations.

ReAct: Synergizing Reasoning and Acting in Language Models

Authors:Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
Date:2022-10-06 01:00:32

While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples. Project site with code: https://react-lm.github.io

Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration and Planning

Authors:Reda Ouhamma, Debabrota Basu, Odalric-Ambrym Maillard
Date:2022-10-05 08:26:49

We study the problem of episodic reinforcement learning in continuous state-action spaces with unknown rewards and transitions. Specifically, we consider the setting where the rewards and transitions are modeled using parametric bilinear exponential families. We propose an algorithm, BEF-RLSVI, that a) uses penalized maximum likelihood estimators to learn the unknown parameters, b) injects a calibrated Gaussian noise in the parameter of rewards to ensure exploration, and c) leverages linearity of the exponential family with respect to an underlying RKHS to perform tractable planning. We further provide a frequentist regret analysis of BEF-RLSVI that yields an upper bound of $\tilde{\mathcal{O}}(\sqrt{d^3H^3K})$, where $d$ is the dimension of the parameters, $H$ is the episode length, and $K$ is the number of episodes. Our analysis improves the existing bounds for the bilinear exponential family of MDPs by $\sqrt{H}$ and removes the handcrafted clipping deployed in existing \RLSVI-type algorithms. Our regret bound is order-optimal with respect to $H$ and $K$.

Learning Perception-Aware Agile Flight in Cluttered Environments

Authors:Yunlong Song, Kexin Shi, Robert Penicka, Davide Scaramuzza
Date:2022-10-04 18:18:58

Recently, neural control policies have outperformed existing model-based planning-and-control methods for autonomously navigating quadrotors through cluttered environments in minimum time. However, they are not perception aware, a crucial requirement in vision-based navigation due to the camera's limited field of view and the underactuated nature of a quadrotor. We propose a learning-based system that achieves perception-aware, agile flight in cluttered environments. Our method combines imitation learning with reinforcement learning (RL) by leveraging a privileged learning-by-cheating framework. Using RL, we first train a perception-aware teacher policy with full-state information to fly in minimum time through cluttered environments. Then, we use imitation learning to distill its knowledge into a vision-based student policy that only perceives the environment via a camera. Our approach tightly couples perception and control, showing a significant advantage in computation speed (10 times faster) and success rate. We demonstrate the closed-loop control performance using hardware-in-the-loop simulation.

Learning Minimally-Violating Continuous Control for Infeasible Linear Temporal Logic Specifications

Authors:Mingyu Cai, Makai Mann, Zachary Serlin, Kevin Leahy, Cristian-Ioan Vasile
Date:2022-10-03 18:32:20

This paper explores continuous-time control synthesis for target-driven navigation to satisfy complex high-level tasks expressed as linear temporal logic (LTL). We propose a model-free framework using deep reinforcement learning (DRL) where the underlying dynamic system is unknown (an opaque box). Unlike prior work, this paper considers scenarios where the given LTL specification might be infeasible and therefore cannot be accomplished globally. Instead of modifying the given LTL formula, we provide a general DRL-based approach to satisfy it with minimal violation. To do this, we transform a previously multi-objective DRL problem, which requires simultaneous automata satisfaction and minimum violation cost, into a single objective. By guiding the DRL agent with a sampling-based path planning algorithm for the potentially infeasible LTL task, the proposed approach mitigates the myopic tendencies of DRL, which are often an issue when learning general LTL tasks that can have long or infinite horizons. This is achieved by decomposing an infeasible LTL formula into several reach-avoid sub-tasks with shorter horizons, which can be trained in a modular DRL architecture. Furthermore, we overcome the challenge of the exploration process for DRL in complex and cluttered environments by using path planners to design rewards that are dense in the configuration space. The benefits of the presented approach are demonstrated through testing on various complex nonlinear systems and compared with state-of-the-art baselines. The Video demonstration can be found here:https://youtu.be/jBhx6Nv224E.

Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation

Authors:Dan Qiao, Yu-Xiang Wang
Date:2022-10-03 03:48:26

We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the \emph{reward-free} exploration setting. This is a well-motivated problem because deploying new policies is costly in real-life RL applications. Under the linear MDP setting with feature dimension $d$ and planning horizon $H$, we propose a new algorithm that collects at most $\widetilde{O}(\frac{d^2H^5}{\epsilon^2})$ trajectories within $H$ deployments to identify $\epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions. To the best of our knowledge, our approach is the first to achieve optimal deployment complexity and optimal $d$ dependence in sample complexity at the same time, even if the reward is known ahead of time. Our novel techniques include an exploration-preserving policy discretization and a generalized G-optimal experiment design, which could be of independent interest. Lastly, we analyze the related problem of regret minimization in low-adaptive RL and provide information-theoretic lower bounds for switching cost and batch complexity.

Occlusion-Aware Crowd Navigation Using People as Sensors

Authors:Ye-Ji Mun, Masha Itkina, Shuijing Liu, Katherine Driggs-Campbell
Date:2022-10-02 15:18:32

Autonomous navigation in crowded spaces poses a challenge for mobile robots due to the highly dynamic, partially observable environment. Occlusions are highly prevalent in such settings due to a limited sensor field of view and obstructing human agents. Previous work has shown that observed interactive behaviors of human agents can be used to estimate potential obstacles despite occlusions. We propose integrating such social inference techniques into the planning pipeline. We use a variational autoencoder with a specially designed loss function to learn representations that are meaningful for occlusion inference. This work adopts a deep reinforcement learning approach to incorporate the learned representation for occlusion-aware planning. In simulation, our occlusion-aware policy achieves comparable collision avoidance performance to fully observable navigation by estimating agents in occluded spaces. We demonstrate successful policy transfer from simulation to the real-world Turtlebot 2i. To the best of our knowledge, this work is the first to use social occlusion inference for crowd navigation.

Visuo-Tactile Transformers for Manipulation

Authors:Yizhou Chen, Andrea Sipos, Mark Van der Merwe, Nima Fazeli
Date:2022-09-30 22:38:29

Learning representations in the joint domain of vision and touch can improve manipulation dexterity, robustness, and sample-complexity by exploiting mutual information and complementary cues. Here, we present Visuo-Tactile Transformers (VTTs), a novel multimodal representation learning approach suited for model-based reinforcement learning and planning. Our approach extends the Visual Transformer \cite{dosovitskiy2021image} to handle visuo-tactile feedback. Specifically, VTT uses tactile feedback together with self and cross-modal attention to build latent heatmap representations that focus attention on important task features in the visual domain. We demonstrate the efficacy of VTT for representation learning with a comparative evaluation against baselines on four simulated robot tasks and one real world block pushing task. We conduct an ablation study over the components of VTT to highlight the importance of cross-modality in representation learning.

A Benchmark Comparison of Imitation Learning-based Control Policies for Autonomous Racing

Authors:Xiatao Sun, Mingyan Zhou, Zhijun Zhuang, Shuo Yang, Johannes Betz, Rahul Mangharam
Date:2022-09-29 19:49:13

Autonomous racing with scaled race cars has gained increasing attention as an effective approach for developing perception, planning and control algorithms for safe autonomous driving at the limits of the vehicle's handling. To train agile control policies for autonomous racing, learning-based approaches largely utilize reinforcement learning, albeit with mixed results. In this study, we benchmark a variety of imitation learning policies for racing vehicles that are applied directly or for bootstrapping reinforcement learning both in simulation and on scaled real-world environments. We show that interactive imitation learning techniques outperform traditional imitation learning methods and can greatly improve the performance of reinforcement learning policies by bootstrapping thanks to its better sample efficiency. Our benchmarks provide a foundation for future research on autonomous racing using Imitation Learning and Reinforcement Learning.

Does Zero-Shot Reinforcement Learning Exist?

Authors:Ahmed Touati, Jérémy Rapin, Yann Ollivier
Date:2022-09-29 16:54:05

A zero-shot RL agent is an agent that can solve any RL task in a given environment, instantly with no additional planning or learning, after an initial reward-free learning phase. This marks a shift from the reward-centric RL paradigm towards "controllable" agents that can follow arbitrary instructions in an environment. Current RL agents can solve families of related tasks at best, or require planning anew for each task. Strategies for approximate zero-shot RL ave been suggested using successor features (SFs) [BBQ+ 18] or forward-backward (FB) representations [TO21], but testing has been limited. After clarifying the relationships between these schemes, we introduce improved losses and new SF models, and test the viability of zero-shot RL schemes systematically on tasks from the Unsupervised RL benchmark [LYL+21]. To disentangle universal representation learning from exploration, we work in an offline setting and repeat the tests on several existing replay buffers. SFs appear to suffer from the choice of the elementary state features. SFs with Laplacian eigenfunctions do well, while SFs based on auto-encoders, inverse curiosity, transition models, low-rank transition matrix, contrastive learning, or diversity (APS), perform unconsistently. In contrast, FB representations jointly learn the elementary and successor features from a single, principled criterion. They perform best and consistently across the board, reaching 85% of supervised RL performance with a good replay buffer, in a zero-shot manner.

Learning Parsimonious Dynamics for Generalization in Reinforcement Learning

Authors:Tankred Saanum, Eric Schulz
Date:2022-09-29 13:45:34

Humans are skillful navigators: We aptly maneuver through new places, realize when we are back at a location we have seen before, and can even conceive of shortcuts that go through parts of our environments we have never visited. Current methods in model-based reinforcement learning on the other hand struggle with generalizing about environment dynamics out of the training distribution. We argue that two principles can help bridge this gap: latent learning and parsimonious dynamics. Humans tend to think about environment dynamics in simple terms -- we reason about trajectories not in reference to what we expect to see along a path, but rather in an abstract latent space, containing information about the places' spatial coordinates. Moreover, we assume that moving around in novel parts of our environment works the same way as in parts we are familiar with. These two principles work together in tandem: it is in the latent space that the dynamics show parsimonious characteristics. We develop a model that learns such parsimonious dynamics. Using a variational objective, our model is trained to reconstruct experienced transitions in a latent space using locally linear transformations, while encouraged to invoke as few distinct transformations as possible. Using our framework, we demonstrate the utility of learning parsimonious latent dynamics models in a range of policy learning and planning tasks.

Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling

Authors:Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, Jun Zhu
Date:2022-09-29 04:36:23

In offline reinforcement learning, weighted regression is a common method to ensure the learned policy stays close to the behavior policy and to prevent selecting out-of-sample actions. In this work, we show that due to the limited distributional expressivity of policy models, previous methods might still select unseen actions during training, which deviates from their initial motivation. To address this problem, we adopt a generative approach by decoupling the learned policy into two parts: an expressive generative behavior model and an action evaluation model. The key insight is that such decoupling avoids learning an explicitly parameterized policy model with a closed-form expression. Directly learning the behavior policy allows us to leverage existing advances in generative modeling, such as diffusion-based methods, to model diverse behaviors. As for action evaluation, we combine our method with an in-sample planning technique to further avoid selecting out-of-sample actions and increase computational efficiency. Experimental results on D4RL datasets show that our proposed method achieves competitive or superior performance compared with state-of-the-art offline RL methods, especially in complex tasks such as AntMaze. We also empirically demonstrate that our method can successfully learn from a heterogeneous dataset containing multiple distinctive but similarly successful strategies, whereas previous unimodal policies fail.

Zero-Shot Retargeting of Learned Quadruped Locomotion Policies Using Hybrid Kinodynamic Model Predictive Control

Authors:He Li, Tingnan Zhang, Wenhao Yu, Patrick M. Wensing
Date:2022-09-28 14:21:31

Reinforcement Learning (RL) has witnessed great strides for quadruped locomotion, with continued progress in the reliable sim-to-real transfer of policies. However, it remains a challenge to reuse a policy on another robot, which could save time for retraining. In this work, we present a framework for zero-shot policy retargeting wherein diverse motor skills can be transferred between robots of different shapes and sizes. The new framework centers on a planning-and-control pipeline that systematically integrates RL and Model Predictive Control (MPC). The planning stage employs RL to generate a dynamically plausible trajectory as well as the contact schedule, avoiding the combinatorial complexity of contact sequence optimization. This information is then used to seed the MPC to stabilize and robustify the policy roll-out via a new Hybrid Kinodynamic (HKD) model that implicitly optimizes the foothold locations. Hardware results show an ability to transfer policies from both the A1 and Laikago robots to the MIT Mini Cheetah robot without requiring any policy re-tuning.

Exploiting Transformer in Sparse Reward Reinforcement Learning for Interpretable Temporal Logic Motion Planning

Authors:Hao Zhang, Hao Wang, Zhen Kan
Date:2022-09-27 07:41:11

Automaton based approaches have enabled robots to perform various complex tasks. However, most existing automaton based algorithms highly rely on the manually customized representation of states for the considered task, limiting its applicability in deep reinforcement learning algorithms. To address this issue, by incorporating Transformer into reinforcement learning, we develop a Double-Transformer-guided Temporal Logic framework (T2TL) that exploits the structural feature of Transformer twice, i.e., first encoding the LTL instruction via the Transformer module for efficient understanding of task instructions during the training and then encoding the context variable via the Transformer again for improved task performance. Particularly, the LTL instruction is specified by co-safe LTL. As a semantics-preserving rewriting operation, LTL progression is exploited to decompose the complex task into learnable sub-goals, which not only converts non-Markovian reward decision processes to Markovian ones, but also improves the sampling efficiency by simultaneous learning of multiple sub-tasks. An environment-agnostic LTL pre-training scheme is further incorporated to facilitate the learning of the Transformer module resulting in an improved representation of LTL. The simulation results demonstrate the effectiveness of the T2TL framework.

Advanced Skills by Learning Locomotion and Local Navigation End-to-End

Authors:Nikita Rudin, David Hoeller, Marko Bjelonic, Marco Hutter
Date:2022-09-26 16:35:00

The common approach for local navigation on challenging environments with legged robots requires path planning, path following and locomotion, which usually requires a locomotion control policy that accurately tracks a commanded velocity. However, by breaking down the navigation problem into these sub-tasks, we limit the robot's capabilities since the individual tasks do not consider the full solution space. In this work, we propose to solve the complete problem by training an end-to-end policy with deep reinforcement learning. Instead of continuously tracking a precomputed path, the robot needs to reach a target position within a provided time. The task's success is only evaluated at the end of an episode, meaning that the policy does not need to reach the target as fast as possible. It is free to select its path and the locomotion gait. Training a policy in this way opens up a larger set of possible solutions, which allows the robot to learn more complex behaviors. We compare our approach to velocity tracking and additionally show that the time dependence of the task reward is critical to successfully learn these new behaviors. Finally, we demonstrate the successful deployment of policies on a real quadrupedal robot. The robot is able to cross challenging terrains, which were not possible previously, while using a more energy-efficient gait and achieving a higher success rate.

SAFER: Safe Collision Avoidance using Focused and Efficient Trajectory Search with Reinforcement Learning

Authors:Mario Srouji, Hugues Thomas, Hubert Tsai, Ali Farhadi, Jian Zhang
Date:2022-09-23 18:08:08

Collision avoidance is key for mobile robots and agents to operate safely in the real world. In this work we present SAFER, an efficient and effective collision avoidance system that is able to improve safety by correcting the control commands sent by an operator. It combines real-world reinforcement learning (RL), search-based online trajectory planning, and automatic emergency intervention, e.g. automatic emergency braking (AEB). The goal of the RL is to learn an effective corrective control action that is used in a focused search for collision-free trajectories, and to reduce the frequency of triggering automatic emergency braking. This novel setup enables the RL policy to learn safely and directly on mobile robots in a real-world indoor environment, minimizing actual crashes even during training. Our real-world experiments show that, when compared with several baselines, our approach enjoys a higher average speed, lower crash rate, less emergency intervention, smaller computation overhead, and smoother overall control.

Learning Agile Flight Maneuvers: Deep SE(3) Motion Planning and Control for Quadrotors

Authors:Yixiao Wang, Bingheng Wang, Shenning Zhang, Han Wei Sia, Lin Zhao
Date:2022-09-22 15:36:08

Agile flights of autonomous quadrotors in cluttered environments require constrained motion planning and control subject to translational and rotational dynamics. Traditional model-based methods typically demand complicated design and heavy computation. In this paper, we develop a novel deep reinforcement learning-based method that tackles the challenging task of flying through a dynamic narrow gate. We design a model predictive controller with its adaptive tracking references parameterized by a deep neural network (DNN). These references include the traversal time and the quadrotor SE(3) traversal pose that encourage the robot to fly through the gate with maximum safety margins from various initial conditions. To cope with the difficulty of training in highly dynamic environments, we develop a reinforce-imitate learning framework to train the DNN efficiently that generalizes well to diverse settings. Furthermore, we propose a binary search algorithm that allows online adaption of the SE(3) references to dynamic gates in real-time. Finally, through extensive high-fidelity simulations, we show that our approach is robust to the gate's velocity uncertainties and adaptive to different gate trajectories and orientations.

Graph Value Iteration

Authors:Dieqiao Feng, Carla P. Gomes, Bart Selman
Date:2022-09-20 10:45:03

In recent years, deep Reinforcement Learning (RL) has been successful in various combinatorial search domains, such as two-player games and scientific discovery. However, directly applying deep RL in planning domains is still challenging. One major difficulty is that without a human-crafted heuristic function, reward signals remain zero unless the learning framework discovers any solution plan. Search space becomes \emph{exponentially larger} as the minimum length of plans grows, which is a serious limitation for planning instances with a minimum plan length of hundreds to thousands of steps. Previous learning frameworks that augment graph search with deep neural networks and extra generated subgoals have achieved success in various challenging planning domains. However, generating useful subgoals requires extensive domain knowledge. We propose a domain-independent method that augments graph search with graph value iteration to solve hard planning instances that are out of reach for domain-specialized solvers. In particular, instead of receiving learning signals only from discovered plans, our approach also learns from failed search attempts where no goal state has been reached. The graph value iteration component can exploit the graph structure of local search space and provide more informative learning signals. We also show how we use a curriculum strategy to smooth the learning process and perform a full analysis of how graph value iteration scales and enables learning.

Active Predicting Coding: Brain-Inspired Reinforcement Learning for Sparse Reward Robotic Control Problems

Authors:Alexander Ororbia, Ankur Mali
Date:2022-09-19 16:49:32

In this article, we propose a backpropagation-free approach to robotic control through the neuro-cognitive computational framework of neural generative coding (NGC), designing an agent built completely from powerful predictive coding/processing circuits that facilitate dynamic, online learning from sparse rewards, embodying the principles of planning-as-inference. Concretely, we craft an adaptive agent system, which we call active predictive coding (ActPC), that balances an internally-generated epistemic signal (meant to encourage intelligent exploration) with an internally-generated instrumental signal (meant to encourage goal-seeking behavior) to ultimately learn how to control various simulated robotic systems as well as a complex robotic arm using a realistic robotics simulator, i.e., the Surreal Robotics Suite, for the block lifting task and can pick-and-place problems. Notably, our experimental results demonstrate that our proposed ActPC agent performs well in the face of sparse (extrinsic) reward signals and is competitive with or outperforms several powerful backprop-based RL approaches.

Latent Plans for Task-Agnostic Offline Reinforcement Learning

Authors:Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, Wolfram Burgard
Date:2022-09-19 12:27:15

Everyday tasks of long-horizon and comprising a sequence of multiple implicit subtasks still impose a major challenge in offline robot control. While a number of prior methods aimed to address this setting with variants of imitation and offline reinforcement learning, the learned behavior is typically narrow and often struggles to reach configurable long-horizon goals. As both paradigms have complementary strengths and weaknesses, we propose a novel hierarchical approach that combines the strengths of both methods to learn task-agnostic long-horizon policies from high-dimensional camera observations. Concretely, we combine a low-level policy that learns latent skills via imitation learning and a high-level policy learned from offline reinforcement learning for skill-chaining the latent behavior priors. Experiments in various simulated and real robot control tasks show that our formulation enables producing previously unseen combinations of skills to reach temporally extended goals by "stitching" together latent skills through goal chaining with an order-of-magnitude improvement in performance upon state-of-the-art baselines. We even learn one multi-task visuomotor policy for 25 distinct manipulation tasks in the real world which outperforms both imitation learning and offline reinforcement learning techniques.

Unguided Self-exploration in Narrow Spaces with Safety Region Enhanced Reinforcement Learning for Ackermann-steering Robots

Authors:Zhaofeng Tian, Zichuan Liu, Xingyu Zhou, Weisong Shi
Date:2022-09-17 15:20:02

In narrow spaces, motion planning based on the traditional hierarchical autonomous system could cause collisions due to mapping, localization, and control noises, especially for car-like Ackermann-steering robots which suffer from non-convex and non-holonomic kinematics. To tackle these problems, we leverage deep reinforcement learning which is verified to be effective in self-decision-making, to self-explore in narrow spaces without a given map and destination while avoiding collisions. Specifically, based on our Ackermann-steering rectangular-shaped ZebraT robot and its Gazebo simulator, we propose the rectangular safety region to represent states and detect collisions for rectangular-shaped robots, and a carefully crafted reward function for reinforcement learning that does not require the waypoint guidance. For validation, the robot was first trained in a simulated narrow track. Then, the well-trained model was transferred to other simulation tracks and could outperform other traditional methods including classical and learning methods. Finally, the trained model is demonstrated in the real world with our ZebraT robot.

Technical Report for Trend Prediction Based Intelligent UAV Trajectory Planning for Large-scale Dynamic Scenarios

Authors:Jinjing Wang, Xindi Wang
Date:2022-09-17 03:46:23

The unmanned aerial vehicle (UAV)-enabled communication technology is regarded as an efficient and effective solution for some special application scenarios where existing terrestrial infrastructures are overloaded to provide reliable services. To maximize the utility of the UAV-enabled system while meeting the QoS and energy constraints, the UAV needs to plan its trajectory considering the dynamic characteristics of scenarios, which is formulated as the Markov Decision Process (MDP). To solve the above problem, a deep reinforcement learning (DRL)-based scheme is proposed here, which predicts the trend of the dynamic scenarios to provide a long-term view for the UAV trajectory planning. Simulation results validate that our proposed scheme converges more quickly and achieves the better performance in dynamic scenarios.

Value Summation: A Novel Scoring Function for MPC-based Model-based Reinforcement Learning

Authors:Mehran Raisi, Amirhossein Noohian, Luc Mccutcheon, Saber Fallah
Date:2022-09-16 20:52:39

This paper proposes a novel scoring function for the planning module of MPC-based reinforcement learning methods to address the inherent bias of using the reward function to score trajectories. The proposed method enhances the learning efficiency of existing MPC-based MBRL methods using the discounted sum of values. The method utilizes optimal trajectories to guide policy learning and updates its state-action value function based on real-world and augmented onboard data. The learning efficiency of the proposed method is evaluated in selected MuJoCo Gym environments as well as in learning locomotion skills for a simulated model of the Cassie robot. The results demonstrate that the proposed method outperforms the current state-of-the-art algorithms in terms of learning efficiency and average reward return.

A Learning-Based Trajectory Planning of Multiple UAVs for AoI Minimization in IoT Networks

Authors:Eslam Eldeeb, Dian Echevarría Pérez, Jean Michel de Souza Sant'Ana, Mohammad Shehab, Nurul Huda Mahmood, Hirley Alves, Matti Latva-aho
Date:2022-09-13 12:39:23

Many emerging Internet of Things (IoT) applications rely on information collected by sensor nodes where the freshness of information is an important criterion. \textit{Age of Information} (AoI) is a metric that quantifies information timeliness, i.e., the freshness of the received information or status update. This work considers a setup of deployed sensors in an IoT network, where multiple unmanned aerial vehicles (UAVs) serve as mobile relay nodes between the sensors and the base station. We formulate an optimization problem to jointly plan the UAVs' trajectory, while minimizing the AoI of the received messages. This ensures that the received information at the base station is as fresh as possible. The complex optimization problem is efficiently solved using a deep reinforcement learning (DRL) algorithm. In particular, we propose a deep Q-network, which works as a function approximation to estimate the state-action value function. The proposed scheme is quick to converge and results in a lower AoI than the random walk scheme. Our proposed algorithm reduces the average age by approximately $25\%$ and requires down to $50\%$ less energy when compared to the baseline scheme.

Model-based Reinforcement Learning with Multi-step Plan Value Estimation

Authors:Haoxin Lin, Yihao Sun, Jiaji Zhang, Yang Yu
Date:2022-09-12 18:22:11

A promising way to improve the sample efficiency of reinforcement learning is model-based methods, in which many explorations and evaluations can happen in the learned models to save real-world samples. However, when the learned model has a non-negligible model error, sequential steps in the model are hard to be accurately evaluated, limiting the model's utilization. This paper proposes to alleviate this issue by introducing multi-step plans to replace multi-step actions for model-based RL. We employ the multi-step plan value estimation, which evaluates the expected discounted return after executing a sequence of action plans at a given state, and updates the policy by directly computing the multi-step policy gradient via plan value estimation. The new model-based reinforcement learning algorithm MPPVE (Model-based Planning Policy Learning with Multi-step Plan Value Estimation) shows a better utilization of the learned model and achieves a better sample efficiency than state-of-the-art model-based RL approaches.

Route Planning for Last-Mile Deliveries Using Mobile Parcel Lockers: A Hybrid Q-Learning Network Approach

Authors:Yubin Liu, Qiming Ye, Jose Escribano-Macias, Yuxiang Feng, Eduardo Candela, Panagiotis Angeloudis
Date:2022-09-09 11:59:42

Mobile parcel lockers have been recently proposed by logistics operators as a technology that could help reduce traffic congestion and operational costs in urban freight distribution. Given their ability to relocate throughout their area of deployment, they hold the potential to improve customer accessibility and convenience. In this study, we formulate the Mobile Parcel Locker Problem (MPLP) , a special case of the Location-Routing Problem (LRP) which determines the optimal stopover location for MPLs throughout the day and plans corresponding delivery routes. A Hybrid Q Learning Network based Method (HQM) is developed to resolve the computational complexity of the resulting large problem instances while escaping local optima. In addition, the HQM is integrated with global and local search mechanisms to resolve the dilemma of exploration and exploitation faced by classic reinforcement learning methods. We examine the performance of HQM under different problem sizes (up to 200 nodes) and benchmarked it against the exact approach and Genetic Algorithm (GA). Our results indicate that HQM achieves better optimisation performance with shorter computation time than the exact approach solved by the Gurobi solver in large problem instances. Additionally, the average reward obtained by HQM is 1.96 times greater than GA, which demonstrates that HQM has a better optimisation ability. Further, we identify critical factors that contribute to fleet size requirements, travel distances, and service delays. Our findings outline that the efficiency of MPLs is mainly contingent on the length of time windows and the deployment of MPL stopovers. Finally, we highlight managerial implications based on parametric analysis to provide guidance for logistics operators in the context of efficient last-mile distribution operations.

Task-Agnostic Learning to Accomplish New Tasks

Authors:Xianqi Zhang, Xingtao Wang, Xu Liu, Wenrui Wang, Xiaopeng Fan, Debin Zhao
Date:2022-09-09 03:02:49

Reinforcement Learning (RL) and Imitation Learning (IL) have made great progress in robotic control in recent years. However, these methods show obvious deterioration for new tasks that need to be completed through new combinations of actions. RL methods heavily rely on reward functions that cannot generalize well for new tasks, while IL methods are limited by expert demonstrations which do not cover new tasks. In contrast, humans can easily complete these tasks with the fragmented knowledge learned from task-agnostic experience. Inspired by this observation, this paper proposes a task-agnostic learning method (TAL for short) that can learn fragmented knowledge from task-agnostic data to accomplish new tasks. TAL consists of four stages. First, the task-agnostic exploration is performed to collect data from interactions with the environment. The collected data is organized via a knowledge graph. Compared with the previous sequential structure, the knowledge graph representation is more compact and fits better for environment exploration. Second, an action feature extractor is proposed and trained using the collected knowledge graph data for task-agnostic fragmented knowledge learning. Third, a candidate action generator is designed, which applies the action feature extractor on a new task to generate multiple candidate action sets. Finally, an action proposal is designed to produce the probabilities for actions in a new task according to the environmental information. The probabilities are then used to select actions to be executed from multiple candidate action sets to form the plan. Experiments on a virtual indoor scene show that the proposed method outperforms the state-of-the-art offline RL method: CQL by 35.28% and the IL method: BC by 22.22%.

Real-to-Sim: Predicting Residual Errors of Robotic Systems with Sparse Data using a Learning-based Unscented Kalman Filter

Authors:Alexander Schperberg, Yusuke Tanaka, Feng Xu, Marcel Menner, Dennis Hong
Date:2022-09-07 15:15:12

Achieving highly accurate dynamic or simulator models that are close to the real robot can facilitate model-based controls (e.g., model predictive control or linear-quadradic regulators), model-based trajectory planning (e.g., trajectory optimization), and decrease the amount of learning time necessary for reinforcement learning methods. Thus, the objective of this work is to learn the residual errors between a dynamic and/or simulator model and the real robot. This is achieved using a neural network, where the parameters of a neural network are updated through an Unscented Kalman Filter (UKF) formulation. Using this method, we model these residual errors with only small amounts of data -- a necessity as we improve the simulator/dynamic model by learning directly from real-world operation. We demonstrate our method on robotic hardware (e.g., manipulator arm, and a wheeled robot), and show that with the learned residual errors, we can further close the reality gap between dynamic models, simulations, and actual hardware.

Project proposal: A modular reinforcement learning based automated theorem prover

Authors:Boris Shminke
Date:2022-09-06 15:12:53

We propose to build a reinforcement learning prover of independent components: a deductive system (an environment), the proof state representation (how an agent sees the environment), and an agent training algorithm. To that purpose, we contribute an additional Vampire-based environment to $\texttt{gym-saturation}$ package of OpenAI Gym environments for saturation provers. We demonstrate a prototype of using $\texttt{gym-saturation}$ together with a popular reinforcement learning framework (Ray $\texttt{RLlib}$). Finally, we discuss our plans for completing this work in progress to a competitive automated theorem prover.

Indoor Path Planning for Multiple Unmanned Aerial Vehicles via Curriculum Learning

Authors:Jongmin Park, Kwansik Park
Date:2022-09-05 06:07:42

Multi-agent reinforcement learning was performed in this study for indoor path planning of two unmanned aerial vehicles (UAVs). Each UAV performed the task of moving as fast as possible from a randomly paired initial position to a goal position in an environment with obstacles. To minimize training time and prevent the damage of UAVs, learning was performed by simulation. Considering the non-stationary characteristics of the multi-agent environment wherein the optimal behavior varies based on the actions of other agents, the action of the other UAV was also included in the state space of each UAV. Curriculum learning was performed in two stages to increase learning efficiency. A goal rate of 89.0% was obtained compared with other learning strategies that obtained goal rates of 73.6% and 79.9%.

Reinforcement Learning with Prior Policy Guidance for Motion Planning of Dual-Arm Free-Floating Space Robot

Authors:Yuxue Cao, Shengjie Wang, Xiang Zheng, Wenke Ma, Xinru Xie, Lei Liu
Date:2022-09-03 14:20:17

Reinforcement learning methods as a promising technique have achieved superior results in the motion planning of free-floating space robots. However, due to the increase in planning dimension and the intensification of system dynamics coupling, the motion planning of dual-arm free-floating space robots remains an open challenge. In particular, the current study cannot handle the task of capturing a non-cooperative object due to the lack of the pose constraint of the end-effectors. To address the problem, we propose a novel algorithm, EfficientLPT, to facilitate RL-based methods to improve planning accuracy efficiently. Our core contributions are constructing a mixed policy with prior knowledge guidance and introducing infinite norm to build a more reasonable reward function. Furthermore, our method successfully captures a rotating object with different spinning speeds.

Inference and dynamic decision-making for deteriorating systems with probabilistic dependencies through Bayesian networks and deep reinforcement learning

Authors:Pablo G. Morato, Charalampos P. Andriotis, Konstantinos G. Papakonstantinou, Philippe Rigo
Date:2022-09-02 14:45:40

In the context of modern environmental and societal concerns, there is an increasing demand for methods able to identify management strategies for civil engineering systems, minimizing structural failure risks while optimally planning inspection and maintenance (I&M) processes. Most available methods simplify the I&M decision problem to the component level due to the computational complexity associated with global optimization methodologies under joint system-level state descriptions. In this paper, we propose an efficient algorithmic framework for inference and decision-making under uncertainty for engineering systems exposed to deteriorating environments, providing optimal management strategies directly at the system level. In our approach, the decision problem is formulated as a factored partially observable Markov decision process, whose dynamics are encoded in Bayesian network conditional structures. The methodology can handle environments under equal or general, unequal deterioration correlations among components, through Gaussian hierarchical structures and dynamic Bayesian networks. In terms of policy optimization, we adopt a deep decentralized multi-agent actor-critic (DDMAC) reinforcement learning approach, in which the policies are approximated by actor neural networks guided by a critic network. By including deterioration dependence in the simulated environment, and by formulating the cost model at the system level, DDMAC policies intrinsically consider the underlying system-effects. This is demonstrated through numerical experiments conducted for both a 9-out-of-10 system and a steel frame under fatigue deterioration. Results demonstrate that DDMAC policies offer substantial benefits when compared to state-of-the-art heuristic approaches. The inherent consideration of system-effects by DDMAC strategies is also interpreted based on the learned policies.

TarGF: Learning Target Gradient Field to Rearrange Objects without Explicit Goal Specification

Authors:Mingdong Wu, Fangwei Zhong, Yulong Xia, Hao Dong
Date:2022-09-02 07:20:34

Object Rearrangement is to move objects from an initial state to a goal state. Here, we focus on a more practical setting in object rearrangement, i.e., rearranging objects from shuffled layouts to a normative target distribution without explicit goal specification. However, it remains challenging for AI agents, as it is hard to describe the target distribution (goal specification) for reward engineering or collect expert trajectories as demonstrations. Hence, it is infeasible to directly employ reinforcement learning or imitation learning algorithms to address the task. This paper aims to search for a policy only with a set of examples from a target distribution instead of a handcrafted reward function. We employ the score-matching objective to train a Target Gradient Field (TarGF), indicating a direction on each object to increase the likelihood of the target distribution. For object rearrangement, the TarGF can be used in two ways: 1) For model-based planning, we can cast the target gradient into a reference control and output actions with a distributed path planner; 2) For model-free reinforcement learning, the TarGF is not only used for estimating the likelihood-change as a reward but also provides suggested actions in residual policy learning. Experimental results in ball and room rearrangement demonstrate that our method significantly outperforms the state-of-the-art methods in the quality of the terminal state, the efficiency of the control process, and scalability.

Neural Approaches to Co-Optimization in Robotics

Authors:Charles Schaff
Date:2022-09-01 16:49:22

Robots and intelligent systems that sense or interact with the world are increasingly being used to automate a wide array of tasks. The ability of these systems to complete these tasks depends on a large range of technologies such as the mechanical and electrical parts that make up the physical body of the robot and its sensors, perception algorithms to perceive the environment, and planning and control algorithms to produce meaningful actions. Therefore, it is often necessary to consider the interactions between these components when designing an embodied system. This thesis explores work on the task-driven co-optimization of robotics systems in an end-to-end manner, simultaneously optimizing the physical components of the system with inference or control algorithms directly for task performance. We start by considering the problem of optimizing a beacon-based localization system directly for localization accuracy. Designing such a system involves placing beacons throughout the environment and inferring location from sensor readings. In our work, we develop a deep learning approach to optimize both beacon placement and location inference directly for localization accuracy. We then turn our attention to the related problem of task-driven optimization of robots and their controllers. In our work, we start by proposing a data-efficient algorithm based on multi-task reinforcement learning. Our approach efficiently optimizes both physical design and control parameters directly for task performance by leveraging a design-conditioned controller capable of generalizing over the space of physical designs. We then follow this up with an extension to allow for the optimization over discrete morphological parameters such as the number and configuration of limbs. Finally, we conclude by exploring the fabrication and deployment of optimized soft robots.

Socially Fair Reinforcement Learning

Authors:Debmalya Mandal, Jiarui Gan
Date:2022-08-26 11:01:55

We consider the problem of episodic reinforcement learning where there are multiple stakeholders with different reward functions. Our goal is to output a policy that is socially fair with respect to different reward functions. Prior works have proposed different objectives that a fair policy must optimize including minimum welfare, and generalized Gini welfare. We first take an axiomatic view of the problem, and propose four axioms that any such fair objective must satisfy. We show that the Nash social welfare is the unique objective that uniquely satisfies all four objectives, whereas prior objectives fail to satisfy all four axioms. We then consider the learning version of the problem where the underlying model i.e. Markov decision process is unknown. We consider the problem of minimizing regret with respect to the fair policies maximizing three different fair objectives -- minimum welfare, generalized Gini welfare, and Nash social welfare. Based on optimistic planning, we propose a generic learning algorithm and derive its regret bound with respect to the three different policies. For the objective of Nash social welfare, we also derive a lower bound in regret that grows exponentially with $n$, the number of agents. Finally, we show that for the objective of minimum welfare, one can improve regret by a factor of $O(H)$ for a weaker notion of regret.

A model-based approach to meta-Reinforcement Learning: Transformers and tree search

Authors:Brieuc Pinon, Jean-Charles Delvenne, Raphaël Jungers
Date:2022-08-24 13:30:26

Meta-learning is a line of research that develops the ability to leverage past experiences to efficiently solve new learning problems. Meta-Reinforcement Learning (meta-RL) methods demonstrate a capability to learn behaviors that efficiently acquire and exploit information in several meta-RL problems. In this context, the Alchemy benchmark has been proposed by Wang et al. [2021]. Alchemy features a rich structured latent space that is challenging for state-of-the-art model-free RL methods. These methods fail to learn to properly explore then exploit. We develop a model-based algorithm. We train a model whose principal block is a Transformer Encoder to fit the symbolic Alchemy environment dynamics. Then we define an online planner with the learned model using a tree search method. This algorithm significantly outperforms previously applied model-free RL methods on the symbolic Alchemy problem. Our results reveal the relevance of model-based approaches with online planning to perform exploration and exploitation successfully in meta-RL. Moreover, we show the efficiency of the Transformer architecture to learn complex dynamics that arise from latent spaces present in meta-RL problems.

Strategic Decision-Making in the Presence of Information Asymmetry: Provably Efficient RL with Algorithmic Instruments

Authors:Mengxin Yu, Zhuoran Yang, Jianqing Fan
Date:2022-08-23 15:32:44

We study offline reinforcement learning under a novel model called strategic MDP, which characterizes the strategic interactions between a principal and a sequence of myopic agents with private types. Due to the bilevel structure and private types, strategic MDP involves information asymmetry between the principal and the agents. We focus on the offline RL problem, where the goal is to learn the optimal policy of the principal concerning a target population of agents based on a pre-collected dataset that consists of historical interactions. The unobserved private types confound such a dataset as they affect both the rewards and observations received by the principal. We propose a novel algorithm, Pessimistic policy Learning with Algorithmic iNstruments (PLAN), which leverages the ideas of instrumental variable regression and the pessimism principle to learn a near-optimal principal's policy in the context of general function approximation. Our algorithm is based on the critical observation that the principal's actions serve as valid instrumental variables. In particular, under a partial coverage assumption on the offline dataset, we prove that PLAN outputs a $1 / \sqrt{K}$-optimal policy with $K$ being the number of collected trajectories. We further apply our framework to some special cases of strategic MDP, including strategic regression, strategic bandit, and noncompliance in recommendation systems.

Efficient Planning in a Compact Latent Action Space

Authors:Zhengyao Jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rocktäschel, Edward Grefenstette, Yuandong Tian
Date:2022-08-22 13:19:02

Planning-based reinforcement learning has shown strong performance in tasks in discrete and low-dimensional continuous action spaces. However, planning usually brings significant computational overhead for decision-making, and scaling such methods to high-dimensional action spaces remains challenging. To advance efficient planning for high-dimensional continuous control, we propose Trajectory Autoencoding Planner (TAP), which learns low-dimensional latent action codes with a state-conditional VQ-VAE. The decoder of the VQ-VAE thus serves as a novel dynamics model that takes latent actions and current state as input and reconstructs long-horizon trajectories. During inference time, given a starting state, TAP searches over discrete latent actions to find trajectories that have both high probability under the training distribution and high predicted cumulative reward. Empirical evaluation in the offline RL setting demonstrates low decision latency which is indifferent to the growing raw action dimensionality. For Adroit robotic hand manipulation tasks with high-dimensional continuous action space, TAP surpasses existing model-based methods by a large margin and also beats strong model-free actor-critic baselines.

Path Planning of Cleaning Robot with Reinforcement Learning

Authors:Woohyeon Moon, Bumgeun Park, Sarvar Hussain Nengroo, Taeyoung Kim, Dongsoo Har
Date:2022-08-17 10:36:51

Recently, as the demand for cleaning robots has steadily increased, therefore household electricity consumption is also increasing. To solve this electricity consumption issue, the problem of efficient path planning for cleaning robot has become important and many studies have been conducted. However, most of them are about moving along a simple path segment, not about the whole path to clean all places. As the emerging deep learning technique, reinforcement learning (RL) has been adopted for cleaning robot. However, the models for RL operate only in a specific cleaning environment, not the various cleaning environment. The problem is that the models have to retrain whenever the cleaning environment changes. To solve this problem, the proximal policy optimization (PPO) algorithm is combined with an efficient path planning that operates in various cleaning environments, using transfer learning (TL), detection nearest cleaned tile, reward shaping, and making elite set methods. The proposed method is validated with an ablation study and comparison with conventional methods such as random and zigzag. The experimental results demonstrate that the proposed method achieves improved training performance and increased convergence speed over the original PPO. And it also demonstrates that this proposed method is better performance than conventional methods (random, zigzag).

Autonomous Resource Management in Construction Companies Using Deep Reinforcement Learning Based on IoT

Authors:Maryam Soleymani, Mahdi Bonyani, Meghdad Attarzadeh
Date:2022-08-17 05:58:02

Resource allocation is one of the most critical issues in planning construction projects, due to its direct impact on cost, time, and quality. There are usually specific allocation methods for autonomous resource management according to the projects objectives. However, integrated planning and optimization of utilizing resources in an entire construction organization are scarce. The purpose of this study is to present an automatic resource allocation structure for construction companies based on Deep Reinforcement Learning (DRL), which can be used in various situations. In this structure, Data Harvesting (DH) gathers resource information from the distributed Internet of Things (IoT) sensor devices all over the companys projects to be employed in the autonomous resource management approach. Then, Coverage Resources Allocation (CRA) is compared to the information obtained from DH in which the Autonomous Resource Management (ARM) determines the project of interest. Likewise, Double Deep Q-Networks (DDQNs) with similar models are trained on two distinct assignment situations based on structured resource information of the company to balance objectives with resource constraints. The suggested technique in this paper can efficiently adjust to large resource management systems by combining portfolio information with adopted individual project information. Also, the effects of important information processing parameters on resource allocation performance are analyzed in detail. Moreover, the results of the generalizability of management approaches are presented, indicating no need for additional training when the variables of situations change.

Fairness Based Energy-Efficient 3D Path Planning of a Portable Access Point: A Deep Reinforcement Learning Approach

Authors:Nithin Babu, Igor Donevski, Alvaro Valcarce, Petar Popovski, Jimmy Jessen Nielsen, Constantinos B. Papadias
Date:2022-08-10 10:48:16

In this work, we optimize the 3D trajectory of an unmanned aerial vehicle (UAV)-based portable access point (PAP) that provides wireless services to a set of ground nodes (GNs). Moreover, as per the Peukert effect, we consider pragmatic non-linear battery discharge for the battery of the UAV. Thus, we formulate the problem in a novel manner that represents the maximization of a fairness-based energy efficiency metric and is named fair energy efficiency (FEE). The FEE metric defines a system that lays importance on both the per-user service fairness and the energy efficiency of the PAP. The formulated problem takes the form of a non-convex problem with non-tractable constraints. To obtain a solution, we represent the problem as a Markov Decision Process (MDP) with continuous state and action spaces. Considering the complexity of the solution space, we use the twin delayed deep deterministic policy gradient (TD3) actor-critic deep reinforcement learning (DRL) framework to learn a policy that maximizes the FEE of the system. We perform two types of RL training to exhibit the effectiveness of our approach: the first (offline) approach keeps the positions of the GNs the same throughout the training phase; the second approach generalizes the learned policy to any arrangement of GNs by changing the positions of GNs after each training episode. Numerical evaluations show that neglecting the Peukert effect overestimates the air-time of the PAP and can be addressed by optimally selecting the PAP's flying speed. Moreover, the user fairness, energy efficiency, and hence the FEE value of the system can be improved by efficiently moving the PAP above the GNs. As such, we notice massive FEE improvements over baseline scenarios of up to 88.31%, 272.34%, and 318.13% for suburban, urban, and dense urban environments, respectively.

Vehicle Type Specific Waypoint Generation

Authors:Yunpeng Liu, Jonathan Wilder Lavington, Adam Scibior, Frank Wood
Date:2022-08-09 18:29:00

We develop a generic mechanism for generating vehicle-type specific sequences of waypoints from a probabilistic foundation model of driving behavior. Many foundation behavior models are trained on data that does not include vehicle information, which limits their utility in downstream applications such as planning. Our novel methodology conditionally specializes such a behavior predictive model to a vehicle-type by utilizing byproducts of the reinforcement learning algorithms used to produce vehicle specific controllers. We show how to compose a vehicle specific value function estimate with a generic probabilistic behavior model to generate vehicle-type specific waypoint sequences that are more likely to be physically plausible then their vehicle-agnostic counterparts.

Learning to Coordinate for a Worker-Station Multi-robot System in Planar Coverage Tasks

Authors:Jingtao Tang, Yuan Gao, Tin Lun Lam
Date:2022-08-05 05:36:42

For massive large-scale tasks, a multi-robot system (MRS) can effectively improve efficiency by utilizing each robot's different capabilities, mobility, and functionality. In this paper, we focus on the multi-robot coverage path planning (mCPP) problem in large-scale planar areas with random dynamic interferers in the environment, where the robots have limited resources. We introduce a worker-station MRS consisting of multiple workers with limited resources for actual work, and one station with enough resources for resource replenishment. We aim to solve the mCPP problem for the worker-station MRS by formulating it as a fully cooperative multi-agent reinforcement learning problem. Then we propose an end-to-end decentralized online planning method, which simultaneously solves coverage planning for workers and rendezvous planning for station. Our method manages to reduce the influence of random dynamic interferers on planning, while the robots can avoid collisions with them. We conduct simulation and real robot experiments, and the comparison results show that our method has competitive performance in solving the mCPP problem for worker-station MRS in metric of task finish time.

DL-DRL: A double-level deep reinforcement learning approach for large-scale task scheduling of multi-UAV

Authors:Xiao Mao, Zhiguang Cao, Mingfeng Fan, Guohua Wu, Witold Pedrycz
Date:2022-08-04 04:35:53

Exploiting unmanned aerial vehicles (UAVs) to execute tasks is gaining growing popularity recently. To solve the underlying task scheduling problem, the deep reinforcement learning (DRL) based methods demonstrate notable advantage over the conventional heuristics as they rely less on hand-engineered rules. However, their decision space will become prohibitively huge as the problem scales up, thus deteriorating the computation efficiency. To alleviate this issue, we propose a double-level deep reinforcement learning (DL-DRL) approach based on a divide and conquer framework (DCF), where we decompose the task scheduling of multi-UAV into task allocation and route planning. Particularly, we design an encoder-decoder structured policy network in our upper-level DRL model to allocate the tasks to different UAVs, and we exploit another attention based policy network in our lower-level DRL model to construct the route for each UAV, with the objective to maximize the number of executed tasks given the maximum flight distance of the UAV. To effectively train the two models, we design an interactive training strategy (ITS), which includes pre-training, intensive training and alternate training. Experimental results show that our DL-DRL performs favorably against the learning-based and conventional baselines including the OR-Tools, in terms of solution quality and computation efficiency. We also verify the generalization performance of our approach by applying it to larger sizes of up to 1000 tasks. Moreover, we also show via an ablation study that our ITS can help achieve a balance between the performance and training efficiency.

Hierarchical Reinforcement Learning for Precise Soccer Shooting Skills using a Quadrupedal Robot

Authors:Yandong Ji, Zhongyu Li, Yinan Sun, Xue Bin Peng, Sergey Levine, Glen Berseth, Koushil Sreenath
Date:2022-08-01 22:34:51

We address the problem of enabling quadrupedal robots to perform precise shooting skills in the real world using reinforcement learning. Developing algorithms to enable a legged robot to shoot a soccer ball to a given target is a challenging problem that combines robot motion control and planning into one task. To solve this problem, we need to consider the dynamics limitation and motion stability during the control of a dynamic legged robot. Moreover, we need to consider motion planning to shoot the hard-to-model deformable ball rolling on the ground with uncertain friction to a desired location. In this paper, we propose a hierarchical framework that leverages deep reinforcement learning to train (a) a robust motion control policy that can track arbitrary motions and (b) a planning policy to decide the desired kicking motion to shoot a soccer ball to a target. We deploy the proposed framework on an A1 quadrupedal robot and enable it to accurately shoot the ball to random targets in the real world.

A Maintenance Planning Framework using Online and Offline Deep Reinforcement Learning

Authors:Zaharah A. Bukhsh, Nils Jansen, Hajo Molegraaf
Date:2022-08-01 12:41:06

Cost-effective asset management is an area of interest across several industries. Specifically, this paper develops a deep reinforcement learning (DRL) solution to automatically determine an optimal rehabilitation policy for continuously deteriorating water pipes. We approach the problem of rehabilitation planning in an online and offline DRL setting. In online DRL, the agent interacts with a simulated environment of multiple pipes with distinct lengths, materials, and failure rate characteristics. We train the agent using deep Q-learning (DQN) to learn an optimal policy with minimal average costs and reduced failure probability. In offline learning, the agent uses static data, e.g., DQN replay data, to learn an optimal policy via a conservative Q-learning algorithm without further interactions with the environment. We demonstrate that DRL-based policies improve over standard preventive, corrective, and greedy planning alternatives. Additionally, learning from the fixed DQN replay dataset in an offline setting further improves the performance. The results warrant that the existing deterioration profiles of water pipes consisting of large and diverse states and action trajectories provide a valuable avenue to learn rehabilitation policies in the offline setting, which can be further fine-tuned using the simulator.

Planning and Learning: Path-Planning for Autonomous Vehicles, a Review of the Literature

Authors:Kevin Osanlou, Christophe Guettier, Tristan Cazenave, Eric Jacopin
Date:2022-07-26 20:56:18

This short review aims to make the reader familiar with state-of-the-art works relating to planning, scheduling and learning. First, we study state-of-the-art planning algorithms. We give a brief introduction of neural networks. Then we explore in more detail graph neural networks, a recent variant of neural networks suited for processing graph-structured inputs. We describe briefly the concept of reinforcement learning algorithms and some approaches designed to date. Next, we study some successful approaches combining neural networks for path-planning. Lastly, we focus on temporal planning problems with uncertainty.

Learning Bipedal Walking On Planned Footsteps For Humanoid Robots

Authors:Rohan Pratap Singh, Mehdi Benallegue, Mitsuharu Morisawa, Rafael Cisneros, Fumio Kanehiro
Date:2022-07-26 04:16:00

Deep reinforcement learning (RL) based controllers for legged robots have demonstrated impressive robustness for walking in different environments for several robot platforms. To enable the application of RL policies for humanoid robots in real-world settings, it is crucial to build a system that can achieve robust walking in any direction, on 2D and 3D terrains, and be controllable by a user-command. In this paper, we tackle this problem by learning a policy to follow a given step sequence. The policy is trained with the help of a set of procedurally generated step sequences (also called footstep plans). We show that simply feeding the upcoming 2 steps to the policy is sufficient to achieve omnidirectional walking, turning in place, standing, and climbing stairs. Our method employs curriculum learning on the complexity of terrains, and circumvents the need for reference motions or pre-trained weights. We demonstrate the application of our proposed method to learn RL policies for 2 new robot platforms - HRP5P and JVRC-1 - in the MuJoCo simulation environment. The code for training and evaluation is available online.

Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning

Authors:Deborah Cohen, Moonkyung Ryu, Yinlam Chow, Orgad Keller, Ido Greenberg, Avinatan Hassidim, Michael Fink, Yossi Matias, Idan Szpektor, Craig Boutilier, Gal Elidan
Date:2022-07-25 16:12:33

Despite recent advances in natural language understanding and generation, and decades of research on the development of conversational bots, building automated agents that can carry on rich open-ended conversations with humans "in the wild" remains a formidable challenge. In this work we develop a real-time, open-ended dialogue system that uses reinforcement learning (RL) to power a bot's conversational skill at scale. Our work pairs the succinct embedding of the conversation state generated using SOTA (supervised) language models with RL techniques that are particularly suited to a dynamic action space that changes as the conversation progresses. Trained using crowd-sourced data, our novel system is able to substantially exceeds the (strong) baseline supervised model with respect to several metrics of interest in a live experiment with real users of the Google Assistant.

Continuous ErrP detections during multimodal human-robot interaction

Authors:Su Kyoung Kim, Michael Maurus, Mathias Trampler, Marc Tabie, Elsa Andrea Kirchner
Date:2022-07-25 15:39:32

Human-in-the-loop approaches are of great importance for robot applications. In the presented study, we implemented a multimodal human-robot interaction (HRI) scenario, in which a simulated robot communicates with its human partner through speech and gestures. The robot announces its intention verbally and selects the appropriate action using pointing gestures. The human partner, in turn, evaluates whether the robot's verbal announcement (intention) matches the action (pointing gesture) chosen by the robot. For cases where the verbal announcement of the robot does not match the corresponding action choice of the robot, we expect error-related potentials (ErrPs) in the human electroencephalogram (EEG). These intrinsic evaluations of robot actions by humans, evident in the EEG, were recorded in real time, continuously segmented online and classified asynchronously. For feature selection, we propose an approach that allows the combinations of forward and backward sliding windows to train a classifier. We achieved an average classification performance of 91% across 9 subjects. As expected, we also observed a relatively high variability between the subjects. In the future, the proposed feature selection approach will be extended to allow for customization of feature selection. To this end, the best combinations of forward and backward sliding windows will be automatically selected to account for inter-subject variability in classification performance. In addition, we plan to use the intrinsic human error evaluation evident in the error case by the ErrP in interactive reinforcement learning to improve multimodal human-robot interaction.

Successor Representation Active Inference

Authors:Beren Millidge, Christopher L Buckley
Date:2022-07-20 13:50:27

Recent work has uncovered close links between between classical reinforcement learning algorithms, Bayesian filtering, and Active Inference which lets us understand value functions in terms of Bayesian posteriors. An alternative, but less explored, model-free RL algorithm is the successor representation, which expresses the value function in terms of a successor matrix of expected future state occupancies. In this paper, we derive the probabilistic interpretation of the successor representation in terms of Bayesian filtering and thus design a novel active inference agent architecture utilizing successor representations instead of model-based planning. We demonstrate that active inference successor representations have significant advantages over current active inference agents in terms of planning horizon and computational cost. Moreover, we demonstrate how the successor representation agent can generalize to changing reward functions such as variants of the expected free energy.

New Auction Algorithms for Path Planning, Network Transport, and Reinforcement Learning

Authors:Dimitri Bertsekas
Date:2022-07-19 23:31:36

We consider some classical optimization problems in path planning and network transport, and we introduce new auction-based algorithms for their optimal and suboptimal solution. The algorithms are based on mathematical ideas that are related to competitive bidding by persons for objects and the attendant market equilibrium, which underlie auction processes. However, the starting point of our algorithms is different, namely weighted and unweighted path construction in directed graphs, rather than assignment of persons to objects. The new algorithms have several potential advantages over existing methods: they are empirically faster in some important contexts, such as max-flow, they are well-suited for on-line replanning, and they can be adapted to distributed asynchronous operation. Moreover, they allow arbitrary initial prices, without complementary slackness restrictions, and thus are better-suited to take advantage of reinforcement learning methods that use off-line training with data, as well as on-line training during real-time operation. The new algorithms may also find use in reinforcement learning contexts involving approximation, such as multistep lookahead and tree search schemes, and/or rollout algorithms.

An Enhanced Graph Representation for Machine Learning Based Automatic Intersection Management

Authors:Marvin Klimke, Jasper Gerigk, Benjamin Völz, Michael Buchholz
Date:2022-07-18 14:53:50

The improvement of traffic efficiency at urban intersections receives strong research interest in the field of automated intersection management. So far, mostly non-learning algorithms like reservation or optimization-based ones were proposed to solve the underlying multi-agent planning problem. At the same time, automated driving functions for a single ego vehicle are increasingly implemented using machine learning methods. In this work, we build upon a previously presented graph-based scene representation and graph neural network to approach the problem using reinforcement learning. The scene representation is improved in key aspects by using edge features in addition to the existing node features for the vehicles. This leads to an increased representation quality that is leveraged by an updated network architecture. The paper provides an in-depth evaluation of the proposed method against baselines that are commonly used in automatic intersection management. Compared to a traditional signalized intersection and an enhanced first-in-first-out scheme, a significant reduction of induced delay is observed at varying traffic densities. Finally, the generalization capability of the graph-based representation is evaluated by testing the policy on intersection layouts not seen during training. The model generalizes virtually without restrictions to smaller intersection layouts and within certain limits to larger ones.

Federated Deep Reinforcement Learning for RIS-Assisted Indoor Multi-Robot Communication Systems

Authors:Ruyu Luo, Wanli Ni, Hui Tian, Julian Cheng
Date:2022-07-17 02:41:05

Indoor multi-robot communications face two key challenges: one is the severe signal strength degradation caused by blockages (e.g., walls) and the other is the dynamic environment caused by robot mobility. To address these issues, we consider the reconfigurable intelligent surface (RIS) to overcome the signal blockage and assist the trajectory design among multiple robots. Meanwhile, the non-orthogonal multiple access (NOMA) is adopted to cope with the scarcity of spectrum and enhance the connectivity of robots. Considering the limited battery capacity of robots, we aim to maximize the energy efficiency by jointly optimizing the transmit power of the access point (AP), the phase shifts of the RIS, and the trajectory of robots. A novel federated deep reinforcement learning (F-DRL) approach is developed to solve this challenging problem with one dynamic long-term objective. Through each robot planning its path and downlink power, the AP only needs to determine the phase shifts of the RIS, which can significantly save the computation overhead due to the reduced training dimension. Simulation results reveal the following findings: I) the proposed F-DRL can reduce at least 86% convergence time compared to the centralized DRL; II) the designed algorithm can adapt to the increasing number of robots; III) compared to traditional OMA-based benchmarks, NOMA-enhanced schemes can achieve higher energy efficiency.

Skill-based Model-based Reinforcement Learning

Authors:Lucy Xiaoyang Shi, Joseph J. Lim, Youngwoon Lee
Date:2022-07-15 16:06:33

Model-based reinforcement learning (RL) is a sample-efficient way of learning complex behaviors by leveraging a learned single-step dynamics model to plan actions in imagination. However, planning every action for long-horizon tasks is not practical, akin to a human planning out every muscle movement. Instead, humans efficiently plan with high-level skills to solve complex tasks. From this intuition, we propose a Skill-based Model-based RL framework (SkiMo) that enables planning in the skill space using a skill dynamics model, which directly predicts the skill outcomes, rather than predicting all small details in the intermediate states, step by step. For accurate and efficient long-term planning, we jointly learn the skill dynamics model and a skill repertoire from prior experience. We then harness the learned skill dynamics model to accurately simulate and plan over long horizons in the skill space, which enables efficient downstream learning of long-horizon, sparse reward tasks. Experimental results in navigation and manipulation domains show that SkiMo extends the temporal horizon of model-based approaches and improves the sample efficiency for both model-based RL and skill-based RL. Code and videos are available at https://clvrai.com/skimo

Inverse Resource Rational Based Stochastic Driver Behavior Model

Authors:Mehmet Ozkan, Yao Ma
Date:2022-07-14 17:40:02

Human drivers have limited and time-varying cognitive resources when making decisions in real-world traffic scenarios, which often leads to unique and stochastic behaviors that can not be explained by perfect rationality assumption, a widely accepted premise in modeling driving behaviors that presume drivers rationally make decisions to maximize their own rewards under all circumstances. To explicitly address this disadvantage, this study presents a novel driver behavior model that aims to capture the resource rationality and stochasticity of the human driver's behaviors in realistic longitudinal driving scenarios. The resource rationality principle can provide a theoretic framework to better understand the human cognition processes by modeling human's internal cognitive mechanisms as utility maximization subject to cognitive resource limitations, which can be represented as finite and time-varying preview horizons in the context of driving. An inverse resource rational-based stochastic inverse reinforcement learning approach (IRR-SIRL) is proposed to learn a distribution of the planning horizon and cost function of the human driver with a given series of human demonstrations. A nonlinear model predictive control (NMPC) with a time-varying horizon approach is used to generate driver-specific trajectories by using the learned distributions of the planning horizon and the cost function of the driver. The simulation experiments are carried out using human demonstrations gathered from the driver-in-the-loop driving simulator. The results reveal that the proposed inverse resource rational-based stochastic driver model can address the resource rationality and stochasticity of human driving behaviors in a variety of realistic longitudinal driving scenarios.

Visuo-Tactile Manipulation Planning Using Reinforcement Learning with Affordance Representation

Authors:Wenyu Liang, Fen Fang, Cihan Acar, Wei Qi Toh, Ying Sun, Qianli Xu, Yan Wu
Date:2022-07-14 02:02:21

Robots are increasingly expected to manipulate objects in ever more unstructured environments where the object properties have high perceptual uncertainty from any single sensory modality. This directly impacts successful object manipulation. In this work, we propose a reinforcement learning-based motion planning framework for object manipulation which makes use of both on-the-fly multisensory feedback and a learned attention-guided deep affordance model as perceptual states. The affordance model is learned from multiple sensory modalities, including vision and touch (tactile and force/torque), which is designed to predict and indicate the manipulable regions of multiple affordances (i.e., graspability and pushability) for objects with similar appearances but different intrinsic properties (e.g., mass distribution). A DQN-based deep reinforcement learning algorithm is then trained to select the optimal action for successful object manipulation. To validate the performance of the proposed framework, our method is evaluated and benchmarked using both an open dataset and our collected dataset. The results show that the proposed method and overall framework outperform existing methods and achieve better accuracy and higher efficiency.

Learning Temporally Extended Skills in Continuous Domains as Symbolic Actions for Planning

Authors:Jan Achterhold, Markus Krimmel, Joerg Stueckler
Date:2022-07-11 17:13:10

Problems which require both long-horizon planning and continuous control capabilities pose significant challenges to existing reinforcement learning agents. In this paper we introduce a novel hierarchical reinforcement learning agent which links temporally extended skills for continuous control with a forward model in a symbolic discrete abstraction of the environment's state for planning. We term our agent SEADS for Symbolic Effect-Aware Diverse Skills. We formulate an objective and corresponding algorithm which leads to unsupervised learning of a diverse set of skills through intrinsic motivation given a known state abstraction. The skills are jointly learned with the symbolic forward model which captures the effect of skill execution in the state abstraction. After training, we can leverage the skills as symbolic actions using the forward model for long-horizon planning and subsequently execute the plan using the learned continuous-action control skills. The proposed algorithm learns skills and forward models that can be used to solve complex tasks which require both continuous control and long-horizon planning capabilities with high success rate. It compares favorably with other flat and hierarchical reinforcement learning baseline agents and is successfully demonstrated with a real robot.

State Dropout-Based Curriculum Reinforcement Learning for Self-Driving at Unsignalized Intersections

Authors:Shivesh Khaitan, John M. Dolan
Date:2022-07-10 01:33:49

Traversing intersections is a challenging problem for autonomous vehicles, especially when the intersections do not have traffic control. Recently deep reinforcement learning has received massive attention due to its success in dealing with autonomous driving tasks. In this work, we address the problem of traversing unsignalized intersections using a novel curriculum for deep reinforcement learning. The proposed curriculum leads to: 1) A faster training process for the reinforcement learning agent, and 2) Better performance compared to an agent trained without curriculum. Our main contribution is two-fold: 1) Presenting a unique curriculum for training deep reinforcement learning agents, and 2) showing the application of the proposed curriculum for the unsignalized intersection traversal task. The framework expects processed observations of the surroundings from the perception system of the autonomous vehicle. We test our method in the CommonRoad motion planning simulator on T-intersections and four-way intersections.

Finding Fallen Objects Via Asynchronous Audio-Visual Integration

Authors:Chuang Gan, Yi Gu, Siyuan Zhou, Jeremy Schwartz, Seth Alter, James Traer, Dan Gutfreund, Joshua B. Tenenbaum, Josh McDermott, Antonio Torralba
Date:2022-07-07 17:59:59

The way an object looks and sounds provide complementary reflections of its physical properties. In many settings cues from vision and audition arrive asynchronously but must be integrated, as when we hear an object dropped on the floor and then must find it. In this paper, we introduce a setting in which to study multi-modal object localization in 3D virtual environments. An object is dropped somewhere in a room. An embodied robot agent, equipped with a camera and microphone, must determine what object has been dropped -- and where -- by combining audio and visual signals with knowledge of the underlying physics. To study this problem, we have generated a large-scale dataset -- the Fallen Objects dataset -- that includes 8000 instances of 30 physical object categories in 64 rooms. The dataset uses the ThreeDWorld platform which can simulate physics-based impact sounds and complex physical interactions between objects in a photorealistic setting. As a first step toward addressing this challenge, we develop a set of embodied agent baselines, based on imitation learning, reinforcement learning, and modular planning, and perform an in-depth analysis of the challenge of this new task.

A Learning System for Motion Planning of Free-Float Dual-Arm Space Manipulator towards Non-Cooperative Object

Authors:Shengjie Wang, Yuxue Cao, Xiang Zheng, Tao Zhang
Date:2022-07-06 06:22:34

Recent years have seen the emergence of non-cooperative objects in space, like failed satellites and space junk. These objects are usually operated or collected by free-float dual-arm space manipulators. Thanks to eliminating the difficulties of modeling and manual parameter-tuning, reinforcement learning (RL) methods have shown a more promising sign in the trajectory planning of space manipulators. Although previous studies demonstrate their effectiveness, they cannot be applied in tracking dynamic targets with unknown rotation (non-cooperative objects). In this paper, we proposed a learning system for motion planning of free-float dual-arm space manipulator (FFDASM) towards non-cooperative objects. Specifically, our method consists of two modules. Module I realizes the multi-target trajectory planning for two end-effectors within a large target space. Next, Module II takes as input the point clouds of the non-cooperative object to estimate the motional property, and then can predict the position of target points on an non-cooperative object. We leveraged the combination of Module I and Module II to track target points on a spinning object with unknown regularity successfully. Furthermore, the experiments also demonstrate the scalability and generalization of our learning system.

Learning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement Learning

Authors:Lukas Schäfer, Filippos Christianos, Amos Storkey, Stefano V. Albrecht
Date:2022-07-05 18:23:20

Successful deployment of multi-agent reinforcement learning often requires agents to adapt their behaviour. In this work, we discuss the problem of teamwork adaptation in which a team of agents needs to adapt their policies to solve novel tasks with limited fine-tuning. Motivated by the intuition that agents need to be able to identify and distinguish tasks in order to adapt their behaviour to the current task, we propose to learn multi-agent task embeddings (MATE). These task embeddings are trained using an encoder-decoder architecture optimised for reconstruction of the transition and reward functions which uniquely identify tasks. We show that a team of agents is able to adapt to novel tasks when provided with task embeddings. We propose three MATE training paradigms: independent MATE, centralised MATE, and mixed MATE which vary in the information used for the task encoding. We show that the embeddings learned by MATE identify tasks and provide useful information which agents leverage during adaptation to novel tasks.

Tackling Real-World Autonomous Driving using Deep Reinforcement Learning

Authors:Paolo Maramotti, Alessandro Paolo Capasso, Giulio Bacchiani, Alberto Broggi
Date:2022-07-05 16:33:20

In the typical autonomous driving stack, planning and control systems represent two of the most crucial components in which data retrieved by sensors and processed by perception algorithms are used to implement a safe and comfortable self-driving behavior. In particular, the planning module predicts the path the autonomous car should follow taking the correct high-level maneuver, while control systems perform a sequence of low-level actions, controlling steering angle, throttle and brake. In this work, we propose a model-free Deep Reinforcement Learning Planner training a neural network that predicts both acceleration and steering angle, thus obtaining a single module able to drive the vehicle using the data processed by localization and perception algorithms on board of the self-driving car. In particular, the system that was fully trained in simulation is able to drive smoothly and safely in obstacle-free environments both in simulation and in a real-world urban area of the city of Parma, proving that the system features good generalization capabilities also driving in those parts outside the training scenarios. Moreover, in order to deploy the system on board of the real self-driving car and to reduce the gap between simulated and real-world performances, we also develop a module represented by a tiny neural network able to reproduce the real vehicle dynamic behavior during the training in simulation.

Planning with RL and episodic-memory behavioral priors

Authors:Shivansh Beohar, Andrew Melnik
Date:2022-07-05 07:11:05

The practical application of learning agents requires sample efficient and interpretable algorithms. Learning from behavioral priors is a promising way to bootstrap agents with a better-than-random exploration policy or a safe-guard against the pitfalls of early learning. Existing solutions for imitation learning require a large number of expert demonstrations and rely on hard-to-interpret learning methods like Deep Q-learning. In this work we present a planning-based approach that can use these behavioral priors for effective exploration and learning in a reinforcement learning environment, and we demonstrate that curated exploration policies in the form of behavioral priors can help an agent learn faster.

Doubly-Asynchronous Value Iteration: Making Value Iteration Asynchronous in Actions

Authors:Tian Tian, Kenny Young, Richard S. Sutton
Date:2022-07-04 17:55:44

Value iteration (VI) is a foundational dynamic programming method, important for learning and planning in optimal control and reinforcement learning. VI proceeds in batches, where the update to the value of each state must be completed before the next batch of updates can begin. Completing a single batch is prohibitively expensive if the state space is large, rendering VI impractical for many applications. Asynchronous VI helps to address the large state space problem by updating one state at a time, in-place and in an arbitrary order. However, Asynchronous VI still requires a maximization over the entire action space, making it impractical for domains with large action space. To address this issue, we propose doubly-asynchronous value iteration (DAVI), a new algorithm that generalizes the idea of asynchrony from states to states and actions. More concretely, DAVI maximizes over a sampled subset of actions that can be of any user-defined size. This simple approach of using sampling to reduce computation maintains similarly appealing theoretical properties to VI without the need to wait for a full sweep through the entire action space in each update. In this paper, we show DAVI converges to the optimal value function with probability one, converges at a near-geometric rate with probability 1-delta, and returns a near-optimal policy in computation time that nearly matches a previously established bound for VI. We also empirically demonstrate DAVI's effectiveness in several experiments.

Reinforcement Learning Based User-Guided Motion Planning for Human-Robot Collaboration

Authors:Tian Yu, Qing Chang
Date:2022-07-01 15:26:36

Robots are good at performing repetitive tasks in modern manufacturing industries. However, robot motions are mostly planned and preprogrammed with a notable lack of adaptivity to task changes. Even for slightly changed tasks, the whole system must be reprogrammed by robotics experts. Therefore, it is highly desirable to have a flexible motion planning method, with which robots can adapt to specific task changes in unstructured environments, such as production systems or warehouses, with little or no intervention from non-expert personnel. In this paper, we propose a user-guided motion planning algorithm in combination with the reinforcement learning (RL) method to enable robots automatically generate their motion plans for new tasks by learning from a few kinesthetic human demonstrations. To achieve adaptive motion plans for a specific application environment, e.g., desk assembly or warehouse loading/unloading, a library is built by abstracting features of common human demonstrated tasks. The definition of semantical similarity between features in the library and features of a new task is proposed and further used to construct the reward function in RL. The RL policy can automatically generate motion plans for a new task if it determines that new task constraints can be satisfied with the current library and request additional human demonstrations. Multiple experiments conducted on common tasks and scenarios demonstrate that the proposed user-guided RL-assisted motion planning method is effective.

A Survey on Active Simultaneous Localization and Mapping: State of the Art and New Frontiers

Authors:Julio A. Placed, Jared Strader, Henry Carrillo, Nikolay Atanasov, Vadim Indelman, Luca Carlone, José A. Castellanos
Date:2022-07-01 07:56:48

Active Simultaneous Localization and Mapping (SLAM) is the problem of planning and controlling the motion of a robot to build the most accurate and complete model of the surrounding environment. Since the first foundational work in active perception appeared, more than three decades ago, this field has received increasing attention across different scientific communities. This has brought about many different approaches and formulations, and makes a review of the current trends necessary and extremely valuable for both new and experienced researchers. In this work, we survey the state-of-the-art in active SLAM and take an in-depth look at the open challenges that still require attention to meet the needs of modern applications. After providing a historical perspective, we present a unified problem formulation and review the well-established modular solution scheme, which decouples the problem into three stages that identify, select, and execute potential navigation actions. We then analyze alternative approaches, including belief-space planning and deep reinforcement learning techniques, and review related work on multi-robot coordination. The manuscript concludes with a discussion of new research directions, addressing reproducible research, active spatial perception, and practical applications, among other topics.

Left Heavy Tails and the Effectiveness of the Policy and Value Networks in DNN-based best-first search for Sokoban Planning

Authors:Dieqiao Feng, Carla Gomes, Bart Selman
Date:2022-06-28 21:48:54

Despite the success of practical solvers in various NP-complete domains such as SAT and CSP as well as using deep reinforcement learning to tackle two-player games such as Go, certain classes of PSPACE-hard planning problems have remained out of reach. Even carefully designed domain-specialized solvers can fail quickly due to the exponential search space on hard instances. Recent works that combine traditional search methods, such as best-first search and Monte Carlo tree search, with Deep Neural Networks' (DNN) heuristics have shown promising progress and can solve a significant number of hard planning instances beyond specialized solvers. To better understand why these approaches work, we studied the interplay of the policy and value networks of DNN-based best-first search on Sokoban and show the surprising effectiveness of the policy network, further enhanced by the value network, as a guiding heuristic for the search. To further understand the phenomena, we studied the cost distribution of the search algorithms and found that Sokoban instances can have heavy-tailed runtime distributions, with tails both on the left and right-hand sides. In particular, for the first time, we show the existence of \textit{left heavy tails} and propose an abstract tree model that can empirically explain the appearance of these tails. The experiments show the critical role of the policy network as a powerful heuristic guiding the search, which can lead to left heavy tails with polynomial scaling by avoiding exploring exponentially sized subtrees. Our results also demonstrate the importance of random restarts, as are widely used in traditional combinatorial solvers, for DNN-based search methods to avoid left and right heavy tails.

DayDreamer: World Models for Physical Robot Learning

Authors:Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, Pieter Abbeel
Date:2022-06-28 17:44:48

To solve tasks in complex environments, robots need to learn from experience. Deep reinforcement learning is a common approach to robot learning but requires a large amount of trial and error to learn, limiting its deployment in the physical world. As a consequence, many advances in robot learning rely on simulators. On the other hand, learning inside of simulators fails to capture the complexity of the real world, is prone to simulator inaccuracies, and the resulting behaviors do not adapt to changes in the world. The Dreamer algorithm has recently shown great promise for learning from small amounts of interaction by planning within a learned world model, outperforming pure reinforcement learning in video games. Learning a world model to predict the outcomes of potential actions enables planning in imagination, reducing the amount of trial and error needed in the real environment. However, it is unknown whether Dreamer can facilitate faster learning on physical robots. In this paper, we apply Dreamer to 4 robots to learn online and directly in the real world, without simulators. Dreamer trains a quadruped robot to roll off its back, stand up, and walk from scratch and without resets in only 1 hour. We then push the robot and find that Dreamer adapts within 10 minutes to withstand perturbations or quickly roll over and stand back up. On two different robotic arms, Dreamer learns to pick and place multiple objects directly from camera images and sparse rewards, approaching human performance. On a wheeled robot, Dreamer learns to navigate to a goal position purely from camera images, automatically resolving ambiguity about the robot orientation. Using the same hyperparameters across all experiments, we find that Dreamer is capable of online learning in the real world, establishing a strong baseline. We release our infrastructure for future applications of world models to robot learning.

Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-free RL

Authors:Ruiquan Huang, Jing Yang, Yingbin Liang
Date:2022-06-28 15:00:45

Reward-free reinforcement learning (RF-RL), a recently introduced RL paradigm, relies on random action-taking to explore the unknown environment without any reward feedback information. While the primary goal of the exploration phase in RF-RL is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity in order to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.

DistSPECTRL: Distributing Specifications in Multi-Agent Reinforcement Learning Systems

Authors:Joe Eappen, Suresh Jagannathan
Date:2022-06-28 04:53:33

While notable progress has been made in specifying and learning objectives for general cyber-physical systems, applying these methods to distributed multi-agent systems still pose significant challenges. Among these are the need to (a) craft specification primitives that allow expression and interplay of both local and global objectives, (b) tame explosion in the state and action spaces to enable effective learning, and (c) minimize coordination frequency and the set of engaged participants for global objectives. To address these challenges, we propose a novel specification framework that allows natural composition of local and global objectives used to guide training of a multi-agent system. Our technique enables learning expressive policies that allow agents to operate in a coordination-free manner for local objectives, while using a decentralized communication protocol for enforcing global ones. Experimental results support our claim that sophisticated multi-agent distributed planning problems can be effectively realized using specification-guided learning.

Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning

Authors:Yang Yue, Bingyi Kang, Zhongwen Xu, Gao Huang, Shuicheng Yan
Date:2022-06-25 03:02:25

Deep reinforcement learning (RL) algorithms suffer severe performance degradation when the interaction data is scarce, which limits their real-world application. Recently, visual representation learning has been shown to be effective and promising for boosting sample efficiency in RL. These methods usually rely on contrastive learning and data augmentation to train a transition model for state prediction, which is different from how the model is used in RL--performing value-based planning. Accordingly, the learned representation by these visual methods may be good for recognition but not optimal for estimating state value and solving the decision problem. To address this issue, we propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making. More specifically, VCR trains a model to predict the future state (also referred to as the ''imagined state'') based on the current one and a sequence of actions. Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values. Then a distance is computed and minimized to force the imagined state to produce a similar action value prediction as that by the real state. We develop two implementations of the above idea for the discrete and continuous action spaces respectively. We conduct experiments on Atari 100K and DeepMind Control Suite benchmarks to validate their effectiveness for improving sample efficiency. It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.

World Value Functions: Knowledge Representation for Learning and Planning

Authors:Geraud Nangue Tasse, Benjamin Rosman, Steven James
Date:2022-06-23 18:49:54

We propose world value functions (WVFs), a type of goal-oriented general value function that represents how to solve not just a given task, but any other goal-reaching task in an agent's environment. This is achieved by equipping an agent with an internal goal space defined as all the world states where it experiences a terminal transition. The agent can then modify the standard task rewards to define its own reward function, which provably drives it to learn how to achieve all reachable internal goals, and the value of doing so in the current task. We demonstrate two key benefits of WVFs in the context of learning and planning. In particular, given a learned WVF, an agent can compute the optimal policy in a new task by simply estimating the task's reward function. Furthermore, we show that WVFs also implicitly encode the transition dynamics of the environment, and so can be used to perform planning. Experimental results show that WVFs can be learned faster than regular value functions, while their ability to infer the environment's dynamics can be used to integrate learning and planning methods to further improve sample efficiency.

Curious Exploration via Structured World Models Yields Zero-Shot Object Manipulation

Authors:Cansu Sancaktar, Sebastian Blaes, Georg Martius
Date:2022-06-22 22:08:50

It has been a long-standing dream to design artificial agents that explore their environment efficiently via intrinsic motivation, similar to how children perform curious free play. Despite recent advances in intrinsically motivated reinforcement learning (RL), sample-efficient exploration in object manipulation scenarios remains a significant challenge as most of the relevant information lies in the sparse agent-object and object-object interactions. In this paper, we propose to use structured world models to incorporate relational inductive biases in the control loop to achieve sample-efficient and interaction-rich exploration in compositional multi-object environments. By planning for future novelty inside structured world models, our method generates free-play behavior that starts to interact with objects early on and develops more complex behavior over time. Instead of using models only to compute intrinsic rewards, as commonly done, our method showcases that the self-reinforcing cycle between good models and good exploration also opens up another avenue: zero-shot generalization to downstream tasks via model-based planning. After the entirely intrinsic task-agnostic exploration phase, our method solves challenging downstream tasks such as stacking, flipping, pick & place, and throwing that generalizes to unseen numbers and arrangements of objects without any additional training.

Global Planning for Contact-Rich Manipulation via Local Smoothing of Quasi-dynamic Contact Models

Authors:Tao Pang, H. J. Terry Suh, Lujie Yang, Russ Tedrake
Date:2022-06-22 00:55:35

The empirical success of Reinforcement Learning (RL) in the setting of contact-rich manipulation leaves much to be understood from a model-based perspective, where the key difficulties are often attributed to (i) the explosion of contact modes, (ii) stiff, non-smooth contact dynamics and the resulting exploding / discontinuous gradients, and (iii) the non-convexity of the planning problem. The stochastic nature of RL addresses (i) and (ii) by effectively sampling and averaging the contact modes. On the other hand, model-based methods have tackled the same challenges by smoothing contact dynamics analytically. Our first contribution is to establish the theoretical equivalence of the two methods for simple systems, and provide qualitative and empirical equivalence on a number of complex examples. In order to further alleviate (ii), our second contribution is a convex, differentiable and quasi-dynamic formulation of contact dynamics, which is amenable to both smoothing schemes, and has proven through experiments to be highly effective for contact-rich planning. Our final contribution resolves (iii), where we show that classical sampling-based motion planning algorithms can be effective in global planning when contact modes are abstracted via smoothing. Applying our method on a collection of challenging contact-rich manipulation tasks, we demonstrate that efficient model-based motion planning can achieve results comparable to RL with dramatically less computation. Video: https://youtu.be/12Ew4xC-VwA

Multi-UAV Planning for Cooperative Wildfire Coverage and Tracking with Quality-of-Service Guarantees

Authors:Esmaeil Seraj, Andrew Silva, Matthew Gombolay
Date:2022-06-21 17:20:54

In recent years, teams of robot and Unmanned Aerial Vehicles (UAVs) have been commissioned by researchers to enable accurate, online wildfire coverage and tracking. While the majority of prior work focuses on the coordination and control of such multi-robot systems, to date, these UAV teams have not been given the ability to reason about a fire's track (i.e., location and propagation dynamics) to provide performance guarantee over a time horizon. Motivated by the problem of aerial wildfire monitoring, we propose a predictive framework which enables cooperation in multi-UAV teams towards collaborative field coverage and fire tracking with probabilistic performance guarantee. Our approach enables UAVs to infer the latent fire propagation dynamics for time-extended coordination in safety-critical conditions. We derive a set of novel, analytical temporal, and tracking-error bounds to enable the UAV-team to distribute their limited resources and cover the entire fire area according to the case-specific estimated states and provide a probabilistic performance guarantee. Our results are not limited to the aerial wildfire monitoring case-study and are generally applicable to problems, such as search-and-rescue, target tracking and border patrol. We evaluate our approach in simulation and provide demonstrations of the proposed framework on a physical multi-robot testbed to account for real robot dynamics and restrictions. Our quantitative evaluations validate the performance of our method accumulating 7.5x and 9.0x smaller tracking-error than state-of-the-art model-based and reinforcement learning benchmarks, respectively.

Hybridization of evolutionary algorithm and deep reinforcement learning for multi-objective orienteering optimization

Authors:Wei Liu, Rui Wang, Tao Zhang, Kaiwen Li, Wenhua Li, Hisao Ishibuchi
Date:2022-06-21 15:20:42

Multi-objective orienteering problems (MO-OPs) are classical multi-objective routing problems and have received a lot of attention in the past decades. This study seeks to solve MO-OPs through a problem-decomposition framework, that is, a MO-OP is decomposed into a multi-objective knapsack problem (MOKP) and a travelling salesman problem (TSP). The MOKP and TSP are then solved by a multi-objective evolutionary algorithm (MOEA) and a deep reinforcement learning (DRL) method, respectively. While the MOEA module is for selecting cities, the DRL module is for planning a Hamiltonian path for these cities. An iterative use of these two modules drives the population towards the Pareto front of MO-OPs. The effectiveness of the proposed method is compared against NSGA-II and NSGA-III on various types of MO-OP instances. Experimental results show that our method exhibits the best performance on almost all the test instances, and has shown strong generalization ability.

Guided Safe Shooting: model based reinforcement learning with safety constraints

Authors:Giuseppe Paolo, Jonas Gonzalez-Billandon, Albert Thomas, Balázs Kégl
Date:2022-06-20 12:46:35

In the last decade, reinforcement learning successfully solved complex control tasks and decision-making problems, like the Go board game. Yet, there are few success stories when it comes to deploying those algorithms to real-world scenarios. One of the reasons is the lack of guarantees when dealing with and avoiding unsafe states, a fundamental requirement in critical control engineering systems. In this paper, we introduce Guided Safe Shooting (GuSS), a model-based RL approach that can learn to control systems with minimal violations of the safety constraints. The model is learned on the data collected during the operation of the system in an iterated batch fashion, and is then used to plan for the best action to perform at each time step. We propose three different safe planners, one based on a simple random shooting strategy and two based on MAP-Elites, a more advanced divergent-search algorithm. Experiments show that these planners help the learning agent avoid unsafe situations while maximally exploring the state space, a necessary aspect when learning an accurate model of the system. Furthermore, compared to model-free approaches, learning a model allows GuSS reducing the number of interactions with the real-system while still reaching high rewards, a fundamental requirement when handling engineering systems.

A deep inverse reinforcement learning approach to route choice modeling with context-dependent rewards

Authors:Zhan Zhao, Yuebing Liang
Date:2022-06-18 06:33:06

Route choice modeling is a fundamental task in transportation planning and demand forecasting. Classical methods generally adopt the discrete choice model (DCM) framework with linear utility functions and high-level route characteristics. While several recent studies have started to explore the applicability of deep learning for route choice modeling, they are limited to path-based models with relatively simple model architectures and relying on predefined choice sets. Existing link-based models can capture the dynamic nature of link choices within the trip without the need for choice set generation, but still assume linear relationships and link-additive features. To address these issues, this study proposes a general deep inverse reinforcement learning (IRL) framework for link-based route choice modeling, which is capable of incorporating diverse features (of the state, action and trip context) and capturing complex relationships. Specifically, we adapt an adversarial IRL model to the route choice problem for efficient estimation of context-dependent reward functions without value iteration. Experiment results based on taxi GPS data from Shanghai, China validate the superior prediction performance of the proposed model over conventional DCMs and other imitation learning baselines, even for destinations unseen in the training data. Further analysis show that the model exhibits competitive computational efficiency and reasonable interpretability. The proposed methodology provides a new direction for future development of route choice models. It is general and can be adaptable to other route choice problems across different modes and networks.

Generalised Policy Improvement with Geometric Policy Composition

Authors:Shantanu Thakoor, Mark Rowland, Diana Borsa, Will Dabney, Rémi Munos, André Barreto
Date:2022-06-17 12:52:13

We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its effectiveness over standard GPI on a challenging deep RL continuous control task. We also provide an analysis of GHM training methods, proving a novel convergence result regarding previously proposed methods and showing how to train these models stably in deep RL settings.

Bootstrapped Transformer for Offline Reinforcement Learning

Authors:Kerong Wang, Hanye Zhao, Xufang Luo, Kan Ren, Weinan Zhang, Dongsheng Li
Date:2022-06-17 05:57:47

Offline reinforcement learning (RL) aims at learning policies from previously collected static trajectory data without interacting with the real environment. Recent works provide a novel perspective by viewing offline RL as a generic sequence generation problem, adopting sequence models such as Transformer architecture to model distributions over trajectories, and repurposing beam search as a planning algorithm. However, the training datasets utilized in general offline RL tasks are quite limited and often suffer from insufficient distribution coverage, which could be harmful to training sequence generation models yet has not drawn enough attention in the previous works. In this paper, we propose a novel algorithm named Bootstrapped Transformer, which incorporates the idea of bootstrapping and leverages the learned model to self-generate more offline data to further boost the sequence model training. We conduct extensive experiments on two offline RL benchmarks and demonstrate that our model can largely remedy the existing offline RL training limitations and beat other strong baseline methods. We also analyze the generated pseudo data and the revealed characteristics may shed some light on offline RL training. The codes are available at https://seqml.github.io/bootorl.

A Look at Value-Based Decision-Time vs. Background Planning Methods Across Different Settings

Authors:Safa Alver, Doina Precup
Date:2022-06-16 20:48:19

In model-based reinforcement learning (RL), an agent can leverage a learned model to improve its way of behaving in different ways. Two of the prevalent ways to do this are through decision-time and background planning methods. In this study, we are interested in understanding how the value-based versions of these two planning methods will compare against each other across different settings. Towards this goal, we first consider the simplest instantiations of value-based decision-time and background planning methods and provide theoretical results on which one will perform better in the regular RL and transfer learning settings. Then, we consider the modern instantiations of them and provide hypotheses on which one will perform better in the same settings. Finally, we perform illustrative experiments to validate these theoretical results and hypotheses. Overall, our findings suggest that even though value-based versions of the two planning methods perform on par in their simplest instantiations, the modern instantiations of value-based decision-time planning methods can perform on par or better than the modern instantiations of value-based background planning methods in both the regular RL and transfer learning settings.

Reinforcement Learning-based Placement of Charging Stations in Urban Road Networks

Authors:Leonie von Wahl, Nicolas Tempelmeier, Ashutosh Sao, Elena Demidova
Date:2022-06-13 10:03:32

The transition from conventional mobility to electromobility largely depends on charging infrastructure availability and optimal placement.This paper examines the optimal placement of charging stations in urban areas. We maximise the charging infrastructure supply over the area and minimise waiting, travel, and charging times while setting budget constraints. Moreover, we include the possibility of charging vehicles at home to obtain a more refined estimation of the actual charging demand throughout the urban area. We formulate the Placement of Charging Stations problem as a non-linear integer optimisation problem that seeks the optimal positions for charging stations and the optimal number of charging piles of different charging types. We design a novel Deep Reinforcement Learning approach to solve the charging station placement problem (PCRL). Extensive experiments on real-world datasets show how the PCRL reduces the waiting and travel time while increasing the benefit of the charging plan compared to five baselines. Compared to the existing infrastructure, we can reduce the waiting time by up to 97% and increase the benefit up to 497%.

Deep Reinforcement Learning for Optimal Investment and Saving Strategy Selection in Heterogeneous Profiles: Intelligent Agents working towards retirement

Authors:Fatih Ozhamaratli, Paolo Barucca
Date:2022-06-12 20:27:58

The transition from defined benefit to defined contribution pension plans shifts the responsibility for saving toward retirement from governments and institutions to the individuals. Determining optimal saving and investment strategy for individuals is paramount for stable financial stance and for avoiding poverty during work-life and retirement, and it is a particularly challenging task in a world where form of employment and income trajectory experienced by different occupation groups are highly diversified. We introduce a model in which agents learn optimal portfolio allocation and saving strategies that are suitable for their heterogeneous profiles. We use deep reinforcement learning to train agents. The environment is calibrated with occupation and age dependent income evolution dynamics. The research focuses on heterogeneous income trajectories dependent on agent profiles and incorporates the behavioural parameterisation of agents. The model provides a flexible methodology to estimate lifetime consumption and investment choices for heterogeneous profiles under varying scenarios.

LTL-Transfer: Skill Transfer for Temporal Task Specification

Authors:Jason Xinyu Liu, Ankit Shah, Eric Rosen, Mingxi Jia, George Konidaris, Stefanie Tellex
Date:2022-06-10 13:43:03

Deploying robots in real-world environments, such as households and manufacturing lines, requires generalization across novel task specifications without violating safety constraints. Linear temporal logic (LTL) is a widely used task specification language with a compositional grammar that naturally induces commonalities among tasks while preserving safety guarantees. However, most prior work on reinforcement learning with LTL specifications treats every new task independently, thus requiring large amounts of training data to generalize. We propose LTL-Transfer, a zero-shot transfer algorithm that composes task-agnostic skills learned during training to safely satisfy a wide variety of novel LTL task specifications. Experiments in Minecraft-inspired domains show that after training on only 50 tasks, LTL-Transfer can solve over 90% of 100 challenging unseen tasks and 100% of 300 commonly used novel tasks without violating any safety constraints. We deployed LTL-Transfer at the task-planning level of a quadruped mobile manipulator to demonstrate its zero-shot transfer ability for fetch-and-deliver and navigation tasks.

Deep Surrogate Assisted Generation of Environments

Authors:Varun Bhatt, Bryon Tjanaka, Matthew C. Fontaine, Stefanos Nikolaidis
Date:2022-06-09 00:14:03

Recent progress in reinforcement learning (RL) has started producing generally capable agents that can solve a distribution of complex environments. These agents are typically tested on fixed, human-authored environments. On the other hand, quality diversity (QD) optimization has been proven to be an effective component of environment generation algorithms, which can generate collections of high-quality environments that are diverse in the resulting agent behaviors. However, these algorithms require potentially expensive simulations of agents on newly generated environments. We propose Deep Surrogate Assisted Generation of Environments (DSAGE), a sample-efficient QD environment generation algorithm that maintains a deep surrogate model for predicting agent behaviors in new environments. Results in two benchmark domains show that DSAGE significantly outperforms existing QD environment generation algorithms in discovering collections of environments that elicit diverse behaviors of a state-of-the-art RL agent and a planning agent. Our source code and videos are available at https://dsagepaper.github.io/.

Deep Hierarchical Planning from Pixels

Authors:Danijar Hafner, Kuang-Huei Lee, Ian Fischer, Pieter Abbeel
Date:2022-06-08 18:20:15

Intelligent agents need to select long sequences of actions to solve complex tasks. While humans easily break down tasks into subgoals and reach them through millions of muscle commands, current artificial intelligence is limited to tasks with horizons of a few hundred decisions, despite large compute budgets. Research on hierarchical reinforcement learning aims to overcome this limitation but has proven to be challenging, current methods rely on manually specified goal spaces or subtasks, and no general solution exists. We introduce Director, a practical method for learning hierarchical behaviors directly from pixels by planning inside the latent space of a learned world model. The high-level policy maximizes task and exploration rewards by selecting latent goals and the low-level policy learns to achieve the goals. Despite operating in latent space, the decisions are interpretable because the world model can decode goals into images for visualization. Director outperforms exploration methods on tasks with sparse rewards, including 3D maze traversal with a quadruped robot from an egocentric camera and proprioception, without access to the global position or top-down view that was used by prior work. Director also learns successful behaviors across a wide range of environments, including visual control, Atari games, and DMLab levels.

Learning in Observable POMDPs, without Computationally Intractable Oracles

Authors:Noah Golowich, Ankur Moitra, Dhruv Rohatgi
Date:2022-06-07 17:05:27

Much of reinforcement learning theory is built on top of oracles that are computationally hard to implement. Specifically for learning near-optimal policies in Partially Observable Markov Decision Processes (POMDPs), existing algorithms either need to make strong assumptions about the model dynamics (e.g. deterministic transitions) or assume access to an oracle for solving a hard optimistic planning or estimation problem as a subroutine. In this work we develop the first oracle-free learning algorithm for POMDPs under reasonable assumptions. Specifically, we give a quasipolynomial-time end-to-end algorithm for learning in "observable" POMDPs, where observability is the assumption that well-separated distributions over states induce well-separated distributions over observations. Our techniques circumvent the more traditional approach of using the principle of optimism under uncertainty to promote exploration, and instead give a novel application of barycentric spanners to constructing policy covers.

Goal-Space Planning with Subgoal Models

Authors:Chunlok Lo, Kevin Roice, Parham Mohammad Panahi, Scott Jordan, Adam White, Gabor Mihucz, Farzane Aminmansour, Martha White
Date:2022-06-06 20:59:07

This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can propagate value from an abstract space in a manner that helps a variety of base learners learn significantly faster in different domains.

Provably Efficient Risk-Sensitive Reinforcement Learning: Iterated CVaR and Worst Path

Authors:Yihan Du, Siwei Wang, Longbo Huang
Date:2022-06-06 15:24:06

In this paper, we study a novel episodic risk-sensitive Reinforcement Learning (RL) problem, named Iterated CVaR RL, which aims to maximize the tail of the reward-to-go at each step, and focuses on tightly controlling the risk of getting into catastrophic situations at each stage. This formulation is applicable to real-world tasks that demand strong risk avoidance throughout the decision process, such as autonomous driving, clinical treatment planning and robotics. We investigate two performance metrics under Iterated CVaR RL, i.e., Regret Minimization and Best Policy Identification. For both metrics, we design efficient algorithms ICVaR-RM and ICVaR-BPI, respectively, and provide nearly matching upper and lower bounds with respect to the number of episodes $K$. We also investigate an interesting limiting case of Iterated CVaR RL, called Worst Path RL, where the objective becomes to maximize the minimum possible cumulative reward. For Worst Path RL, we propose an efficient algorithm with constant upper and lower bounds. Finally, our techniques for bounding the change of CVaR due to the value function shift and decomposing the regret via a distorted visitation distribution are novel, and can find applications in other risk-sensitive RL problems.

Robust Adversarial Attacks Detection based on Explainable Deep Reinforcement Learning For UAV Guidance and Planning

Authors:Thomas Hickling, Nabil Aouf, Phillippa Spencer
Date:2022-06-06 15:16:10

The dangers of adversarial attacks on Uncrewed Aerial Vehicle (UAV) agents operating in public are increasing. Adopting AI-based techniques and, more specifically, Deep Learning (DL) approaches to control and guide these UAVs can be beneficial in terms of performance but can add concerns regarding the safety of those techniques and their vulnerability against adversarial attacks. Confusion in the agent's decision-making process caused by these attacks can seriously affect the safety of the UAV. This paper proposes an innovative approach based on the explainability of DL methods to build an efficient detector that will protect these DL schemes and the UAVs adopting them from attacks. The agent adopts a Deep Reinforcement Learning (DRL) scheme for guidance and planning. The agent is trained with a Deep Deterministic Policy Gradient (DDPG) with Prioritised Experience Replay (PER) DRL scheme that utilises Artificial Potential Field (APF) to improve training times and obstacle avoidance performance. A simulated environment for UAV explainable DRL-based planning and guidance, including obstacles and adversarial attacks, is built. The adversarial attacks are generated by the Basic Iterative Method (BIM) algorithm and reduced obstacle course completion rates from 97\% to 35\%. Two adversarial attack detectors are proposed to counter this reduction. The first one is a Convolutional Neural Network Adversarial Detector (CNN-AD), which achieves accuracy in the detection of 80\%. The second detector utilises a Long Short Term Memory (LSTM) network. It achieves an accuracy of 91\% with faster computing times compared to the CNN-AD, allowing for real-time adversarial detection.

Rapid Learning of Spatial Representations for Goal-Directed Navigation Based on a Novel Model of Hippocampal Place Fields

Authors:Adedapo Alabi, Dieter Vanderelst, Ali Minai
Date:2022-06-05 19:50:33

The discovery of place cells and other spatially modulated neurons in the hippocampal complex of rodents has been crucial to elucidating the neural basis of spatial cognition. More recently, the replay of neural sequences encoding previously experienced trajectories has been observed during consummatory behavior potentially with implications for quick memory consolidation and behavioral planning. Several promising models for robotic navigation and reinforcement learning have been proposed based on these and previous findings. Most of these models, however, use carefully engineered neural networks and are tested in simple environments. In this paper, we develop a self-organized model incorporating place cells and replay, and demonstrate its utility for rapid one-shot learning in non-trivial environments with obstacles.

Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning

Authors:Dilip Arumugam, Benjamin Van Roy
Date:2022-06-04 23:36:38

The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective planning over behaviors. Recently formalized as the value equivalence principle, this algorithmic technique is perhaps unavoidable as real-world reinforcement learning demands consideration of a simple, computationally-bounded agent interacting with an overwhelmingly complex environment, whose underlying dynamics likely exceed the agent's capacity for representation. In this work, we consider the scenario where agent limitations may entirely preclude identifying an exactly value-equivalent model, immediately giving rise to a trade-off between identifying a model that is simple enough to learn while only incurring bounded sub-optimality. To address this problem, we introduce an algorithm that, using rate-distortion theory, iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model. We prove an information-theoretic, Bayesian regret bound for our algorithm that holds for any finite-horizon, episodic sequential decision-making problem. Crucially, our regret bound can be expressed in one of two possible forms, providing a performance guarantee for finding either the simplest model that achieves a desired sub-optimality gap or, alternatively, the best model given a limit on agent capacity.

Beyond Value: CHECKLIST for Testing Inferences in Planning-Based RL

Authors:Kin-Ho Lam, Delyar Tabatabai, Jed Irvine, Donald Bertucci, Anita Ruangrotsakun, Minsuk Kahng, Alan Fern
Date:2022-06-04 18:16:05

Reinforcement learning (RL) agents are commonly evaluated via their expected value over a distribution of test scenarios. Unfortunately, this evaluation approach provides limited evidence for post-deployment generalization beyond the test distribution. In this paper, we address this limitation by extending the recent CheckList testing methodology from natural language processing to planning-based RL. Specifically, we consider testing RL agents that make decisions via online tree search using a learned transition model and value function. The key idea is to improve the assessment of future performance via a CheckList approach for exploring and assessing the agent's inferences during tree search. The approach provides the user with an interface and general query-rule mechanism for identifying potential inference flaws and validating expected inference invariances. We present a user study involving knowledgeable AI researchers using the approach to evaluate an agent trained to play a complex real-time strategy game. The results show the approach is effective in allowing users to identify previously-unknown flaws in the agent's reasoning. In addition, our analysis provides insight into how AI experts use this type of testing approach, which may help improve future instantiations.

Between Rate-Distortion Theory & Value Equivalence in Model-Based Reinforcement Learning

Authors:Dilip Arumugam, Benjamin Van Roy
Date:2022-06-04 17:09:46

The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective planning over behaviors. Recently formalized as the value equivalence principle, this algorithmic technique is perhaps unavoidable as real-world reinforcement learning demands consideration of a simple, computationally-bounded agent interacting with an overwhelmingly complex environment. In this work, we entertain an extreme scenario wherein some combination of immense environment complexity and limited agent capacity entirely precludes identifying an exactly value-equivalent model. In light of this, we embrace a notion of approximate value equivalence and introduce an algorithm for incrementally synthesizing simple and useful approximations of the environment from which an agent might still recover near-optimal behavior. Crucially, we recognize the information-theoretic nature of this lossy environment compression problem and use the appropriate tools of rate-distortion theory to make mathematically precise how value equivalence can lend tractability to otherwise intractable sequential decision-making problems.

Robotic Planning under Uncertainty in Spatiotemporal Environments in Expeditionary Science

Authors:Victoria Preston, Genevieve Flaspohler, Anna P. M. Michel, John W. Fisher III, Nicholas Roy
Date:2022-06-03 02:04:15

In the expeditionary sciences, spatiotemporally varying environments -- hydrothermal plumes, algal blooms, lava flows, or animal migrations -- are ubiquitous. Mobile robots are uniquely well-suited to study these dynamic, mesoscale natural environments. We formalize expeditionary science as a sequential decision-making problem, modeled using the language of partially-observable Markov decision processes (POMDPs). Solving the expeditionary science POMDP under real-world constraints requires efficient probabilistic modeling and decision-making in problems with complex dynamics and observational models. Previous work in informative path planning, adaptive sampling, and experimental design have shown compelling results, largely in static environments, using data-driven models and information-based rewards. However, these methodologies do not trivially extend to expeditionary science in spatiotemporal environments: they generally do not make use of scientific knowledge such as equations of state dynamics, they focus on information gathering as opposed to scientific task execution, and they make use of decision-making approaches that scale poorly to large, continuous problems with long planning horizons and real-time operational constraints. In this work, we discuss these and other challenges related to probabilistic modeling and decision-making in expeditionary science, and present some of our preliminary work that addresses these gaps. We ground our results in a real expeditionary science deployment of an autonomous underwater vehicle (AUV) in the deep ocean for hydrothermal vent discovery and characterization. Our concluding thoughts highlight remaining work to be done, and the challenges that merit consideration by the reinforcement learning and decision-making community.

Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes

Authors:Tetsuro Morimura, Kazuhiro Ota, Kenshi Abe, Peinan Zhang
Date:2022-06-02 12:21:40

Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent. While PG can work well even in non-Markovian environments, it may encounter plateaus or peakiness issues. As another successful RL approach, algorithms based on Monte Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results, especially in the game-playing domain. They are also effective when applied to non-Markov decision processes. However, the standard MCTS is a method for decision-time planning, which differs from the online RL setting. In this work, we first introduce Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online RL setups. We then explore a combined policy approach of PG and MCTL to leverage their strengths. We derive conditions for asymptotic convergence with the results of a two-timescale stochastic approximation and propose an algorithm that satisfies these conditions and converges to a reasonable solution. Our numerical experiments validate the effectiveness of the proposed methods.

RLSS: A Deep Reinforcement Learning Algorithm for Sequential Scene Generation

Authors:Azimkhon Ostonov, Peter Wonka, Dominik L. Michels
Date:2022-06-01 08:39:33

We present RLSS: a reinforcement learning algorithm for sequential scene generation. This is based on employing the proximal policy optimization (PPO) algorithm for generative problems. In particular, we consider how to effectively reduce the action space by including a greedy search algorithm in the learning process. Our experiments demonstrate that our method converges for a relatively large number of actions and learns to generate scenes with predefined design objectives. This approach is placing objects iteratively in the virtual scene. In each step, the network chooses which objects to place and selects positions which result in maximal reward. A high reward is assigned if the last action resulted in desired properties whereas the violation of constraints is penalized. We demonstrate the capability of our method to generate plausible and diverse scenes efficiently by solving indoor planning problems and generating Angry Birds levels.

Provably Efficient Lifelong Reinforcement Learning with Linear Function Approximation

Authors:Sanae Amani, Lin F. Yang, Ching-An Cheng
Date:2022-06-01 06:53:28

We study lifelong reinforcement learning (RL) in a regret minimization setting of linear contextual Markov decision process (MDP), where the agent needs to learn a multi-task policy while solving a streaming sequence of tasks. We propose an algorithm, called UCB Lifelong Value Distillation (UCBlvd), that provably achieves sublinear regret for any sequence of tasks, which may be adaptively chosen based on the agent's past behaviors. Remarkably, our algorithm uses only sublinear number of planning calls, which means that the agent eventually learns a policy that is near optimal for multiple tasks (seen or unseen) without the need of deliberate planning. A key to this property is a new structural assumption that enables computation sharing across tasks during exploration. Specifically, for $K$ task episodes of horizon $H$, our algorithm has a regret bound $\tilde{\mathcal{O}}(\sqrt{(d^3+d^\prime d)H^4K})$ based on $\mathcal{O}(dH\log(K))$ number of planning calls, where $d$ and $d^\prime$ are the feature dimensions of the dynamics and rewards, respectively. This theoretical guarantee implies that our algorithm can enable a lifelong learning agent to accumulate experiences and learn to rapidly solve new tasks.

BRExIt: On Opponent Modelling in Expert Iteration

Authors:Daniel Hernandez, Hendrik Baier, Michael Kaisers
Date:2022-05-31 20:49:10

Finding a best response policy is a central objective in game theory and multi-agent learning, with modern population-based training approaches employing reinforcement learning algorithms as best-response oracles to improve play against candidate opponents (typically previously learnt policies). We propose Best Response Expert Iteration (BRExIt), which accelerates learning in games by incorporating opponent models into the state-of-the-art learning algorithm Expert Iteration (ExIt). BRExIt aims to (1) improve feature shaping in the apprentice, with a policy head predicting opponent policies as an auxiliary task, and (2) bias opponent moves in planning towards the given or learnt opponent model, to generate apprentice targets that better approximate a best response. In an empirical ablation on BRExIt's algorithmic variants against a set of fixed test agents, we provide statistical evidence that BRExIt learns better performing policies than ExIt.

A Mixture-of-Expert Approach to RL-based Dialogue Management

Authors:Yinlam Chow, Aza Tulepbergenov, Ofir Nachum, MoonKyung Ryu, Mohammad Ghavamzadeh, Craig Boutilier
Date:2022-05-31 19:00:41

Despite recent advancements in language models (LMs), their application to dialogue management (DM) problems and ability to carry on rich conversations remain a challenge. We use reinforcement learning (RL) to develop a dialogue agent that avoids being short-sighted (outputting generic utterances) and maximizes overall user satisfaction. Most existing RL approaches to DM train the agent at the word-level, and thus, have to deal with a combinatorially complex action space even for a medium-size vocabulary. As a result, they struggle to produce a successful and engaging dialogue even if they are warm-started with a pre-trained LM. To address this issue, we develop a RL-based DM using a novel mixture of expert language model (MoE-LM) that consists of (i) a LM capable of learning diverse semantics for conversation histories, (ii) a number of {\em specialized} LMs (or experts) capable of generating utterances corresponding to a particular attribute or personality, and (iii) a RL-based DM that performs dialogue planning with the utterances generated by the experts. Our MoE approach provides greater flexibility to generate sensible utterances with different intents and allows RL to focus on conversational-level DM. We compare it with SOTA baselines on open-domain dialogues and demonstrate its effectiveness both in terms of the diversity and sensibility of the generated utterances and the overall DM performance.

Multi-Domain Virtual Network Embedding Algorithm based on Horizontal Federated Learning

Authors:Peiying Zhang, Ning Chen, Shibao Li, Kim-Kwang Raymond Choo, Chunxiao Jiang
Date:2022-05-29 13:46:19

Network Virtualization (NV) is an emerging network dynamic planning technique to overcome network rigidity. As its necessary challenge, Virtual Network Embedding (VNE) enhances the scalability and flexibility of the network by decoupling the resources and services of the underlying physical network. For the future multi-domain physical network modeling with the characteristics of dynamics, heterogeneity, privacy, and real-time, the existing related works perform satisfactorily. Federated learning (FL) jointly optimizes the network by sharing parameters among multiple parties and is widely employed in disputes over data privacy and data silos. Aiming at the NV challenge of multi-domain physical networks, this work is the first to propose using FL to model VNE, and presents a VNE architecture based on Horizontal Federated Learning (HFL) (HFL-VNE). Specifically, combined with the distributed training paradigm of FL, we deploy local servers in each physical domain, which can effectively focus on local features and reduce resource fragmentation. A global server is deployed to aggregate and share training parameters, which enhances local data privacy and significantly improves learning efficiency. Furthermore, we deploy the Deep Reinforcement Learning (DRL) model in each server to dynamically adjust and optimize the resource allocation of the multi-domain physical network. In DRL-assisted FL, HFL-VNE jointly optimizes decision-making through specific local and federated reward mechanisms and loss functions. Finally, the superiority of HFL-VNE is proved by combining simulation experiments and comparing it with related works.

Learning to Use Chopsticks in Diverse Gripping Styles

Authors:Zeshi Yang, KangKang Yin, Libin Liu
Date:2022-05-28 03:07:13

Learning dexterous manipulation skills is a long-standing challenge in computer graphics and robotics, especially when the task involves complex and delicate interactions between the hands, tools and objects. In this paper, we focus on chopsticks-based object relocation tasks, which are common yet demanding. The key to successful chopsticks skills is steady gripping of the sticks that also supports delicate maneuvers. We automatically discover physically valid chopsticks holding poses by Bayesian Optimization (BO) and Deep Reinforcement Learning (DRL), which works for multiple gripping styles and hand morphologies without the need of example data. Given as input the discovered gripping poses and desired objects to be moved, we build physics-based hand controllers to accomplish relocation tasks in two stages. First, kinematic trajectories are synthesized for the chopsticks and hand in a motion planning stage. The key components of our motion planner include a grasping model to select suitable chopsticks configurations for grasping the object, and a trajectory optimization module to generate collision-free chopsticks trajectories. Then we train physics-based hand controllers through DRL again to track the desired kinematic trajectories produced by the motion planner. We demonstrate the capabilities of our framework by relocating objects of various shapes and sizes, in diverse gripping styles and holding positions for multiple hand morphologies. Our system achieves faster learning speed and better control robustness, when compared to vanilla systems that attempt to learn chopstick-based skills without a gripping pose optimization module and/or without a kinematic motion planner.

Provably Sample-Efficient RL with Side Information about Latent Dynamics

Authors:Yao Liu, Dipendra Misra, Miro Dudík, Robert E. Schapire
Date:2022-05-27 21:07:03

We study reinforcement learning (RL) in settings where observations are high-dimensional, but where an RL agent has access to abstract knowledge about the structure of the state space, as is the case, for example, when a robot is tasked to go to a specific room in a building using observations from its own camera, while having access to the floor plan. We formalize this setting as transfer reinforcement learning from an abstract simulator, which we assume is deterministic (such as a simple model of moving around the floor plan), but which is only required to capture the target domain's latent-state dynamics approximately up to unknown (bounded) perturbations (to account for environment stochasticity). Crucially, we assume no prior knowledge about the structure of observations in the target domain except that they can be used to identify the latent states (but the decoding map is unknown). Under these assumptions, we present an algorithm, called TASID, that learns a robust policy in the target domain, with sample complexity that is polynomial in the horizon, and independent of the number of states, which is not possible without access to some prior knowledge. In synthetic experiments, we verify various properties of our algorithm and show that it empirically outperforms transfer RL algorithms that require access to "full simulators" (i.e., those that also simulate observations).

Personalized Algorithmic Recourse with Preference Elicitation

Authors:Giovanni De Toni, Paolo Viappiani, Stefano Teso, Bruno Lepri, Andrea Passerini
Date:2022-05-27 03:12:18

Algorithmic Recourse (AR) is the problem of computing a sequence of actions that -- once performed by a user -- overturns an undesirable machine decision. It is paramount that the sequence of actions does not require too much effort for users to implement. Yet, most approaches to AR assume that actions cost the same for all users, and thus may recommend unfairly expensive recourse plans to certain users. Prompted by this observation, we introduce PEAR, the first human-in-the-loop approach capable of providing personalized algorithmic recourse tailored to the needs of any end-user. PEAR builds on insights from Bayesian Preference Elicitation to iteratively refine an estimate of the costs of actions by asking choice set queries to the target user. The queries themselves are computed by maximizing the Expected Utility of Selection, a principled measure of information gain accounting for uncertainty on both the cost estimate and the user's responses. PEAR integrates elicitation into a Reinforcement Learning agent coupled with Monte Carlo Tree Search to quickly identify promising recourse plans. Our empirical evaluation on real-world datasets highlights how PEAR produces high-quality personalized recourse in only a handful of iterations.

Dynamic Network Reconfiguration for Entropy Maximization using Deep Reinforcement Learning

Authors:Christoffel Doorman, Victor-Alexandru Darvariu, Stephen Hailes, Mirco Musolesi
Date:2022-05-26 18:44:22

A key problem in network theory is how to reconfigure a graph in order to optimize a quantifiable objective. Given the ubiquity of networked systems, such work has broad practical applications in a variety of situations, ranging from drug and material design to telecommunications. The large decision space of possible reconfigurations, however, makes this problem computationally intensive. In this paper, we cast the problem of network rewiring for optimizing a specified structural property as a Markov Decision Process (MDP), in which a decision-maker is given a budget of modifications that are performed sequentially. We then propose a general approach based on the Deep Q-Network (DQN) algorithm and graph neural networks (GNNs) that can efficiently learn strategies for rewiring networks. We then discuss a cybersecurity case study, i.e., an application to the computer network reconfiguration problem for intrusion protection. In a typical scenario, an attacker might have a (partial) map of the system they plan to penetrate; if the network is effectively "scrambled", they would not be able to navigate it since their prior knowledge would become obsolete. This can be viewed as an entropy maximization problem, in which the goal is to increase the surprise of the network. Indeed, entropy acts as a proxy measurement of the difficulty of navigating the network topology. We demonstrate the general ability of the proposed method to obtain better entropy gains than random rewiring on synthetic and real-world graphs while being computationally inexpensive, as well as being able to generalize to larger graphs than those seen during training. Simulations of attack scenarios confirm the effectiveness of the learned rewiring strategies.

Physics-Guided Hierarchical Reward Mechanism for Learning-Based Robotic Grasping

Authors:Yunsik Jung, Lingfeng Tao, Michael Bowman, Jiucai Zhang, Xiaoli Zhang
Date:2022-05-26 18:01:56

Learning-based grasping can afford real-time grasp motion planning of multi-fingered robotics hands thanks to its high computational efficiency. However, learning-based methods are required to explore large search spaces during the learning process. The search space causes low learning efficiency, which has been the main barrier to its practical adoption. In addition, the trained policy lacks a generalizable outcome unless objects are identical to the trained objects. In this work, we develop a novel Physics-Guided Deep Reinforcement Learning with a Hierarchical Reward Mechanism to improve learning efficiency and generalizability for learning-based autonomous grasping. Unlike conventional observation-based grasp learning, physics-informed metrics are utilized to convey correlations between features associated with hand structures and objects to improve learning efficiency and outcomes. Further, the hierarchical reward mechanism enables the robot to learn prioritized components of the grasping tasks. Our method is validated in robotic grasping tasks with a 3-finger MICO robot arm. The results show that our method outperformed the standard Deep Reinforcement Learning methods in various robotic grasping tasks.

Hierarchical Planning Through Goal-Conditioned Offline Reinforcement Learning

Authors:Jinning Li, Chen Tang, Masayoshi Tomizuka, Wei Zhan
Date:2022-05-24 05:13:40

Offline Reinforcement learning (RL) has shown potent in many safe-critical tasks in robotics where exploration is risky and expensive. However, it still struggles to acquire skills in temporally extended tasks. In this paper, we study the problem of offline RL for temporally extended tasks. We propose a hierarchical planning framework, consisting of a low-level goal-conditioned RL policy and a high-level goal planner. The low-level policy is trained via offline RL. We improve the offline training to deal with out-of-distribution goals by a perturbed goal sampling process. The high-level planner selects intermediate sub-goals by taking advantages of model-based planning methods. It plans over future sub-goal sequences based on the learned value function of the low-level policy. We adopt a Conditional Variational Autoencoder to sample meaningful high-dimensional sub-goal candidates and to solve the high-level long-term strategy optimization problem. We evaluate our proposed method in long-horizon driving and robot navigation tasks. Experiments show that our method outperforms baselines with different hierarchical designs and other regular planners without hierarchy in these complex tasks.

Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs

Authors:Dongruo Zhou, Quanquan Gu
Date:2022-05-23 17:59:18

Recent studies have shown that episodic reinforcement learning (RL) is not more difficult than contextual bandits, even with a long planning horizon and unknown state transitions. However, these results are limited to either tabular Markov decision processes (MDPs) or computationally inefficient algorithms for linear mixture MDPs. In this paper, we propose the first computationally efficient horizon-free algorithm for linear mixture MDPs, which achieves the optimal $\tilde O(d\sqrt{K} +d^2)$ regret up to logarithmic factors. Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic, where the weight is both \emph{variance-aware} and \emph{uncertainty-aware}. When applying our weighted least square estimator to heterogeneous linear bandits, we can obtain an $\tilde O(d\sqrt{\sum_{k=1}^K \sigma_k^2} +d)$ regret in the first $K$ rounds, where $d$ is the dimension of the context and $\sigma_k^2$ is the variance of the reward in the $k$-th round. This also improves upon the best-known algorithms in this setting when $\sigma_k^2$'s are known.

Cooperative Reinforcement Learning on Traffic Signal Control

Authors:Chi-Chun Chao, Jun-Wei Hsieh, Bor-Shiun Wang
Date:2022-05-23 13:25:15

Traffic signal control is a challenging real-world problem aiming to minimize overall travel time by coordinating vehicle movements at road intersections. Existing traffic signal control systems in use still rely heavily on oversimplified information and rule-based methods. Specifically, the periodicity of green/red light alternations can be considered as a prior for better planning of each agent in policy optimization. To better learn such adaptive and predictive priors, traditional RL-based methods can only return a fixed length from predefined action pool with only local agents. If there is no cooperation between these agents, some agents often make conflicts to other agents and thus decrease the whole throughput. This paper proposes a cooperative, multi-objective architecture with age-decaying weights to better estimate multiple reward terms for traffic signal control optimization, which termed COoperative Multi-Objective Multi-Agent Deep Deterministic Policy Gradient (COMMA-DDPG). Two types of agents running to maximize rewards of different goals - one for local traffic optimization at each intersection and the other for global traffic waiting time optimization. The global agent is used to guide the local agents as a means for aiding faster learning but not used in the inference phase. We also provide an analysis of solution existence together with convergence proof for the proposed RL optimization. Evaluation is performed using real-world traffic data collected using traffic cameras from an Asian country. Our method can effectively reduce the total delayed time by 60\%. Results demonstrate its superiority when compared to SoTA methods.

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation

Authors:Xiaoyu Chen, Han Zhong, Zhuoran Yang, Zhaoran Wang, Liwei Wang
Date:2022-05-23 09:03:24

We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. The goal of the agent is to learn the optimal policy which is most preferred by the human overseer. Despite the empirical successes, the theoretical understanding of preference-based RL (PbRL) is only limited to the tabular case. In this paper, we propose the first optimistic model-based algorithm for PbRL with general function approximation, which estimates the model using value-targeted regression and calculates the exploratory policies by solving an optimistic planning problem. Our algorithm achieves the regret of $\tilde{O} (\operatorname{poly}(d H) \sqrt{K} )$, where $d$ is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes, and $\tilde O(\cdot)$ omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the PbRL problem by formulating a novel problem called RL with $n$-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.

Should Models Be Accurate?

Authors:Esra'a Saleh, John D. Martin, Anna Koop, Arash Pourzarabi, Michael Bowling
Date:2022-05-22 04:23:54

Model-based Reinforcement Learning (MBRL) holds promise for data-efficiency by planning with model-generated experience in addition to learning with experience from the environment. However, in complex or changing environments, models in MBRL will inevitably be imperfect, and their detrimental effects on learning can be difficult to mitigate. In this work, we question whether the objective of these models should be the accurate simulation of environment dynamics at all. We focus our investigations on Dyna-style planning in a prediction setting. First, we highlight and support three motivating points: a perfectly accurate model of environment dynamics is not practically achievable, is not necessary, and is not always the most useful anyways. Second, we introduce a meta-learning algorithm for training models with a focus on their usefulness to the learner instead of their accuracy in modelling the environment. Our experiments show that in a simple non-stationary environment, our algorithm enables faster learning than even using an accurate model built with domain-specific knowledge of the non-stationarity.

Synthesis from Satisficing and Temporal Goals

Authors:Suguman Bansal, Lydia Kavraki, Moshe Y. Vardi, Andrew Wells
Date:2022-05-20 23:46:31

Reactive synthesis from high-level specifications that combine hard constraints expressed in Linear Temporal Logic LTL with soft constraints expressed by discounted-sum (DS) rewards has applications in planning and reinforcement learning. An existing approach combines techniques from LTL synthesis with optimization for the DS rewards but has failed to yield a sound algorithm. An alternative approach combining LTL synthesis with satisficing DS rewards (rewards that achieve a threshold) is sound and complete for integer discount factors, but, in practice, a fractional discount factor is desired. This work extends the existing satisficing approach, presenting the first sound algorithm for synthesis from LTL and DS rewards with fractional discount factors. The utility of our algorithm is demonstrated on robotic planning domains.

Towards biologically plausible Dreaming and Planning in recurrent spiking networks

Authors:Cristiano Capone, Pier Stanislao Paolucci
Date:2022-05-20 09:35:26

Humans and animals can learn new skills after practicing for a few hours, while current reinforcement learning algorithms require a large amount of data to achieve good performances. Recent model-based approaches show promising results by reducing the number of necessary interactions with the environment to learn a desirable policy. However, these methods require biological implausible ingredients, such as the detailed storage of older experiences, and long periods of offline learning. The optimal way to learn and exploit word-models is still an open question. Taking inspiration from biology, we suggest that dreaming might be an efficient expedient to use an inner model. We propose a two-module (agent and model) spiking neural network in which "dreaming" (living new experiences in a model-based simulated environment) significantly boosts learning. We also explore "planning", an online alternative to dreaming, that shows comparable performances. Importantly, our model does not require the detailed storage of experiences, and learns online the world-model and the policy. Moreover, we stress that our network is composed of spiking neurons, further increasing the biological plausibility and implementability in neuromorphic hardware.

Planning with Diffusion for Flexible Behavior Synthesis

Authors:Michael Janner, Yilun Du, Joshua B. Tenenbaum, Sergey Levine
Date:2022-05-20 07:02:03

Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility.

A Fully Controllable Agent in the Path Planning using Goal-Conditioned Reinforcement Learning

Authors:GyeongTaek Lee
Date:2022-05-20 05:18:03

The aim of path planning is to reach the goal from starting point by searching for the route of an agent. In the path planning, the routes may vary depending on the number of variables such that it is important for the agent to reach various goals. Numerous studies, however, have dealt with a single goal that is predefined by the user. In the present study, I propose a novel reinforcement learning framework for a fully controllable agent in the path planning. To do this, I propose a bi-directional memory editing to obtain various bi-directional trajectories of the agent, in which the behavior of the agent and sub-goals are trained on the goal-conditioned RL. As for moving the agent in various directions, I utilize the sub-goals dedicated network, separated from a policy network. Lastly, I present the reward shaping to shorten the number of steps for the agent to reach the goal. In the experimental result, the agent was able to reach the various goals that have never been visited by the agent in the training. We confirmed that the agent could perform difficult missions such as a round trip and the agent used the shorter route with the reward shaping.

TC-Driver: Trajectory Conditioned Driving for Robust Autonomous Racing -- A Reinforcement Learning Approach

Authors:Edoardo Ghignone, Nicolas Baumann, Mike Boss, Michele Magno
Date:2022-05-19 08:06:10

Autonomous racing is becoming popular for academic and industry researchers as a test for general autonomous driving by pushing perception, planning, and control algorithms to their limits. While traditional control methods such as MPC are capable of generating an optimal control sequence at the edge of the vehicles physical controllability, these methods are sensitive to the accuracy of the modeling parameters. This paper presents TC-Driver, a RL approach for robust control in autonomous racing. In particular, the TC-Driver agent is conditioned by a trajectory generated by any arbitrary traditional high-level planner. The proposed TC-Driver addresses the tire parameter modeling inaccuracies by exploiting the heuristic nature of RL while leveraging the reliability of traditional planning methods in a hierarchical control structure. We train the agent under varying tire conditions, allowing it to generalize to different model parameters, aiming to increase the racing capabilities of the system in practice. The proposed RL method outperforms a non-learning-based MPC with a 2.7 lower crash ratio in a model mismatch setting, underlining robustness to parameter discrepancies. In addition, the average RL inference duration is 0.25 ms compared to the average MPC solving time of 11.5 ms, yielding a nearly 40-fold speedup, allowing for complex control deployment in computationally constrained devices. Lastly, we show that the frequently utilized end-to-end RL architecture, as a control policy directly learned from sensory input, is not well suited to model mismatch robustness nor track generalization. Our realistic simulations show that TC-Driver achieves a 6.7 and 3-fold lower crash ratio under model mismatch and track generalization settings, while simultaneously achieving lower lap times than an end-to-end approach, demonstrating the viability of TC-driver to robust autonomous racing.

Slowly Changing Adversarial Bandit Algorithms are Efficient for Discounted MDPs

Authors:Ian A. Kash, Lev Reyzin, Zishun Yu
Date:2022-05-18 16:40:30

Reinforcement learning generalizes multi-armed bandit problems with additional difficulties of a longer planning horizon and unknown transition kernel. We explore a black-box reduction from discounted infinite-horizon tabular reinforcement learning to multi-armed bandits, where, specifically, an independent bandit learner is placed in each state. We show that, under ergodicity and fast mixing assumptions, any slowly changing adversarial bandit algorithm achieving optimal regret in the adversarial bandit setting can also attain optimal expected regret in infinite-horizon discounted Markov decision processes, with respect to the number of rounds $T$. Furthermore, we examine our reduction using a specific instance of the exponential-weight algorithm.

World Value Functions: Knowledge Representation for Multitask Reinforcement Learning

Authors:Geraud Nangue Tasse, Steven James, Benjamin Rosman
Date:2022-05-18 09:45:14

An open problem in artificial intelligence is how to learn and represent knowledge that is sufficient for a general agent that needs to solve multiple tasks in a given world. In this work we propose world value functions (WVFs), which are a type of general value function with mastery of the world - they represent not only how to solve a given task, but also how to solve any other goal-reaching task. To achieve this, we equip the agent with an internal goal space defined as all the world states where it experiences a terminal transition - a task outcome. The agent can then modify task rewards to define its own reward function, which provably drives it to learn how to achieve all achievable internal goals, and the value of doing so in the current task. We demonstrate a number of benefits of WVFs. When the agent's internal goal space is the entire state space, we demonstrate that the transition function can be inferred from the learned WVF, which allows the agent to plan using learned value functions. Additionally, we show that for tasks in the same world, a pretrained agent that has learned any WVF can then infer the policy and value function for any new task directly from its rewards. Finally, an important property for long-lived agents is the ability to reuse existing knowledge to solve new tasks. Using WVFs as the knowledge representation for learned tasks, we show that an agent is able to solve their logical combination zero-shot, resulting in a combinatorially increasing number of skills throughout their lifetime.

Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space

Authors:Kuan Fang, Patrick Yin, Ashvin Nair, Sergey Levine
Date:2022-05-17 06:58:17

General-purpose robots require diverse repertoires of behaviors to complete challenging tasks in real-world unstructured environments. To address this issue, goal-conditioned reinforcement learning aims to acquire policies that can reach configurable goals for a wide range of tasks on command. However, such goal-conditioned policies are notoriously difficult and time-consuming to train from scratch. In this paper, we propose Planning to Practice (PTP), a method that makes it practical to train goal-conditioned policies for long-horizon tasks that require multiple distinct types of interactions to solve. Our approach is based on two key ideas. First, we decompose the goal-reaching problem hierarchically, with a high-level planner that sets intermediate subgoals using conditional subgoal generators in the latent space for a low-level model-free policy. Second, we propose a hybrid approach which first pre-trains both the conditional subgoal generator and the policy on previously collected data through offline reinforcement learning, and then fine-tunes the policy via online exploration. This fine-tuning process is itself facilitated by the planned subgoals, which breaks down the original target task into short-horizon goal-reaching tasks that are significantly easier to learn. We conduct experiments in both the simulation and real world, in which the policy is pre-trained on demonstrations of short primitive behaviors and fine-tuned for temporally extended tasks that are unseen in the offline data. Our experimental results show that PTP can generate feasible sequences of subgoals that enable the policy to efficiently solve the target tasks.

GoalNet: Inferring Conjunctive Goal Predicates from Human Plan Demonstrations for Robot Instruction Following

Authors:Shreya Sharma, Jigyasa Gupta, Shreshth Tuli, Rohan Paul, Mausam
Date:2022-05-14 15:14:40

Our goal is to enable a robot to learn how to sequence its actions to perform tasks specified as natural language instructions, given successful demonstrations from a human partner. The ability to plan high-level tasks can be factored as (i) inferring specific goal predicates that characterize the task implied by a language instruction for a given world state and (ii) synthesizing a feasible goal-reaching action-sequence with such predicates. For the former, we leverage a neural network prediction model, while utilizing a symbolic planner for the latter. We introduce a novel neuro-symbolic model, GoalNet, for contextual and task dependent inference of goal predicates from human demonstrations and linguistic task descriptions. GoalNet combines (i) learning, where dense representations are acquired for language instruction and the world state that enables generalization to novel settings and (ii) planning, where the cause-effect modeling by the symbolic planner eschews irrelevant predicates facilitating multi-stage decision making in large domains. GoalNet demonstrates a significant improvement (51%) in the task completion rate in comparison to a state-of-the-art rule-based approach on a benchmark data set displaying linguistic variations, particularly for multi-stage instructions.

Provably Safe Deep Reinforcement Learning for Robotic Manipulation in Human Environments

Authors:Jakob Thumm, Matthias Althoff
Date:2022-05-12 18:51:07

Deep reinforcement learning (RL) has shown promising results in the motion planning of manipulators. However, no method guarantees the safety of highly dynamic obstacles, such as humans, in RL-based manipulator control. This lack of formal safety assurances prevents the application of RL for manipulators in real-world human environments. Therefore, we propose a shielding mechanism that ensures ISO-verified human safety while training and deploying RL algorithms on manipulators. We utilize a fast reachability analysis of humans and manipulators to guarantee that the manipulator comes to a complete stop before a human is within its range. Our proposed method guarantees safety and significantly improves the RL performance by preventing episode-ending collisions. We demonstrate the performance of our proposed method in simulation using human motion capture data.

Learning Generalized Policies Without Supervision Using GNNs

Authors:Simon Ståhlberg, Blai Bonet, Hector Geffner
Date:2022-05-12 10:28:46

We consider the problem of learning generalized policies for classical planning domains using graph neural networks from small instances represented in lifted STRIPS. The problem has been considered before but the proposed neural architectures are complex and the results are often mixed. In this work, we use a simple and general GNN architecture and aim at obtaining crisp experimental results and a deeper understanding: either the policy greedy in the learned value function achieves close to 100% generalization over instances larger than those used in training, or the failure must be understood, and possibly fixed, logically. For this, we exploit the relation established between the expressive power of GNNs and the $C_{2}$ fragment of first-order logic (namely, FOL with 2 variables and counting quantifiers). We find for example that domains with general policies that require more expressive features can be solved with GNNs once the states are extended with suitable "derived atoms" encoding role compositions and transitive closures that do not fit into $C_{2}$. The work follows the GNN approach for learning optimal general policies in a supervised fashion (Stahlberg, Bonet, Geffner, 2022); but the learned policies are no longer required to be optimal (which expands the scope, as many planning domains do not have general optimal policies) and are learned without supervision. Interestingly, value-based reinforcement learning methods that aim to produce optimal policies, do not always yield policies that generalize, as the goals of optimality and generality are in conflict in domains where optimal planning is NP-hard.

Learning to Brachiate via Simplified Model Imitation

Authors:Daniele Reda, Hung Yu Ling, Michiel van de Panne
Date:2022-05-08 19:44:19

Brachiation is the primary form of locomotion for gibbons and siamangs, in which these primates swing from tree limb to tree limb using only their arms. It is challenging to control because of the limited control authority, the required advance planning, and the precision of the required grasps. We present a novel approach to this problem using reinforcement learning, and as demonstrated on a finger-less 14-link planar model that learns to brachiate across challenging handhold sequences. Key to our method is the use of a simplified model, a point mass with a virtual arm, for which we first learn a policy that can brachiate across handhold sequences with a prescribed order. This facilitates the learning of the policy for the full model, for which it provides guidance by providing an overall center-of-mass trajectory to imitate, as well as for the timing of the holds. Lastly, the simplified model can also readily be used for planning suitable sequences of handholds in a given environment. Our results demonstrate brachiation motions with a variety of durations for the flight and hold phases, as well as emergent extra back-and-forth swings when this proves useful. The system is evaluated with a variety of ablations. The method enables future work towards more general 3D brachiation, as well as using simplified model imitation in other settings.

Learning Abstract and Transferable Representations for Planning

Authors:Steven James, Benjamin Rosman, George Konidaris
Date:2022-05-04 14:40:04

We are concerned with the question of how an agent can acquire its own representations from sensory data. We restrict our focus to learning representations for long-term planning, a class of problems that state-of-the-art learning methods are unable to solve. We propose a framework for autonomously learning state abstractions of an agent's environment, given a set of skills. Importantly, these abstractions are task-independent, and so can be reused to solve new tasks. We demonstrate how an agent can use an existing set of options to acquire representations from ego- and object-centric observations. These abstractions can immediately be reused by the same agent in new environments. We show how to combine these portable representations with problem-specific ones to generate a sound description of a specific task that can be used for abstract planning. Finally, we show how to autonomously construct a multi-level hierarchy consisting of increasingly abstract representations. Since these hierarchies are transferable, higher-order concepts can be reused in new tasks, relieving the agent from relearning them and improving sample efficiency. Our results demonstrate that our approach allows an agent to transfer previous knowledge to new tasks, improving sample efficiency as the number of tasks increases.

Multi-subgoal Robot Navigation in Crowds with History Information and Interactions

Authors:Xinyi Yu, Jianan Hu, Yuehai Fan, Wancai Zheng, Linlin Ou
Date:2022-05-04 11:24:49

Robot navigation in dynamic environments shared with humans is an important but challenging task, which suffers from performance deterioration as the crowd grows. In this paper, multi-subgoal robot navigation approach based on deep reinforcement learning is proposed, which can reason about more comprehensive relationships among all agents (robot and humans). Specifically, the next position point is planned for the robot by introducing history information and interactions in our work. Firstly, based on subgraph network, the history information of all agents is aggregated before encoding interactions through a graph neural network, so as to improve the ability of the robot to anticipate the future scenarios implicitly. Further consideration, in order to reduce the probability of unreliable next position points, the selection module is designed after policy network in the reinforcement learning framework. In addition, the next position point generated from the selection module satisfied the task requirements better than that obtained directly from the policy network. The experiments demonstrate that our approach outperforms state-of-the-art approaches in terms of both success rate and collision rate, especially in crowded human environments.

State Representation Learning for Goal-Conditioned Reinforcement Learning

Authors:Lorenzo Steccanella, Anders Jonsson
Date:2022-05-04 09:20:09

This paper presents a novel state representation for reward-free Markov decision processes. The idea is to learn, in a self-supervised manner, an embedding space where distances between pairs of embedded states correspond to the minimum number of actions needed to transition between them. Compared to previous methods, our approach does not require any domain knowledge, learning from offline and unlabeled data. We show how this representation can be leveraged to learn goal-conditioned policies, providing a notion of similarity between states and goals and a useful heuristic distance to guide planning and reinforcement learning algorithms. Finally, we empirically validate our method in classic control domains and multi-goal environments, demonstrating that our method can successfully learn representations in large and/or continuous domains.

Go Back in Time: Generating Flashbacks in Stories with Event Temporal Prompts

Authors:Rujun Han, Hong Chen, Yufei Tian, Nanyun Peng
Date:2022-05-04 05:26:05

Stories or narratives are comprised of a sequence of events. To compose interesting stories, professional writers often leverage a creative writing technique called flashback that inserts past events into current storylines as we commonly observe in novels and plays. However, it is challenging for machines to generate flashback as it requires a solid understanding of event temporal order (e.g. "feeling hungry" before "eat," not vice versa), and the creativity to arrange storylines so that earlier events do not always appear first in narrative order. Two major issues in existing systems that exacerbate the challenges: 1) temporal bias in pertaining and story datasets that leads to monotonic event temporal orders; 2) lack of explicit guidance that helps machines decide where to insert flashbacks. We propose to address these issues using structured storylines to encode events and their pair-wise temporal relations (before, after and vague) as temporal prompts that guide how stories should unfold temporally. We leverage a Plan-and-Write framework enhanced by reinforcement learning to generate storylines and stories end-to-end. Evaluation results show that the proposed method can generate more interesting stories with flashbacks while maintaining textual diversity, fluency, and temporal coherence.

AlphaZero-Inspired Game Learning: Faster Training by Using MCTS Only at Test Time

Authors:Johannes Scheiermann, Wolfgang Konen
Date:2022-04-28 07:04:14

Recently, the seminal algorithms AlphaGo and AlphaZero have started a new era in game learning and deep reinforcement learning. While the achievements of AlphaGo and AlphaZero - playing Go and other complex games at super human level - are truly impressive, these architectures have the drawback that they require high computational resources. Many researchers are looking for methods that are similar to AlphaZero, but have lower computational demands and are thus more easily reproducible. In this paper, we pick an important element of AlphaZero - the Monte Carlo Tree Search (MCTS) planning stage - and combine it with temporal difference (TD) learning agents. We wrap MCTS for the first time around TD n-tuple networks and we use this wrapping only at test time to create versatile agents that keep at the same time the computational demands low. We apply this new architecture to several complex games (Othello, ConnectFour, Rubik's Cube) and show the advantages achieved with this AlphaZero-inspired MCTS wrapper. In particular, we present results that this agent is the first one trained on standard hardware (no GPU or TPU) to beat the very strong Othello program Edax up to and including level 7 (where most other learning-from-scratch algorithms could only defeat Edax up to level 2).

Hierarchical Control for Cooperative Teams in Competitive Autonomous Racing

Authors:Rishabh Saumil Thakkar, Aryaman Singh Samyal, David Fridovich-Keil, Zhe Xu, Ufuk Topcu
Date:2022-04-27 17:08:56

We investigate the problem of autonomous racing among teams of cooperative agents that are subject to realistic racing rules. Our work extends previous research on hierarchical control in head-to-head autonomous racing by considering a generalized version of the problem while maintaining the two-level hierarchical control structure. A high-level tactical planner constructs a discrete game that encodes the complex rules using simplified dynamics to produce a sequence of target waypoints. The low-level path planner uses these waypoints as a reference trajectory and computes high-resolution control inputs by solving a simplified formulation of a racing game with a simplified representation of the realistic racing rules. We explore two approaches for the low-level path planner: training a multi-agent reinforcement learning (MARL) policy and solving a linear-quadratic Nash game (LQNG) approximation. We evaluate our controllers on simple and complex tracks against three baselines: an end-to-end MARL controller, a MARL controller tracking a fixed racing line, and an LQNG controller tracking a fixed racing line. Quantitative results show our hierarchical methods outperform the baselines in terms of race wins, overall team performance, and compliance with the rules. Qualitatively, we observe the hierarchical controllers mimic actions performed by expert human drivers such as coordinated overtaking, defending against multiple opponents, and long-term planning for delayed advantages.

BATS: Best Action Trajectory Stitching

Authors:Ian Char, Viraj Mehta, Adam Villaflor, John M. Dolan, Jeff Schneider
Date:2022-04-26 01:48:32

The problem of offline reinforcement learning focuses on learning a good policy from a log of environment interactions. Past efforts for developing algorithms in this area have revolved around introducing constraints to online reinforcement learning algorithms to ensure the actions of the learned policy are constrained to the logged data. In this work, we explore an alternative approach by planning on the fixed dataset directly. Specifically, we introduce an algorithm which forms a tabular Markov Decision Process (MDP) over the logged data by adding new transitions to the dataset. We do this by using learned dynamics models to plan short trajectories between states. Since exact value iteration can be performed on this constructed MDP, it becomes easy to identify which trajectories are advantageous to add to the MDP. Crucially, since most transitions in this MDP come from the logged data, trajectories from the MDP can be rolled out for long periods with confidence. We prove that this property allows one to make upper and lower bounds on the value function up to appropriate distance metrics. Finally, we demonstrate empirically how algorithms that uniformly constrain the learned policy to the entire dataset can result in unwanted behavior, and we show an example in which simply behavior cloning the optimal policy of the MDP created by our algorithm avoids this problem.

"Think Before You Speak": Improving Multi-Action Dialog Policy by Planning Single-Action Dialogs

Authors:Shuo Zhang, Junzhou Zhao, Pinghui Wang, Yu Li, Yi Huang, Junlan Feng
Date:2022-04-25 07:55:53

Multi-action dialog policy (MADP), which generates multiple atomic dialog actions per turn, has been widely applied in task-oriented dialog systems to provide expressive and efficient system responses. Existing MADP models usually imitate action combinations from the labeled multi-action dialog samples. Due to data limitations, they generalize poorly toward unseen dialog flows. While interactive learning and reinforcement learning algorithms can be applied to incorporate external data sources of real users and user simulators, they take significant manual effort to build and suffer from instability. To address these issues, we propose Planning Enhanced Dialog Policy (PEDP), a novel multi-task learning framework that learns single-action dialog dynamics to enhance multi-action prediction. Our PEDP method employs model-based planning for conceiving what to express before deciding the current response through simulating single-action dialogs. Experimental results on the MultiWOZ dataset demonstrate that our fully supervised learning-based method achieves a solid task success rate of 90.6%, improving 3% compared to the state-of-the-art methods.

Road Traffic Law Adaptive Decision-making for Self-Driving Vehicles

Authors:Jiaxin Liu, Wenhui Zhou, Hong Wang, Zhong Cao, Wenhao Yu, Chengxiang Zhao, Ding Zhao, Diange Yang, Jun Li
Date:2022-04-25 03:04:04

Self-driving vehicles have their own intelligence to drive on open roads. However, vehicle managers, e.g., government or industrial companies, still need a way to tell these self-driving vehicles what behaviors are encouraged or forbidden. Unlike human drivers, current self-driving vehicles cannot understand the traffic laws, thus rely on the programmers manually writing the corresponding principles into the driving systems. It would be less efficient and hard to adapt some temporary traffic laws, especially when the vehicles use data-driven decision-making algorithms. Besides, current self-driving vehicle systems rarely take traffic law modification into consideration. This work aims to design a road traffic law adaptive decision-making method. The decision-making algorithm is designed based on reinforcement learning, in which the traffic rules are usually implicitly coded in deep neural networks. The main idea is to supply the adaptability to traffic laws of self-driving vehicles by a law-adaptive backup policy. In this work, the natural language-based traffic laws are first translated into a logical expression by the Linear Temporal Logic method. Then, the system will try to monitor in advance whether the self-driving vehicle may break the traffic laws by designing a long-term RL action space. Finally, a sample-based planning method will re-plan the trajectory when the vehicle may break the traffic rules. The method is validated in a Beijing Winter Olympic Lane scenario and an overtaking case, built in CARLA simulator. The results show that by adopting this method, the self-driving vehicles can comply with new issued or updated traffic laws effectively. This method helps self-driving vehicles governed by digital traffic laws, which is necessary for the wide adoption of autonomous driving.

Adaptive Task Planning for Large-Scale Robotized Warehouses

Authors:Dingyuan Shi, Yongxin Tong, Zimu Zhou, Ke Xu, Wenzhe Tan, Hongbo Li
Date:2022-04-24 04:40:44

Robotized warehouses are deployed to automatically distribute millions of items brought by the massive logistic orders from e-commerce. A key to automated item distribution is to plan paths for robots, also known as task planning, where each task is to deliver racks with items to pickers for processing and then return the rack back. Prior solutions are unfit for large-scale robotized warehouses due to the inflexibility to time-varying item arrivals and the low efficiency for high throughput. In this paper, we propose a new task planning problem called TPRW, which aims to minimize the end-to-end makespan that incorporates the entire item distribution pipeline, known as a fulfilment cycle. Direct extensions from state-of-the-art path finding methods are ineffective to solve the TPRW problem because they fail to adapt to the bottleneck variations of fulfillment cycles. In response, we propose Efficient Adaptive Task Planning, a framework for large-scale robotized warehouses with time-varying item arrivals. It adaptively selects racks to fulfill at each timestamp via reinforcement learning, accounting for the time-varying bottleneck of the fulfillment cycles. Then it finds paths for robots to transport the selected racks. The framework adopts a series of efficient optimizations on both time and memory to handle large-scale item throughput. Evaluations on both synthesized and real data show an improvement of $37.1\%$ in effectiveness and $75.5\%$ in efficiency over the state-of-the-arts.

Comparing Deep Reinforcement Learning Algorithms in Two-Echelon Supply Chains

Authors:Francesco Stranieri, Fabio Stella
Date:2022-04-20 16:33:01

In this study, we analyze and compare the performance of state-of-the-art deep reinforcement learning algorithms for solving the supply chain inventory management problem. This complex sequential decision-making problem consists of determining the optimal quantity of products to be produced and shipped across different warehouses over a given time horizon. In particular, we present a mathematical formulation of a two-echelon supply chain environment with stochastic and seasonal demand, which allows managing an arbitrary number of warehouses and product types. Through a rich set of numerical experiments, we compare the performance of different deep reinforcement learning algorithms under various supply chain structures, topologies, demands, capacities, and costs. The results of the experimental plan indicate that deep reinforcement learning algorithms outperform traditional inventory management strategies, such as the static (s, Q)-policy. Furthermore, this study provides detailed insight into the design and development of an open-source software library that provides a customizable environment for solving the supply chain inventory management problem using a wide range of data-driven approaches.

Network Topology Optimization via Deep Reinforcement Learning

Authors:Zhuoran Li, Xing Wang, Ling Pan, Lin Zhu, Zhendong Wang, Junlan Feng, Chao Deng, Longbo Huang
Date:2022-04-19 07:45:07

Topology impacts important network performance metrics, including link utilization, throughput and latency, and is of central importance to network operators. However, due to the combinatorial nature of network topology, it is extremely difficult to obtain an optimal solution, especially since topology planning in networks also often comes with management-specific constraints. As a result, local optimization with hand-tuned heuristic methods from human experts are often adopted in practice. Yet, heuristic methods cannot cover the global topology design space while taking into account constraints, and cannot guarantee to find good solutions. In this paper, we propose a novel deep reinforcement learning (DRL) algorithm, called Advantage Actor Critic-Graph Searching (A2C-GS), for network topology optimization. A2C-GS consists of three novel components, including a verifier to validate the correctness of a generated network topology, a graph neural network (GNN) to efficiently approximate topology rating, and a DRL actor layer to conduct a topology search. A2C-GS can efficiently search over large topology space and output topology with satisfying performance. We conduct a case study based on a real network scenario, and our experimental results demonstrate the superior performance of A2C-GS in terms of both efficiency and performance.

Multi-UAV Collision Avoidance using Multi-Agent Reinforcement Learning with Counterfactual Credit Assignment

Authors:Shuangyao Huang, Haibo Zhang, Zhiyi Huang
Date:2022-04-19 00:28:51

Multi-UAV collision avoidance is a challenging task for UAV swarm applications due to the need of tight cooperation among swarm members for collision-free path planning. Centralized Training with Decentralized Execution (CTDE) in Multi-Agent Reinforcement Learning is a promising method for multi-UAV collision avoidance, in which the key challenge is to effectively learn decentralized policies that can maximize a global reward cooperatively. We propose a new multi-agent critic-actor learning scheme called MACA for UAV swarm collision avoidance. MACA uses a centralized critic to maximize the discounted global reward that considers both safety and energy efficiency, and an actor per UAV to find decentralized policies to avoid collisions. To solve the credit assignment problem in CTDE, we design a counterfactual baseline that marginalizes both an agent's state and action, enabling to evaluate the importance of an agent in the joint observation-action space. To train and evaluate MACA, we design our own simulation environment MACAEnv to closely mimic the realistic behaviors of a UAV swarm. Simulation results show that MACA achieves more than 16% higher average reward than two state-of-the-art MARL algorithms and reduces failure rate by 90% and response time by over 99% compared to a conventional UAV swarm collision avoidance algorithm in all test scenarios.

CryoRL: Reinforcement Learning Enables Efficient Cryo-EM Data Collection

Authors:Quanfu Fan, Yilai Li, Yuguang Yao, John Cohn, Sijia Liu, Seychelle M. Vos, Michael A. Cianfrocco
Date:2022-04-15 17:00:06

Single-particle cryo-electron microscopy (cryo-EM) has become one of the mainstream structural biology techniques because of its ability to determine high-resolution structures of dynamic bio-molecules. However, cryo-EM data acquisition remains expensive and labor-intensive, requiring substantial expertise. Structural biologists need a more efficient and objective method to collect the best data in a limited time frame. We formulate the cryo-EM data collection task as an optimization problem in this work. The goal is to maximize the total number of good images taken within a specified period. We show that reinforcement learning offers an effective way to plan cryo-EM data collection, successfully navigating heterogenous cryo-EM grids. The approach we developed, cryoRL, demonstrates better performance than average users for data collection under similar settings.

Safe Reinforcement Learning Using Black-Box Reachability Analysis

Authors:Mahmoud Selim, Amr Alanwar, Shreyas Kousik, Grace Gao, Marco Pavone, Karl H. Johansson
Date:2022-04-15 10:51:09

Reinforcement learning (RL) is capable of sophisticated motion planning and control for robots in uncertain environments. However, state-of-the-art deep RL approaches typically lack safety guarantees, especially when the robot and environment models are unknown. To justify widespread deployment, robots must respect safety constraints without sacrificing performance. Thus, we propose a Black-box Reachability-based Safety Layer (BRSL) with three main components: (1) data-driven reachability analysis for a black-box robot model, (2) a trajectory rollout planner that predicts future actions and observations using an ensemble of neural networks trained online, and (3) a differentiable polytope collision check between the reachable set and obstacles that enables correcting unsafe actions. In simulation, BRSL outperforms other state-of-the-art safe RL methods on a Turtlebot 3, a quadrotor, a trajectory-tracking point mass, and a hexarotor in wind with an unsafe set adjacent to the area of highest reward.

Learning Design and Construction with Varying-Sized Materials via Prioritized Memory Resets

Authors:Yunfei Li, Tao Kong, Lei Li, Yi Wu
Date:2022-04-12 03:45:48

Can a robot autonomously learn to design and construct a bridge from varying-sized blocks without a blueprint? It is a challenging task with long horizon and sparse reward -- the robot has to figure out physically stable design schemes and feasible actions to manipulate and transport blocks. Due to diverse block sizes, the state space and action trajectories are vast to explore. In this paper, we propose a hierarchical approach for this problem. It consists of a reinforcement-learning designer to propose high-level building instructions and a motion-planning-based action generator to manipulate blocks at the low level. For high-level learning, we develop a novel technique, prioritized memory resetting (PMR) to improve exploration. PMR adaptively resets the state to those most critical configurations from a replay buffer so that the robot can resume training on partial architectures instead of from scratch. Furthermore, we augment PMR with auxiliary training objectives and fine-tune the designer with the locomotion generator. Our experiments in simulation and on a real deployed robotic system demonstrate that it is able to effectively construct bridges with blocks of varying sizes at a high success rate. Demos can be found at https://sites.google.com/view/bridge-pmr.

Automatically Learning Fallback Strategies with Model-Free Reinforcement Learning in Safety-Critical Driving Scenarios

Authors:Ugo Lecerf, Christelle Yemdji-Tchassi, Sébastien Aubert, Pietro Michiardi
Date:2022-04-11 15:34:49

When learning to behave in a stochastic environment where safety is critical, such as driving a vehicle in traffic, it is natural for human drivers to plan fallback strategies as a backup to use if ever there is an unexpected change in the environment. Knowing to expect the unexpected, and planning for such outcomes, increases our capability for being robust to unseen scenarios and may help prevent catastrophic failures. Control of Autonomous Vehicles (AVs) has a particular interest in knowing when and how to use fallback strategies in the interest of safety. Due to imperfect information available to an AV about its environment, it is important to have alternate strategies at the ready which might not have been deduced from the original training data distribution. In this paper we present a principled approach for a model-free Reinforcement Learning (RL) agent to capture multiple modes of behaviour in an environment. We introduce an extra pseudo-reward term to the reward model, to encourage exploration to areas of state-space different from areas privileged by the optimal policy. We base this reward term on a distance metric between the trajectories of agents, in order to force policies to focus on different areas of state-space than the initial exploring agent. Throughout the paper, we refer to this particular training paradigm as learning fallback strategies. We apply this method to an autonomous driving scenario, and show that we are able to learn useful policies that would have otherwise been missed out on during training, and unavailable to use when executing the control algorithm.

Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning

Authors:Carl Qi, Pieter Abbeel, Aditya Grover
Date:2022-04-07 17:16:52

The goal of imitation learning is to mimic expert behavior from demonstrations, without access to an explicit reward signal. A popular class of approach infers the (unknown) reward function via inverse reinforcement learning (IRL) followed by maximizing this reward function via reinforcement learning (RL). The policies learned via these approaches are however very brittle in practice and deteriorate quickly even with small test-time perturbations due to compounding errors. We propose Imitation with Planning at Test-time (IMPLANT), a new meta-algorithm for imitation learning that utilizes decision-time planning to correct for compounding errors of any base imitation policy. In contrast to existing approaches, we retain both the imitation policy and the rewards model at decision-time, thereby benefiting from the learning signal of the two components. Empirically, we demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments and excels at zero-shot generalization when subject to challenging perturbations in test-time dynamics.

Distributed Reinforcement Learning for Robot Teams: A Review

Authors:Yutong Wang, Mehul Damani, Pamela Wang, Yuhong Cao, Guillaume Sartoretti
Date:2022-04-07 15:34:19

Purpose of review: Recent advances in sensing, actuation, and computation have opened the door to multi-robot systems consisting of hundreds/thousands of robots, with promising applications to automated manufacturing, disaster relief, harvesting, last-mile delivery, port/airport operations, or search and rescue. The community has leveraged model-free multi-agent reinforcement learning (MARL) to devise efficient, scalable controllers for multi-robot systems (MRS). This review aims to provide an analysis of the state-of-the-art in distributed MARL for multi-robot cooperation. Recent findings: Decentralized MRS face fundamental challenges, such as non-stationarity and partial observability. Building upon the "centralized training, decentralized execution" paradigm, recent MARL approaches include independent learning, centralized critic, value decomposition, and communication learning approaches. Cooperative behaviors are demonstrated through AI benchmarks and fundamental real-world robotic capabilities such as multi-robot motion/path planning. Summary: This survey reports the challenges surrounding decentralized model-free MARL for multi-robot cooperation and existing classes of approaches. We present benchmarks and robotic applications along with a discussion on current open avenues for research.

A Framework for Following Temporal Logic Instructions with Unknown Causal Dependencies

Authors:Duo Xu, Faramarz Fekri
Date:2022-04-07 04:01:17

Teaching a deep reinforcement learning (RL) agent to follow instructions in multi-task environments is a challenging problem. We consider that user defines every task by a linear temporal logic (LTL) formula. However, some causal dependencies in complex environments may be unknown to the user in advance. Hence, when human user is specifying instructions, the robot cannot solve the tasks by simply following the given instructions. In this work, we propose a hierarchical reinforcement learning (HRL) framework in which a symbolic transition model is learned to efficiently produce high-level plans that can guide the agent efficiently solve different tasks. Specifically, the symbolic transition model is learned by inductive logic programming (ILP) to capture logic rules of state transitions. By planning over the product of the symbolic transition model and the automaton derived from the LTL formula, the agent can resolve causal dependencies and break a causally complex problem down into a sequence of simpler low-level sub-tasks. We evaluate the proposed framework on three environments in both discrete and continuous domains, showing advantages over previous representative methods.

Learning Pneumatic Non-Prehensile Manipulation with a Mobile Blower

Authors:Jimmy Wu, Xingyuan Sun, Andy Zeng, Shuran Song, Szymon Rusinkiewicz, Thomas Funkhouser
Date:2022-04-05 17:55:58

We investigate pneumatic non-prehensile manipulation (i.e., blowing) as a means of efficiently moving scattered objects into a target receptacle. Due to the chaotic nature of aerodynamic forces, a blowing controller must (i) continually adapt to unexpected changes from its actions, (ii) maintain fine-grained control, since the slightest misstep can result in large unintended consequences (e.g., scatter objects already in a pile), and (iii) infer long-range plans (e.g., move the robot to strategic blowing locations). We tackle these challenges in the context of deep reinforcement learning, introducing a multi-frequency version of the spatial action maps framework. This allows for efficient learning of vision-based policies that effectively combine high-level planning and low-level closed-loop control for dynamic mobile manipulation. Experiments show that our system learns efficient behaviors for the task, demonstrating in particular that blowing achieves better downstream performance than pushing, and that our policies improve performance over baselines. Moreover, we show that our system naturally encourages emergent specialization between the different subpolicies spanning low-level fine-grained control and high-level planning. On a real mobile robot equipped with a miniature air blower, we show that our simulation-trained policies transfer well to a real environment and can generalize to novel objects.

CIRS: Bursting Filter Bubbles by Counterfactual Interactive Recommender System

Authors:Chongming Gao, Shiqi Wang, Shijun Li, Jiawei Chen, Xiangnan He, Wenqiang Lei, Biao Li, Yuan Zhang, Peng Jiang
Date:2022-04-04 06:20:48

While personalization increases the utility of recommender systems, it also brings the issue of filter bubbles. E.g., if the system keeps exposing and recommending the items that the user is interested in, it may also make the user feel bored and less satisfied. Existing work studies filter bubbles in static recommendation, where the effect of overexposure is hard to capture. In contrast, we believe it is more meaningful to study the issue in interactive recommendation and optimize long-term user satisfaction. Nevertheless, it is unrealistic to train the model online due to the high cost. As such, we have to leverage offline training data and disentangle the causal effect on user satisfaction. To achieve this goal, we propose a counterfactual interactive recommender system (CIRS) that augments offline reinforcement learning (offline RL) with causal inference. The basic idea is to first learn a causal user model on historical data to capture the overexposure effect of items on user satisfaction. It then uses the learned causal user model to help the planning of the RL policy. To conduct evaluation offline, we innovatively create an authentic RL environment (KuaiEnv) based on a real-world fully observed user rating dataset. The experiments show the effectiveness of CIRS in bursting filter bubbles and achieving long-term success in interactive recommendation. The implementation of CIRS is available via https://github.com/chongminggao/CIRS-codes.

Learning High-DOF Reaching-and-Grasping via Dynamic Representation of Gripper-Object Interaction

Authors:Qijin She, Ruizhen Hu, Juzhan Xu, Min Liu, Kai Xu, Hui Huang
Date:2022-04-03 07:03:54

We approach the problem of high-DOF reaching-and-grasping via learning joint planning of grasp and motion with deep reinforcement learning. To resolve the sample efficiency issue in learning the high-dimensional and complex control of dexterous grasping, we propose an effective representation of grasping state characterizing the spatial interaction between the gripper and the target object. To represent gripper-object interaction, we adopt Interaction Bisector Surface (IBS) which is the Voronoi diagram between two close by 3D geometric objects and has been successfully applied in characterizing spatial relations between 3D objects. We found that IBS is surprisingly effective as a state representation since it well informs the fine-grained control of each finger with spatial relation against the target object. This novel grasp representation, together with several technical contributions including a fast IBS approximation, a novel vector-based reward and an effective training strategy, facilitate learning a strong control model of high-DOF grasping with good sample efficiency, dynamic adaptability, and cross-category generality. Experiments show that it generates high-quality dexterous grasp for complex shapes with smooth grasping motions.

Learning to Accelerate by the Methods of Step-size Planning

Authors:Hengshuai Yao
Date:2022-04-01 19:59:40

Gradient descent is slow to converge for ill-conditioned problems and non-convex problems. An important technique for acceleration is step-size adaptation. The first part of this paper contains a detailed review of step-size adaptation methods, including Polyak step-size, L4, LossGrad, Adam, IDBD, and Hypergradient descent, and the relation of step-size adaptation to meta-gradient methods. In the second part of this paper, we propose a new class of methods of accelerating gradient descent that have some distinctiveness from existing techniques. The new methods, which we call {\em step-size planning}, use the {\em update experience} to learn an improved way of updating the parameters. The methods organize the experience into $K$ steps away from each other to facilitate planning. From the past experience, our planning algorithm, Csawg, learns a step-size model which is a form of multi-step machine that predicts future updates. We extends Csawg to applying step-size planning multiple steps, which leads to further speedup. We discuss and highlight the projection power of the diagonal-matrix step-size for future large scale applications. We show for a convex problem, our methods can surpass the convergence rate of Nesterov's accelerated gradient, $1 - \sqrt{\mu/L}$, where $\mu, L$ are the strongly convex factor of the loss function $F$ and the Lipschitz constant of $F'$, which is the theoretical limit for the convergence rate of first-order methods. On the well-known non-convex Rosenbrock function, our planning methods achieve zero error below 500 gradient evaluations, while gradient descent takes about 10000 gradient evaluations to reach a $10^{-3}$ accuracy. We discuss the connection of step-size planing to planning in reinforcement learning, in particular, Dyna architectures. (This is a shorter abstract than in the paper because of length requirement)

DiffSkill: Skill Abstraction from Differentiable Physics for Deformable Object Manipulations with Tools

Authors:Xingyu Lin, Zhiao Huang, Yunzhu Li, Joshua B. Tenenbaum, David Held, Chuang Gan
Date:2022-03-31 17:59:38

We consider the problem of sequential robotic manipulation of deformable objects using tools. Previous works have shown that differentiable physics simulators provide gradients to the environment state and help trajectory optimization to converge orders of magnitude faster than model-free reinforcement learning algorithms for deformable object manipulation. However, such gradient-based trajectory optimization typically requires access to the full simulator states and can only solve short-horizon, single-skill tasks due to local optima. In this work, we propose a novel framework, named DiffSkill, that uses a differentiable physics simulator for skill abstraction to solve long-horizon deformable object manipulation tasks from sensory observations. In particular, we first obtain short-horizon skills using individual tools from a gradient-based optimizer, using the full state information in a differentiable simulator; we then learn a neural skill abstractor from the demonstration trajectories which takes RGBD images as input. Finally, we plan over the skills by finding the intermediate goals and then solve long-horizon tasks. We show the advantages of our method in a new set of sequential deformable object manipulation tasks compared to previous reinforcement learning algorithms and compared to the trajectory optimizer.

A Cooperative Optimal Control Framework for Connected and Automated Vehicles in Mixed Traffic Using Social Value Orientation

Authors:Viet-Anh Le, Andreas A. Malikopoulos
Date:2022-03-31 15:26:44

In this paper, we develop a socially cooperative optimal control framework to address the motion planning problem for connected and automated vehicles (CAVs) in mixed traffic using social value orientation (SVO) and a potential game approach. In the proposed framework, we formulate the interaction between a CAV and a human-driven vehicle (HDV) as a simultaneous game where each vehicle minimizes a weighted sum of its egoistic objective and a cooperative objective. The SVO angles are used to quantify preferences of the vehicles toward the egoistic and cooperative objectives. Using the potential game approach, we propose a single objective function for the optimal control problem whose weighting factors are chosen based on the SVOs of the vehicles. We prove that a Nash equilibrium can be obtained by minimizing the proposed objective function. To estimate the SVO angle of the HDV, we develop a moving horizon estimation algorithm based on maximum entropy inverse reinforcement learning. The effectiveness of the proposed approach is demonstrated by numerical simulations of a vehicle merging scenario.

End-to-End Trajectory Distribution Prediction Based on Occupancy Grid Maps

Authors:Ke Guo, Wenxi Liu, Jia Pan
Date:2022-03-31 09:24:32

In this paper, we aim to forecast a future trajectory distribution of a moving agent in the real world, given the social scene images and historical trajectories. Yet, it is a challenging task because the ground-truth distribution is unknown and unobservable, while only one of its samples can be applied for supervising model learning, which is prone to bias. Most recent works focus on predicting diverse trajectories in order to cover all modes of the real distribution, but they may despise the precision and thus give too much credit to unrealistic predictions. To address the issue, we learn the distribution with symmetric cross-entropy using occupancy grid maps as an explicit and scene-compliant approximation to the ground-truth distribution, which can effectively penalize unlikely predictions. In specific, we present an inverse reinforcement learning based multi-modal trajectory distribution forecasting framework that learns to plan by an approximate value iteration network in an end-to-end manner. Besides, based on the predicted distribution, we generate a small set of representative trajectories through a differentiable Transformer-based network, whose attention mechanism helps to model the relations of trajectories. In experiments, our method achieves state-of-the-art performance on the Stanford Drone Dataset and Intersection Drone Dataset.

Maze Learning using a Hyperdimensional Predictive Processing Cognitive Architecture

Authors:Alexander Ororbia, M. Alex Kelly
Date:2022-03-31 04:44:28

We present the COGnitive Neural GENerative system (CogNGen), a cognitive architecture that combines two neurobiologically-plausible, computational models: predictive processing and hyperdimensional/vector-symbolic models. We draw inspiration from architectures such as ACT-R and Spaun/Nengo. CogNGen is in broad agreement with these, providing a level of detail between ACT-R's high-level symbolic description of human cognition and Spaun's low-level neurobiological description, furthermore creating the groundwork for designing agents that learn continually from diverse tasks and model human performance at larger scales than what is possible with current systems. We test CogNGen on four maze-learning tasks, including those that test memory and planning, and find that CogNGen matches performance of deep reinforcement learning models and exceeds on a task designed to test memory.

Learning Minimum-Time Flight in Cluttered Environments

Authors:Robert Penicka, Yunlong Song, Elia Kaufmann, Davide Scaramuzza
Date:2022-03-28 19:41:13

We tackle the problem of minimum-time flight for a quadrotor through a sequence of waypoints in the presence of obstacles while exploiting the full quadrotor dynamics. Early works relied on simplified dynamics or polynomial trajectory representations that did not exploit the full actuator potential of the quadrotor, and, thus, resulted in suboptimal solutions. Recent works can plan minimum-time trajectories; yet, the trajectories are executed with control methods that do not account for obstacles. Thus, a successful execution of such trajectories is prone to errors due to model mismatch and in-flight disturbances. To this end, we leverage deep reinforcement learning and classical topological path planning to train robust neural-network controllers for minimum-time quadrotor flight in cluttered environments. The resulting neural network controller demonstrates substantially better performance of up to 19\% over state-of-the-art methods. More importantly, the learned policy solves the planning and control problem simultaneously online to account for disturbances, thus achieving much higher robustness. As such, the presented method achieves 100% success rate of flying minimum-time policies without collision, while traditional planning and control approaches achieve only 40%. The proposed method is validated in both simulation and the real world, with quadrotor speeds of up to 42km/h and accelerations of 3.6g.

Aggressive Quadrotor Flight Using Curiosity-Driven Reinforcement Learning

Authors:Qiyu Sun, Jinbao Fang, Wei Xing Zheng, Yang Tang
Date:2022-03-26 09:29:23

The ability to perform aggressive movements, which are called aggressive flights, is important for quadrotors during navigation. However, aggressive quadrotor flights are still a great challenge to practical applications. The existing solutions to aggressive flights heavily rely on a predefined trajectory, which is a time-consuming preprocessing step. To avoid such path planning, we propose a curiosity-driven reinforcement learning method for aggressive flight missions and a similarity-based curiosity module is introduced to speed up the training procedure. A branch structure exploration (BSE) strategy is also applied to guarantee the robustness of the policy and to ensure the policy trained in simulations can be performed in real-world experiments directly. The experimental results in simulations demonstrate that our reinforcement learning algorithm performs well in aggressive flight tasks, speeds up the convergence process and improves the robustness of the policy. Besides, our algorithm shows a satisfactory simulated to real transferability and performs well in real-world experiments.

Unsupervised Learning of Temporal Abstractions with Slot-based Transformers

Authors:Anand Gopalakrishnan, Kazuki Irie, Jürgen Schmidhuber, Sjoerd van Steenkiste
Date:2022-03-25 10:59:46

The discovery of reusable sub-routines simplifies decision-making and planning in complex reinforcement learning problems. Previous approaches propose to learn such temporal abstractions in a purely unsupervised fashion through observing state-action trajectories gathered from executing a policy. However, a current limitation is that they process each trajectory in an entirely sequential manner, which prevents them from revising earlier decisions about sub-routine boundary points in light of new incoming information. In this work we propose SloTTAr, a fully parallel approach that integrates sequence processing Transformers with a Slot Attention module and adaptive computation for learning about the number of such sub-routines in an unsupervised fashion. We demonstrate how SloTTAr is capable of outperforming strong baselines in terms of boundary point discovery, even for sequences containing variable amounts of sub-routines, while being up to 7x faster to train on existing benchmarks.

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

Authors:Zihan Zhang, Xiangyang Ji, Simon S. Du
Date:2022-03-24 08:14:12

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}. Specifically, we consider tabular MDP with $S$ states, $A$ actions, a planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We design an algorithm that achieves an $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ dependency~\citep{zhang2020reinforcement} or has an exponential dependency on $S$~\citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.

Optimizing Trajectories for Highway Driving with Offline Reinforcement Learning

Authors:Branka Mirchevska, Moritz Werling, Joschka Boedecker
Date:2022-03-21 13:13:08

Implementing an autonomous vehicle that is able to output feasible, smooth and efficient trajectories is a long-standing challenge. Several approaches have been considered, roughly falling under two categories: rule-based and learning-based approaches. The rule-based approaches, while guaranteeing safety and feasibility, fall short when it comes to long-term planning and generalization. The learning-based approaches are able to account for long-term planning and generalization to unseen situations, but may fail to achieve smoothness, safety and the feasibility which rule-based approaches ensure. Hence, combining the two approaches is an evident step towards yielding the best compromise out of both. We propose a Reinforcement Learning-based approach, which learns target trajectory parameters for fully autonomous driving on highways. The trained agent outputs continuous trajectory parameters based on which a feasible polynomial-based trajectory is generated and executed. We compare the performance of our agent against four other highway driving agents. The experiments are conducted in the Sumo simulator, taking into consideration various realistic, dynamically changing highway scenarios, including surrounding vehicles with different driver behaviors. We demonstrate that our offline trained agent, with randomly collected data, learns to drive smoothly, achieving velocities as close as possible to the desired velocity, while outperforming the other agents. Code, training data and details available at: https://nrgit.informatik.uni-freiburg. de/branka.mirchevska/offline-rl-tp.

Long Short-Term Memory for Spatial Encoding in Multi-Agent Path Planning

Authors:Marc R. Schlichting, Stefan Notter, Walter Fichter
Date:2022-03-21 09:16:56

Reinforcement learning-based path planning for multi-agent systems of varying size constitutes a research topic with increasing significance as progress in domains such as urban air mobility and autonomous aerial vehicles continues. Reinforcement learning with continuous state and action spaces is used to train a policy network that accommodates desirable path planning behaviors and can be used for time-critical applications. A Long Short-Term Memory module is proposed to encode an unspecified number of states for a varying, indefinite number of agents. The described training strategies and policy architecture lead to a guidance that scales to an infinite number of agents and unlimited physical dimensions, although training takes place at a smaller scale. The guidance is implemented on a low-cost, off-the-shelf onboard computer. The feasibility of the proposed approach is validated by presenting flight test results of up to four drones, autonomously navigating collision-free in a real-world environment.

Learning on the Job: Long-Term Behavioural Adaptation in Human-Robot Interactions

Authors:Francesco Del Duchetto, Marc Hanheide
Date:2022-03-20 10:21:52

In this work, we propose a framework for allowing autonomous robots deployed for extended periods of time in public spaces to adapt their own behaviour online from user interactions. The robot behaviour planning is embedded in a Reinforcement Learning (RL) framework, where the objective is maximising the level of overall user engagement during the interactions. We use the Upper-Confidence-Bound Value-Iteration (UCBVI) algorithm, which gives a helpful way of managing the exploration-exploitation trade-off for real-time interactions. An engagement model trained end-to-end generates the reward function in real-time during policy execution. We test this approach in a public museum in Lincoln (UK), where the robot is deployed as a tour guide for the visitors. Results show that after a couple of months of exploration, the robot policy learned to maintain the engagement of users for longer, with an increase of 22.8% over the initial static policy in the number of items visited during the tour and a 30% increase in the probability of completing the tour. This work is a promising step toward behavioural adaptation in long-term scenarios for robotics applications in social settings.

Skill-based Multi-objective Reinforcement Learning of Industrial Robot Tasks with Planning and Knowledge Integration

Authors:Matthias Mayr, Faseeh Ahmad, Konstantinos Chatzilygeroudis, Luigi Nardi, Volker Krueger
Date:2022-03-18 16:03:27

In modern industrial settings with small batch sizes it should be easy to set up a robot system for a new task. Strategies exist, e.g. the use of skills, but when it comes to handling forces and torques, these systems often fall short. We introduce an approach that provides a combination of task-level planning with targeted learning of scenario-specific parameters for skill-based systems. We propose the following pipeline: (1) the user provides a task goal in the planning language PDDL, (2) a plan (i.e., a sequence of skills) is generated and the learnable parameters of the skills are automatically identified. An operator then chooses (3) reward functions and hyperparameters for the learning process. Two aspects of our methodology are critical: (a) learning is tightly integrated with a knowledge framework to support symbolic planning and to provide priors for learning, (b) using multi-objective optimization. This can help to balance key performance indicators (KPIs) such as safety and task performance since they can often affect each other. We adopt a multi-objective Bayesian optimization approach and learn entirely in simulation. We demonstrate the efficacy and versatility of our approach by learning skill parameters for two different contact-rich tasks. We show their successful execution on a real 7-DOF KUKA-iiwa manipulator and outperform the manual parameterization by human robot operators.

Investigating Compounding Prediction Errors in Learned Dynamics Models

Authors:Nathan Lambert, Kristofer Pister, Roberto Calandra
Date:2022-03-17 22:24:38

Accurately predicting the consequences of agents' actions is a key prerequisite for planning in robotic control. Model-based reinforcement learning (MBRL) is one paradigm which relies on the iterative learning and prediction of state-action transitions to solve a task. Deep MBRL has become a popular candidate, using a neural network to learn a dynamics model that predicts with each pass from high-dimensional states to actions. These "one-step" predictions are known to become inaccurate over longer horizons of composed prediction - called the compounding error problem. Given the prevalence of the compounding error problem in MBRL and related fields of data-driven control, we set out to understand the properties of and conditions causing these long-horizon errors. In this paper, we explore the effects of subcomponents of a control problem on long term prediction error: including choosing a system, collecting data, and training a model. These detailed quantitative studies on simulated and real-world data show that the underlying dynamics of a system are the strongest factor determining the shape and magnitude of prediction error. Given a clearer understanding of compounding prediction error, researchers can implement new types of models beyond "one-step" that are more useful for control.

Blocks Assemble! Learning to Assemble with Large-Scale Structured Reinforcement Learning

Authors:Seyed Kamyar Seyed Ghasemipour, Daniel Freeman, Byron David, Shixiang Shane Gu, Satoshi Kataoka, Igor Mordatch
Date:2022-03-15 18:21:02

Assembly of multi-part physical structures is both a valuable end product for autonomous robotics, as well as a valuable diagnostic task for open-ended training of embodied intelligent agents. We introduce a naturalistic physics-based environment with a set of connectable magnet blocks inspired by children's toy kits. The objective is to assemble blocks into a succession of target blueprints. Despite the simplicity of this objective, the compositional nature of building diverse blueprints from a set of blocks leads to an explosion of complexity in structures that agents encounter. Furthermore, assembly stresses agents' multi-step planning, physical reasoning, and bimanual coordination. We find that the combination of large-scale reinforcement learning and graph-based policies -- surprisingly without any additional complexity -- is an effective recipe for training agents that not only generalize to complex unseen blueprints in a zero-shot manner, but even operate in a reset-free setting without being trained to do so. Through extensive experiments, we highlight the importance of large-scale training, structured representations, contributions of multi-task vs. single-task learning, as well as the effects of curriculums, and discuss qualitative behaviors of trained agents.

Adaptive Environment Modeling Based Reinforcement Learning for Collision Avoidance in Complex Scenes

Authors:Shuaijun Wang, Rui Gao, Ruihua Han, Shengduo Chen, Chengyang Li, Qi Hao
Date:2022-03-15 07:57:39

The major challenges of collision avoidance for robot navigation in crowded scenes lie in accurate environment modeling, fast perceptions, and trustworthy motion planning policies. This paper presents a novel adaptive environment model based collision avoidance reinforcement learning (i.e., AEMCARL) framework for an unmanned robot to achieve collision-free motions in challenging navigation scenarios. The novelty of this work is threefold: (1) developing a hierarchical network of gated-recurrent-unit (GRU) for environment modeling; (2) developing an adaptive perception mechanism with an attention module; (3) developing an adaptive reward function for the reinforcement learning (RL) framework to jointly train the environment model, perception function and motion planning policy. The proposed method is tested with the Gym-Gazebo simulator and a group of robots (Husky and Turtlebot) under various crowded scenes. Both simulation and experimental results have demonstrated the superior performance of the proposed method over baseline methods.

An Introduction to Multi-Agent Reinforcement Learning and Review of its Application to Autonomous Mobility

Authors:Lukas M. Schmidt, Johanna Brosig, Axel Plinge, Bjoern M. Eskofier, Christopher Mutschler
Date:2022-03-15 06:40:28

Many scenarios in mobility and traffic involve multiple different agents that need to cooperate to find a joint solution. Recent advances in behavioral planning use Reinforcement Learning to find effective and performant behavior strategies. However, as autonomous vehicles and vehicle-to-X communications become more mature, solutions that only utilize single, independent agents leave potential performance gains on the road. Multi-Agent Reinforcement Learning (MARL) is a research field that aims to find optimal solutions for multiple agents that interact with each other. This work aims to give an overview of the field to researchers in autonomous mobility. We first explain MARL and introduce important concepts. Then, we discuss the central paradigms that underlie MARL algorithms, and give an overview of state-of-the-art methods and ideas in each paradigm. With this background, we survey applications of MARL in autonomous mobility scenarios and give an overview of existing scenarios and implementations.

Precise atom manipulation through deep reinforcement learning

Authors:I-Ju Chen, Markus Aapro, Abraham Kipnis, Alexander Ilin, Peter Liljeroth, Adam S. Foster
Date:2022-03-14 10:24:43

Atomic-scale manipulation in scanning tunneling microscopy has enabled the creation of quantum states of matter based on artificial structures and extreme miniaturization of computational circuitry based on individual atoms. The ability to autonomously arrange atomic structures with precision will enable the scaling up of nanoscale fabrication and expand the range of artificial structures hosting exotic quantum states. However, the \textit{a priori} unknown manipulation parameters, the possibility of spontaneous tip apex changes, and the difficulty of modeling tip-atom interactions make it challenging to select manipulation parameters that can achieve atomic precision throughout extended operations. Here we use deep reinforcement learning (DRL) to control the real-world atom manipulation process. Several state-of-the-art reinforcement learning techniques are used jointly to boost data efficiency. The reinforcement learning agent learns to manipulate Ag adatoms on Ag(111) surfaces with optimal precision and is integrated with path planning algorithms to complete an autonomous atomic assembly system. The results demonstrate that state-of-the-art deep reinforcement learning can offer effective solutions to real-world challenges in nanofabrication and powerful approaches to increasingly complex scientific experiments at the atomic scale.

SAGE: Generating Symbolic Goals for Myopic Models in Deep Reinforcement Learning

Authors:Andrew Chester, Michael Dann, Fabio Zambetta, John Thangarajah
Date:2022-03-09 22:55:53

Model-based reinforcement learning algorithms are typically more sample efficient than their model-free counterparts, especially in sparse reward problems. Unfortunately, many interesting domains are too complex to specify the complete models required by traditional model-based approaches. Learning a model takes a large number of environment samples, and may not capture critical information if the environment is hard to explore. If we could specify an incomplete model and allow the agent to learn how best to use it, we could take advantage of our partial understanding of many domains. Existing hybrid planning and learning systems which address this problem often impose highly restrictive assumptions on the sorts of models which can be used, limiting their applicability to a wide range of domains. In this work we propose SAGE, an algorithm combining learning and planning to exploit a previously unusable class of incomplete models. This combines the strengths of symbolic planning and neural learning approaches in a novel way that outperforms competing methods on variations of taxi world and Minecraft.

Graph-based Reinforcement Learning meets Mixed Integer Programs: An application to 3D robot assembly discovery

Authors:Niklas Funk, Svenja Menzenbach, Georgia Chalvatzaki, Jan Peters
Date:2022-03-08 14:44:51

Robot assembly discovery is a challenging problem that lives at the intersection of resource allocation and motion planning. The goal is to combine a predefined set of objects to form something new while considering task execution with the robot-in-the-loop. In this work, we tackle the problem of building arbitrary, predefined target structures entirely from scratch using a set of Tetris-like building blocks and a robotic manipulator. Our novel hierarchical approach aims at efficiently decomposing the overall task into three feasible levels that benefit mutually from each other. On the high level, we run a classical mixed-integer program for global optimization of block-type selection and the blocks' final poses to recreate the desired shape. Its output is then exploited to efficiently guide the exploration of an underlying reinforcement learning (RL) policy. This RL policy draws its generalization properties from a flexible graph-based representation that is learned through Q-learning and can be refined with search. Moreover, it accounts for the necessary conditions of structural stability and robotic feasibility that cannot be effectively reflected in the previous layer. Lastly, a grasp and motion planner transforms the desired assembly commands into robot joint movements. We demonstrate our proposed method's performance on a set of competitive simulated RAD environments, showcase real-world transfer, and report performance and robustness gains compared to an unstructured end-to-end approach. Videos are available at https://sites.google.com/view/rl-meets-milp .

Policy Regularization for Legible Behavior

Authors:Michele Persiani, Thomas Hellström
Date:2022-03-08 10:55:46

In Reinforcement Learning interpretability generally means to provide insight into the agent's mechanisms such that its decisions are understandable by an expert upon inspection. This definition, with the resulting methods from the literature, may however fall short for online settings where the fluency of interactions prohibits deep inspections of the decision-making algorithm. To support interpretability in online settings it is useful to borrow from the Explainable Planning literature methods that focus on the legibility of the agent, by making its intention easily discernable in an observer model. As we propose in this paper, injecting legible behavior inside an agent's policy doesn't require modify components of its learning algorithm. Rather, the agent's optimal policy can be regularized for legibility by evaluating how the policy may produce observations that would make an observer infer an incorrect policy. In our formulation, the decision boundary introduced by legibility impacts the states in which the agent's policy returns an action that has high likelihood also in other policies. In these cases, a trade-off between such action, and legible/sub-optimal action is made.

Reinforcement Learning for Location-Aware Scheduling

Authors:Stelios Stavroulakis, Biswa Sengupta
Date:2022-03-07 15:51:00

Recent techniques in dynamical scheduling and resource management have found applications in warehouse environments due to their ability to organize and prioritize tasks in a higher temporal resolution. The rise of deep reinforcement learning, as a learning paradigm, has enabled decentralized agent populations to discover complex coordination strategies. However, training multiple agents simultaneously introduce many obstacles in training as observation and action spaces become exponentially large. In our work, we experimentally quantify how various aspects of the warehouse environment (e.g., floor plan complexity, information about agents' live location, level of task parallelizability) affect performance and execution priority. To achieve efficiency, we propose a compact representation of the state and action space for location-aware multi-agent systems, wherein each agent has knowledge of only self and task coordinates, hence only partial observability of the underlying Markov Decision Process. Finally, we show how agents trained in certain environments maintain performance in completely unseen settings and also correlate performance degradation with floor plan geometry.

MIRROR: Differentiable Deep Social Projection for Assistive Human-Robot Communication

Authors:Kaiqi Chen, Jeffrey Fong, Harold Soh
Date:2022-03-06 05:01:00

Communication is a hallmark of intelligence. In this work, we present MIRROR, an approach to (i) quickly learn human models from human demonstrations, and (ii) use the models for subsequent communication planning in assistive shared-control settings. MIRROR is inspired by social projection theory, which hypothesizes that humans use self-models to understand others. Likewise, MIRROR leverages self-models learned using reinforcement learning to bootstrap human modeling. Experiments with simulated humans show that this approach leads to rapid learning and more robust models compared to existing behavioral cloning and state-of-the-art imitation learning methods. We also present a human-subject study using the CARLA simulator which shows that (i) MIRROR is able to scale to complex domains with high-dimensional observations and complicated world physics and (ii) provides effective assistive communication that enabled participants to drive more safely in adverse weather conditions.

Vision-based Distributed Multi-UAV Collision Avoidance via Deep Reinforcement Learning for Navigation

Authors:Huaxing Huang, Guijie Zhu, Zhun Fan, Hao Zhai, Yuwei Cai, Ze Shi, Zhaohui Dong, Zhifeng Hao
Date:2022-03-05 03:01:01

Online path planning for multiple unmanned aerial vehicle (multi-UAV) systems is considered a challenging task. It needs to ensure collision-free path planning in real-time, especially when the multi-UAV systems can become very crowded on certain occasions. In this paper, we presented a vision-based decentralized collision-avoidance policy for multi-UAV systems, which takes depth images and inertial measurements as sensory inputs and outputs UAV's steering commands. The policy is trained together with the latent representation of depth images using a policy gradient-based reinforcement learning algorithm and autoencoder in the multi-UAV threedimensional workspaces. Each UAV follows the same trained policy and acts independently to reach the goal without colliding or communicating with other UAVs. We validate our policy in various simulated scenarios. The experimental results show that our learned policy can guarantee fully autonomous collision-free navigation for multi-UAV in the three-dimensional workspaces with good robustness and scalability.

Where to Look Next: Learning Viewpoint Recommendations for Informative Trajectory Planning

Authors:Max Lodel, Bruno Brito, Álvaro Serra-Gómez, Laura Ferranti, Robert Babuška, Javier Alonso-Mora
Date:2022-03-04 15:38:19

Search missions require motion planning and navigation methods for information gathering that continuously replan based on new observations of the robot's surroundings. Current methods for information gathering, such as Monte Carlo Tree Search, are capable of reasoning over long horizons, but they are computationally expensive. An alternative for fast online execution is to train, offline, an information gathering policy, which indirectly reasons about the information value of new observations. However, these policies lack safety guarantees and do not account for the robot dynamics. To overcome these limitations we train an information-aware policy via deep reinforcement learning, that guides a receding-horizon trajectory optimization planner. In particular, the policy continuously recommends a reference viewpoint to the local planner, such that the resulting dynamically feasible and collision-free trajectories lead to observations that maximize the information gain and reduce the uncertainty about the environment. In simulation tests in previously unseen environments, our method consistently outperforms greedy next-best-view policies and achieves competitive performance compared to Monte Carlo Tree Search, in terms of information gains and coverage time, with a reduction in execution time by three orders of magnitude.

Pareto Frontier Approximation Network (PA-Net) to Solve Bi-objective TSP

Authors:Ishaan Mehta, Sharareh Taghipour, Sajad Saeedi
Date:2022-03-02 18:25:45

The travelling salesperson problem (TSP) is a classic resource allocation problem used to find an optimal order of doing a set of tasks while minimizing (or maximizing) an associated objective function. It is widely used in robotics for applications such as planning and scheduling. In this work, we solve TSP for two objectives using reinforcement learning (RL). Often in multi-objective optimization problems, the associated objective functions can be conflicting in nature. In such cases, the optimality is defined in terms of Pareto optimality. A set of these Pareto optimal solutions in the objective space form a Pareto front (or frontier). Each solution has its trade-off. We present the Pareto frontier approximation network (PA-Net), a network that generates good approximations of the Pareto front for the bi-objective travelling salesperson problem (BTSP). Firstly, BTSP is converted into a constrained optimization problem. We then train our network to solve this constrained problem using the Lagrangian relaxation and policy gradient. With PA-Net we improve the performance over an existing deep RL-based method. The average improvement in the hypervolume metric, which is used to measure the optimality of the Pareto front, is 2.3%. At the same time, PA-Net has 4.5x faster inference time. Finally, we present the application of PA-Net to find optimal visiting order in a robotic navigation task/coverage planning. Our code is available on the project website.

Imitation and Adaptation Based on Consistency: A Quadruped Robot Imitates Animals from Videos Using Deep Reinforcement Learning

Authors:Qingfeng Yao, Jilong Wang, Shuyu Yang, Cong Wang, Hongyin Zhang, Qifeng Zhang, Donglin Wang
Date:2022-03-02 15:27:10

The essence of quadrupeds' movements is the movement of the center of gravity, which has a pattern in the action of quadrupeds. However, the gait motion planning of the quadruped robot is time-consuming. Animals in nature can provide a large amount of gait information for robots to learn and imitate. Common methods learn animal posture with a motion capture system or numerous motion data points. In this paper, we propose a video imitation adaptation network (VIAN) that can imitate the action of animals and adapt it to the robot from a few seconds of video. The deep learning model extracts key points during animal motion from videos. The VIAN eliminates noise and extracts key information of motion with a motion adaptor, and then applies the extracted movements function as the motion pattern into deep reinforcement learning (DRL). To ensure similarity between the learning result and the animal motion in the video, we introduce rewards that are based on the consistency of the motion. DRL explores and learns to maintain balance from movement patterns from videos, imitates the action of animals, and eventually, allows the model to learn the gait or skills from short motion videos of different animals and to transfer the motion pattern to the real robot.

Hierarchical Reinforcement Learning with AI Planning Models

Authors:Junkyu Lee, Michael Katz, Don Joven Agravante, Miao Liu, Geraud Nangue Tasse, Tim Klinger, Shirin Sohrabi
Date:2022-03-01 18:38:41

Two common approaches to sequential decision-making are AI planning (AIP) and reinforcement learning (RL). Each has strengths and weaknesses. AIP is interpretable, easy to integrate with symbolic knowledge, and often efficient, but requires an up-front logical domain specification and is sensitive to noise; RL only requires specification of rewards and is robust to noise but is sample inefficient and not easily supplied with external knowledge. We propose an integrative approach that combines high-level planning with RL, retaining interpretability, transfer, and efficiency, while allowing for robust learning of the lower-level plan actions. Our approach defines options in hierarchical reinforcement learning (HRL) from AIP operators by establishing a correspondence between the state transition model of AI planning problem and the abstract state transition system of a Markov Decision Process (MDP). Options are learned by adding intrinsic rewards to encourage consistency between the MDP and AIP transition models. We demonstrate the benefit of our integrated approach by comparing the performance of RL and HRL algorithms in both MiniGrid and N-rooms environments, showing the advantage of our method over the existing ones.

A Theory of Abstraction in Reinforcement Learning

Authors:David Abel
Date:2022-03-01 12:46:28

Reinforcement learning defines the problem facing agents that learn to make good decisions through action and observation alone. To be effective problem solvers, such agents must efficiently explore vast worlds, assign credit from delayed feedback, and generalize to new experiences, all while making use of limited data, computational resources, and perceptual bandwidth. Abstraction is essential to all of these endeavors. Through abstraction, agents can form concise models of their environment that support the many practices required of a rational, adaptive decision maker. In this dissertation, I present a theory of abstraction in reinforcement learning. I first offer three desiderata for functions that carry out the process of abstraction: they should 1) preserve representation of near-optimal behavior, 2) be learned and constructed efficiently, and 3) lower planning or learning time. I then present a suite of new algorithms and analysis that clarify how agents can learn to abstract according to these desiderata. Collectively, these results provide a partial path toward the discovery and use of abstraction that minimizes the complexity of effective reinforcement learning.

Affordance Learning from Play for Sample-Efficient Policy Learning

Authors:Jessica Borja-Diaz, Oier Mees, Gabriel Kalweit, Lukas Hermann, Joschka Boedecker, Wolfram Burgard
Date:2022-03-01 11:00:35

Robots operating in human-centered environments should have the ability to understand how objects function: what can be done with each object, where this interaction may occur, and how the object is used to achieve a goal. To this end, we propose a novel approach that extracts a self-supervised visual affordance model from human teleoperated play data and leverages it to enable efficient policy learning and motion planning. We combine model-based planning with model-free deep reinforcement learning (RL) to learn policies that favor the same object regions favored by people, while requiring minimal robot interactions with the environment. We evaluate our algorithm, Visual Affordance-guided Policy Optimization (VAPO), with both diverse simulation manipulation tasks and real world robot tidy-up experiments to demonstrate the effectiveness of our affordance-guided policies. We find that our policies train 4x faster than the baselines and generalize better to novel objects because our visual affordance model can anticipate their affordance regions.

Hierarchical Policy Learning for Mechanical Search

Authors:Oussama Zenkri, Ngo Anh Vien, Gerhard Neumann
Date:2022-02-28 11:00:31

Retrieving objects from clutters is a complex task, which requires multiple interactions with the environment until the target object can be extracted. These interactions involve executing action primitives like grasping or pushing as well as setting priorities for the objects to manipulate and the actions to execute. Mechanical Search (MS) is a framework for object retrieval, which uses a heuristic algorithm for pushing and rule-based algorithms for high-level planning. While rule-based policies profit from human intuition in how they work, they usually perform sub-optimally in many cases. Deep reinforcement learning (RL) has shown great performance in complex tasks such as taking decisions through evaluating pixels, which makes it suitable for training policies in the context of object-retrieval. In this work, we first formulate the MS problem in a principled formulation as a hierarchical POMDP. Based on this formulation, we propose a hierarchical policy learning approach for the MS problem. For demonstration, we present two main parameterized sub-policies: a push policy and an action selection policy. When integrated into the hierarchical POMDP's policy, our proposed sub-policies increase the success rate of retrieving the target object from less than 32% to nearly 80%, while reducing the computation time for push actions from multiple seconds to less than 10 milliseconds.

Neural-Progressive Hedging: Enforcing Constraints in Reinforcement Learning with Stochastic Programming

Authors:Supriyo Ghosh, Laura Wynter, Shiau Hong Lim, Duc Thien Nguyen
Date:2022-02-27 19:39:19

We propose a framework, called neural-progressive hedging (NP), that leverages stochastic programming during the online phase of executing a reinforcement learning (RL) policy. The goal is to ensure feasibility with respect to constraints and risk-based objectives such as conditional value-at-risk (CVaR) during the execution of the policy, using probabilistic models of the state transitions to guide policy adjustments. The framework is particularly amenable to the class of sequential resource allocation problems since feasibility with respect to typical resource constraints cannot be enforced in a scalable manner. The NP framework provides an alternative that adds modest overhead during the online phase. Experimental results demonstrate the efficacy of the NP framework on two continuous real-world tasks: (i) the portfolio optimization problem with liquidity constraints for financial planning, characterized by non-stationary state distributions; and (ii) the dynamic repositioning problem in bike sharing systems, that embodies the class of supply-demand matching problems. We show that the NP framework produces policies that are better than deep RL and other baseline approaches, adapting to non-stationarity, whilst satisfying structural constraints and accommodating risk measures in the resulting policies. Additional benefits of the NP framework are ease of implementation and better explainability of the policies.

Decision Making in Non-Stationary Environments with Policy-Augmented Monte Carlo Tree Search

Authors:Geoffrey Pettet, Ayan Mukhopadhyay, Abhishek Dubey
Date:2022-02-25 22:31:37

Decision-making under uncertainty (DMU) is present in many important problems. An open challenge is DMU in non-stationary environments, where the dynamics of the environment can change over time. Reinforcement Learning (RL), a popular approach for DMU problems, learns a policy by interacting with a model of the environment offline. Unfortunately, if the environment changes the policy can become stale and take sub-optimal actions, and relearning the policy for the updated environment takes time and computational effort. An alternative is online planning approaches such as Monte Carlo Tree Search (MCTS), which perform their computation at decision time. Given the current environment, MCTS plans using high-fidelity models to determine promising action trajectories. These models can be updated as soon as environmental changes are detected to immediately incorporate them into decision making. However, MCTS's convergence can be slow for domains with large state-action spaces. In this paper, we present a novel hybrid decision-making approach that combines the strengths of RL and planning while mitigating their weaknesses. Our approach, called Policy Augmented MCTS (PA-MCTS), integrates a policy's actin-value estimates into MCTS, using the estimates to seed the action trajectories favored by the search. We hypothesize that PA-MCTS will converge more quickly than standard MCTS while making better decisions than the policy can make on its own when faced with nonstationary environments. We test our hypothesis by comparing PA-MCTS with pure MCTS and an RL agent applied to the classical CartPole environment. We find that PC-MCTS can achieve higher cumulative rewards than the policy in isolation under several environmental shifts while converging in significantly fewer iterations than pure MCTS.

Hierarchical Control for Head-to-Head Autonomous Racing

Authors:Rishabh Saumil Thakkar, Aryaman Singh Samyal, David Fridovich-Keil, Zhe Xu, Ufuk Topcu
Date:2022-02-25 18:11:52

We develop a hierarchical controller for head-to-head autonomous racing. We first introduce a formulation of a racing game with realistic safety and fairness rules. A high-level planner approximates the original formulation as a discrete game with simplified state, control, and dynamics to easily encode the complex safety and fairness rules and calculates a series of target waypoints. The low-level controller takes the resulting waypoints as a reference trajectory and computes high-resolution control inputs by solving an alternative formulation approximation with simplified objectives and constraints. We consider two approaches for the low-level planner, constructing two hierarchical controllers. One approach uses multi-agent reinforcement learning (MARL), and the other solves a linear-quadratic Nash game (LQNG) to produce control inputs. The controllers are compared against three baselines: an end-to-end MARL controller, a MARL controller tracking a fixed racing line, and an LQNG controller tracking a fixed racing line. Quantitative results show that the proposed hierarchical methods outperform their respective baseline methods in terms of head-to-head race wins and abiding by the rules. The hierarchical controller using MARL for low-level control consistently outperformed all other methods by winning over 90% of head-to-head races and more consistently adhered to the complex racing rules. Qualitatively, we observe the proposed controllers mimicking actions performed by expert human drivers such as shielding/blocking, overtaking, and long-term planning for delayed advantages. We show that hierarchical planning for game-theoretic reasoning produces competitive behavior even when challenged with complex rules and constraints.

Context-Hierarchy Inverse Reinforcement Learning

Authors:Wei Gao, David Hsu, Wee Sun Lee
Date:2022-02-25 10:29:05

An inverse reinforcement learning (IRL) agent learns to act intelligently by observing expert demonstrations and learning the expert's underlying reward function. Although learning the reward functions from demonstrations has achieved great success in various tasks, several other challenges are mostly ignored. Firstly, existing IRL methods try to learn the reward function from scratch without relying on any prior knowledge. Secondly, traditional IRL methods assume the reward functions are homogeneous across all the demonstrations. Some existing IRL methods managed to extend to the heterogeneous demonstrations. However, they still assume one hidden variable that affects the behavior and learn the underlying hidden variable together with the reward from demonstrations. To solve these issues, we present Context Hierarchy IRL(CHIRL), a new IRL algorithm that exploits the context to scale up IRL and learn reward functions of complex behaviors. CHIRL models the context hierarchically as a directed acyclic graph; it represents the reward function as a corresponding modular deep neural network that associates each network module with a node of the context hierarchy. The context hierarchy and the modular reward representation enable data sharing across multiple contexts and state abstraction, significantly improving the learning performance. CHIRL has a natural connection with hierarchical task planning when the context hierarchy represents subtask decomposition. It enables to incorporate the prior knowledge of causal dependencies of subtasks and make it capable of solving large complex tasks by decoupling it into several subtasks and conquering each subtask to solve the original task. Experiments on benchmark tasks, including a large scale autonomous driving task in the CARLA simulator, show promising results in scaling up IRL for tasks with complex reward functions.

Evolutionary Multi-Objective Reinforcement Learning Based Trajectory Control and Task Offloading in UAV-Assisted Mobile Edge Computing

Authors:Fuhong Song, Huanlai Xing, Xinhan Wang, Shouxi Luo, Penglin Dai, Zhiwen Xiao, Bowen Zhao
Date:2022-02-24 11:17:30

This paper studies the trajectory control and task offloading (TCTO) problem in an unmanned aerial vehicle (UAV)-assisted mobile edge computing system, where a UAV flies along a planned trajectory to collect computation tasks from smart devices (SDs). We consider a scenario that SDs are not directly connected by the base station (BS) and the UAV has two roles to play: MEC server or wireless relay. The UAV makes task offloading decisions online, in which the collected tasks can be executed locally on the UAV or offloaded to the BS for remote processing. The TCTO problem involves multi-objective optimization as its objectives are to minimize the task delay and the UAV's energy consumption, and maximize the number of tasks collected by the UAV, simultaneously. This problem is challenging because the three objectives conflict with each other. The existing reinforcement learning (RL) algorithms, either single-objective RLs or single-policy multi-objective RLs, cannot well address the problem since they cannot output multiple policies for various preferences (i.e. weights) across objectives in a single run. This paper adapts the evolutionary multi-objective RL (EMORL), a multi-policy multi-objective RL, to the TCTO problem. This algorithm can output multiple optimal policies in just one run, each optimizing a certain preference. The simulation results demonstrate that the proposed algorithm can obtain more excellent nondominated policies by striking a balance between the three objectives regarding policy quality, compared with two evolutionary and two multi-policy RL algorithms.

Cooperative Behavior Planning for Automated Driving using Graph Neural Networks

Authors:Marvin Klimke, Benjamin Völz, Michael Buchholz
Date:2022-02-23 09:36:15

Urban intersections are prone to delays and inefficiencies due to static precedence rules and occlusions limiting the view on prioritized traffic. Existing approaches to improve traffic flow, widely known as automatic intersection management systems, are mostly based on non-learning reservation schemes or optimization algorithms. Machine learning-based techniques show promising results in planning for a single ego vehicle. This work proposes to leverage machine learning algorithms to optimize traffic flow at urban intersections by jointly planning for multiple vehicles. Learning-based behavior planning poses several challenges, demanding for a suited input and output representation as well as large amounts of ground-truth data. We address the former issue by using a flexible graph-based input representation accompanied by a graph neural network. This allows to efficiently encode the scene and inherently provide individual outputs for all involved vehicles. To learn a sensible policy, without relying on the imitation of expert demonstrations, the cooperative planning task is considered as a reinforcement learning problem. We train and evaluate the proposed method in an open-source simulation environment for decision making in automated driving. Compared to a first-in-first-out scheme and traffic governed by static priority rules, the learned planner shows a significant gain in flow rate, while reducing the number of induced stops. In addition to synthetic simulations, the approach is also evaluated based on real-world traffic data taken from the publicly available inD dataset.

Reinforcement Learning in Practice: Opportunities and Challenges

Authors:Yuxi Li
Date:2022-02-23 03:58:46

This article is a gentle discussion about the field of reinforcement learning in practice, about opportunities and challenges, touching a broad range of topics, with perspectives and without technical details. The article is based on both historical and recent research papers, surveys, tutorials, talks, blogs, books, (panel) discussions, and workshops/conferences. Various groups of readers, like researchers, engineers, students, managers, investors, officers, and people wanting to know more about the field, may find the article interesting. In this article, we first give a brief introduction to reinforcement learning (RL), and its relationship with deep learning, machine learning and AI. Then we discuss opportunities of RL, in particular, products and services, games, bandits, recommender systems, robotics, transportation, finance and economics, healthcare, education, combinatorial optimization, computer systems, and science and engineering. Then we discuss challenges, in particular, 1) foundation, 2) representation, 3) reward, 4) exploration, 5) model, simulation, planning, and benchmarks, 6) off-policy/offline learning, 7) learning to learn a.k.a. meta-learning, 8) explainability and interpretability, 9) constraints, 10) software development and deployment, 11) business perspectives, and 12) more challenges. We conclude with a discussion, attempting to answer: "Why has RL not been widely adopted in practice yet?" and "When is RL helpful?".

Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning

Authors:Jibang Wu, Zixuan Zhang, Zhe Feng, Zhaoran Wang, Zhuoran Yang, Michael I. Jordan, Haifeng Xu
Date:2022-02-22 05:41:43

In today's economy, it becomes important for Internet platforms to consider the sequential information design problem to align its long term interest with incentives of the gig service providers. This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs), where a sender, with informational advantage, seeks to persuade a stream of myopic receivers to take actions that maximizes the sender's cumulative utilities in a finite horizon Markovian environment with varying prior and utility functions. Planning in MPPs thus faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender. Nevertheless, in the population level where the model is known, it turns out that we can efficiently determine the optimal (resp. $\epsilon$-optimal) policy with finite (resp. infinite) states and outcomes, through a modified formulation of the Bellman equation. Our main technical contribution is to study the MPP under the online reinforcement learning (RL) setting, where the goal is to learn the optimal signaling policy by interacting with with the underlying MPP, without the knowledge of the sender's utility functions, prior distributions, and the Markov transition kernels. We design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles. Our algorithm enjoys sample efficiency by achieving a sublinear $\sqrt{T}$-regret upper bound. Furthermore, both our algorithm and theory can be applied to MPPs with large space of outcomes and states via function approximation, and we showcase such a success under the linear setting.

Cooperative Artificial Intelligence

Authors:Tobias Baumann
Date:2022-02-20 16:50:37

In the future, artificial learning agents are likely to become increasingly widespread in our society. They will interact with both other learning agents and humans in a variety of complex settings including social dilemmas. We argue that there is a need for research on the intersection between game theory and artificial intelligence, with the goal of achieving cooperative artificial intelligence that can navigate social dilemmas well. We consider the problem of how an external agent can promote cooperation between artificial learners by distributing additional rewards and punishments based on observing the actions of the learners. We propose a rule for automatically learning how to create the right incentives by considering the anticipated parameter updates of each agent. Using this learning rule leads to cooperation with high social welfare in matrix games in which the agents would otherwise learn to defect with high probability. We show that the resulting cooperative outcome is stable in certain games even if the planning agent is turned off after a given number of episodes, while other games require ongoing intervention to maintain mutual cooperation. Finally, we reflect on what the goals of multi-agent reinforcement learning should be in the first place, and discuss the necessary building blocks towards the goal of building cooperative AI.

Learning to Help Emergency Vehicles Arrive Faster: A Cooperative Vehicle-Road Scheduling Approach

Authors:Lige Ding, Dong Zhao, Zhaofeng Wang, Guang Wang, Chang Tan, Lei Fan, Huadong Ma
Date:2022-02-20 10:25:15

The ever-increasing heavy traffic congestion potentially impedes the accessibility of emergency vehicles (EVs), resulting in detrimental impacts on critical services and even safety of people's lives. Hence, it is significant to propose an efficient scheduling approach to help EVs arrive faster. Existing vehicle-centric scheduling approaches aim to recommend the optimal paths for EVs based on the current traffic status while the road-centric scheduling approaches aim to improve the traffic condition and assign a higher priority for EVs to pass an intersection. With the intuition that real-time vehicle-road information interaction and strategy coordination can bring more benefits, we propose LEVID, a LEarning-based cooperative VehIcle-roaD scheduling approach including a real-time route planning module and a collaborative traffic signal control module, which interact with each other and make decisions iteratively. The real-time route planning module adapts the artificial potential field method to address the real-time changes of traffic signals and avoid falling into a local optimum. The collaborative traffic signal control module leverages a graph attention reinforcement learning framework to extract the latent features of different intersections and abstract their interplay to learn cooperative policies. Extensive experiments based on multiple real-world datasets show that our approach outperforms the state-of-the-art baselines.

Selective Credit Assignment

Authors:Veronica Chelu, Diana Borsa, Doina Precup, Hado van Hasselt
Date:2022-02-20 00:07:57

Efficient credit assignment is essential for reinforcement learning algorithms in both prediction and control settings. We describe a unified view on temporal-difference algorithms for selective credit assignment. These selective algorithms apply weightings to quantify the contribution of learning updates. We present insights into applying weightings to value-based learning and planning algorithms, and describe their role in mediating the backward credit distribution in prediction and control. Within this space, we identify some existing online learning algorithms that can assign credit selectively as special cases, as well as add new algorithms that assign credit backward in time counterfactually, allowing credit to be assigned off-trajectory and off-policy.

TransDreamer: Reinforcement Learning with Transformer World Models

Authors:Chang Chen, Yi-Fu Wu, Jaesik Yoon, Sungjin Ahn
Date:2022-02-19 00:30:52

The Dreamer agent provides various benefits of Model-Based Reinforcement Learning (MBRL) such as sample efficiency, reusable knowledge, and safe planning. However, its world model and policy networks inherit the limitations of recurrent neural networks and thus an important question is how an MBRL framework can benefit from the recent advances of transformers and what the challenges are in doing so. In this paper, we propose a transformer-based MBRL agent, called TransDreamer. We first introduce the Transformer State-Space Model, a world model that leverages a transformer for dynamics predictions. We then share this world model with a transformer-based policy network and obtain stability in training a transformer-based RL agent. In experiments, we apply the proposed model to 2D visual RL and 3D first-person visual RL tasks both requiring long-range memory access for memory-based reasoning. We show that the proposed model outperforms Dreamer in these complex tasks.

Safe Reinforcement Learning by Imagining the Near Future

Authors:Garrett Thomas, Yuping Luo, Tengyu Ma
Date:2022-02-15 23:28:24

Safe reinforcement learning is a promising path toward applying reinforcement learning algorithms to real-world problems, where suboptimal behaviors may lead to actual negative consequences. In this work, we focus on the setting where unsafe states can be avoided by planning ahead a short time into the future. In this setting, a model-based agent with a sufficiently accurate model can avoid unsafe states. We devise a model-based algorithm that heavily penalizes unsafe trajectories, and derive guarantees that our algorithm can avoid unsafe states under certain assumptions. Experiments demonstrate that our algorithm can achieve competitive rewards with fewer safety violations in several continuous control tasks.

Provably Efficient Causal Model-Based Reinforcement Learning for Systematic Generalization

Authors:Mirco Mutti, Riccardo De Santi, Emanuele Rossi, Juan Felipe Calderon, Michael Bronstein, Marcello Restelli
Date:2022-02-14 08:34:51

In the sequential decision making setting, an agent aims to achieve systematic generalization over a large, possibly infinite, set of environments. Such environments are modeled as discrete Markov decision processes with both states and actions represented through a feature vector. The underlying structure of the environments allows the transition dynamics to be factored into two components: one that is environment-specific and another that is shared. Consider a set of environments that share the laws of motion as an example. In this setting, the agent can take a finite amount of reward-free interactions from a subset of these environments. The agent then must be able to approximately solve any planning task defined over any environment in the original set, relying on the above interactions only. Can we design a provably efficient algorithm that achieves this ambitious goal of systematic generalization? In this paper, we give a partially positive answer to this question. First, we provide a tractable formulation of systematic generalization by employing a causal viewpoint. Then, under specific structural assumptions, we provide a simple learning algorithm that guarantees any desired planning error up to an unavoidable sub-optimality term, while showcasing a polynomial sample complexity.

Learning Reward Models for Cooperative Trajectory Planning with Inverse Reinforcement Learning and Monte Carlo Tree Search

Authors:Karl Kurzer, Matthias Bitzer, J. Marius Zöllner
Date:2022-02-14 00:33:08

Cooperative trajectory planning methods for automated vehicles can solve traffic scenarios that require a high degree of cooperation between traffic participants. However, for cooperative systems to integrate into human-centered traffic, the automated systems must behave human-like so that humans can anticipate the system's decisions. While Reinforcement Learning has made remarkable progress in solving the decision-making part, it is non-trivial to parameterize a reward model that yields predictable actions. This work employs feature-based Maximum Entropy Inverse Reinforcement Learning combined with Monte Carlo Tree Search to learn reward models that maximize the likelihood of recorded multi-agent cooperative expert trajectories. The evaluation demonstrates that the approach can recover a reasonable reward model that mimics the expert and performs similarly to a manually tuned baseline reward model.

A Unified Perspective on Value Backup and Exploration in Monte-Carlo Tree Search

Authors:Tuan Dam, Carlo D'Eramo, Jan Peters, Joni Pajarinen
Date:2022-02-11 15:30:08

Monte-Carlo Tree Search (MCTS) is a class of methods for solving complex decision-making problems through the synergy of Monte-Carlo planning and Reinforcement Learning (RL). The highly combinatorial nature of the problems commonly addressed by MCTS requires the use of efficient exploration strategies for navigating the planning tree and quickly convergent value backup methods. These crucial problems are particularly evident in recent advances that combine MCTS with deep neural networks for function approximation. In this work, we propose two methods for improving the convergence rate and exploration based on a newly introduced backup operator and entropy regularization. We provide strong theoretical guarantees to bound convergence rate, approximation error, and regret of our methods. Moreover, we introduce a mathematical framework based on the use of the $\alpha$-divergence for backup and exploration in MCTS. We show that this theoretical formulation unifies different approaches, including our newly introduced ones, under the same mathematical framework, allowing to obtain different methods by simply changing the value of $\alpha$. In practice, our unified perspective offers a flexible way to balance between exploration and exploitation by tuning the single $\alpha$ parameter according to the problem at hand. We validate our methods through a rigorous empirical study from basic toy problems to the complex Atari games, and including both MDP and POMDP problems.

Universal Learning Waveform Selection Strategies for Adaptive Target Tracking

Authors:Charles E. Thornton, R. Michael Buehrer, Harpreet S. Dhillon, Anthony F. Martone
Date:2022-02-10 19:21:03

Online selection of optimal waveforms for target tracking with active sensors has long been a problem of interest. Many conventional solutions utilize an estimation-theoretic interpretation, in which a waveform-specific Cram\'{e}r-Rao lower bound on measurement error is used to select the optimal waveform for each tracking step. However, this approach is only valid in the high SNR regime, and requires a rather restrictive set of assumptions regarding the target motion and measurement models. Further, due to computational concerns, many traditional approaches are limited to near-term, or myopic, optimization, even though radar scenes exhibit strong temporal correlation. More recently, reinforcement learning has been proposed for waveform selection, in which the problem is framed as a Markov decision process (MDP), allowing for long-term planning. However, a major limitation of reinforcement learning is that the memory length of the underlying Markov process is often unknown for realistic target and channel dynamics, and a more general framework is desirable. This work develops a universal sequential waveform selection scheme which asymptotically achieves Bellman optimality in any radar scene which can be modeled as a $U^{\text{th}}$ order Markov process for a finite, but unknown, integer $U$. Our approach is based on well-established tools from the field of universal source coding, where a stationary source is parsed into variable length phrases in order to build a context-tree, which is used as a probabalistic model for the scene's behavior. We show that an algorithm based on a multi-alphabet version of the Context-Tree Weighting (CTW) method can be used to optimally solve a broad class of waveform-agile tracking problems while making minimal assumptions about the environment's behavior.

Provable Reinforcement Learning with a Short-Term Memory

Authors:Yonathan Efroni, Chi Jin, Akshay Krishnamurthy, Sobhan Miryoosefi
Date:2022-02-08 16:39:57

Real-world sequential decision making problems commonly involve partial observability, which requires the agent to maintain a memory of history in order to infer the latent states, plan and make good decisions. Coping with partial observability in general is extremely challenging, as a number of worst-case statistical and computational barriers are known in learning Partially Observable Markov Decision Processes (POMDPs). Motivated by the problem structure in several physical applications, as well as a commonly used technique known as "frame stacking", this paper proposes to study a new subclass of POMDPs, whose latent states can be decoded by the most recent history of a short length $m$. We establish a set of upper and lower bounds on the sample complexity for learning near-optimal policies for this class of problems in both tabular and rich-observation settings (where the number of observations is enormous). In particular, in the rich-observation setting, we develop new algorithms using a novel "moment matching" approach with a sample complexity that scales exponentially with the short length $m$ rather than the problem horizon, and is independent of the number of observations. Our results show that a short-term memory suffices for reinforcement learning in these environments.

Multi-Agent Path Finding with Prioritized Communication Learning

Authors:Wenhao Li, Hongjun Chen, Bo Jin, Wenzhe Tan, Hongyuan Zha, Xiangfeng Wang
Date:2022-02-08 04:04:19

Multi-agent pathfinding (MAPF) has been widely used to solve large-scale real-world problems, e.g., automation warehouses. The learning-based, fully decentralized framework has been introduced to alleviate real-time problems and simultaneously pursue optimal planning policy. However, existing methods might generate significantly more vertex conflicts (or collisions), which lead to a low success rate or more makespan. In this paper, we propose a PrIoritized COmmunication learning method (PICO), which incorporates the \textit{implicit} planning priorities into the communication topology within the decentralized multi-agent reinforcement learning framework. Assembling with the classic coupled planners, the implicit priority learning module can be utilized to form the dynamic communication topology, which also builds an effective collision-avoiding mechanism. PICO performs significantly better in large-scale MAPF tasks in success rates and collision rates than state-of-the-art learning-based planners.

GrASP: Gradient-Based Affordance Selection for Planning

Authors:Vivek Veeriah, Zeyu Zheng, Richard Lewis, Satinder Singh
Date:2022-02-08 03:24:36

Planning with a learned model is arguably a key component of intelligence. There are several challenges in realizing such a component in large-scale reinforcement learning (RL) problems. One such challenge is dealing effectively with continuous action spaces when using tree-search planning (e.g., it is not feasible to consider every action even at just the root node of the tree). In this paper we present a method for selecting affordances useful for planning -- for learning which small number of actions/options from a continuous space of actions/options to consider in the tree-expansion process during planning. We consider affordances that are goal-and-state-conditional mappings to actions/options as well as unconditional affordances that simply select actions/options available in all states. Our selection method is gradient based: we compute gradients through the planning procedure to update the parameters of the function that represents affordances. Our empirical work shows that it is feasible to learn to select both primitive-action and option affordances, and that simultaneously learning to select affordances and planning with a learned value-equivalent model can outperform model-free RL.

Reward-Respecting Subtasks for Model-Based Reinforcement Learning

Authors:Richard S. Sutton, Marlos C. Machado, G. Zacharias Holland, David Szepesvari, Finbarr Timbers, Brian Tanner, Adam White
Date:2022-02-07 19:09:27

To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such reward-respecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option-critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions.

Model-Free Reinforcement Learning for Symbolic Automata-encoded Objectives

Authors:Anand Balakrishnan, Stefan Jakšić, Edgar A. Aguilar, Dejan Ničković, Jyotirmoy V. Deshmukh
Date:2022-02-04 21:54:36

Reinforcement learning (RL) is a popular approach for robotic path planning in uncertain environments. However, the control policies trained for an RL agent crucially depend on user-defined, state-based reward functions. Poorly designed rewards can lead to policies that do get maximal rewards but fail to satisfy desired task objectives or are unsafe. There are several examples of the use of formal languages such as temporal logics and automata to specify high-level task specifications for robots (in lieu of Markovian rewards). Recent efforts have focused on inferring state-based rewards from formal specifications; here, the goal is to provide (probabilistic) guarantees that the policy learned using RL (with the inferred rewards) satisfies the high-level formal specification. A key drawback of several of these techniques is that the rewards that they infer are sparse: the agent receives positive rewards only upon completion of the task and no rewards otherwise. This naturally leads to poor convergence properties and high variance during RL. In this work, we propose using formal specifications in the form of symbolic automata: these serve as a generalization of both bounded-time temporal logic-based specifications as well as automata. Furthermore, our use of symbolic automata allows us to define non-sparse potential-based rewards which empirically shape the reward surface, leading to better convergence during RL. We also show that our potential-based rewarding strategy still allows us to obtain the policy that maximizes the satisfaction of the given specification.

Do Differentiable Simulators Give Better Policy Gradients?

Authors:H. J. Terry Suh, Max Simchowitz, Kaiqing Zhang, Russ Tedrake
Date:2022-02-02 00:12:28

Differentiable simulators promise faster computation time for reinforcement learning by replacing zeroth-order gradient estimates of a stochastic objective with an estimate based on first-order gradients. However, it is yet unclear what factors decide the performance of the two estimators on complex landscapes that involve long-horizon planning and control on physical systems, despite the crucial relevance of this question for the utility of differentiable simulators. We show that characteristics of certain physical systems, such as stiffness or discontinuities, may compromise the efficacy of the first-order estimator, and analyze this phenomenon through the lens of bias and variance. We additionally propose an $\alpha$-order gradient estimator, with $\alpha \in [0,1]$, which correctly utilizes exact gradients to combine the efficiency of first-order estimates with the robustness of zero-order methods. We demonstrate the pitfalls of traditional estimators and the advantages of the $\alpha$-order estimator on some numerical examples.

Accelerating Deep Reinforcement Learning for Digital Twin Network Optimization with Evolutionary Strategies

Authors:Carlos Güemes-Palau, Paul Almasan, Shihan Xiao, Xiangle Cheng, Xiang Shi, Pere Barlet-Ros, Albert Cabellos-Aparicio
Date:2022-02-01 11:56:55

The recent growth of emergent network applications (e.g., satellite networks, vehicular networks) is increasing the complexity of managing modern communication networks. As a result, the community proposed the Digital Twin Networks (DTN) as a key enabler of efficient network management. Network operators can leverage the DTN to perform different optimization tasks (e.g., Traffic Engineering, Network Planning). Deep Reinforcement Learning (DRL) showed a high performance when applied to solve network optimization problems. In the context of DTN, DRL can be leveraged to solve optimization problems without directly impacting the real-world network behavior. However, DRL scales poorly with the problem size and complexity. In this paper, we explore the use of Evolutionary Strategies (ES) to train DRL agents for solving a routing optimization problem. The experimental results show that ES achieved a training time speed-up of 128 and 6 for the NSFNET and GEANT2 topologies respectively.

RFUniverse: A Multiphysics Simulation Platform for Embodied AI

Authors:Haoyuan Fu, Wenqiang Xu, Ruolin Ye, Han Xue, Zhenjun Yu, Tutian Tang, Yutong Li, Wenxin Du, Jieyi Zhang, Cewu Lu
Date:2022-02-01 03:35:13

Multiphysics phenomena, the coupling effects involving different aspects of physics laws, are pervasive in the real world and can often be encountered when performing everyday household tasks. Intelligent agents which seek to assist or replace human laborers will need to learn to cope with such phenomena in household task settings. To equip the agents with such kind of abilities, the research community needs a simulation environment, which will have the capability to serve as the testbed for the training process of these intelligent agents, to have the ability to support multiphysics coupling effects. Though many mature simulation software for multiphysics simulation have been adopted in industrial production, such techniques have not been applied to robot learning or embodied AI research. To bridge the gap, we propose a novel simulation environment named RFUniverse. This simulator can not only compute rigid and multi-body dynamics, but also multiphysics coupling effects commonly observed in daily life, such as air-solid interaction, fluid-solid interaction, and heat transfer. Because of the unique multiphysics capacities of this simulator, we can benchmark tasks that involve complex dynamics due to multiphysics coupling effects in a simulation environment before deploying to the real world. RFUniverse provides multiple interfaces to let the users interact with the virtual world in various ways, which is helpful and essential for learning, planning, and control. We benchmark three tasks with reinforcement learning, including food cutting, water pushing, and towel catching. We also evaluate butter pushing with a classic planning-control paradigm. This simulator offers an enhancement of physics simulation in terms of the computation of multiphysics coupling effects.

Zeroth-Order Actor-Critic: An Evolutionary Framework for Sequential Decision Problems

Authors:Yuheng Lei, Yao Lyu, Guojian Zhan, Tao Zhang, Jiangtao Li, Jianyu Chen, Shengbo Eben Li, Sifa Zheng
Date:2022-01-29 07:09:03

Evolutionary algorithms (EAs) have shown promise in solving sequential decision problems (SDPs) by simplifying them to static optimization problems and searching for the optimal policy parameters in a zeroth-order way. While these methods are highly versatile, they often suffer from high sample complexity due to their ignorance of the underlying temporal structures. In contrast, reinforcement learning (RL) methods typically formulate SDPs as Markov Decision Process (MDP). Although more sample efficient than EAs, RL methods are restricted to differentiable policies and prone to getting stuck in local optima. To address these issues, we propose a novel evolutionary framework Zeroth-Order Actor-Critic (ZOAC). We propose to use step-wise exploration in parameter space and theoretically derive the zeroth-order policy gradient. We further utilize the actor-critic architecture to effectively leverage the Markov property of SDPs and reduce the variance of gradient estimators. In each iteration, ZOAC employs samplers to collect trajectories with parameter space exploration, and alternates between first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM). To evaluate the effectiveness of ZOAC, we apply it to a challenging multi-lane driving task, optimizing the parameters in a rule-based, non-differentiable driving policy that consists of three sub-modules: behavior selection, path planning, and trajectory tracking. We also compare it with gradient-based RL methods on three Gymnasium tasks, optimizing neural network policies with thousands of parameters. Experimental results demonstrate the strong capability of ZOAC in solving SDPs. ZOAC significantly outperforms EAs that treat the problem as static optimization and matches the performance of gradient-based RL methods even without first-order information, in terms of total average return across all tasks.

Planning and Learning with Adaptive Lookahead

Authors:Aviv Rosenberg, Assaf Hallak, Shie Mannor, Gal Chechik, Gal Dalal
Date:2022-01-28 20:26:55

Some of the most powerful reinforcement learning frameworks use planning for action selection. Interestingly, their planning horizon is either fixed or determined arbitrarily by the state visitation history. Here, we expand beyond the naive fixed horizon and propose a theoretically justified strategy for adaptive selection of the planning horizon as a function of the state-dependent value estimate. We propose two variants for lookahead selection and analyze the trade-off between iteration count and computational complexity per iteration. We then devise a corresponding deep Q-network algorithm with an adaptive tree search horizon. We separate the value estimation per depth to compensate for the off-policy discrepancy between depths. Lastly, we demonstrate the efficacy of our adaptive lookahead method in a maze environment and Atari.

A deep Q-learning method for optimizing visual search strategies in backgrounds of dynamic noise

Authors:Weimin Zhou, Miguel P. Eckstein
Date:2022-01-28 19:26:45

Humans process visual information with varying resolution (foveated visual system) and explore images by orienting through eye movements the high-resolution fovea to points of interest. The Bayesian ideal searcher (IS) that employs complete knowledge of task-relevant information optimizes eye movement strategy and achieves the optimal search performance. The IS can be employed as an important tool to evaluate the optimality of human eye movements, and potentially provide guidance to improve human observer visual search strategies. Najemnik and Geisler (2005) derived an IS for backgrounds of spatial 1/f noise. The corresponding template responses follow Gaussian distributions and the optimal search strategy can be analytically determined. However, the computation of the IS can be intractable when considering more realistic and complex backgrounds such as medical images. Modern reinforcement learning methods, successfully applied to obtain optimal policy for a variety of tasks, do not require complete knowledge of the background generating functions and can be potentially applied to anatomical backgrounds. An important first step is to validate the optimality of the reinforcement learning method. In this study, we investigate the ability of a reinforcement learning method that employs Q-network to approximate the IS. We demonstrate that the search strategy corresponding to the Q-network is consistent with the IS search strategy. The findings show the potential of the reinforcement learning with Q-network approach to estimate optimal eye movement planning with real anatomical backgrounds.

Overcoming Exploration: Deep Reinforcement Learning for Continuous Control in Cluttered Environments from Temporal Logic Specifications

Authors:Mingyu Cai, Erfan Aasi, Calin Belta, Cristian-Ioan Vasile
Date:2022-01-28 16:39:08

Model-free continuous control for robot navigation tasks using Deep Reinforcement Learning (DRL) that relies on noisy policies for exploration is sensitive to the density of rewards. In practice, robots are usually deployed in cluttered environments, containing many obstacles and narrow passageways. Designing dense effective rewards is challenging, resulting in exploration issues during training. Such a problem becomes even more serious when tasks are described using temporal logic specifications. This work presents a deep policy gradient algorithm for controlling a robot with unknown dynamics operating in a cluttered environment when the task is specified as a Linear Temporal Logic (LTL) formula. To overcome the environmental challenge of exploration during training, we propose a novel path planning-guided reward scheme by integrating sampling-based methods to effectively complete goal-reaching missions. To facilitate LTL satisfaction, our approach decomposes the LTL mission into sub-goal-reaching tasks that are solved in a distributed manner. Our framework is shown to significantly improve performance (effectiveness, efficiency) and exploration of robots tasked with complex missions in large-scale cluttered environments. A video demonstration can be found on YouTube Channel: https://youtu.be/yMh_NUNWxho.

Dynamic Temporal Reconciliation by Reinforcement learning

Authors:Himanshi Charotia, Abhishek Garg, Gaurav Dhama, Naman Maheshwari
Date:2022-01-28 07:15:23

Planning based on long and short term time series forecasts is a common practice across many industries. In this context, temporal aggregation and reconciliation techniques have been useful in improving forecasts, reducing model uncertainty, and providing a coherent forecast across different time horizons. However, an underlying assumption spanning all these techniques is the complete availability of data across all levels of the temporal hierarchy, while this offers mathematical convenience but most of the time low frequency data is partially completed and it is not available while forecasting. On the other hand, high frequency data can significantly change in a scenario like the COVID pandemic and this change can be used to improve forecasts that will otherwise significantly diverge from long term actuals. We propose a dynamic reconciliation method whereby we formulate the problem of informing low frequency forecasts based on high frequency actuals as a Markov Decision Process (MDP) allowing for the fact that we do not have complete information about the dynamics of the process. This allows us to have the best long term estimates based on the most recent data available even if the low frequency cycles have only been partially completed. The MDP has been solved using a Time Differenced Reinforcement learning (TDRL) approach with customizable actions and improves the long terms forecasts dramatically as compared to relying solely on historical low frequency data. The result also underscores the fact that while low frequency forecasts can improve the high frequency forecasts as mentioned in the temporal reconciliation literature (based on the assumption that low frequency forecasts have lower noise to signal ratio) the high frequency forecasts can also be used to inform the low frequency forecasts.

Excavation Reinforcement Learning Using Geometric Representation

Authors:Qingkai Lu, Yifan Zhu, Liangjun Zhang
Date:2022-01-27 02:59:56

Excavation of irregular rigid objects in clutter, such as fragmented rocks and wood blocks, is very challenging due to their complex interaction dynamics and highly variable geometries. In this paper, we adopt reinforcement learning (RL) to tackle this challenge and learn policies to plan for a sequence of excavation trajectories for irregular rigid objects, given point clouds of excavation scenes. Moreover, we separately learn a compact representation of the point cloud on geometric tasks that do not require human labeling. We show that using the representation reduces training time for RL, while achieving similar asymptotic performance compare to an end-to-end RL algorithm. When using a policy trained in simulation directly on a real scene, we show that the policy trained with the representation outperforms end-to-end RL. To our best knowledge, this paper presents the first application of RL to plan a sequence of excavation trajectories of irregular rigid objects in clutter.

Migration of self-propelling agent in a turbulent environment with minimal energy consumption

Authors:Ao Xu, Hua-Lin Wu, Heng-Dong Xi
Date:2022-01-25 01:55:09

We present a numerical study of training a self-propelling agent to migrate in the unsteady flow environment. We control the agent to utilize the background flow structure by adopting the reinforcement learning algorithm to minimize energy consumption. We considered the agent migrating in two types of flows: one is simple periodical double-gyre flow as a proof-of-concept example, while the other is complex turbulent Rayleigh-B\'enard convection as a paradigm for migrating in the convective atmosphere or the ocean. The results show that the smart agent in both flows can learn to migrate from one position to another while utilizing background flow currents as much as possible to minimize the energy consumption, which is evident by comparing the smart agent with a naive agent that moves straight from the origin to the destination. In addition, we found that compared to the double-gyre flow, the flow field in the turbulent Rayleigh-B\'enard convection exhibits more substantial fluctuations, and the training agent is more likely to explore different migration strategies; thus, the training process is more difficult to converge. Nevertheless, we can still identify an energy-efficient trajectory that corresponds to the strategy with the highest reward received by the agent. These results have important implications for many migration problems such as unmanned aerial vehicles flying in a turbulent convective environment, where planning energy-efficient trajectories are often involved.

Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning

Authors:Haichao Zhang, Wei Xu, Haonan Yu
Date:2022-01-24 15:53:32

Standard model-free reinforcement learning algorithms optimize a policy that generates the action to be taken in the current time step in order to maximize expected future return. While flexible, it faces difficulties arising from the inefficient exploration due to its single step nature. In this work, we present Generative Planning method (GPM), which can generate actions not only for the current step, but also for a number of future steps (thus termed as generative planning). This brings several benefits to GPM. Firstly, since GPM is trained by maximizing value, the plans generated from it can be regarded as intentional action sequences for reaching high value regions. GPM can therefore leverage its generated multi-step plans for temporally coordinated exploration towards high value regions, which is potentially more effective than a sequence of actions generated by perturbing each action at single step level, whose consistent movement decays exponentially with the number of exploration steps. Secondly, starting from a crude initial plan generator, GPM can refine it to be adaptive to the task, which, in return, benefits future explorations. This is potentially more effective than commonly used action-repeat strategy, which is non-adaptive in its form of plans. Additionally, since the multi-step plan can be interpreted as the intent of the agent from now to a span of time period into the future, it offers a more informative and intuitive signal for interpretation. Experiments are conducted on several benchmark environments and the results demonstrated its effectiveness compared with several baseline methods.

The Paradox of Choice: Using Attention in Hierarchical Reinforcement Learning

Authors:Andrei Nica, Khimya Khetarpal, Doina Precup
Date:2022-01-24 13:18:02

Decision-making AI agents are often faced with two important challenges: the depth of the planning horizon, and the branching factor due to having many choices. Hierarchical reinforcement learning methods aim to solve the first problem, by providing shortcuts that skip over multiple time steps. To cope with the breadth, it is desirable to restrict the agent's attention at each step to a reasonable number of possible choices. The concept of affordances (Gibson, 1977) suggests that only certain actions are feasible in certain states. In this work, we model "affordances" through an attention mechanism that limits the available choices of temporally extended options. We present an online, model-free algorithm to learn affordances that can be used to further learn subgoal options. We investigate the role of hard versus soft attention in training data collection, abstract value learning in long-horizon tasks, and handling a growing number of choices. We identify and empirically illustrate the settings in which the paradox of choice arises, i.e. when having fewer but more meaningful choices improves the learning speed and performance of a reinforcement learning agent.

Learning to Reformulate for Linear Programming

Authors:Xijun Li, Qingyu Qu, Fangzhou Zhu, Jia Zeng, Mingxuan Yuan, Kun Mao, Jie Wang
Date:2022-01-17 04:58:46

It has been verified that the linear programming (LP) is able to formulate many real-life optimization problems, which can obtain the optimum by resorting to corresponding solvers such as OptVerse, Gurobi and CPLEX. In the past decades, a serial of traditional operation research algorithms have been proposed to obtain the optimum of a given LP in a fewer solving time. Recently, there is a trend of using machine learning (ML) techniques to improve the performance of above solvers. However, almost no previous work takes advantage of ML techniques to improve the performance of solver from the front end, i.e., the modeling (or formulation). In this paper, we are the first to propose a reinforcement learning-based reformulation method for LP to improve the performance of solving process. Using an open-source solver COIN-OR LP (CLP) as an environment, we implement the proposed method over two public research LP datasets and one large-scale LP dataset collected from practical production planning scenario. The evaluation results suggest that the proposed method can effectively reduce both the solving iteration number ($25\%\downarrow$) and the solving time ($15\%\downarrow$) over above datasets in average, compared to directly solving the original LP instances.

Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning

Authors:Phillip Swazinna, Steffen Udluft, Daniel Hein, Thomas Runkler
Date:2022-01-14 13:08:19

Offline reinforcement learning (RL) Algorithms are often designed with environments such as MuJoCo in mind, in which the planning horizon is extremely long and no noise exists. We compare model-free, model-based, as well as hybrid offline RL approaches on various industrial benchmark (IB) datasets to test the algorithms in settings closer to real world problems, including complex noise and partially observable states. We find that on the IB, hybrid approaches face severe difficulties and that simpler algorithms, such as rollout based algorithms or model-free algorithms with simpler regularizers perform best on the datasets.

Planning in Observable POMDPs in Quasipolynomial Time

Authors:Noah Golowich, Ankur Moitra, Dhruv Rohatgi
Date:2022-01-12 23:16:37

Partially Observable Markov Decision Processes (POMDPs) are a natural and general model in reinforcement learning that take into account the agent's uncertainty about its current state. In the literature on POMDPs, it is customary to assume access to a planning oracle that computes an optimal policy when the parameters are known, even though the problem is known to be computationally hard. Almost all existing planning algorithms either run in exponential time, lack provable performance guarantees, or require placing strong assumptions on the transition dynamics under every possible policy. In this work, we revisit the planning problem and ask: are there natural and well-motivated assumptions that make planning easy? Our main result is a quasipolynomial-time algorithm for planning in (one-step) observable POMDPs. Specifically, we assume that well-separated distributions on states lead to well-separated distributions on observations, and thus the observations are at least somewhat informative in each step. Crucially, this assumption places no restrictions on the transition dynamics of the POMDP; nevertheless, it implies that near-optimal policies admit quasi-succinct descriptions, which is not true in general (under standard hardness assumptions). Our analysis is based on new quantitative bounds for filter stability -- i.e. the rate at which an optimal filter for the latent state forgets its initialization. Furthermore, we prove matching hardness for planning in observable POMDPs under the Exponential Time Hypothesis.

Multi-echelon Supply Chains with Uncertain Seasonal Demands and Lead Times Using Deep Reinforcement Learning

Authors:Julio César Alves, Geraldo Robson Mateus
Date:2022-01-12 19:03:07

We address the problem of production planning and distribution in multi-echelon supply chains. We consider uncertain demands and lead times which makes the problem stochastic and non-linear. A Markov Decision Process formulation and a Non-linear Programming model are presented. As a sequential decision-making problem, Deep Reinforcement Learning (RL) is a possible solution approach. This type of technique has gained a lot of attention from Artificial Intelligence and Optimization communities in recent years. Considering the good results obtained with Deep RL approaches in different areas there is a growing interest in applying them in problems from the Operations Research field. We have used a Deep RL technique, namely Proximal Policy Optimization (PPO2), to solve the problem considering uncertain, regular and seasonal demands and constant or stochastic lead times. Experiments are carried out in different scenarios to better assess the suitability of the algorithm. An agent based on a linearized model is used as a baseline. Experimental results indicate that PPO2 is a competitive and adequate tool for this type of problem. PPO2 agent is better than baseline in all scenarios with stochastic lead times (7.3-11.2%), regardless of whether demands are seasonal or not. In scenarios with constant lead times, the PPO2 agent is better when uncertain demands are non-seasonal (2.2-4.7%). The results show that the greater the uncertainty of the scenario, the greater the viability of this type of approach.

Dyna-T: Dyna-Q and Upper Confidence Bounds Applied to Trees

Authors:Tarek Faycal, Claudio Zito
Date:2022-01-12 15:06:30

In this work we present a preliminary investigation of a novel algorithm called Dyna-T. In reinforcement learning (RL) a planning agent has its own representation of the environment as a model. To discover an optimal policy to interact with the environment, the agent collects experience in a trial and error fashion. Experience can be used for learning a better model or improve directly the value function and policy. Typically separated, Dyna-Q is an hybrid approach which, at each iteration, exploits the real experience to update the model as well as the value function, while planning its action using simulated data from its model. However, the planning process is computationally expensive and strongly depends on the dimensionality of the state-action space. We propose to build a Upper Confidence Tree (UCT) on the simulated experience and search for the best action to be selected during the on-line learning process. We prove the effectiveness of our proposed method on a set of preliminary tests on three testbed environments from Open AI. In contrast to Dyna-Q, Dyna-T outperforms state-of-the-art RL agents in the stochastic environments by choosing a more robust action selection strategy.

Combining Learning-based Locomotion Policy with Model-based Manipulation for Legged Mobile Manipulators

Authors:Yuntao Ma, Farbod Farshidian, Takahiro Miki, Joonho Lee, Marco Hutter
Date:2022-01-11 10:27:31

Deep reinforcement learning produces robust locomotion policies for legged robots over challenging terrains. To date, few studies have leveraged model-based methods to combine these locomotion skills with the precise control of manipulators. Here, we incorporate external dynamics plans into learning-based locomotion policies for mobile manipulation. We train the base policy by applying a random wrench sequence on the robot base in simulation and adding the noisified wrench sequence prediction to the policy observations. The policy then learns to counteract the partially-known future disturbance. The random wrench sequences are replaced with the wrench prediction generated with the dynamics plans from model predictive control to enable deployment. We show zero-shot adaptation for manipulators unseen during training. On the hardware, we demonstrate stable locomotion of legged robots with the prediction of the external wrench.

Assessing Policy, Loss and Planning Combinations in Reinforcement Learning using a New Modular Architecture

Authors:Tiago Gaspar Oliveira, Arlindo L. Oliveira
Date:2022-01-08 18:30:25

The model-based reinforcement learning paradigm, which uses planning algorithms and neural network models, has recently achieved unprecedented results in diverse applications, leading to what is now known as deep reinforcement learning. These agents are quite complex and involve multiple components, factors that can create challenges for research. In this work, we propose a new modular software architecture suited for these types of agents, and a set of building blocks that can be easily reused and assembled to construct new model-based reinforcement learning agents. These building blocks include planning algorithms, policies, and loss functions. We illustrate the use of this architecture by combining several of these building blocks to implement and test agents that are optimized to three different test environments: Cartpole, Minigrid, and Tictactoe. One particular planning algorithm, made available in our implementation and not previously used in reinforcement learning, which we called averaged minimax, achieved good results in the three tested environments. Experiments performed with this architecture have shown that the best combination of planning algorithm, policy, and loss function is heavily problem dependent. This result provides evidence that the proposed architecture, which is modular and reusable, is useful for reinforcement learning researchers who want to study new environments and techniques.

Supervised Permutation Invariant Networks for Solving the CVRP with Bounded Fleet Size

Authors:Daniela Thyssens, Jonas Falkner, Lars Schmidt-Thieme
Date:2022-01-05 10:32:18

Learning to solve combinatorial optimization problems, such as the vehicle routing problem, offers great computational advantages over classical operations research solvers and heuristics. The recently developed deep reinforcement learning approaches either improve an initially given solution iteratively or sequentially construct a set of individual tours. However, most of the existing learning-based approaches are not able to work for a fixed number of vehicles and thus bypass the complex assignment problem of the customers onto an apriori given number of available vehicles. On the other hand, this makes them less suitable for real applications, as many logistic service providers rely on solutions provided for a specific bounded fleet size and cannot accommodate short term changes to the number of vehicles. In contrast we propose a powerful supervised deep learning framework that constructs a complete tour plan from scratch while respecting an apriori fixed number of available vehicles. In combination with an efficient post-processing scheme, our supervised approach is not only much faster and easier to train but also achieves competitive results that incorporate the practical aspect of vehicle costs. In thorough controlled experiments we compare our method to multiple state-of-the-art approaches where we demonstrate stable performance, while utilizing less vehicles and shed some light on existent inconsistencies in the experimentation protocols of the related work.

Have I done enough planning or should I plan more?

Authors:Ruiqi He, Yash Raj Jain, Falk Lieder
Date:2022-01-03 17:11:07

People's decisions about how to allocate their limited computational resources are essential to human intelligence. An important component of this metacognitive ability is deciding whether to continue thinking about what to do and move on to the next decision. Here, we show that people acquire this ability through learning and reverse-engineer the underlying learning mechanisms. Using a process-tracing paradigm that externalises human planning, we find that people quickly adapt how much planning they perform to the cost and benefit of planning. To discover the underlying metacognitive learning mechanisms we augmented a set of reinforcement learning models with metacognitive features and performed Bayesian model selection. Our results suggest that the metacognitive ability to adjust the amount of planning might be learned through a policy-gradient mechanism that is guided by metacognitive pseudo-rewards that communicate the value of planning.

Stochastic convex optimization for provably efficient apprenticeship learning

Authors:Angeliki Kamoutsi, Goran Banjac, John Lygeros
Date:2021-12-31 19:47:57

We consider large-scale Markov decision processes (MDPs) with an unknown cost function and employ stochastic convex optimization tools to address the problem of imitation learning, which consists of learning a policy from a finite set of expert demonstrations. We adopt the apprenticeship learning formalism, which carries the assumption that the true cost function can be represented as a linear combination of some known features. Existing inverse reinforcement learning algorithms come with strong theoretical guarantees, but are computationally expensive because they use reinforcement learning or planning algorithms as a subroutine. On the other hand, state-of-the-art policy gradient based algorithms (like IM-REINFORCE, IM-TRPO, and GAIL), achieve significant empirical success in challenging benchmark tasks, but are not well understood in terms of theory. With an emphasis on non-asymptotic guarantees of performance, we propose a method that directly learns a policy from expert demonstrations, bypassing the intermediate step of learning the cost function, by formulating the problem as a single convex optimization problem over occupancy measures. We develop a computationally efficient algorithm and derive high confidence regret bounds on the quality of the extracted policy, utilizing results from stochastic convex optimization and recent works in approximate linear programming for solving forward MDPs.

Abstractions of General Reinforcement Learning

Authors:Sultan J. Majeed
Date:2021-12-26 15:50:05

The field of artificial intelligence (AI) is devoted to the creation of artificial decision-makers that can perform (at least) on par with the human counterparts on a domain of interest. Unlike the agents in traditional AI, the agents in artificial general intelligence (AGI) are required to replicate human intelligence in almost every domain of interest. Moreover, an AGI agent should be able to achieve this without (virtually any) further changes, retraining, or fine-tuning of the parameters. The real world is non-stationary, non-ergodic, and non-Markovian: we, humans, can neither revisit our past nor are the most recent observations sufficient statistics. Yet, we excel at a variety of complex tasks. Many of these tasks require longterm planning. We can associate this success to our natural faculty to abstract away task-irrelevant information from our overwhelming sensory experience. We make task-specific mental models of the world without much effort. Due to this ability to abstract, we can plan on a significantly compact representation of a task without much loss of performance. Not only this, we also abstract our actions to produce high-level plans: the level of action-abstraction can be anywhere between small muscle movements to a mental notion of "doing an action". It is natural to assume that any AGI agent competing with humans (at every plausible domain) should also have these abilities to abstract its experiences and actions. This thesis is an inquiry into the existence of such abstractions which aid efficient planing for a wide range of domains, and most importantly, these abstractions come with some optimality guarantees.

Reducing Planning Complexity of General Reinforcement Learning with Non-Markovian Abstractions

Authors:Sultan J. Majeed, Marcus Hutter
Date:2021-12-26 14:26:41

The field of General Reinforcement Learning (GRL) formulates the problem of sequential decision-making from ground up. The history of interaction constitutes a "ground" state of the system, which never repeats. On the one hand, this generality allows GRL to model almost every domain possible, e.g.\ Bandits, MDPs, POMDPs, PSRs, and history-based environments. On the other hand, in general, the near-optimal policies in GRL are functions of complete history, which hinders not only learning but also planning in GRL. The usual way around for the planning part is that the agent is given a Markovian abstraction of the underlying process. So, it can use any MDP planning algorithm to find a near-optimal policy. The Extreme State Aggregation (ESA) framework has extended this idea to non-Markovian abstractions without compromising on the possibility of planning through a (surrogate) MDP. A distinguishing feature of ESA is that it proves an upper bound of $O\left(\varepsilon^{-A} \cdot (1-\gamma)^{-2A}\right)$ on the number of states required for the surrogate MDP (where $A$ is the number of actions, $\gamma$ is the discount-factor, and $\varepsilon$ is the optimality-gap) which holds \emph{uniformly} for \emph{all} domains. While the possibility of a universal bound is quite remarkable, we show that this bound is very loose. We propose a novel non-MDP abstraction which allows for a much better upper bound of $O\left(\varepsilon^{-1} \cdot (1-\gamma)^{-2} \cdot A \cdot 2^{A}\right)$. Furthermore, we show that this bound can be improved further to $O\left(\varepsilon^{-1} \cdot (1-\gamma)^{-2} \cdot \log^3 A \right)$ by using an action-sequentialization method.

Graph augmented Deep Reinforcement Learning in the GameRLand3D environment

Authors:Edward Beeching, Maxim Peter, Philippe Marcotte, Jilles Debangoye, Olivier Simonin, Joshua Romoff, Christian Wolf
Date:2021-12-22 08:48:00

We address planning and navigation in challenging 3D video games featuring maps with disconnected regions reachable by agents using special actions. In this setting, classical symbolic planners are not applicable or difficult to adapt. We introduce a hybrid technique combining a low level policy trained with reinforcement learning and a graph based high level classical planner. In addition to providing human-interpretable paths, the approach improves the generalization performance of an end-to-end approach in unseen maps, where it achieves a 20% absolute increase in success rate over a recurrent end-to-end agent on a point to point navigation task in yet unseen large-scale maps of size 1km x 1km. In an in-depth experimental study, we quantify the limitations of end-to-end Deep RL approaches in vast environments and we also introduce "GameRLand3D", a new benchmark and soon to be released environment can generate complex procedural 3D maps for navigation tasks.

A deep reinforcement learning model for predictive maintenance planning of road assets: Integrating LCA and LCCA

Authors:Moein Latifi, Fateme Golivand Darvishvand, Omid Khandel, Mobin Latifi Nowsoud
Date:2021-12-20 13:46:39

Road maintenance planning is an integral part of road asset management. One of the main challenges in Maintenance and Rehabilitation (M&R) practices is to determine maintenance type and timing. This research proposes a framework using Reinforcement Learning (RL) based on the Long Term Pavement Performance (LTPP) database to determine the type and timing of M&R practices. A predictive DNN model is first developed in the proposed algorithm, which serves as the Environment for the RL algorithm. For the Policy estimation of the RL model, both DQN and PPO models are developed. However, PPO has been selected in the end due to better convergence and higher sample efficiency. Indicators used in this study are International Roughness Index (IRI) and Rutting Depth (RD). Initially, we considered Cracking Metric (CM) as the third indicator, but it was then excluded due to the much fewer data compared to other indicators, which resulted in lower accuracy of the results. Furthermore, in cost-effectiveness calculation (reward), we considered both the economic and environmental impacts of M&R treatments. Costs and environmental impacts have been evaluated with paLATE 2.0 software. Our method is tested on a hypothetical case study of a six-lane highway with 23 kilometers length located in Texas, which has a warm and wet climate. The results propose a 20-year M&R plan in which road condition remains in an excellent condition range. Because the early state of the road is at a good level of service, there is no need for heavy maintenance practices in the first years. Later, after heavy M&R actions, there are several 1-2 years of no need for treatments. All of these show that the proposed plan has a logical result. Decision-makers and transportation agencies can use this scheme to conduct better maintenance practices that can prevent budget waste and, at the same time, minimize the environmental impacts.

Online Grounding of Symbolic Planning Domains in Unknown Environments

Authors:Leonardo Lamanna, Luciano Serafini, Alessandro Saetti, Alfonso Gerevini, Paolo Traverso
Date:2021-12-18 21:48:20

If a robotic agent wants to exploit symbolic planning techniques to achieve some goal, it must be able to properly ground an abstract planning domain in the environment in which it operates. However, if the environment is initially unknown by the agent, the agent needs to explore it and discover the salient aspects of the environment needed to reach its goals. Namely, the agent has to discover: (i) the objects present in the environment, (ii) the properties of these objects and their relations, and finally (iii) how abstract actions can be successfully executed. The paper proposes a framework that aims to accomplish the aforementioned perspective for an agent that perceives the environment partially and subjectively, through real value sensors (e.g., GPS, and on-board camera) and can operate in the environment through low level actuators (e.g., move forward of 20 cm). We evaluate the proposed architecture in photo-realistic simulated environments, where the sensors are RGB-D on-board camera, GPS and compass, and low level actions include movements, grasping/releasing objects, and manipulating objects. The agent is placed in an unknown environment and asked to find objects of a certain type, place an object on top of another, close or open an object of a certain type. We compare our approach with the state of the art methods on object goal navigation based on reinforcement learning, showing better performances.

Creativity of AI: Hierarchical Planning Model Learning for Facilitating Deep Reinforcement Learning

Authors:Hankz Hankui Zhuo, Shuting Deng, Mu Jin, Zhihao Ma, Kebing Jin, Chen Chen, Chao Yu
Date:2021-12-18 03:45:28

Despite of achieving great success in real-world applications, Deep Reinforcement Learning (DRL) is still suffering from three critical issues, i.e., data efficiency, lack of the interpretability and transferability. Recent research shows that embedding symbolic knowledge into DRL is promising in addressing those challenges. Inspired by this, we introduce a novel deep reinforcement learning framework with symbolic options. Our framework features a loop training procedure, which enables guiding the improvement of policy by planning with planning models (including action models and hierarchical task network models) and symbolic options learned from interactive trajectories automatically. The learned symbolic options alleviate the dense requirement of expert domain knowledge and provide inherent interpretability of policies. Moreover, the transferability and data efficiency can be further improved by planning with the symbolic planning models. To validate the effectiveness of our framework, we conduct experiments on two domains, Montezuma's Revenge and Office World, respectively. The results demonstrate the comparable performance, improved data efficiency, interpretability and transferability.

Resilient Branching MPC for Multi-Vehicle Traffic Scenarios Using Adversarial Disturbance Sequences

Authors:Victor Fors, Björn Olofsson, Erik Frisk
Date:2021-12-17 15:06:55

An approach to resilient planning and control of autonomous vehicles in multi-vehicle traffic scenarios is proposed. The proposed method is based on model predictive control (MPC), where alternative predictions of the surrounding traffic are determined automatically such that they are intentionally adversarial to the ego vehicle. This provides robustness against the inherent uncertainty in traffic predictions. To reduce conservatism, an assumption that other agents are of no ill intent is formalized. Simulation results from highway driving scenarios show that the proposed method in real-time negotiates traffic situations out of scope for a nominal MPC approach and performs favorably to state-of-the-art reinforcement-learning approaches without requiring prior training. The results also show that the proposed method performs effectively, with the ability to prune disturbance sequences with a lower risk for the ego vehicle.

CEM-GD: Cross-Entropy Method with Gradient Descent Planner for Model-Based Reinforcement Learning

Authors:Kevin Huang, Sahin Lale, Ugo Rosolia, Yuanyuan Shi, Anima Anandkumar
Date:2021-12-14 21:11:27

Current state-of-the-art model-based reinforcement learning algorithms use trajectory sampling methods, such as the Cross-Entropy Method (CEM), for planning in continuous control settings. These zeroth-order optimizers require sampling a large number of trajectory rollouts to select an optimal action, which scales poorly for large prediction horizons or high dimensional action spaces. First-order methods that use the gradients of the rewards with respect to the actions as an update can mitigate this issue, but suffer from local optima due to the non-convex optimization landscape. To overcome these issues and achieve the best of both worlds, we propose a novel planner, Cross-Entropy Method with Gradient Descent (CEM-GD), that combines first-order methods with CEM. At the beginning of execution, CEM-GD uses CEM to sample a significant amount of trajectory rollouts to explore the optimization landscape and avoid poor local minima. It then uses the top trajectories as initialization for gradient descent and applies gradient updates to each of these trajectories to find the optimal action sequence. At each subsequent time step, however, CEM-GD samples much fewer trajectories from CEM before applying gradient updates. We show that as the dimensionality of the planning problem increases, CEM-GD maintains desirable performance with a constant small number of samples by using the gradient information, while avoiding local optima using initially well-sampled trajectories. Furthermore, CEM-GD achieves better performance than CEM on a variety of continuous control benchmarks in MuJoCo with 100x fewer samples per time step, resulting in around 25% less computation time and 10% less memory usage. The implementation of CEM-GD is available at $\href{https://github.com/KevinHuang8/CEM-GD}{\text{https://github.com/KevinHuang8/CEM-GD}}$.

Stochastic Planner-Actor-Critic for Unsupervised Deformable Image Registration

Authors:Ziwei Luo, Jing Hu, Xin Wang, Shu Hu, Bin Kong, Youbing Yin, Qi Song, Xi Wu, Siwei Lyu
Date:2021-12-14 14:08:56

Large deformations of organs, caused by diverse shapes and nonlinear shape changes, pose a significant challenge for medical image registration. Traditional registration methods need to iteratively optimize an objective function via a specific deformation model along with meticulous parameter tuning, but which have limited capabilities in registering images with large deformations. While deep learning-based methods can learn the complex mapping from input images to their respective deformation field, it is regression-based and is prone to be stuck at local minima, particularly when large deformations are involved. To this end, we present Stochastic Planner-Actor-Critic (SPAC), a novel reinforcement learning-based framework that performs step-wise registration. The key notion is warping a moving image successively by each time step to finally align to a fixed image. Considering that it is challenging to handle high dimensional continuous action and state spaces in the conventional reinforcement learning (RL) framework, we introduce a new concept `Plan' to the standard Actor-Critic model, which is of low dimension and can facilitate the actor to generate a tractable high dimensional action. The entire framework is based on unsupervised training and operates in an end-to-end manner. We evaluate our method on several 2D and 3D medical image datasets, some of which contain large deformations. Our empirical results highlight that our work achieves consistent, significant gains and outperforms state-of-the-art methods.

Branching Time Active Inference with Bayesian Filtering

Authors:Théophile Champion, Marek Grześ, Howard Bowman
Date:2021-12-14 14:01:07

Branching Time Active Inference (Champion et al., 2021b,a) is a framework proposing to look at planning as a form of Bayesian model expansion. Its root can be found in Active Inference (Friston et al., 2016; Da Costa et al., 2020; Champion et al., 2021c), a neuroscientific framework widely used for brain modelling, as well as in Monte Carlo Tree Search (Browne et al., 2012), a method broadly applied in the Reinforcement Learning literature. Up to now, the inference of the latent variables was carried out by taking advantage of the flexibility offered by Variational Message Passing (Winn and Bishop, 2005), an iterative process that can be understood as sending messages along the edges of a factor graph (Forney, 2001). In this paper, we harness the efficiency of an alternative method for inference called Bayesian Filtering (Fox et al., 2003), which does not require the iteration of the update equations until convergence of the Variational Free Energy. Instead, this scheme alternates between two phases: integration of evidence and prediction of future states. Both of those phases can be performed efficiently and this provides a seventy times speed up over the state-of-the-art.

A Reinforcement Learning-based Adaptive Control Model for Future Street Planning, An Algorithm and A Case Study

Authors:Qiming Ye, Yuxiang Feng, Jing Han, Marc Stettler, Panagiotis Angeloudis
Date:2021-12-10 10:32:46

With the emerging technologies in Intelligent Transportation System (ITS), the adaptive operation of road space is likely to be realised within decades. An intelligent street can learn and improve its decision-making on the right-of-way (ROW) for road users, liberating more active pedestrian space while maintaining traffic safety and efficiency. However, there is a lack of effective controlling techniques for these adaptive street infrastructures. To fill this gap in existing studies, we formulate this control problem as a Markov Game and develop a solution based on the multi-agent Deep Deterministic Policy Gradient (MADDPG) algorithm. The proposed model can dynamically assign ROW for sidewalks, autonomous vehicles (AVs) driving lanes and on-street parking areas in real-time. Integrated with the SUMO traffic simulator, this model was evaluated using the road network of the South Kensington District against three cases of divergent traffic conditions: pedestrian flow rates, AVs traffic flow rates and parking demands. Results reveal that our model can achieve an average reduction of 3.87% and 6.26% in street space assigned for on-street parking and vehicular operations. Combined with space gained by limiting the number of driving lanes, the average proportion of sidewalks to total widths of streets can significantly increase by 10.13%.

Learning Generalizable Behavior via Visual Rewrite Rules

Authors:Yiheng Xie, Mingxuan Li, Shangqun Yu, Michael Littman
Date:2021-12-09 21:23:26

Though deep reinforcement learning agents have achieved unprecedented success in recent years, their learned policies can be brittle, failing to generalize to even slight modifications of their environments or unfamiliar situations. The black-box nature of the neural network learning dynamics makes it impossible to audit trained deep agents and recover from such failures. In this paper, we propose a novel representation and learning approach to capture environment dynamics without using neural networks. It originates from the observation that, in games designed for people, the effect of an action can often be perceived in the form of local changes in consecutive visual observations. Our algorithm is designed to extract such vision-based changes and condense them into a set of action-dependent descriptive rules, which we call ''visual rewrite rules'' (VRRs). We also present preliminary results from a VRR agent that can explore, expand its rule set, and solve a game via planning with its learned VRR world model. In several classical games, our non-deep agent demonstrates superior performance, extreme sample efficiency, and robust generalization ability compared with several mainstream deep agents.

Model-Value Inconsistency as a Signal for Epistemic Uncertainty

Authors:Angelos Filos, Eszter Vértes, Zita Marinho, Gregory Farquhar, Diana Borsa, Abram Friesen, Feryal Behbahani, Tom Schaul, André Barreto, Simon Osindero
Date:2021-12-08 07:53:41

Using a model of the environment and a value function, an agent can construct many estimates of a state's value, by unrolling the model for different lengths and bootstrapping with its value function. Our key insight is that one can treat this set of value estimates as a type of ensemble, which we call an \emph{implicit value ensemble} (IVE). Consequently, the discrepancy between these estimates can be used as a proxy for the agent's epistemic uncertainty; we term this signal \emph{model-value inconsistency} or \emph{self-inconsistency} for short. Unlike prior work which estimates uncertainty by training an ensemble of many models and/or value functions, this approach requires only the single model and value function which are already being learned in most model-based reinforcement learning algorithms. We provide empirical evidence in both tabular and function approximation settings from pixels that self-inconsistency is useful (i) as a signal for exploration, (ii) for acting safely under distribution shifts, and (iii) for robustifying value-based planning with a learned model.

Pragmatic Implementation of Reinforcement Algorithms For Path Finding On Raspberry Pi

Authors:Serena Raju, Sherin Shibu, Riya Mol Raji, Joel Thomas
Date:2021-12-07 09:00:14

In this paper, pragmatic implementation of an indoor autonomous delivery system that exploits Reinforcement Learning algorithms for path planning and collision avoidance is audited. The proposed system is a cost-efficient approach that is implemented to facilitate a Raspberry Pi controlled four-wheel-drive non-holonomic robot map a grid. This approach computes and navigates the shortest path from a source key point to a destination key point to carry out the desired delivery. Q learning and Deep-Q learning are used to find the optimal path while avoiding collision with static obstacles. This work defines an approach to deploy these two algorithms on a robot. A novel algorithm to decode an array of directions into accurate movements in a certain action space is also proposed. The procedure followed to dispatch this system with the said requirements is described, ergo presenting our proof of concept for indoor autonomous delivery vehicles.

Using Image Transformations to Learn Network Structure

Authors:Brayan Ortiz, Amitabh Sinha
Date:2021-12-06 23:28:38

Many learning tasks require observing a sequence of images and making a decision. In a transportation problem of designing and planning for shipping boxes between nodes, we show how to treat the network of nodes and the flows between them as images. These images have useful structural information that can be statistically summarized. Using image compression techniques, we reduce an image down to a set of numbers that contain interpretable geographic information that we call geographic signatures. Using geographic signatures, we learn network structure that can be utilized to recommend future network connectivity. We develop a Bayesian reinforcement algorithm that takes advantage of statistically summarized network information as priors and user-decisions to reinforce an agent's probabilistic decision. Additionally, we show how reinforcement learning can be used with compression directly without interpretation in simple tasks.

Reward-Free Attacks in Multi-Agent Reinforcement Learning

Authors:Ted Fujimoto, Timothy Doster, Adam Attarian, Jill Brandenberger, Nathan Hodas
Date:2021-12-02 02:36:09

We investigate how effective an attacker can be when it only learns from its victim's actions, without access to the victim's reward. In this work, we are motivated by the scenario where the attacker wants to behave strategically when the victim's motivations are unknown. We argue that one heuristic approach an attacker can use is to maximize the entropy of the victim's policy. The policy is generally not obfuscated, which implies it may be extracted simply by passively observing the victim. We provide such a strategy in the form of a reward-free exploration algorithm that maximizes the attacker's entropy during the exploration phase, and then maximizes the victim's empirical entropy during the planning phase. In our experiments, the victim agents are subverted through policy entropy maximization, implying an attacker might not need access to the victim's reward to succeed. Hence, reward-free attacks, which are based only on observing behavior, show the feasibility of an attacker to act strategically without knowledge of the victim's motives even if the victim's reward information is protected.

Joint Cluster Head Selection and Trajectory Planning in UAV-Aided IoT Networks by Reinforcement Learning with Sequential Model

Authors:Botao Zhu, Ebrahim Bedeer, Ha H. Nguyen, Robert Barton, Jerome Henry
Date:2021-12-01 07:59:53

Employing unmanned aerial vehicles (UAVs) has attracted growing interests and emerged as the state-of-the-art technology for data collection in Internet-of-Things (IoT) networks. In this paper, with the objective of minimizing the total energy consumption of the UAV-IoT system, we formulate the problem of jointly designing the UAV's trajectory and selecting cluster heads in the IoT network as a constrained combinatorial optimization problem which is classified as NP-hard and challenging to solve. We propose a novel deep reinforcement learning (DRL) with a sequential model strategy that can effectively learn the policy represented by a sequence-to-sequence neural network for the UAV's trajectory design in an unsupervised manner. Through extensive simulations, the obtained results show that the proposed DRL method can find the UAV's trajectory that requires much less energy consumption when compared to other baseline algorithms and achieves close-to-optimal performance. In addition, simulation results show that the trained model by our proposed DRL algorithm has an excellent generalization ability to larger problem sizes without the need to retrain the model.

Energy-Efficient Autonomous Driving Using Cognitive Driver Behavioral Models and Reinforcement Learning

Authors:Huayi Li, Nan Li, Ilya Kolmanovsky, Anouck Girard
Date:2021-11-27 19:03:18

Autonomous driving technologies are expected to not only improve mobility and road safety but also bring energy efficiency benefits. In the foreseeable future, autonomous vehicles (AVs) will operate on roads shared with human-driven vehicles. To maintain safety and liveness while simultaneously minimizing energy consumption, the AV planning and decision-making process should account for interactions between the autonomous ego vehicle and surrounding human-driven vehicles. In this chapter, we describe a framework for developing energy-efficient autonomous driving policies on shared roads by exploiting human-driver behavior modeling based on cognitive hierarchy theory and reinforcement learning.

Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification

Authors:Ling Pan, Longbo Huang, Tengyu Ma, Huazhe Xu
Date:2021-11-22 13:27:42

Conservatism has led to significant progress in offline reinforcement learning (RL) where an agent learns from pre-collected datasets. However, as many real-world scenarios involve interaction among multiple agents, it is important to resolve offline RL in the multi-agent setting. Given the recent success of transferring online RL algorithms to the multi-agent setting, one may expect that offline RL algorithms will also transfer to multi-agent settings directly. Surprisingly, we empirically observe that conservative offline RL algorithms do not work well in the multi-agent setting -- the performance degrades significantly with an increasing number of agents. Towards mitigating the degradation, we identify a key issue that non-concavity of the value function makes the policy gradient improvements prone to local optima. Multiple agents exacerbate the problem severely, since the suboptimal policy by any agent can lead to uncoordinated global failure. Following this intuition, we propose a simple yet effective method, Offline Multi-Agent RL with Actor Rectification (OMAR), which combines the first-order policy gradients and zeroth-order optimization methods to better optimize the conservative value functions over the actor parameters. Despite the simplicity, OMAR achieves state-of-the-art results in a variety of multi-agent control tasks.

UMBRELLA: Uncertainty-Aware Model-Based Offline Reinforcement Learning Leveraging Planning

Authors:Christopher Diehl, Timo Sievernich, Martin Krüger, Frank Hoffmann, Torsten Bertram
Date:2021-11-22 10:37:52

Offline reinforcement learning (RL) provides a framework for learning decision-making from offline data and therefore constitutes a promising approach for real-world applications as automated driving. Self-driving vehicles (SDV) learn a policy, which potentially even outperforms the behavior in the sub-optimal data set. Especially in safety-critical applications as automated driving, explainability and transferability are key to success. This motivates the use of model-based offline RL approaches, which leverage planning. However, current state-of-the-art methods often neglect the influence of aleatoric uncertainty arising from the stochastic behavior of multi-agent systems. This work proposes a novel approach for Uncertainty-aware Model-Based Offline REinforcement Learning Leveraging plAnning (UMBRELLA), which solves the prediction, planning, and control problem of the SDV jointly in an interpretable learning-based fashion. A trained action-conditioned stochastic dynamics model captures distinctively different future evolutions of the traffic scene. The analysis provides empirical evidence for the effectiveness of our approach in challenging automated driving simulations and based on a real-world public dataset.

Vulcan: Solving the Steiner Tree Problem with Graph Neural Networks and Deep Reinforcement Learning

Authors:Haizhou Du, Zong Yan, Qiao Xiang, Qinqing Zhan
Date:2021-11-21 12:53:50

Steiner Tree Problem (STP) in graphs aims to find a tree of minimum weight in the graph that connects a given set of vertices. It is a classic NP-hard combinatorial optimization problem and has many real-world applications (e.g., VLSI chip design, transportation network planning and wireless sensor networks). Many exact and approximate algorithms have been developed for STP, but they suffer from high computational complexity and weak worst-case solution guarantees, respectively. Heuristic algorithms are also developed. However, each of them requires application domain knowledge to design and is only suitable for specific scenarios. Motivated by the recently reported observation that instances of the same NP-hard combinatorial problem may maintain the same or similar combinatorial structure but mainly differ in their data, we investigate the feasibility and benefits of applying machine learning techniques to solving STP. To this end, we design a novel model Vulcan based on novel graph neural networks and deep reinforcement learning. The core of Vulcan is a novel, compact graph embedding that transforms highdimensional graph structure data (i.e., path-changed information) into a low-dimensional vector representation. Given an STP instance, Vulcan uses this embedding to encode its pathrelated information and sends the encoded graph to a deep reinforcement learning component based on a double deep Q network (DDQN) to find solutions. In addition to STP, Vulcan can also find solutions to a wide range of NP-hard problems (e.g., SAT, MVC and X3C) by reducing them to STP. We implement a prototype of Vulcan and demonstrate its efficacy and efficiency with extensive experiments using real-world and synthetic datasets.

Q-Learning Based Energy-Efficient Network Planning in IP-over-EON

Authors:Pramit Biswas, Md Shahbaz Akhtar, Aneek Adhya, Sriparna Saha, Sudhan Majhi
Date:2021-11-20 18:58:03

During network planning phase, optimal network planning implemented through efficient resource allocation and static traffic demand provisioning in IP-over-elastic optical network (IP-over-EON) is significantly challenging compared with the fixed-grid wavelength division multiplexing (WDM) network due to increased flexibility in IP-over-EON. Mathematical optimization models used for this purpose may not provide solution for large networks due to large computational complexity. In this regard, a greedy heuristic may be used that intuitively selects traffic elements in sequence from static traffic demand matrix and attempts to find the best solution. However, in general, such greedy heuristics offer suboptimal solutions, since appropriate traffic sequence offering the optimal performance is rarely selected. In this regard, we propose a reinforcement learning technique (in particular a Q-learning method), combined with an auxiliary graph (AG)-based energy efficient greedy method to be used for large network planning. The Q-learning method is used to decide the suitable sequence of traffic allocation such that the overall power consumption in the network reduces. In the proposed heuristic, each traffic from the given static traffic demand matrix is successively selected using the Q-learning technique and provisioned using the AG-based greedy method.

Successor Feature Landmarks for Long-Horizon Goal-Conditioned Reinforcement Learning

Authors:Christopher Hoang, Sungryull Sohn, Jongwook Choi, Wilka Carvalho, Honglak Lee
Date:2021-11-18 18:36:05

Operating in the real-world often requires agents to learn about a complex environment and apply this understanding to achieve a breadth of goals. This problem, known as goal-conditioned reinforcement learning (GCRL), becomes especially challenging for long-horizon goals. Current methods have tackled this problem by augmenting goal-conditioned policies with graph-based planning algorithms. However, they struggle to scale to large, high-dimensional state spaces and assume access to exploration mechanisms for efficiently collecting training data. In this work, we introduce Successor Feature Landmarks (SFL), a framework for exploring large, high-dimensional environments so as to obtain a policy that is proficient for any goal. SFL leverages the ability of successor features (SF) to capture transition dynamics, using it to drive exploration by estimating state-novelty and to enable high-level planning by abstracting the state-space as a non-parametric landmark-based graph. We further exploit SF to directly compute a goal-conditioned policy for inter-landmark traversal, which we use to execute plans to "frontier" landmarks at the edge of the explored state space. We show in our experiments on MiniGrid and ViZDoom that SFL enables efficient exploration of large, high-dimensional state spaces and outperforms state-of-the-art baselines on long-horizon GCRL tasks.

Learning to Execute: Efficient Learning of Universal Plan-Conditioned Policies in Robotics

Authors:Ingmar Schubert, Danny Driess, Ozgur S. Oguz, Marc Toussaint
Date:2021-11-15 16:58:50

Applications of Reinforcement Learning (RL) in robotics are often limited by high data demand. On the other hand, approximate models are readily available in many robotics scenarios, making model-based approaches like planning a data-efficient alternative. Still, the performance of these methods suffers if the model is imprecise or wrong. In this sense, the respective strengths and weaknesses of RL and model-based planners are. In the present work, we investigate how both approaches can be integrated into one framework that combines their strengths. We introduce Learning to Execute (L2E), which leverages information contained in approximate plans to learn universal policies that are conditioned on plans. In our robotic manipulation experiments, L2E exhibits increased performance when compared to pure RL, pure planning, or baseline methods combining learning and planning.

Learning Representations for Pixel-based Control: What Matters and Why?

Authors:Manan Tomar, Utkarsh A. Mishra, Amy Zhang, Matthew E. Taylor
Date:2021-11-15 14:16:28

Learning representations for pixel-based control has garnered significant attention recently in reinforcement learning. A wide range of methods have been proposed to enable efficient learning, leading to sample complexities similar to those in the full state setting. However, moving beyond carefully curated pixel data sets (centered crop, appropriate lighting, clear background, etc.) remains challenging. In this paper, we adopt a more difficult setting, incorporating background distractors, as a first step towards addressing this challenge. We present a simple baseline approach that can learn meaningful representations with no metric-based learning, no data augmentations, no world-model learning, and no contrastive learning. We then analyze when and why previously proposed methods are likely to fail or reduce to the same performance as the baseline in this harder setting and why we should think carefully about extending such methods beyond the well curated environments. Our results show that finer categorization of benchmarks on the basis of characteristics like density of reward, planning horizon of the problem, presence of task-irrelevant components, etc., is crucial in evaluating algorithms. Based on these observations, we propose different metrics to consider when evaluating an algorithm on benchmark tasks. We hope such a data-centric view can motivate researchers to rethink representation learning when investigating how to best apply RL to real-world tasks.

Measuring Outcomes in Healthcare Economics using Artificial Intelligence: with Application to Resource Management

Authors:Chih-Hao Huang, Feras A. Batarseh, Adel Boueiz, Ajay Kulkarni, Po-Hsuan Su, Jahan Aman
Date:2021-11-15 02:39:39

The quality of service in healthcare is constantly challenged by outlier events such as pandemics (i.e. Covid-19) and natural disasters (such as hurricanes and earthquakes). In most cases, such events lead to critical uncertainties in decision making, as well as in multiple medical and economic aspects at a hospital. External (geographic) or internal factors (medical and managerial), lead to shifts in planning and budgeting, but most importantly, reduces confidence in conventional processes. In some cases, support from other hospitals proves necessary, which exacerbates the planning aspect. This manuscript presents three data-driven methods that provide data-driven indicators to help healthcare managers organize their economics and identify the most optimum plan for resources allocation and sharing. Conventional decision-making methods fall short in recommending validated policies for managers. Using reinforcement learning, genetic algorithms, traveling salesman, and clustering, we experimented with different healthcare variables and presented tools and outcomes that could be applied at health institutes. Experiments are performed; the results are recorded, evaluated, and presented.

Two steps to risk sensitivity

Authors:Chris Gagne, Peter Dayan
Date:2021-11-12 16:27:47

Distributional reinforcement learning (RL) -- in which agents learn about all the possible long-term consequences of their actions, and not just the expected value -- is of great recent interest. One of the most important affordances of a distributional view is facilitating a modern, measured, approach to risk when outcomes are not completely certain. By contrast, psychological and neuroscientific investigations into decision making under risk have utilized a variety of more venerable theoretical models such as prospect theory that lack axiomatically desirable properties such as coherence. Here, we consider a particularly relevant risk measure for modeling human and animal planning, called conditional value-at-risk (CVaR), which quantifies worst-case outcomes (e.g., vehicle accidents or predation). We first adopt a conventional distributional approach to CVaR in a sequential setting and reanalyze the choices of human decision-makers in the well-known two-step task, revealing substantial risk aversion that had been lurking under stickiness and perseveration. We then consider a further critical property of risk sensitivity, namely time consistency, showing alternatives to this form of CVaR that enjoy this desirable characteristic. We use simulations to examine settings in which the various forms differ in ways that have implications for human and animal planning and behavior.

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation

Authors:I-Chun Arthur Liu, Shagun Uppal, Gaurav S. Sukhatme, Joseph J. Lim, Peter Englert, Youngwoon Lee
Date:2021-11-11 18:52:00

Learning complex manipulation tasks in realistic, obstructed environments is a challenging problem due to hard exploration in the presence of obstacles and high-dimensional visual observations. Prior work tackles the exploration problem by integrating motion planning and reinforcement learning. However, the motion planner augmented policy requires access to state information, which is often not available in the real-world settings. To this end, we propose to distill a state-based motion planner augmented policy to a visual control policy via (1) visual behavioral cloning to remove the motion planner dependency along with its jittery motion, and (2) vision-based reinforcement learning with the guidance of the smoothed trajectories from the behavioral cloning agent. We evaluate our method on three manipulation tasks in obstructed environments and compare it against various reinforcement learning and imitation learning baselines. The results demonstrate that our framework is highly sample-efficient and outperforms the state-of-the-art algorithms. Moreover, coupled with domain randomization, our policy is capable of zero-shot transfer to unseen environment settings with distractors. Code and videos are available at https://clvrai.com/mopa-pd

VeSoNet: Traffic-Aware Content Caching for Vehicular Social Networks based on Path Planning and Deep Reinforcement Learning

Authors:Nyothiri Aung, Sahraoui Dhelim, Liming Chen, Wenyin Zhang, Abderrahmane Lakas, Huansheng Ning
Date:2021-11-10 08:28:35

Vehicular social networking is an emerging application of the promising Internet of Vehicles (IoV) which aims to achieve the seamless integration of vehicular networks and social networks. However, the unique characteristics of vehicular networks such as high mobility and frequent communication interruptions make content delivery to end-users under strict delay constrains an extremely challenging task. In this paper, we propose a social-aware vehicular edge computing architecture that solves the content delivery problem by using some of the vehicles in the network as edge servers that can store and stream popular content to close-by end-users. The proposed architecture includes three components. First, we propose a social-aware graph pruning search algorithm that computes and assigns the vehicles to the shortest path with the most relevant vehicular content providers. Secondly, we use a traffic-aware content recommendation scheme to recommend relevant content according to their social context. This scheme uses graph embeddings in which the vehicles are represented by a set of low-dimension vectors (vehicle2vec) to store information about previously consumed content. Finally, we propose a Deep Reinforcement Learning (DRL) method to optimize the content provider vehicles distribution across the network. The results obtained from a realistic traffic simulation show the effectiveness and robustness of the proposed system when compared to the state-of-the-art baselines.

Spatially and Seamlessly Hierarchical Reinforcement Learning for State Space and Policy space in Autonomous Driving

Authors:Jaehyun Kim, Jaeseung Jeong
Date:2021-11-10 01:35:14

Despite advances in hierarchical reinforcement learning, its applications to path planning in autonomous driving on highways are challenging. One reason is that conventional hierarchical reinforcement learning approaches are not amenable to autonomous driving due to its riskiness: the agent must move avoiding multiple obstacles such as other agents that are highly unpredictable, thus safe regions are small, scattered, and changeable over time. To overcome this challenge, we propose a spatially hierarchical reinforcement learning method for state space and policy space. The high-level policy selects not only behavioral sub-policy but also regions to pay mind to in state space and for outline in policy space. Subsequently, the low-level policy elaborates the short-term goal position of the agent within the outline of the region selected by the high-level command. The network structure and optimization suggested in our method are as concise as those of single-level methods. Experiments on the environment with various shapes of roads showed that our method finds the nearly optimal policies from early episodes, outperforming a baseline hierarchical reinforcement learning method, especially in narrow and complex roads. The resulting trajectories on the roads were similar to those of human strategies on the behavioral planning level.

Risk Sensitive Model-Based Reinforcement Learning using Uncertainty Guided Planning

Authors:Stefan Radic Webster, Peter Flach
Date:2021-11-09 07:28:00

Identifying uncertainty and taking mitigating actions is crucial for safe and trustworthy reinforcement learning agents, especially when deployed in high-risk environments. In this paper, risk sensitivity is promoted in a model-based reinforcement learning algorithm by exploiting the ability of a bootstrap ensemble of dynamics models to estimate environment epistemic uncertainty. We propose uncertainty guided cross-entropy method planning, which penalises action sequences that result in high variance state predictions during model rollouts, guiding the agent to known areas of the state space with low uncertainty. Experiments display the ability for the agent to identify uncertain regions of the state space during planning and to take actions that maintain the agent within high confidence areas, without the requirement of explicit constraints. The result is a reduction in the performance in terms of attaining reward, displaying a trade-off between risk and return.

Multi-Agent Deep Reinforcement Learning For Optimising Energy Efficiency of Fixed-Wing UAV Cellular Access Points

Authors:Boris Galkin, Babatunji Omoniwa, Ivana Dusparic
Date:2021-11-03 14:49:17

Unmanned Aerial Vehicles (UAVs) promise to become an intrinsic part of next generation communications, as they can be deployed to provide wireless connectivity to ground users to supplement existing terrestrial networks. The majority of the existing research into the use of UAV access points for cellular coverage considers rotary-wing UAV designs (i.e. quadcopters). However, we expect fixed-wing UAVs to be more appropriate for connectivity purposes in scenarios where long flight times are necessary (such as for rural coverage), as fixed-wing UAVs rely on a more energy-efficient form of flight when compared to the rotary-wing design. As fixed-wing UAVs are typically incapable of hovering in place, their deployment optimisation involves optimising their individual flight trajectories in a way that allows them to deliver high quality service to the ground users in an energy-efficient manner. In this paper, we propose a multi-agent deep reinforcement learning approach to optimise the energy efficiency of fixed-wing UAV cellular access points while still allowing them to deliver high-quality service to users on the ground. In our decentralized approach, each UAV is equipped with a Dueling Deep Q-Network (DDQN) agent which can adjust the 3D trajectory of the UAV over a series of timesteps. By coordinating with their neighbours, the UAVs adjust their individual flight trajectories in a manner that optimises the total system energy efficiency. We benchmark the performance of our approach against a series of heuristic trajectory planning strategies, and demonstrate that our method can improve the system energy efficiency by as much as 70%.

Deployment Optimization for Shared e-Mobility Systems with Multi-agent Deep Neural Search

Authors:Man Luo, Bowen Du, Konstantin Klemmer, Hongming Zhu, Hongkai Wen
Date:2021-11-03 11:37:11

Shared e-mobility services have been widely tested and piloted in cities across the globe, and already woven into the fabric of modern urban planning. This paper studies a practical yet important problem in those systems: how to deploy and manage their infrastructure across space and time, so that the services are ubiquitous to the users while sustainable in profitability. However, in real-world systems evaluating the performance of different deployment strategies and then finding the optimal plan is prohibitively expensive, as it is often infeasible to conduct many iterations of trial-and-error. We tackle this by designing a high-fidelity simulation environment, which abstracts the key operation details of the shared e-mobility systems at fine-granularity, and is calibrated using data collected from the real-world. This allows us to try out arbitrary deployment plans to learn the optimal given specific context, before actually implementing any in the real-world systems. In particular, we propose a novel multi-agent neural search approach, in which we design a hierarchical controller to produce tentative deployment plans. The generated deployment plans are then tested using a multi-simulation paradigm, i.e., evaluated in parallel, where the results are used to train the controller with deep reinforcement learning. With this closed loop, the controller can be steered to have higher probability of generating better deployment plans in future iterations. The proposed approach has been evaluated extensively in our simulation environment, and experimental results show that it outperforms baselines e.g., human knowledge, and state-of-the-art heuristic-based optimization approaches in both service coverage and net revenue.

Procedural Generalization by Planning with Self-Supervised World Models

Authors:Ankesh Anand, Jacob Walker, Yazhe Li, Eszter Vértes, Julian Schrittwieser, Sherjil Ozair, Théophane Weber, Jessica B. Hamrick
Date:2021-11-02 13:32:21

One of the key promises of model-based reinforcement learning is the ability to generalize using an internal model of the world to make predictions in novel environments and tasks. However, the generalization ability of model-based agents is not well understood because existing work has focused on model-free agents when benchmarking generalization. Here, we explicitly measure the generalization ability of model-based agents in comparison to their model-free counterparts. We focus our analysis on MuZero (Schrittwieser et al., 2020), a powerful model-based agent, and evaluate its performance on both procedural and task generalization. We identify three factors of procedural generalization -- planning, self-supervised representation learning, and procedural data diversity -- and show that by combining these techniques, we achieve state-of-the art generalization performance and data efficiency on Procgen (Cobbe et al., 2019). However, we find that these factors do not always provide the same benefits for the task generalization benchmarks in Meta-World (Yu et al., 2019), indicating that transfer remains a challenge and may require different approaches than procedural generalization. Overall, we suggest that building generalizable agents requires moving beyond the single-task, model-free paradigm and towards self-supervised model-based agents that are trained in rich, procedural, multi-task environments.

Learning to Explore by Reinforcement over High-Level Options

Authors:Liu Juncheng, McCane Brendan, Mills Steven
Date:2021-11-02 04:21:34

Autonomous 3D environment exploration is a fundamental task for various applications such as navigation. The goal of exploration is to investigate a new environment and build its occupancy map efficiently. In this paper, we propose a new method which grants an agent two intertwined options of behaviors: "look-around" and "frontier navigation". This is implemented by an option-critic architecture and trained by reinforcement learning algorithms. In each timestep, an agent produces an option and a corresponding action according to the policy. We also take advantage of macro-actions by incorporating classic path-planning techniques to increase training efficiency. We demonstrate the effectiveness of the proposed method on two publicly available 3D environment datasets and the results show our method achieves higher coverage than competing techniques with better efficiency.

A Decentralized Reinforcement Learning Framework for Efficient Passage of Emergency Vehicles

Authors:Haoran Su, Yaofeng Desmond Zhong, Biswadip Dey, Amit Chakraborty
Date:2021-10-30 16:13:48

Emergency vehicles (EMVs) play a critical role in a city's response to time-critical events such as medical emergencies and fire outbreaks. The existing approaches to reduce EMV travel time employ route optimization and traffic signal pre-emption without accounting for the coupling between route these two subproblems. As a result, the planned route often becomes suboptimal. In addition, these approaches also do not focus on minimizing disruption to the overall traffic flow. To address these issues, we introduce EMVLight in this paper. This is a decentralized reinforcement learning (RL) framework for simultaneous dynamic routing and traffic signal control. EMVLight extends Dijkstra's algorithm to efficiently update the optimal route for an EMV in real-time as it travels through the traffic network. Consequently, the decentralized RL agents learn network-level cooperative traffic signal phase strategies that reduce EMV travel time and the average travel time of non-EMVs in the network. We have carried out comprehensive experiments with synthetic and real-world maps to demonstrate this benefit. Our results show that EMVLight outperforms benchmark transportation engineering techniques as well as existing RL-based traffic signal control methods.

Learning Coordinated Terrain-Adaptive Locomotion by Imitating a Centroidal Dynamics Planner

Authors:Philemon Brakel, Steven Bohez, Leonard Hasenclever, Nicolas Heess, Konstantinos Bousmalis
Date:2021-10-30 14:24:39

Dynamic quadruped locomotion over challenging terrains with precise foot placements is a hard problem for both optimal control methods and Reinforcement Learning (RL). Non-linear solvers can produce coordinated constraint satisfying motions, but often take too long to converge for online application. RL methods can learn dynamic reactive controllers but require carefully tuned shaping rewards to produce good gaits and can have trouble discovering precise coordinated movements. Imitation learning circumvents this problem and has been used with motion capture data to extract quadruped gaits for flat terrains. However, it would be costly to acquire motion capture data for a very large variety of terrains with height differences. In this work, we combine the advantages of trajectory optimization and learning methods and show that terrain adaptive controllers can be obtained by training policies to imitate trajectories that have been planned over procedural terrains by a non-linear solver. We show that the learned policies transfer to unseen terrains and can be fine-tuned to dynamically traverse challenging terrains that require precise foot placements and are very hard to solve with standard RL.

Sparsely Changing Latent States for Prediction and Planning in Partially Observable Domains

Authors:Christian Gumbsch, Martin V. Butz, Georg Martius
Date:2021-10-29 17:50:44

A common approach to prediction and planning in partially observable domains is to use recurrent neural networks (RNNs), which ideally develop and maintain a latent memory about hidden, task-relevant factors. We hypothesize that many of these hidden factors in the physical world are constant over time, changing only sparsely. To study this hypothesis, we propose Gated $L_0$ Regularized Dynamics (GateL0RD), a novel recurrent architecture that incorporates the inductive bias to maintain stable, sparsely changing latent states. The bias is implemented by means of a novel internal gating function and a penalty on the $L_0$ norm of latent state changes. We demonstrate that GateL0RD can compete with or outperform state-of-the-art RNNs in a variety of partially observable prediction and control tasks. GateL0RD tends to encode the underlying generative factors of the environment, ignores spurious temporal dependencies, and generalizes better, improving sampling efficiency and overall performance in model-based planning and reinforcement learning tasks. Moreover, we show that the developing latent states can be easily interpreted, which is a step towards better explainability in RNNs.

Brick-by-Brick: Combinatorial Construction with Deep Reinforcement Learning

Authors:Hyunsoo Chung, Jungtaek Kim, Boris Knyazev, Jinhwi Lee, Graham W. Taylor, Jaesik Park, Minsu Cho
Date:2021-10-29 01:09:51

Discovering a solution in a combinatorial space is prevalent in many real-world problems but it is also challenging due to diverse complex constraints and the vast number of possible combinations. To address such a problem, we introduce a novel formulation, combinatorial construction, which requires a building agent to assemble unit primitives (i.e., LEGO bricks) sequentially -- every connection between two bricks must follow a fixed rule, while no bricks mutually overlap. To construct a target object, we provide incomplete knowledge about the desired target (i.e., 2D images) instead of exact and explicit volumetric information to the agent. This problem requires a comprehensive understanding of partial information and long-term planning to append a brick sequentially, which leads us to employ reinforcement learning. The approach has to consider a variable-sized action space where a large number of invalid actions, which would cause overlap between bricks, exist. To resolve these issues, our model, dubbed Brick-by-Brick, adopts an action validity prediction network that efficiently filters invalid actions for an actor-critic network. We demonstrate that the proposed method successfully learns to construct an unseen object conditioned on a single image or multiple views of a target object.

A hierarchical behavior prediction framework at signalized intersections

Authors:Zhen Yang, Rusheng Zhang, Henry X. Liu
Date:2021-10-29 00:05:18

Road user behavior prediction is one of the most critical components in trajectory planning for autonomous driving, especially in urban scenarios involving traffic signals. In this paper, a hierarchical framework is proposed to predict vehicle behaviors at a signalized intersection, using the traffic signal information of the intersection. The framework is composed of two phases: a discrete intention prediction phase and a continuous trajectory prediction phase. In the discrete intention prediction phase, a Bayesian network is adopted to predict the vehicle's high-level intention, after that, maximum entropy inverse reinforcement learning is utilized to learn the human driving model offline; during the online trajectory prediction phase, a driver characteristic is designed and updated to capture the different driving preferences between human drivers. We applied the proposed framework to one of the most challenging scenarios in autonomous driving: the yellow light running scenario. Numerical experiment results are presented in the later part of the paper which show the viability of the method. The accuracy of the Bayesian network for discrete intention prediction is 91.1%, and the prediction results are getting more and more accurate as the yellow time elapses. The average Euclidean distance error in continuous trajectory prediction is only 0.85 m in the yellow light running scenario.

Average-Reward Learning and Planning with Options

Authors:Yi Wan, Abhishek Naik, Richard S. Sutton
Date:2021-10-26 16:58:05

We extend the options framework for temporal abstraction in reinforcement learning from discounted Markov decision processes (MDPs) to average-reward MDPs. Our contributions include general convergent off-policy inter-option learning algorithms, intra-option algorithms for learning values and models, as well as sample-based planning variants of our learning algorithms. Our algorithms and convergence proofs extend those recently developed by Wan, Naik, and Sutton. We also extend the notion of option-interrupting behavior from the discounted to the average-reward formulation. We show the efficacy of the proposed algorithms with experiments on a continuing version of the Four-Room domain.

Self-Consistent Models and Values

Authors:Gregory Farquhar, Kate Baumli, Zita Marinho, Angelos Filos, Matteo Hessel, Hado van Hasselt, David Silver
Date:2021-10-25 12:09:42

Learned models of the environment provide reinforcement learning (RL) agents with flexible ways of making predictions about the environment. In particular, models enable planning, i.e. using more computation to improve value functions or policies, without requiring additional environment interactions. In this work, we investigate a way of augmenting model-based RL, by additionally encouraging a learned model and value function to be jointly \emph{self-consistent}. Our approach differs from classic planning methods such as Dyna, which only update values to be consistent with the model. We propose multiple self-consistency updates, evaluate these in both tabular and function approximation settings, and find that, with appropriate choices, self-consistency helps both policy evaluation and control.

A Differentiable Newton-Euler Algorithm for Real-World Robotics

Authors:Michael Lutter, Johannes Silberbauer, Joe Watson, Jan Peters
Date:2021-10-24 12:19:41

Obtaining dynamics models is essential for robotics to achieve accurate model-based controllers and simulators for planning. The dynamics models are typically obtained using model specification of the manufacturer or simple numerical methods such as linear regression. However, this approach does not guarantee physically plausible parameters and can only be applied to kinematic chains consisting of rigid bodies. In this article, we describe a differentiable simulator that can be used to identify the system parameters of real-world mechanical systems with complex friction models, holonomic as well as non-holonomic constraints. To guarantee physically consistent parameters, we utilize virtual parameters and gradient-based optimization. The described Differentiable Newton-Euler Algorithm (DiffNEA) can be applied to a class of dynamical systems and guarantees physically plausible predictions. The extensive experimental evaluation shows, that the proposed model learning approach learns accurate dynamics models of systems with complex friction and non-holonomic constraints. Especially in the offline reinforcement learning experiments, the identified DiffNEA models excel. For the challenging ball in a cup task, these models solve the task using model-based offline reinforcement learning on the physical system. The black-box baselines fail on this task in simulation and on the physical system despite using more data for learning the model.

DiffSRL: Learning Dynamical State Representation for Deformable Object Manipulation with Differentiable Simulator

Authors:Sirui Chen, Yunhao Liu, Jialong Li, Shang Wen Yao, Tingxiang Fan, Jia Pan
Date:2021-10-24 04:53:58

Dynamic state representation learning is an important task in robot learning. Latent space that can capture dynamics related information has wide application in areas such as accelerating model free reinforcement learning, closing the simulation to reality gap, as well as reducing the motion planning complexity. However, current dynamic state representation learning methods scale poorly on complex dynamic systems such as deformable objects, and cannot directly embed well defined simulation function into the training pipeline. We propose DiffSRL, a dynamic state representation learning pipeline utilizing differentiable simulation that can embed complex dynamics models as part of the end-to-end training. We also integrate differentiable dynamic constraints as part of the pipeline which provide incentives for the latent state to be aware of dynamical constraints. We further establish a state representation learning benchmark on a soft-body simulation system, PlasticineLab, and our model demonstrates superior performance in terms of capturing long-term dynamics as well as reward prediction.

C-Planning: An Automatic Curriculum for Learning Goal-Reaching Tasks

Authors:Tianjun Zhang, Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine, Joseph E. Gonzalez
Date:2021-10-22 22:05:31

Goal-conditioned reinforcement learning (RL) can solve tasks in a wide range of domains, including navigation and manipulation, but learning to reach distant goals remains a central challenge to the field. Learning to reach such goals is particularly hard without any offline data, expert demonstrations, and reward shaping. In this paper, we propose an algorithm to solve the distant goal-reaching task by using search at training time to automatically generate a curriculum of intermediate states. Our algorithm, Classifier-Planning (C-Planning), frames the learning of the goal-conditioned policies as expectation maximization: the E-step corresponds to planning an optimal sequence of waypoints using graph search, while the M-step aims to learn a goal-conditioned policy to reach those waypoints. Unlike prior methods that combine goal-conditioned RL with graph search, ours performs search only during training and not testing, significantly decreasing the compute costs of deploying the learned policy. Empirically, we demonstrate that our method is more sample efficient than prior methods. Moreover, it is able to solve very long horizons manipulation and navigation tasks, tasks that prior goal-conditioned methods and methods based on graph search fail to solve.

Feedback Linearization of Car Dynamics for Racing via Reinforcement Learning

Authors:Michael Estrada, Sida Li, Xiangyu Cai
Date:2021-10-20 09:11:18

Through the method of Learning Feedback Linearization, we seek to learn a linearizing controller to simplify the process of controlling a car to race autonomously. A soft actor-critic approach is used to learn a decoupling matrix and drift vector that effectively correct for errors in a hand-designed linearizing controller. The result is an exactly linearizing controller that can be used to enable the well-developed theory of linear systems to design path planning and tracking schemes that are easy to implement and significantly less computationally demanding. To demonstrate the method of feedback linearization, it is first used to learn a simulated model whose exact structure is known, but varied from the initial controller, so as to introduce error. We further seek to apply this method to a system that introduces even more error in the form of a gym environment specifically designed for modeling the dynamics of car racing. To do so, we posit an extension to the method of learning feedback linearization; a neural network that is trained using supervised learning to convert the output of our linearizing controller to the required input for the racing environment. Our progress towards these goals is reported and the next steps in their accomplishment are discussed.

Locally Differentially Private Reinforcement Learning for Linear Mixture Markov Decision Processes

Authors:Chonghua Liao, Jiafan He, Quanquan Gu
Date:2021-10-19 17:44:09

Reinforcement learning (RL) algorithms can be used to provide personalized services, which rely on users' private and sensitive data. To protect the users' privacy, privacy-preserving RL algorithms are in demand. In this paper, we study RL with linear function approximation and local differential privacy (LDP) guarantees. We propose a novel $(\varepsilon, \delta)$-LDP algorithm for learning a class of Markov decision processes (MDPs) dubbed linear mixture MDPs, and obtains an $\tilde{\mathcal{O}}( d^{5/4}H^{7/4}T^{3/4}\left(\log(1/\delta)\right)^{1/4}\sqrt{1/\varepsilon})$ regret, where $d$ is the dimension of feature mapping, $H$ is the length of the planning horizon, and $T$ is the number of interactions with the environment. We also prove a lower bound $\Omega(dH\sqrt{T}/\left(e^{\varepsilon}(e^{\varepsilon}-1)\right))$ for learning linear mixture MDPs under $\varepsilon$-LDP constraint. Experiments on synthetic datasets verify the effectiveness of our algorithm. To the best of our knowledge, this is the first provable privacy-preserving RL algorithm with linear function approximation.

Contrastive Active Inference

Authors:Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt
Date:2021-10-19 16:20:49

Active inference is a unifying theory for perception and action resting upon the idea that the brain maintains an internal model of the world by minimizing free energy. From a behavioral perspective, active inference agents can be seen as self-evidencing beings that act to fulfill their optimistic predictions, namely preferred outcomes or goals. In contrast, reinforcement learning requires human-designed rewards to accomplish any desired outcome. Although active inference could provide a more natural self-supervised objective for control, its applicability has been limited because of the shortcomings in scaling the approach to complex environments. In this work, we propose a contrastive objective for active inference that strongly reduces the computational burden in learning the agent's generative model and planning future actions. Our method performs notably better than likelihood-based active inference in image-based tasks, while also being computationally cheaper and easier to train. We compare to reinforcement learning agents that have access to human-designed reward functions, showing that our approach closely matches their performance. Finally, we also show that contrastive methods perform significantly better in the case of distractors in the environment and that our method is able to generalize goals to variations in the background. Website and code: https://contrastive-aif.github.io/

Offline Reinforcement Learning with Value-based Episodic Memory

Authors:Xiaoteng Ma, Yiqin Yang, Hao Hu, Qihan Liu, Jun Yang, Chongjie Zhang, Qianchuan Zhao, Bin Liang
Date:2021-10-19 08:20:11

Offline reinforcement learning (RL) shows promise of applying RL to real-world problems by effectively utilizing previously collected data. Most existing offline RL algorithms use regularization or constraints to suppress extrapolation error for actions outside the dataset. In this paper, we adopt a different framework, which learns the V-function instead of the Q-function to naturally keep the learning procedure within the support of an offline dataset. To enable effective generalization while maintaining proper conservatism in offline learning, we propose Expectile V-Learning (EVL), which smoothly interpolates between the optimal value learning and behavior cloning. Further, we introduce implicit planning along offline trajectories to enhance learned V-values and accelerate convergence. Together, we present a new offline method called Value-based Episodic Memory (VEM). We provide theoretical analysis for the convergence properties of our proposed VEM method, and empirical results in the D4RL benchmark show that our method achieves superior performance in most tasks, particularly in sparse-reward tasks.

On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game

Authors:Shuang Qiu, Jieping Ye, Zhaoran Wang, Zhuoran Yang
Date:2021-10-19 07:26:33

To achieve sample efficiency in reinforcement learning (RL), it necessitates efficiently exploring the underlying environment. Under the offline setting, addressing the exploration challenge lies in collecting an offline dataset with sufficient coverage. Motivated by such a challenge, we study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. Then, given any extrinsic reward, the agent computes the policy via a planning algorithm with offline data collected in the exploration phase. Moreover, we tackle this problem under the context of function approximation, leveraging powerful function approximators. Specifically, we propose to explore via an optimistic variant of the value-iteration algorithm incorporating kernel and neural function approximations, where we adopt the associated exploration bonus as the exploration reward. Moreover, we design exploration and planning algorithms for both single-agent MDPs and zero-sum Markov games and prove that our methods can achieve $\widetilde{\mathcal{O}}(1 /\varepsilon^2)$ sample complexity for generating a $\varepsilon$-suboptimal policy or $\varepsilon$-approximate Nash equilibrium when given an arbitrary extrinsic reward. To the best of our knowledge, we establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.

Embracing advanced AI/ML to help investors achieve success: Vanguard Reinforcement Learning for Financial Goal Planning

Authors:Shareefuddin Mohammed, Rusty Bealer, Jason Cohen
Date:2021-10-18 18:46:20

In the world of advice and financial planning, there is seldom one right answer. While traditional algorithms have been successful in solving linear problems, its success often depends on choosing the right features from a dataset, which can be a challenge for nuanced financial planning scenarios. Reinforcement learning is a machine learning approach that can be employed with complex data sets where picking the right features can be nearly impossible. In this paper, we will explore the use of machine learning for financial forecasting, predicting economic indicators, and creating a savings strategy. Vanguard ML algorithm for goals-based financial planning is based on deep reinforcement learning that identifies optimal savings rates across multiple goals and sources of income to help clients achieve financial success. Vanguard learning algorithms are trained to identify market indicators and behaviors too complex to capture with formulas and rules, instead, it works to model the financial success trajectory of investors and their investment outcomes as a Markov decision process. We believe that reinforcement learning can be used to create value for advisors and end-investors, creating efficiency, more personalized plans, and data to enable customized solutions.

Option Transfer and SMDP Abstraction with Successor Features

Authors:Dongge Han, Sebastian Tschiatschek
Date:2021-10-18 11:35:08

Abstraction plays an important role in the generalisation of knowledge and skills and is key to sample efficient learning. In this work, we study joint temporal and state abstraction in reinforcement learning, where temporally-extended actions in the form of options induce temporal abstractions, while aggregation of similar states with respect to abstract options induces state abstractions. Many existing abstraction schemes ignore the interplay of state and temporal abstraction. Consequently, the considered option policies often cannot be directly transferred to new environments due to changes in the state space and transition dynamics. To address this issue, we propose a novel abstraction scheme building on successor features. This includes an algorithm for transferring abstract options across different environments and a state abstraction mechanism that allows us to perform efficient planning with the transferred options.

Reinforcement Learning-Based Coverage Path Planning with Implicit Cellular Decomposition

Authors:Javad Heydari, Olimpiya Saha, Viswanath Ganapathy
Date:2021-10-18 05:18:52

Coverage path planning in a generic known environment is shown to be NP-hard. When the environment is unknown, it becomes more challenging as the robot is required to rely on its online map information built during coverage for planning its path. A significant research effort focuses on designing heuristic or approximate algorithms that achieve reasonable performance. Such algorithms have sub-optimal performance in terms of covering the area or the cost of coverage, e.g., coverage time or energy consumption. In this paper, we provide a systematic analysis of the coverage problem and formulate it as an optimal stopping time problem, where the trade-off between coverage performance and its cost is explicitly accounted for. Next, we demonstrate that reinforcement learning (RL) techniques can be leveraged to solve the problem computationally. To this end, we provide some technical and practical considerations to facilitate the application of the RL algorithms and improve the efficiency of the solutions. Finally, through experiments in grid world environments and Gazebo simulator, we show that reinforcement learning-based algorithms efficiently cover realistic unknown indoor environments, and outperform the current state of the art.

Provable RL with Exogenous Distractors via Multistep Inverse Dynamics

Authors:Yonathan Efroni, Dipendra Misra, Akshay Krishnamurthy, Alekh Agarwal, John Langford
Date:2021-10-17 15:21:27

Many real-world applications of reinforcement learning (RL) require the agent to deal with high-dimensional observations such as those generated from a megapixel camera. Prior work has addressed such problems with representation learning, through which the agent can provably extract endogenous, latent state information from raw observations and subsequently plan efficiently. However, such approaches can fail in the presence of temporally correlated noise in the observations, a phenomenon that is common in practice. We initiate the formal study of latent state discovery in the presence of such exogenous noise sources by proposing a new model, the Exogenous Block MDP (EX-BMDP), for rich observation RL. We start by establishing several negative results, by highlighting failure cases of prior representation learning based approaches. Then, we introduce the Predictive Path Elimination (PPE) algorithm, that learns a generalization of inverse dynamics and is provably sample and computationally efficient in EX-BMDPs when the endogenous state dynamics are near deterministic. The sample complexity of PPE depends polynomially on the size of the latent endogenous state space while not directly depending on the size of the observation space, nor the exogenous state space. We provide experiments on challenging exploration problems which show that our approach works empirically.

Improving Hyperparameter Optimization by Planning Ahead

Authors:Hadi S. Jomaa, Jonas Falkner, Lars Schmidt-Thieme
Date:2021-10-15 11:46:14

Hyperparameter optimization (HPO) is generally treated as a bi-level optimization problem that involves fitting a (probabilistic) surrogate model to a set of observed hyperparameter responses, e.g. validation loss, and consequently maximizing an acquisition function using a surrogate model to identify good hyperparameter candidates for evaluation. The choice of a surrogate and/or acquisition function can be further improved via knowledge transfer across related tasks. In this paper, we propose a novel transfer learning approach, defined within the context of model-based reinforcement learning, where we represent the surrogate as an ensemble of probabilistic models that allows trajectory sampling. We further propose a new variant of model predictive control which employs a simple look-ahead strategy as a policy that optimizes a sequence of actions, representing hyperparameter candidates to expedite HPO. Our experiments on three meta-datasets comparing to state-of-the-art HPO algorithms including a model-free reinforcement learning approach show that the proposed method can outperform all baselines by exploiting a simple planning-based policy.

Integrated Path Planning and Tracking Control of Marine Current Turbine in Uncertain Ocean Environments

Authors:Arezoo Hasankhani, Ertugrul Baris Ondes, Yufei Tang, Cornel Sultan, James VanZwieten
Date:2021-10-14 01:03:19

This paper presents an integrated path planning and tracking control of marine hydrokinetic energy harvesting devices. To address the highly nonlinear and uncertain oceanic environment, the path planner is designed based on a reinforcement learning (RL) approach by fully exploring the historical ocean current profiles. The planner will search for a path to optimize a chosen cost criterion, such as maximizing the total harvested energy for a given time. Model predictive control (MPC) is then utilized to design the tracking control for the optimal path command from the planner subject to problem constraints. The planner and the tracking control are accommodated in an integrated framework to optimize these two parts in a real-time manner. The proposed approach is validated on a marine current turbine (MCT) that executes vertical waypoint path searching to maximize the net power due to spatiotemporal uncertainties in the ocean environment, as well as the path following via an MPC tracking controller to navigate the MCT to the optimal path. Results demonstrate that the path planning increases harnessed power compared to the baseline (i.e., maintaining MCT at an equilibrium depth), and the tracking controller can successfully follow the reference path under different shear profiles.

Feudal Reinforcement Learning by Reading Manuals

Authors:Kai Wang, Zhonghao Wang, Mo Yu, Humphrey Shi
Date:2021-10-13 03:50:15

Reading to act is a prevalent but challenging task which requires the ability to reason from a concise instruction. However, previous works face the semantic mismatch between the low-level actions and the high-level language descriptions and require the human-designed curriculum to work properly. In this paper, we present a Feudal Reinforcement Learning (FRL) model consisting of a manager agent and a worker agent. The manager agent is a multi-hop plan generator dealing with high-level abstract information and generating a series of sub-goals in a backward manner. The worker agent deals with the low-level perceptions and actions to achieve the sub-goals one by one. In comparison, our FRL model effectively alleviate the mismatching between text-level inference and low-level perceptions and actions; and is general to various forms of environments, instructions and manuals; and our multi-hop plan generator can significantly boost for challenging tasks where multi-step reasoning form the texts is critical to resolve the instructed goals. We showcase our approach achieves competitive performance on two challenging tasks, Read to Fight Monsters (RTFM) and Messenger, without human-designed curriculum learning.

Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation

Authors:Weitong Zhang, Dongruo Zhou, Quanquan Gu
Date:2021-10-12 23:03:58

We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, the agent works in two phases. In the exploration phase, the agent interacts with the environment and collects samples without the reward. In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy. We propose a new provably efficient algorithm, called UCRL-RFE under the Linear Mixture MDP assumption, where the transition probability kernel of the MDP can be parameterized by a linear function over certain feature mappings defined on the triplet of state, action, and next state. We show that to obtain an $\epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $\tilde{\mathcal{O}}(H^5d^2\epsilon^{-2})$ episodes during the exploration phase. Here, $H$ is the length of the episode, $d$ is the dimension of the feature mapping. We also propose a variant of UCRL-RFE using Bernstein-type bonus and show that it needs to sample at most $\tilde{\mathcal{O}}(H^4d(H + d)\epsilon^{-2})$ to achieve an $\epsilon$-optimal policy. By constructing a special class of linear Mixture MDPs, we also prove that for any reward-free algorithm, it needs to sample at least $\tilde \Omega(H^2d\epsilon^{-2})$ episodes to obtain an $\epsilon$-optimal policy. Our upper bound matches the lower bound in terms of the dependence on $\epsilon$ and the dependence on $d$ if $H \ge d$.

Temporal Abstraction in Reinforcement Learning with the Successor Representation

Authors:Marlos C. Machado, Andre Barreto, Doina Precup, Michael Bowling
Date:2021-10-12 05:07:43

Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called options. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation (SR), which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the SR can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent's representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the SR allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for exploration and on the use of the SR to combine them. The results of our experiments shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the SR, such as eigenoptions and the option keyboard.

Learning Efficient Multi-Agent Cooperative Visual Exploration

Authors:Chao Yu, Xinyi Yang, Jiaxuan Gao, Huazhong Yang, Yu Wang, Yi Wu
Date:2021-10-12 04:48:10

We tackle the problem of cooperative visual exploration where multiple agents need to jointly explore unseen regions as fast as possible based on visual signals. Classical planning-based methods often suffer from expensive computation overhead at each step and a limited expressiveness of complex cooperation strategy. By contrast, reinforcement learning (RL) has recently become a popular paradigm for tackling this challenge due to its modeling capability of arbitrarily complex strategies and minimal inference overhead. In this paper, we extend the state-of-the-art single-agent visual navigation method, Active Neural SLAM (ANS), to the multi-agent setting by introducing a novel RL-based planning module, Multi-agent Spatial Planner (MSP).MSP leverages a transformer-based architecture, Spatial-TeamFormer, which effectively captures spatial relations and intra-agent interactions via hierarchical spatial self-attentions. In addition, we also implement a few multi-agent enhancements to process local information from each agent for an aligned spatial representation and more precise planning. Finally, we perform policy distillation to extract a meta policy to significantly improve the generalization capability of final policy. We call this overall solution, Multi-Agent Active Neural SLAM (MAANS). MAANS substantially outperforms classical planning-based baselines for the first time in a photo-realistic 3D simulator, Habitat. Code and videos can be found at https://sites.google.com/view/maans.

Addressing crash-imminent situations caused by human driven vehicle errors in a mixed traffic stream: a model-based reinforcement learning approach for CAV

Authors:Jiqian Dong, Sikai Chen, Samuel Labi
Date:2021-10-11 18:54:05

It is anticipated that the era of fully autonomous vehicle operations will be preceded by a lengthy "Transition Period" where the traffic stream will be mixed, that is, consisting of connected autonomous vehicles (CAVs), human-driven vehicles (HDVs) and connected human-driven vehicles (CHDVs). In recognition of the fact that public acceptance of CAVs will hinge on safety performance of automated driving systems, and that there will likely be safety challenges in the early part of the transition period, significant research efforts have been expended in the development of safety-conscious automated driving systems. Yet still, there appears to be a lacuna in the literature regarding the handling of the crash-imminent situations that are caused by errant human driven vehicles (HDVs) in the vicinity of the CAV during operations on the roadway. In this paper, we develop a simple model-based Reinforcement Learning (RL) based system that can be deployed in the CAV to generate trajectories that anticipate and avoid potential collisions caused by drivers of the HDVs. The model involves an end-to-end data-driven approach that contains a motion prediction model based on deep learning, and a fast trajectory planning algorithm based on model predictive control (MPC). The proposed system requires no prior knowledge or assumption about the physical environment including the vehicle dynamics, and therefore represents a general approach that can be deployed on any type of vehicle (e.g., truck, buse, motorcycle, etc.). The framework is trained and tested in the CARLA simulator with multiple collision imminent scenarios, and the results indicate the proposed model can avoid the collision at high successful rate (>85%) even in highly compact and dangerous situations.

Neural Algorithmic Reasoners are Implicit Planners

Authors:Andreea Deac, Petar Veličković, Ognjen Milinković, Pierre-Luc Bacon, Jian Tang, Mladen Nikolić
Date:2021-10-11 17:29:20

Implicit planning has emerged as an elegant technique for combining learned models of the world with end-to-end model-free reinforcement learning. We study the class of implicit planners inspired by value iteration, an algorithm that is guaranteed to yield perfect policies in fully-specified tabular environments. We find that prior approaches either assume that the environment is provided in such a tabular form -- which is highly restrictive -- or infer "local neighbourhoods" of states to run value iteration over -- for which we discover an algorithmic bottleneck effect. This effect is caused by explicitly running the planning algorithm based on scalar predictions in every state, which can be harmful to data efficiency if such scalars are improperly predicted. We propose eXecuted Latent Value Iteration Networks (XLVINs), which alleviate the above limitations. Our method performs all planning computations in a high-dimensional latent space, breaking the algorithmic bottleneck. It maintains alignment with value iteration by carefully leveraging neural graph-algorithmic reasoning and contrastive self-supervised learning. Across eight low-data settings -- including classical control, navigation and Atari -- XLVINs provide significant improvements to data efficiency against value iteration-based implicit planners, as well as relevant model-free baselines. Lastly, we empirically verify that XLVINs can closely align with value iteration.

Interactive Hierarchical Guidance using Language

Authors:Bharat Prakash, Nicholas Waytowich, Tim Oates, Tinoosh Mohsenin
Date:2021-10-09 21:34:32

Reinforcement learning has been successful in many tasks ranging from robotic control, games, energy management etc. In complex real world environments with sparse rewards and long task horizons, sample efficiency is still a major challenge. Most complex tasks can be easily decomposed into high-level planning and low level control. Therefore, it is important to enable agents to leverage the hierarchical structure and decompose bigger tasks into multiple smaller sub-tasks. We introduce an approach where we use language to specify sub-tasks and a high-level planner issues language commands to a low level controller. The low-level controller executes the sub-tasks based on the language commands. Our experiments show that this method is able to solve complex long horizon planning tasks with limited human supervision. Using language has added benefit of interpretability and ability for expert humans to take over the high-level planning task and provide language commands if necessary.

Improving Kinodynamic Planners for Vehicular Navigation with Learned Goal-Reaching Controllers

Authors:Aravind Sivaramakrishnan, Edgar Granados, Seth Karten, Troy McMahon, Kostas E. Bekris
Date:2021-10-08 16:45:42

This paper aims to improve the path quality and computational efficiency of sampling-based kinodynamic planners for vehicular navigation. It proposes a learning framework for identifying promising controls during the expansion process of sampling-based planners. Given a dynamics model, a reinforcement learning process is trained offline to return a low-cost control that reaches a local goal state (i.e., a waypoint) in the absence of obstacles. By focusing on the system's dynamics and not knowing the environment, this process is data-efficient and takes place once for a robotic system. In this way, it can be reused in different environments. The planner generates online local goal states for the learned controller in an informed manner to bias towards the goal and consecutively in an exploratory, random manner. For the informed expansion, local goal states are generated either via (a) medial axis information in environments with obstacles, or (b) wavefront information for setups with traversability costs. The learning process and the resulting planning framework are evaluated for a first and second-order differential drive system, as well as a physically simulated Segway robot. The results show that the proposed integration of learning and planning can produce higher quality paths than sampling-based kinodynamic planning with random controls in fewer iterations and computation time.

Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver

Authors:Xiaoyu Chen, Jiachen Hu, Lin F. Yang, Liwei Wang
Date:2021-10-07 07:59:50

Although model-based reinforcement learning (RL) approaches are considered more sample efficient, existing algorithms are usually relying on sophisticated planning algorithm to couple tightly with the model-learning procedure. Hence the learned models may lack the ability of being re-used with more specialized planners. In this paper we address this issue and provide approaches to learn an RL model efficiently without the guidance of a reward signal. In particular, we take a plug-in solver approach, where we focus on learning a model in the exploration phase and demand that \emph{any planning algorithm} on the learned model can give a near-optimal policy. Specicially, we focus on the linear mixture MDP setting, where the probability transition matrix is a (unknown) convex combination of a set of existing models. We show that, by establishing a novel exploration algorithm, the plug-in approach learns a model by taking $\tilde{O}(d^2H^3/\epsilon^2)$ interactions with the environment and \emph{any} $\epsilon$-optimal planner on the model gives an $O(\epsilon)$-optimal policy on the original model. This sample complexity matches lower bounds for non-plug-in approaches and is \emph{statistically optimal}. We achieve this result by leveraging a careful maximum total-variance bound using Bernstein inequality and properties specified to linear mixture MDP.

Compositional Q-learning for electrolyte repletion with imbalanced patient sub-populations

Authors:Aishwarya Mandyam, Andrew Jones, Jiayu Yao, Krzysztof Laudanski, Barbara Engelhardt
Date:2021-10-06 16:08:05

Reinforcement learning (RL) is an effective framework for solving sequential decision-making tasks. However, applying RL methods in medical care settings is challenging in part due to heterogeneity in treatment response among patients. Some patients can be treated with standard protocols whereas others, such as those with chronic diseases, need personalized treatment planning. Traditional RL methods often fail to account for this heterogeneity, because they assume that all patients respond to the treatment in the same way (i.e., transition dynamics are shared). We introduce Compositional Fitted $Q$-iteration (CFQI), which uses a compositional task structure to represent heterogeneous treatment responses in medical care settings. A compositional task consists of several variations of the same task, each progressing in difficulty; solving simpler variants of the task can enable efficient solving of harder variants. CFQI uses a compositional $Q$-value function with separate modules for each task variant, allowing it to take advantage of shared knowledge while learning distinct policies for each variant. We validate CFQI's performance using a Cartpole environment and use CFQI to recommend electrolyte repletion for patients with and without renal disease. Our results demonstrate that CFQI is robust even in the presence of class imbalance, enabling effective information usage across patient sub-populations. CFQI exhibits great promise for clinical applications in scenarios characterized by known compositional structures.

Mismatched No More: Joint Model-Policy Optimization for Model-Based RL

Authors:Benjamin Eysenbach, Alexander Khazatsky, Sergey Levine, Ruslan Salakhutdinov
Date:2021-10-06 13:43:27

Many model-based reinforcement learning (RL) methods follow a similar template: fit a model to previously observed data, and then use data from that model for RL or planning. However, models that achieve better training performance (e.g., lower MSE) are not necessarily better for control: an RL agent may seek out the small fraction of states where an accurate model makes mistakes, or it might act in ways that do not expose the errors of an inaccurate model. As noted in prior work, there is an objective mismatch: models are useful if they yield good policies, but they are trained to maximize their accuracy, rather than the performance of the policies that result from them. In this work, we propose a single objective for jointly training the model and the policy, such that updates to either component increase a lower bound on expected return. To the best of our knowledge, this is the first lower bound for model-based RL that holds globally and can be efficiently estimated in continuous settings; it is the only lower bound that mends the objective mismatch problem. A version of this bound becomes tight under certain assumptions. Optimizing this bound resembles a GAN: a classifier distinguishes between real and fake transitions, the model is updated to produce transitions that look realistic, and the policy is updated to avoid states where the model predictions are unrealistic. Numerical simulations demonstrate that optimizing this bound yields reward maximizing policies and yields dynamics that (perhaps surprisingly) can aid in exploration. We also show that a deep RL algorithm loosely based on our lower bound can achieve performance competitive with prior model-based methods, and better performance on certain hard exploration tasks.

Improved Reinforcement Learning Coordinated Control of a Mobile Manipulator using Joint Clamping

Authors:Denis Hadjivelichkov, Kostas Vlachos, Dimitrios Kanoulas
Date:2021-10-05 10:21:10

Many robotic path planning problems are continuous, stochastic, and high-dimensional. The ability of a mobile manipulator to coordinate its base and manipulator in order to control its whole-body online is particularly challenging when self and environment collision avoidance is required. Reinforcement Learning techniques have the potential to solve such problems through their ability to generalise over environments. We study joint penalties and joint limits of a state-of-the-art mobile manipulator whole-body controller that uses LIDAR sensing for obstacle collision avoidance. We propose directions to improve the reinforcement learning method. Our agent achieves significantly higher success rates than the baseline in a goal-reaching environment and it can solve environments that require coordinated whole-body control which the baseline fails.

Mapless Navigation: Learning UAVs Motion forExploration of Unknown Environments

Authors:Sunggoo Jung, David Hyunchul Shim
Date:2021-10-04 23:38:58

This study presents a new methodology for learning-based motion planning for autonomous exploration using aerial robots. Through the reinforcement learning method of learning through trial and error, the action policy is derived that can guide autonomous exploration of underground and tunnel environments. A new Markov decision process state is designed to learn the robot's action policy by using simulation only, and the results are applied to the real-world environment without further learning. Reduce the need for the precision map in grid-based path planner and achieve map-less navigation. The proposed method can have a path with less computing cost than the grid-based planner but has similar performance. The trained action policy is broadly evaluated in both simulation and field trials related to autonomous exploration of underground mines or indoor spaces.

Multi-Agent Path Planning Using Deep Reinforcement Learning

Authors:Mert Çetinkaya
Date:2021-10-04 13:56:23

In this paper a deep reinforcement based multi-agent path planning approach is introduced. The experiments are realized in a simulation environment and in this environment different multi-agent path planning problems are produced. The produced problems are actually similar to a vehicle routing problem and they are solved using multi-agent deep reinforcement learning. In the simulation environment, the model is trained on different consecutive problems in this way and, as the time passes, it is observed that the model's performance to solve a problem increases. Always the same simulation environment is used and only the location of target points for the agents to visit is changed. This contributes the model to learn its environment and the right attitude against a problem as the episodes pass. At the end, a model who has already learned a lot to solve a path planning or routing problem in this environment is obtained and this model can already find a nice and instant solution to a given unseen problem even without any training. In routing problems, standard mathematical modeling or heuristics seem to suffer from high computational time to find the solution and it is also difficult and critical to find an instant solution. In this paper a new solution method against these points is proposed and its efficiency is proven experimentally.

AI based Algorithms of Path Planning, Navigation and Control for Mobile Ground Robots and UAVs

Authors:Jian Zhang
Date:2021-10-03 02:33:08

As the demands of autonomous mobile robots are increasing in recent years, the requirement of the path planning/navigation algorithm should not be content with the ability to reach the target without any collisions, but also should try to achieve possible optimal or suboptimal path from the initial position to the target according to the robot's constrains in practice. This report investigates path planning and control strategies for mobile robots with machine learning techniques, including ground mobile robots and flying UAVs. In this report, the hybrid reactive collision-free navigation problem under an unknown static environment is investigated firstly. By combining both the reactive navigation and Q-learning method, we intend to keep the good characteristics of reactive navigation algorithm and Q-learning and overcome the shortcomings of only relying on one of them. The proposed method is then extended into 3D environments. The performance of the mentioned strategies are verified by extensive computer simulations, and good results are obtained. Furthermore, the more challenging dynamic environment situation is taken into our consideration. We tackled this problem by developing a new path planning method that utilizes the integrated environment representation and reinforcement learning. Our novel approach enables to find the optimal path to the target efficiently and avoid collisions in a cluttered environment with steady and moving obstacles. The performance of these methods is compared with other different aspects.

A Novel Automated Curriculum Strategy to Solve Hard Sokoban Planning Instances

Authors:Dieqiao Feng, Carla P. Gomes, Bart Selman
Date:2021-10-03 00:44:50

In recent years, we have witnessed tremendous progress in deep reinforcement learning (RL) for tasks such as Go, Chess, video games, and robot control. Nevertheless, other combinatorial domains, such as AI planning, still pose considerable challenges for RL approaches. The key difficulty in those domains is that a positive reward signal becomes {\em exponentially rare} as the minimal solution length increases. So, an RL approach loses its training signal. There has been promising recent progress by using a curriculum-driven learning approach that is designed to solve a single hard instance. We present a novel {\em automated} curriculum approach that dynamically selects from a pool of unlabeled training instances of varying task complexity guided by our {\em difficulty quantum momentum} strategy. We show how the smoothness of the task hardness impacts the final learning results. In particular, as the size of the instance pool increases, the ``hardness gap'' decreases, which facilitates a smoother automated curriculum based learning process. Our automated curriculum approach dramatically improves upon the previous approaches. We show our results on Sokoban, which is a traditional PSPACE-complete planning problem and presents a great challenge even for specialized solvers. Our RL agent can solve hard instances that are far out of reach for any previous state-of-the-art Sokoban solver. In particular, our approach can uncover plans that require hundreds of steps, while the best previous search methods would take many years of computing time to solve such instances. In addition, we show that we can further boost the RL performance with an intricate coupling of our automated curriculum approach with a curiosity-driven search strategy and a graph neural net representation.

AB-Mapper: Attention and BicNet Based Multi-agent Path Finding for Dynamic Crowded Environment

Authors:Huifeng Guan, Yuan Gao, Min Zhao, Yong Yang, Fuqin Deng, Tin Lun Lam
Date:2021-10-02 08:56:01

Multi-agent path finding in dynamic crowded environments is of great academic and practical value for multi-robot systems in the real world. To improve the effectiveness and efficiency of communication and learning process during path planning in dynamic crowded environments, we introduce an algorithm called Attention and BicNet based Multi-agent path planning with effective reinforcement (AB-Mapper)under the actor-critic reinforcement learning framework. In this framework, on the one hand, we utilize the BicNet with communication function in the actor-network to achieve intra team coordination. On the other hand, we propose a centralized critic network that can selectively allocate attention weights to surrounding agents. This attention mechanism allows an individual agent to automatically learn a better evaluation of actions by also considering the behaviours of its surrounding agents. Compared with the state-of-the-art method Mapper,our AB-Mapper is more effective (85.86% vs. 81.56% in terms of success rate) in solving the general path finding problems with dynamic obstacles. In addition, in crowded scenarios, our method outperforms the Mapper method by a large margin,reaching a stunning gap of more than 40% for each experiment.

Multi-lane Cruising Using Hierarchical Planning and Reinforcement Learning

Authors:Kasra Rezaee, Peyman Yadmellat, Masoud S. Nosrati, Elmira Amirloo Abolfathi, Mohammed Elmahgiubi, Jun Luo
Date:2021-10-01 21:03:39

Competent multi-lane cruising requires using lane changes and within-lane maneuvers to achieve good speed and maintain safety. This paper proposes a design for autonomous multi-lane cruising by combining a hierarchical reinforcement learning framework with a novel state-action space abstraction. While the proposed solution follows the classical hierarchy of behavior decision, motion planning and control, it introduces a key intermediate abstraction within the motion planner to discretize the state-action space according to high level behavioral decisions. We argue that this design allows principled modular extension of motion planning, in contrast to using either monolithic behavior cloning or a large set of hand-written rules. Moreover, we demonstrate that our state-action space abstraction allows transferring of the trained models without retraining from a simulated environment with virtually no dynamics to one with significantly more realistic dynamics. Together, these results suggest that our proposed hierarchical architecture is a promising way to allow reinforcement learning to be applied to complex multi-lane cruising in the real world.

Motion Planning for Autonomous Vehicles in the Presence of Uncertainty Using Reinforcement Learning

Authors:Kasra Rezaee, Peyman Yadmellat, Simon Chamorro
Date:2021-10-01 20:32:25

Motion planning under uncertainty is one of the main challenges in developing autonomous driving vehicles. In this work, we focus on the uncertainty in sensing and perception, resulted from a limited field of view, occlusions, and sensing range. This problem is often tackled by considering hypothetical hidden objects in occluded areas or beyond the sensing range to guarantee passive safety. However, this may result in conservative planning and expensive computation, particularly when numerous hypothetical objects need to be considered. We propose a reinforcement learning (RL) based solution to manage uncertainty by optimizing for the worst case outcome. This approach is in contrast to traditional RL, where the agents try to maximize the average expected reward. The proposed approach is built on top of the Distributional RL with its policy optimization maximizing the stochastic outcomes' lower bound. This modification can be applied to a range of RL algorithms. As a proof-of-concept, the approach is applied to two different RL algorithms, Soft Actor-Critic and DQN. The approach is evaluated against two challenging scenarios of pedestrians crossing with occlusion and curved roads with a limited field of view. The algorithm is trained and evaluated using the SUMO traffic simulator. The proposed approach yields much better motion planning behavior compared to conventional RL algorithms and behaves comparably to humans driving style.

Validating Robotics Simulators on Real-World Impacts

Authors:Brian Acosta, William Yang, Michael Posa
Date:2021-10-01 17:12:05

A realistic simulation environment is an essential tool in every roboticist's toolkit, with uses ranging from planning and control to training policies with reinforcement learning. Despite the centrality of simulation in modern robotics, little work has been done to compare the performance of robotics simulators against real-world data, especially for scenarios involving dynamic motions with high speed impact events. Handling dynamic contact is the computational bottleneck for most simulations, and thus the modeling and algorithmic choices surrounding impacts and friction form the largest distinctions between popular tools. Here, we evaluate the ability of several simulators to reproduce real-world trajectories involving impacts. Using experimental data, we identify system-specific contact parameters of popular simulators Drake, MuJoCo, and Bullet, analyzing the effects of modeling choices around these parameters. For the simple example of a cube tossed onto a table, simulators capture inelastic impacts well while failing to capture elastic impacts. For the higher-dimensional case of a Cassie biped landing from a jump, the simulators capture the bulk motion well but the accuracy is limited by numerous model differences between the real robot and the simulators.

Trajectory Planning with Deep Reinforcement Learning in High-Level Action Spaces

Authors:Kyle R. Williams, Rachel Schlossman, Daniel Whitten, Joe Ingram, Srideep Musuvathy, Anirudh Patel, James Pagan, Kyle A. Williams, Sam Green, Anirban Mazumdar, Julie Parish
Date:2021-09-30 18:50:16

This paper presents a technique for trajectory planning based on continuously parameterized high-level actions (motion primitives) of variable duration. This technique leverages deep reinforcement learning (Deep RL) to formulate a policy which is suitable for real-time implementation. There is no separation of motion primitive generation and trajectory planning: each individual short-horizon motion is formed during the Deep RL training to achieve the full-horizon objective. Effectiveness of the technique is demonstrated numerically on a well-studied trajectory generation problem and a planning problem on a known obstacle-rich map. This paper also develops a new loss function term for policy-gradient-based Deep RL, which is analogous to an anti-windup mechanism in feedback control. We demonstrate the inclusion of this new term in the underlying optimization increases the average policy return in our numerical example.

Scalable Online Planning via Reinforcement Learning Fine-Tuning

Authors:Arnaud Fickinger, Hengyuan Hu, Brandon Amos, Stuart Russell, Noam Brown
Date:2021-09-30 17:59:11

Lookahead search has been a critical component of recent AI successes, such as in the games of chess, go, and poker. However, the search methods used in these games, and in many other settings, are tabular. Tabular search methods do not scale well with the size of the search space, and this problem is exacerbated by stochasticity and partial observability. In this work we replace tabular search with online model-based fine-tuning of a policy neural network via reinforcement learning, and show that this approach outperforms state-of-the-art search algorithms in benchmark settings. In particular, we use our search algorithm to achieve a new state-of-the-art result in self-play Hanabi, and show the generality of our algorithm by also showing that it outperforms tabular search in the Atari game Ms. Pacman.

Is Policy Learning Overrated?: Width-Based Planning and Active Learning for Atari

Authors:Benjamin Ayton, Masataro Asai
Date:2021-09-30 17:52:00

Width-based planning has shown promising results on Atari 2600 games using pixel input, while using substantially fewer environment interactions than reinforcement learning. Recent width-based approaches have computed feature vectors for each screen using a hand designed feature set or a variational autoencoder trained on game screens (VAE-IW), and prune screens that do not have novel features during the search. We propose Olive (Online-VAE-IW), which updates the VAE features online using active learning to maximize the utility of screens observed during planning. Experimental results in 55 Atari games demonstrate that it outperforms Rollout-IW by 42-to-11 and VAE-IW by 32-to-20. Moreover, Olive outperforms existing work based on policy-learning ($\pi$-IW, DQN) trained with 100x training budget by 30-to-22 and 31-to-17, and a state of the art data-efficient reinforcement learning (EfficientZero) trained with the same training budget and ran with 1.8x planning budget by 18-to-7 in Atari 100k benchmark, with no policy learning at all. The source code is available at github.com/ibm/atari-active-learning .

Reinforcement Learning for Classical Planning: Viewing Heuristics as Dense Reward Generators

Authors:Clement Gehring, Masataro Asai, Rohan Chitnis, Tom Silver, Leslie Pack Kaelbling, Shirin Sohrabi, Michael Katz
Date:2021-09-30 03:36:01

Recent advances in reinforcement learning (RL) have led to a growing interest in applying RL to classical planning domains or applying classical planning methods to some complex RL domains. However, the long-horizon goal-based problems found in classical planning lead to sparse rewards for RL, making direct application inefficient. In this paper, we propose to leverage domain-independent heuristic functions commonly used in the classical planning literature to improve the sample efficiency of RL. These classical heuristics act as dense reward generators to alleviate the sparse-rewards issue and enable our RL agent to learn domain-specific value functions as residuals on these heuristics, making learning easier. Correct application of this technique requires consolidating the discounted metric used in RL and the non-discounted metric used in heuristics. We implement the value functions using Neural Logic Machines, a neural network architecture designed for grounded first-order logic inputs. We demonstrate on several classical planning domains that using classical heuristics for RL allows for good sample efficiency compared to sparse-reward RL. We further show that our learned value functions generalize to novel problem instances in the same domain.

Surveillance Evasion Through Bayesian Reinforcement Learning

Authors:Dongping Qi, David Bindel, Alexander Vladimirsky
Date:2021-09-30 02:29:21

We consider a task of surveillance-evading path-planning in a continuous setting. An Evader strives to escape from a 2D domain while minimizing the risk of detection (and immediate capture). The probability of detection is path-dependent and determined by the spatially inhomogeneous surveillance intensity, which is fixed but a priori unknown and gradually learned in the multi-episodic setting. We introduce a Bayesian reinforcement learning algorithm that relies on a Gaussian Process regression (to model the surveillance intensity function based on the information from prior episodes), numerical methods for Hamilton-Jacobi PDEs (to plan the best continuous trajectories based on the current model), and Confidence Bounds (to balance the exploration vs exploitation). We use numerical experiments and regret metrics to highlight the significant advantages of our approach compared to traditional graph-based algorithms of reinforcement learning.

Improving Safety in Deep Reinforcement Learning using Unsupervised Action Planning

Authors:Hao-Lun Hsu, Qiuhua Huang, Sehoon Ha
Date:2021-09-29 10:26:29

One of the key challenges to deep reinforcement learning (deep RL) is to ensure safety at both training and testing phases. In this work, we propose a novel technique of unsupervised action planning to improve the safety of on-policy reinforcement learning algorithms, such as trust region policy optimization (TRPO) or proximal policy optimization (PPO). We design our safety-aware reinforcement learning by storing all the history of "recovery" actions that rescue the agent from dangerous situations into a separate "safety" buffer and finding the best recovery action when the agent encounters similar states. Because this functionality requires the algorithm to query similar states, we implement the proposed safety mechanism using an unsupervised learning algorithm, k-means clustering. We evaluate the proposed algorithm on six robotic control tasks that cover navigation and manipulation. Our results show that the proposed safety RL algorithm can achieve higher rewards compared with multiple baselines in both discrete and continuous control problems. The supplemental video can be found at: https://youtu.be/AFTeWSohILo.

Learning Dynamics Models for Model Predictive Agents

Authors:Michael Lutter, Leonard Hasenclever, Arunkumar Byravan, Gabriel Dulac-Arnold, Piotr Trochim, Nicolas Heess, Josh Merel, Yuval Tassa
Date:2021-09-29 09:50:25

Model-Based Reinforcement Learning involves learning a \textit{dynamics model} from data, and then using this model to optimise behaviour, most often with an online \textit{planner}. Much of the recent research along these lines presents a particular set of design choices, involving problem definition, model learning and planning. Given the multiple contributions, it is difficult to evaluate the effects of each. This paper sets out to disambiguate the role of different design choices for learning dynamics models, by comparing their performance to planning with a ground-truth model -- the simulator. First, we collect a rich dataset from the training sequence of a model-free agent on 5 domains of the DeepMind Control Suite. Second, we train feed-forward dynamics models in a supervised fashion, and evaluate planner performance while varying and analysing different model design choices, including ensembling, stochasticity, multi-step training and timestep size. Besides the quantitative analysis, we describe a set of qualitative findings, rules of thumb, and future research directions for planning with learned dynamics models. Videos of the results are available at https://sites.google.com/view/learning-better-models.

Identifying Reasoning Flaws in Planning-Based RL Using Tree Explanations

Authors:Kin-Ho Lam, Zhengxian Lin, Jed Irvine, Jonathan Dodge, Zeyad T Shureih, Roli Khanna, Minsuk Kahng, Alan Fern
Date:2021-09-28 18:39:03

Enabling humans to identify potential flaws in an agent's decision making is an important Explainable AI application. We consider identifying such flaws in a planning-based deep reinforcement learning (RL) agent for a complex real-time strategy game. In particular, the agent makes decisions via tree search using a learned model and evaluation function over interpretable states and actions. This gives the potential for humans to identify flaws at the level of reasoning steps in the tree, even if the entire reasoning process is too complex to understand. However, it is unclear whether humans will be able to identify such flaws due to the size and complexity of trees. We describe a user interface and case study, where a small group of AI experts and developers attempt to identify reasoning flaws due to inaccurate agent learning. Overall, the interface allowed the group to identify a number of significant flaws of varying types, demonstrating the promise of this approach.

A First-Occupancy Representation for Reinforcement Learning

Authors:Ted Moskovitz, Spencer R. Wilson, Maneesh Sahani
Date:2021-09-28 16:48:16

Both animals and artificial agents benefit from state representations that support rapid transfer of learning across tasks and which enable them to efficiently traverse their environments to reach rewarding states. The successor representation (SR), which measures the expected cumulative, discounted state occupancy under a fixed policy, enables efficient transfer to different reward structures in an otherwise constant Markovian environment and has been hypothesized to underlie aspects of biological behavior and neural activity. However, in the real world, rewards may move or only be available for consumption once, may shift location, or agents may simply aim to reach goal states as rapidly as possible without the constraint of artificially imposed task horizons. In such cases, the most behaviorally-relevant representation would carry information about when the agent was likely to first reach states of interest, rather than how often it should expect to visit them over a potentially infinite time span. To reflect such demands, we introduce the first-occupancy representation (FR), which measures the expected temporal discount to the first time a state is accessed. We demonstrate that the FR facilitates exploration, the selection of efficient paths to desired states, allows the agent, under certain conditions, to plan provably optimal trajectories defined by a sequence of subgoals, and induces similar behavior to animals avoiding threatening stimuli.

Adaptive Informative Path Planning Using Deep Reinforcement Learning for UAV-based Active Sensing

Authors:Julius Rückin, Liren Jin, Marija Popović
Date:2021-09-28 09:00:55

Aerial robots are increasingly being utilized for environmental monitoring and exploration. However, a key challenge is efficiently planning paths to maximize the information value of acquired data as an initially unknown environment is explored. To address this, we propose a new approach for informative path planning based on deep reinforcement learning (RL). Combining recent advances in RL and robotic applications, our method combines tree search with an offline-learned neural network predicting informative sensing actions. We introduce several components making our approach applicable for robotic tasks with high-dimensional state and large action spaces. By deploying the trained network during a mission, our method enables sample-efficient online replanning on platforms with limited computational resources. Simulations show that our approach performs on par with existing methods while reducing runtime by 8-10x. We validate its performance using real-world surface temperature data.

Model-Free Reinforcement Learning for Optimal Control of MarkovDecision Processes Under Signal Temporal Logic Specifications

Authors:Krishna C. Kalagarla, Rahul Jain, Pierluigi Nuzzo
Date:2021-09-27 22:44:55

We present a model-free reinforcement learning algorithm to find an optimal policy for a finite-horizon Markov decision process while guaranteeing a desired lower bound on the probability of satisfying a signal temporal logic (STL) specification. We propose a method to effectively augment the MDP state space to capture the required state history and express the STL objective as a reachability objective. The planning problem can then be formulated as a finite-horizon constrained Markov decision process (CMDP). For a general finite horizon CMDP problem with unknown transition probability, we develop a reinforcement learning scheme that can leverage any model-free RL algorithm to provide an approximately optimal policy out of the general space of non-stationary randomized policies. We illustrate the effectiveness of our approach in the context of robotic motion planning for complex missions under uncertainty and performance objectives.

Solving Challenging Control Problems Using Two-Staged Deep Reinforcement Learning

Authors:Nitish Sontakke, Sehoon Ha
Date:2021-09-27 20:27:47

We present a deep reinforcement learning (deep RL) algorithm that consists of learning-based motion planning and imitation to tackle challenging control problems. Deep RL has been an effective tool for solving many high-dimensional continuous control problems, but it cannot effectively solve challenging problems with certain properties, such as sparse reward functions or sensitive dynamics. In this work, we propose an approach that decomposes the given problem into two deep RL stages: motion planning and motion imitation. The motion planning stage seeks to compute a feasible motion plan by leveraging the powerful planning capability of deep RL. Subsequently, the motion imitation stage learns a control policy that can imitate the given motion plan with realistic sensors and actuation models. This new formulation requires only a nominal added cost to the user because both stages require minimal changes to the original problem. We demonstrate that our approach can solve challenging control problems, rocket navigation, and quadrupedal locomotion, which cannot be solved by the monolithic deep RL formulation or the version with Probabilistic Roadmap.

A Dynamic Programming Algorithm for Finding an Optimal Sequence of Informative Measurements

Authors:Peter N. Loxley, Ka-Wai Cheung
Date:2021-09-24 08:40:06

An informative measurement is the most efficient way to gain information about an unknown state. We present a first-principles derivation of a general-purpose dynamic programming algorithm that returns an optimal sequence of informative measurements by sequentially maximizing the entropy of possible measurement outcomes. This algorithm can be used by an autonomous agent or robot to decide where best to measure next, planning a path corresponding to an optimal sequence of informative measurements. The algorithm is applicable to states and controls that are either continuous or discrete, and agent dynamics that is either stochastic or deterministic; including Markov decision processes and Gaussian processes. Recent results from the fields of approximate dynamic programming and reinforcement learning, including on-line approximations such as rollout and Monte Carlo tree search, allow the measurement task to be solved in real time. The resulting solutions include non-myopic paths and measurement sequences that can generally outperform, sometimes substantially, commonly used greedy approaches. This is demonstrated for a global search task, where on-line planning for a sequence of local searches is found to reduce the number of measurements in the search by approximately half. A variant of the algorithm is derived for Gaussian processes for active sensing.

Deep Reinforcement Learning-Based Long-Range Autonomous Valet Parking for Smart Cities

Authors:Muhammad Khalid, Liang Wang, Kezhi Wang, Cunhua Pan, Nauman Aslam, Yue Cao
Date:2021-09-23 21:55:12

In this paper, to reduce the congestion rate at the city center and increase the quality of experience (QoE) of each user, the framework of long-range autonomous valet parking (LAVP) is presented, where an Autonomous Vehicle (AV) is deployed in the city, which can pick up, drop off users at their required spots, and then drive to the car park out of city center autonomously. In this framework, we aim to minimize the overall distance of the AV, while guarantee all users are served, i.e., picking up, and dropping off users at their required spots through optimizing the path planning of the AV and number of serving time slots. To this end, we first propose a learning based algorithm, which is named as Double-Layer Ant Colony Optimization (DL-ACO) algorithm to solve the above problem in an iterative way. Then, to make the real-time decision, while consider the dynamic environment (i.e., the AV may pick up and drop off users from different locations), we further present a deep reinforcement learning (DRL) based algorithm, which is known as deep Q network (DQN). The experimental results show that the DL-ACO and DQN-based algorithms both achieve the considerable performance.

All-in-One: A DRL-based Control Switch Combining State-of-the-art Navigation Planners

Authors:Linh Kästner, Johannes Cox, Teham Buiyan, Jens Lambrecht
Date:2021-09-23 20:42:03

Autonomous navigation of mobile robots is an essential aspect in use cases such as delivery, assistance or logistics. Although traditional planning methods are well integrated into existing navigation systems, they struggle in highly dynamic environments. On the other hand, Deep-Reinforcement-Learning-based methods show superior performance in dynamic obstacle avoidance but are not suitable for long-range navigation and struggle with local minima. In this paper, we propose a Deep-Reinforcement-Learning-based control switch, which has the ability to select between different planning paradigms based solely on sensor data observations. Therefore, we develop an interface to efficiently operate multiple model-based, as well as learning-based local planners and integrate a variety of state-of-the-art planners to be selected by the control switch. Subsequently, we evaluate our approach against each planner individually and found improvements in navigation performance especially for highly dynamic scenarios. Our planner was able to prefer learning-based approaches in situations with a high number of obstacles while relying on the traditional model-based planners in long corridors or empty spaces.

Enhancing Navigational Safety in Crowded Environments using Semantic-Deep-Reinforcement-Learning-based Navigation

Authors:Linh Kästner, Junhui Li, Zhengcheng Shen, Jens Lambrecht
Date:2021-09-23 10:50:47

Intelligent navigation among social crowds is an essential aspect of mobile robotics for applications such as delivery, health care, or assistance. Deep Reinforcement Learning emerged as an alternative planning method to conservative approaches and promises more efficient and flexible navigation. However, in highly dynamic environments employing different kinds of obstacle classes, safe navigation still presents a grand challenge. In this paper, we propose a semantic Deep-reinforcement-learning-based navigation approach that teaches object-specific safety rules by considering high-level obstacle information. In particular, the agent learns object-specific behavior by contemplating the specific danger zones to enhance safety around vulnerable object classes. We tested the approach against a benchmark obstacle avoidance approach and found an increase in safety. Furthermore, we demonstrate that the agent could learn to navigate more safely by keeping an individual safety distance dependent on the semantic information.

Hierarchies of Planning and Reinforcement Learning for Robot Navigation

Authors:Jan Wöhlke, Felix Schmitt, Herke van Hoof
Date:2021-09-23 07:18:15

Solving robotic navigation tasks via reinforcement learning (RL) is challenging due to their sparse reward and long decision horizon nature. However, in many navigation tasks, high-level (HL) task representations, like a rough floor plan, are available. Previous work has demonstrated efficient learning by hierarchal approaches consisting of path planning in the HL representation and using sub-goals derived from the plan to guide the RL policy in the source task. However, these approaches usually neglect the complex dynamics and sub-optimal sub-goal-reaching capabilities of the robot during planning. This work overcomes these limitations by proposing a novel hierarchical framework that utilizes a trainable planning policy for the HL representation. Thereby robot capabilities and environment conditions can be learned utilizing collected rollout data. We specifically introduce a planning policy based on value iteration with a learned transition model (VI-RL). In simulated robotic navigation tasks, VI-RL results in consistent strong improvement over vanilla RL, is on par with vanilla hierarchal RL on single layouts but more broadly applicable to multiple layouts, and is on par with trainable HL path planning baselines except for a parking task with difficult non-holonomic dynamics where it shows marked improvements.

PredictionNet: Real-Time Joint Probabilistic Traffic Prediction for Planning, Control, and Simulation

Authors:Alexey Kamenev, Lirui Wang, Ollin Boer Bohan, Ishwar Kulkarni, Bilal Kartal, Artem Molchanov, Stan Birchfield, David Nistér, Nikolai Smolyanskiy
Date:2021-09-23 01:23:47

Predicting the future motion of traffic agents is crucial for safe and efficient autonomous driving. To this end, we present PredictionNet, a deep neural network (DNN) that predicts the motion of all surrounding traffic agents together with the ego-vehicle's motion. All predictions are probabilistic and are represented in a simple top-down rasterization that allows an arbitrary number of agents. Conditioned on a multi-layer map with lane information, the network outputs future positions, velocities, and backtrace vectors jointly for all agents including the ego-vehicle in a single pass. Trajectories are then extracted from the output. The network can be used to simulate realistic traffic, and it produces competitive results on popular benchmarks. More importantly, it has been used to successfully control a real-world vehicle for hundreds of kilometers, by combining it with a motion planning/control subsystem. The network runs faster than real-time on an embedded GPU, and the system shows good generalization (across sensory modalities and locations) due to the choice of input representation. Furthermore, we demonstrate that by extending the DNN with reinforcement learning (RL), it can better handle rare or unsafe events like aggressive maneuvers and crashes.

Example-Driven Model-Based Reinforcement Learning for Solving Long-Horizon Visuomotor Tasks

Authors:Bohan Wu, Suraj Nair, Li Fei-Fei, Chelsea Finn
Date:2021-09-21 16:48:07

In this paper, we study the problem of learning a repertoire of low-level skills from raw images that can be sequenced to complete long-horizon visuomotor tasks. Reinforcement learning (RL) is a promising approach for acquiring short-horizon skills autonomously. However, the focus of RL algorithms has largely been on the success of those individual skills, more so than learning and grounding a large repertoire of skills that can be sequenced to complete extended multi-stage tasks. The latter demands robustness and persistence, as errors in skills can compound over time, and may require the robot to have a number of primitive skills in its repertoire, rather than just one. To this end, we introduce EMBER, a model-based RL method for learning primitive skills that are suitable for completing long-horizon visuomotor tasks. EMBER learns and plans using a learned model, critic, and success classifier, where the success classifier serves both as a reward function for RL and as a grounding mechanism to continuously detect if the robot should retry a skill when unsuccessful or under perturbations. Further, the learned model is task-agnostic and trained using data from all skills, enabling the robot to efficiently learn a number of distinct primitives. These visuomotor primitive skills and their associated pre- and post-conditions can then be directly combined with off-the-shelf symbolic planners to complete long-horizon tasks. On a Franka Emika robot arm, we find that EMBER enables the robot to complete three long-horizon visuomotor tasks at 85% success rate, such as organizing an office desk, a file cabinet, and drawers, which require sequencing up to 12 skills, involve 14 unique learned primitives, and demand generalization to novel objects.

Hierarchical Policy for Non-prehensile Multi-object Rearrangement with Deep Reinforcement Learning and Monte Carlo Tree Search

Authors:Fan Bai, Fei Meng, Jianbang Liu, Jiankun Wang, Max Q. -H. Meng
Date:2021-09-18 17:24:37

Non-prehensile multi-object rearrangement is a robotic task of planning feasible paths and transferring multiple objects to their predefined target poses without grasping. It needs to consider how each object reaches the target and the order of object movement, which significantly deepens the complexity of the problem. To address these challenges, we propose a hierarchical policy to divide and conquer for non-prehensile multi-object rearrangement. In the high-level policy, guided by a designed policy network, the Monte Carlo Tree Search efficiently searches for the optimal rearrangement sequence among multiple objects, which benefits from imitation and reinforcement. In the low-level policy, the robot plans the paths according to the order of path primitives and manipulates the objects to approach the goal poses one by one. We verify through experiments that the proposed method can achieve a higher success rate, fewer steps, and shorter path length compared with the state-of-the-art.

Integrating Deep Reinforcement and Supervised Learning to Expedite Indoor Mapping

Authors:Elchanan Zwecher, Eran Iceland, Sean R. Levy, Shmuel Y. Hayoun, Oren Gal, Ariel Barel
Date:2021-09-17 12:07:07

The challenge of mapping indoor environments is addressed. Typical heuristic algorithms for solving the motion planning problem are frontier-based methods, that are especially effective when the environment is completely unknown. However, in cases where prior statistical data on the environment's architectonic features is available, such algorithms can be far from optimal. Furthermore, their calculation time may increase substantially as more areas are exposed. In this paper we propose two means by which to overcome these shortcomings. One is the use of deep reinforcement learning to train the motion planner. The second is the inclusion of a pre-trained generative deep neural network, acting as a map predictor. Each one helps to improve the decision making through use of the learned structural statistics of the environment, and both, being realized as neural networks, ensure a constant calculation time. We show that combining the two methods can shorten the duration of the mapping process by up to 4 times, compared to frontier-based motion planning.

Target Languages (vs. Inductive Biases) for Learning to Act and Plan

Authors:Hector Geffner
Date:2021-09-15 10:24:13

Recent breakthroughs in AI have shown the remarkable power of deep learning and deep reinforcement learning. These developments, however, have been tied to specific tasks, and progress in out-of-distribution generalization has been limited. While it is assumed that these limitations can be overcome by incorporating suitable inductive biases, the notion of inductive biases itself is often left vague and does not provide meaningful guidance. In the paper, I articulate a different learning approach where representations do not emerge from biases in a neural architecture but are learned over a given target language with a known semantics. The basic ideas are implicit in mainstream AI where representations have been encoded in languages ranging from fragments of first-order logic to probabilistic structural causal models. The challenge is to learn from data the representations that have traditionally been crafted by hand. Generalization is then a result of the semantics of the language. The goals of this paper are to make these ideas explicit, to place them in a broader context where the design of the target language is crucial, and to illustrate them in the context of learning to act and plan. For this, after a general discussion, I consider learning representations of actions, general policies, and subgoals ("intrinsic rewards"). In these cases, learning is formulated as a combinatorial problem but nothing prevents the use of deep learning techniques instead. Indeed, learning representations over languages with a known semantics provides an account of what is to be learned, while learning representations with neural nets provides a complementary account of how representations can be learned. The challenge and the opportunity is to bring the two together.

Automatic Inverse Treatment Planning for Gamma Knife Radiosurgery via Deep Reinforcement Learning

Authors:Yingzi Liu, Chenyang Shen, Tonghe Wang, Jiahan Zhang, Xiaofeng Yang, Tian Liu, Shannon Kahn, Hui-Kuo Shu, Zhen Tian
Date:2021-09-14 16:48:08

Purpose: Several inverse planning algorithms have been developed for Gamma Knife (GK) radiosurgery to determine a large number of plan parameters via solving an optimization problem, which typically consists of multiple objectives. The priorities among these objectives need to be repetitively adjusted to achieve a clinically good plan for each patient. This study aimed to achieve automatic and intelligent priority-tuning, by developing a deep reinforcement learning (DRL) based method to model the tuning behaviors of human planners. Methods: We built a priority-tuning policy network using deep convolutional neural networks. Its input was a vector composed of the plan metrics that were used in our institution for GK plan evaluation. The network can determine which tuning action to take, based on the observed quality of the intermediate plan. We trained the network using an end-to-end DRL framework to approximate the optimal action-value function. A scoring function was designed to measure the plan quality. Results: Vestibular schwannoma was chosen as the test bed in this study. The number of training, validation and testing cases were 5, 5, and 16, respectively. For these three datasets, the average plan scores with initial priorities were 3.63 $\pm$ 1.34, 3.83 $\pm$ 0.86 and 4.20 $\pm$ 0.78, respectively, while can be improved to 5.28 $\pm$ 0.23, 4.97 $\pm$ 0.44 and 5.22 $\pm$ 0.26 through manual priority tuning by human expert planners. Our network achieved competitive results with 5.42 $\pm$ 0.11, 5.10 $\pm$ 0. 42, 5.28 $\pm$ 0.20, respectively. Conclusions: Our network can generate GK plans of comparable or slightly higher quality comparing with the plans generated by human planners via manual priority tuning. The network can potentially be incorporated into the clinical workflow to improve GK planning efficiency.

DSDF: An approach to handle stochastic agents in collaborative multi-agent reinforcement learning

Authors:Satheesh K. Perepu, Kaushik Dey
Date:2021-09-14 12:02:28

Multi-Agent reinforcement learning has received lot of attention in recent years and have applications in many different areas. Existing methods involving Centralized Training and Decentralized execution, attempts to train the agents towards learning a pattern of coordinated actions to arrive at optimal joint policy. However if some agents are stochastic to varying degrees of stochasticity, the above methods often fail to converge and provides poor coordination among agents. In this paper we show how this stochasticity of agents, which could be a result of malfunction or aging of robots, can add to the uncertainty in coordination and there contribute to unsatisfactory global coordination. In this case, the deterministic agents have to understand the behavior and limitations of the stochastic agents while arriving at optimal joint policy. Our solution, DSDF which tunes the discounted factor for the agents according to uncertainty and use the values to update the utility networks of individual agents. DSDF also helps in imparting an extent of reliability in coordination thereby granting stochastic agents tasks which are immediate and of shorter trajectory with deterministic ones taking the tasks which involve longer planning. Such an method enables joint co-ordinations of agents some of which may be partially performing and thereby can reduce or delay the investment of agent/robot replacement in many circumstances. Results on benchmark environment for different scenarios shows the efficacy of the proposed approach when compared with existing approaches.

Computation Rate Maximum for Mobile Terminals in UAV-assisted Wireless Powered MEC Networks with Fairness Constraint

Authors:Xiaoyi Zhou, Liang Huang, Tong Ye, Weiqiang Sun
Date:2021-09-13 08:15:41

This paper investigates an unmanned aerial vehicle (UAV)-assisted wireless powered mobile-edge computing (MEC) system, where the UAV powers the mobile terminals by wireless power transfer (WPT) and provides computation service for them. We aim to maximize the computation rate of terminals while ensuring fairness among them. Considering the random trajectories of mobile terminals, we propose a soft actor-critic (SAC)-based UAV trajectory planning and resource allocation (SAC-TR) algorithm, which combines off-policy and maximum entropy reinforcement learning to promote the convergence of the algorithm. We design the reward as a heterogeneous function of computation rate, fairness, and reaching of destination. Simulation results show that SAC-TR can quickly adapt to varying network environments and outperform representative benchmarks in a variety of situations.

Bundled Gradients through Contact via Randomized Smoothing

Authors:H. J. Terry Suh, Tao Pang, Russ Tedrake
Date:2021-09-11 00:03:28

The empirical success of derivative-free methods in reinforcement learning for planning through contact seems at odds with the perceived fragility of classical gradient-based optimization methods in these domains. What is causing this gap, and how might we use the answer to improve gradient-based methods? We believe a stochastic formulation of dynamics is one crucial ingredient. We use tools from randomized smoothing to analyze sampling-based approximations of the gradient, and formalize such approximations through the gradient bundle. We show that using the gradient bundle in lieu of the gradient mitigates fast-changing gradients of non-smooth contact dynamics modeled by the implicit time-stepping, or the penalty method. Finally, we apply the gradient bundle to optimal control using iLQR, introducing a novel algorithm which improves convergence over using exact gradients. Combining our algorithm with a convex implicit time-stepping formulation of contact, we show that we can tractably tackle planning-through-contact problems in manipulation.

TERP: Reliable Planning in Uneven Outdoor Environments using Deep Reinforcement Learning

Authors:Kasun Weerakoon, Adarsh Jagan Sathyamoorthy, Utsav Patel, Dinesh Manocha
Date:2021-09-10 22:06:14

We present a novel method for reliable robot navigation in uneven outdoor terrains. Our approach employs a novel fully-trained Deep Reinforcement Learning (DRL) network that uses elevation maps of the environment, robot pose, and goal as inputs to compute an attention mask of the environment. The attention mask is used to identify reduced stability regions in the elevation map and is computed using channel and spatial attention modules and a novel reward function. We continuously compute and update a navigation cost-map that encodes the elevation information or the level-of-flatness of the terrain using the attention mask. We then generate locally least-cost waypoints on the cost-map and compute the final dynamically feasible trajectory using another DRL-based method. Our approach guarantees safe, locally least-cost paths and dynamically feasible robot velocities in uneven terrains. We observe an increase of 35.18% in terms of success rate and, a decrease of 26.14% in the cumulative elevation gradient of the robot's trajectory compared to prior navigation methods in high-elevation regions. We evaluate our method on a Husky robot in real-world uneven terrains (~ 4m of elevation gain) and demonstrate its benefits.

Potential-based Reward Shaping in Sokoban

Authors:Zhao Yang, Mike Preuss, Aske Plaat
Date:2021-09-10 06:28:09

Learning to solve sparse-reward reinforcement learning problems is difficult, due to the lack of guidance towards the goal. But in some problems, prior knowledge can be used to augment the learning process. Reward shaping is a way to incorporate prior knowledge into the original reward function in order to speed up the learning. While previous work has investigated the use of expert knowledge to generate potential functions, in this work, we study whether we can use a search algorithm(A*) to automatically generate a potential function for reward shaping in Sokoban, a well-known planning task. The results showed that learning with shaped reward function is faster than learning from scratch. Our results indicate that distance functions could be a suitable function for Sokoban. This work demonstrates the possibility of solving multiple instances with the help of reward shaping. The result can be compressed into a single policy, which can be seen as the first phrase towards training a general policy that is able to solve unseen instances.

DAN: Decentralized Attention-based Neural Network for the MinMax Multiple Traveling Salesman Problem

Authors:Yuhong Cao, Zhanhong Sun, Guillaume Sartoretti
Date:2021-09-09 12:26:04

The multiple traveling salesman problem (mTSP) is a well-known NP-hard problem with numerous real-world applications. In particular, this work addresses MinMax mTSP, where the objective is to minimize the max tour length among all agents. Many robotic deployments require recomputing potentially large mTSP instances frequently, making the natural trade-off between computing time and solution quality of great importance. However, exact and heuristic algorithms become inefficient as the number of cities increases, due to their computational complexity. Encouraged by the recent developments in deep reinforcement learning (dRL), this work approaches the mTSP as a cooperative task and introduces DAN, a decentralized attention-based neural method that aims at tackling this key trade-off. In DAN, agents learn fully decentralized policies to collaboratively construct a tour, by predicting each other's future decisions. Our model relies on the Transformer architecture and is trained using multi-agent RL with parameter sharing, providing natural scalability to the numbers of agents and cities. Our experimental results on small- to large-scale mTSP instances ($50$ to $1000$ cities and $5$ to $20$ agents) show that DAN is able to match or outperform state-of-the-art solvers while keeping planning times low. In particular, given the same computation time budget, DAN outperforms all conventional and dRL-based baselines on larger-scale instances (more than 100 cities, more than 5 agents), and exhibits enhanced agent collaboration. A video explaining our approach and presenting our results is available at \url{https://youtu.be/xi3cLsDsLvs}.

Hierarchical Object-to-Zone Graph for Object Navigation

Authors:Sixian Zhang, Xinhang Song, Yubing Bai, Weijie Li, Yakui Chu, Shuqiang Jiang
Date:2021-09-05 13:02:17

The goal of object navigation is to reach the expected objects according to visual information in the unseen environments. Previous works usually implement deep models to train an agent to predict actions in real-time. However, in the unseen environment, when the target object is not in egocentric view, the agent may not be able to make wise decisions due to the lack of guidance. In this paper, we propose a hierarchical object-to-zone (HOZ) graph to guide the agent in a coarse-to-fine manner, and an online-learning mechanism is also proposed to update HOZ according to the real-time observation in new environments. In particular, the HOZ graph is composed of scene nodes, zone nodes and object nodes. With the pre-learned HOZ graph, the real-time observation and the target goal, the agent can constantly plan an optimal path from zone to zone. In the estimated path, the next potential zone is regarded as sub-goal, which is also fed into the deep reinforcement learning model for action prediction. Our methods are evaluated on the AI2-Thor simulator. In addition to widely used evaluation metrics SR and SPL, we also propose a new evaluation metric of SAE that focuses on the effective action rate. Experimental results demonstrate the effectiveness and efficiency of our proposed method.

Learning Practically Feasible Policies for Online 3D Bin Packing

Authors:Hang Zhao, Chenyang Zhu, Xin Xu, Hui Huang, Kai Xu
Date:2021-08-31 08:37:58

We tackle the Online 3D Bin Packing Problem, a challenging yet practically useful variant of the classical Bin Packing Problem. In this problem, the items are delivered to the agent without informing the full sequence information. Agent must directly pack these items into the target bin stably without changing their arrival order, and no further adjustment is permitted. Online 3D-BPP can be naturally formulated as Markov Decision Process (MDP). We adopt deep reinforcement learning, in particular, the on-policy actor-critic framework, to solve this MDP with constrained action space. To learn a practically feasible packing policy, we propose three critical designs. First, we propose an online analysis of packing stability based on a novel stacking tree. It attains a high analysis accuracy while reducing the computational complexity from $O(N^2)$ to $O(N \log N)$, making it especially suited for RL training. Second, we propose a decoupled packing policy learning for different dimensions of placement which enables high-resolution spatial discretization and hence high packing precision. Third, we introduce a reward function that dictates the robot to place items in a far-to-near order and therefore simplifies the collision avoidance in movement planning of the robotic arm. Furthermore, we provide a comprehensive discussion on several key implemental issues. The extensive evaluation demonstrates that our learned policy outperforms the state-of-the-art methods significantly and is practically usable for real-world applications.

A review of mobile robot motion planning methods: from classical motion planning workflows to reinforcement learning-based architectures

Authors:Lu Dong, Zichen He, Chunwei Song, Changyin Sun
Date:2021-08-31 05:05:30

Motion planning is critical to realize the autonomous operation of mobile robots. As the complexity and randomness of robot application scenarios increase, the planning capability of the classical hierarchical motion planners is challenged. With the development of machine learning, deep reinforcement learning (DRL)-based motion planner has gradually become a research hotspot due to its several advantageous features. DRL-based motion planner is model-free and does not rely on the prior structured map. Most importantly, DRL-based motion planner achieves the unification of the global planner and the local planner. In this paper, we provide a systematic review of various motion planning methods. First, we summarize the representative and state-of-the-art works for each submodule of the classical motion planning architecture and analyze their performance features. Subsequently, we concentrate on summarizing RL-based motion planning approaches, including motion planners combined with RL improvements, map-free RL-based motion planners, and multi-robot cooperative planning methods. Last but not least, we analyze the urgent challenges faced by these mainstream RL-based motion planners in detail, review some state-of-the-art works for these issues, and propose suggestions for future research.

Path Planning for Cellular-Connected UAV: A DRL Solution with Quantum-Inspired Experience Replay

Authors:Yuanjian Li, A. Hamid Aghvami, Daoyi Dong
Date:2021-08-30 12:41:43

In cellular-connected unmanned aerial vehicle (UAV) network, a minimization problem on the weighted sum of time cost and expected outage duration is considered. Taking advantage of UAV's adjustable mobility, an intelligent UAV navigation approach is formulated to achieve the aforementioned optimization goal. Specifically, after mapping the navigation task into a Markov decision process (MDP), a deep reinforcement learning (DRL) solution with novel quantum-inspired experience replay (QiER) framework is proposed to help the UAV find the optimal flying direction within each time slot, and thus the designed trajectory towards the destination can be generated. Via relating experienced transition's importance to its associated quantum bit (qubit) and applying Grover iteration based amplitude amplification technique, the proposed DRL-QiER solution commits a better trade-off between sampling priority and diversity. Compared to several representative baselines, the effectiveness and supremacy of the proposed DRL-QiER solution are demonstrated and validated in numerical results.

Active Inference for Stochastic Control

Authors:Aswin Paul, Noor Sajid, Manoj Gopalkrishnan, Adeel Razi
Date:2021-08-27 12:51:42

Active inference has emerged as an alternative approach to control problems given its intuitive (probabilistic) formalism. However, despite its theoretical utility, computational implementations have largely been restricted to low-dimensional, deterministic settings. This paper highlights that this is a consequence of the inability to adequately model stochastic transition dynamics, particularly when an extensive policy (i.e., action trajectory) space must be evaluated during planning. Fortunately, recent advancements propose a modified planning algorithm for finite temporal horizons. We build upon this work to assess the utility of active inference for a stochastic control setting. For this, we simulate the classic windy grid-world task with additional complexities, namely: 1) environment stochasticity; 2) learning of transition dynamics; and 3) partial observability. Our results demonstrate the advantage of using active inference, compared to reinforcement learning, in both deterministic and stochastic settings.

Deep Reinforcement Learning for Dynamic Band Switch in Cellular-Connected UAV

Authors:Gianluca Fontanesi, Anding Zhu, Hamed Ahmadi
Date:2021-08-26 22:33:40

The choice of the transmitting frequency to provide cellular-connected Unmanned Aerial Vehicle (UAV) reliable connectivity and mobility support introduce several challenges. Conventional sub-6 GHz networks are optimized for ground Users (UEs). Operating at the millimeter Wave (mmWave) band would provide high-capacity but highly intermittent links. To reach the destination while minimizing a weighted function of traveling time and number of radio failures, we propose in this paper a UAV joint trajectory and band switch approach. By leveraging Double Deep Q-Learning we develop two different approaches to learn a trajectory besides managing the band switch. A first blind approach switches the band along the trajectory anytime the UAV-UE throughput is below a predefined threshold. In addition, we propose a smart approach for simultaneous learning-based path planning of UAV and band switch. The two approaches are compared with an optimal band switch strategy in terms of radio failure and band switches for different thresholds. Results reveal that the smart approach is able in a high threshold regime to reduce the number of radio failures and band switches while reaching the desired destination.

Indoor Path Planning for an Unmanned Aerial Vehicle via Curriculum Learning

Authors:Jongmin Park, Sooyoung Jang, Younghoon Shin
Date:2021-08-23 07:41:10

In this study, reinforcement learning was applied to learning two-dimensional path planning including obstacle avoidance by unmanned aerial vehicle (UAV) in an indoor environment. The task assigned to the UAV was to reach the goal position in the shortest amount of time without colliding with any obstacles. Reinforcement learning was performed in a virtual environment created using Gazebo, a virtual environment simulator, to reduce the learning time and cost. Curriculum learning, which consists of two stages was performed for more efficient learning. As a result of learning with two reward models, the maximum goal rates achieved were 71.2% and 88.0%.

An Independent Study of Reinforcement Learning and Autonomous Driving

Authors:Hanzhi Yang
Date:2021-08-20 23:46:12

Reinforcement learning has become one of the most trending subjects in the recent decade. It has seen applications in various fields such as robot manipulations, autonomous driving, path planning, computer gaming, etc. We accomplished three tasks during the course of this project. Firstly, we studied the Q-learning algorithm for tabular environments and applied it successfully to an OpenAi Gym environment, Taxi. Secondly, we gained an understanding of and implemented the deep Q-network algorithm for Cart-Pole environment. Thirdly, we also studied the application of reinforcement learning in autonomous driving and its combination with safety check constraints (safety controllers). We trained a rough autonomous driving agent using highway-gym environment and explored the effects of various environment configurations like reward functions on the agent training performance.

Adaptive Selection of Informative Path Planning Strategies via Reinforcement Learning

Authors:Taeyeong Choi, Grzegorz Cielniak
Date:2021-08-14 21:32:33

In our previous work, we designed a systematic policy to prioritize sampling locations to lead significant accuracy improvement in spatial interpolation by using the prediction uncertainty of Gaussian Process Regression (GPR) as "attraction force" to deployed robots in path planning. Although the integration with Traveling Salesman Problem (TSP) solvers was also shown to produce relatively short travel distance, we here hypothesise several factors that could decrease the overall prediction precision as well because sub-optimal locations may eventually be included in their paths. To address this issue, in this paper, we first explore "local planning" approaches adopting various spatial ranges within which next sampling locations are prioritized to investigate their effects on the prediction performance as well as incurred travel distance. Also, Reinforcement Learning (RL)-based high-level controllers are trained to adaptively produce blended plans from a particular set of local planners to inherit unique strengths from that selection depending on latest prediction states. Our experiments on use cases of temperature monitoring robots demonstrate that the dynamic mixtures of planners can not only generate sophisticated, informative plans that a single planner could not create alone but also ensure significantly reduced travel distances at no cost of prediction reliability without any assist of additional modules for shortest path calculation.

Offline-Online Reinforcement Learning for Energy Pricing in Office Demand Response: Lowering Energy and Data Costs

Authors:Doseok Jang, Lucas Spangher, Manan Khattar, Utkarsha Agwan, Selvaprabuh Nadarajah, Costas Spanos
Date:2021-08-14 17:29:59

Our team is proposing to run a full-scale energy demand response experiment in an office building. Although this is an exciting endeavor which will provide value to the community, collecting training data for the reinforcement learning agent is costly and will be limited. In this work, we examine how offline training can be leveraged to minimize data costs (accelerate convergence) and program implementation costs. We present two approaches to doing so: pretraining our model to warm start the experiment with simulated tasks, and using a planning model trained to simulate the real world's rewards to the agent. We present results that demonstrate the utility of offline reinforcement learning to efficient price-setting in the energy demand response problem.

Q-Mixing Network for Multi-Agent Pathfinding in Partially Observable Grid Environments

Authors:Vasilii Davydov, Alexey Skrynnik, Konstantin Yakovlev, Aleksandr I. Panov
Date:2021-08-13 09:44:47

In this paper, we consider the problem of multi-agent navigation in partially observable grid environments. This problem is challenging for centralized planning approaches as they, typically, rely on the full knowledge of the environment. We suggest utilizing the reinforcement learning approach when the agents, first, learn the policies that map observations to actions and then follow these policies to reach their goals. To tackle the challenge associated with learning cooperative behavior, i.e. in many cases agents need to yield to each other to accomplish a mission, we use a mixing Q-network that complements learning individual policies. In the experimental evaluation, we show that such approach leads to plausible results and scales well to large number of agents.

Efficient Local Planning with Linear Function Approximation

Authors:Dong Yin, Botao Hao, Yasin Abbasi-Yadkori, Nevena Lazić, Csaba Szepesvári
Date:2021-08-12 04:56:33

We study query and computationally efficient planning algorithms with linear function approximation and a simulator. We assume that the agent only has local access to the simulator, meaning that the agent can only query the simulator at states that have been visited before. This setting is more practical than many prior works on reinforcement learning with a generative model. We propose two algorithms, named confident Monte Carlo least square policy iteration (Confident MC-LSPI) and confident Monte Carlo Politex (Confident MC-Politex) for this setting. Under the assumption that the Q-functions of all policies are linear in known features of the state-action pairs, we show that our algorithms have polynomial query and computational costs in the dimension of the features, the effective planning horizon, and the targeted sub-optimality, while these costs are independent of the size of the state space. One technical contribution of our work is the introduction of a novel proof technique that makes use of a virtual policy iteration algorithm. We use this method to leverage existing results on $\ell_\infty$-bounded approximate policy iteration to show that our algorithm can learn the optimal policy for the given initial state even only with local access to the simulator. We believe that this technique can be extended to broader settings beyond this work.

DQ-GAT: Towards Safe and Efficient Autonomous Driving with Deep Q-Learning and Graph Attention Networks

Authors:Peide Cai, Hengli Wang, Yuxiang Sun, Ming Liu
Date:2021-08-11 04:55:23

Autonomous driving in multi-agent dynamic traffic scenarios is challenging: the behaviors of road users are uncertain and are hard to model explicitly, and the ego-vehicle should apply complicated negotiation skills with them, such as yielding, merging and taking turns, to achieve both safe and efficient driving in various settings. Traditional planning methods are largely rule-based and scale poorly in these complex dynamic scenarios, often leading to reactive or even overly conservative behaviors. Therefore, they require tedious human efforts to maintain workability. Recently, deep learning-based methods have shown promising results with better generalization capability but less hand engineering efforts. However, they are either implemented with supervised imitation learning (IL), which suffers from dataset bias and distribution mismatch issues, or are trained with deep reinforcement learning (DRL) but focus on one specific traffic scenario. In this work, we propose DQ-GAT to achieve scalable and proactive autonomous driving, where graph attention-based networks are used to implicitly model interactions, and deep Q-learning is employed to train the network end-to-end in an unsupervised manner. Extensive experiments in a high-fidelity driving simulator show that our method achieves higher success rates than previous learning-based methods and a traditional rule-based method, and better trades off safety and efficiency in both seen and unseen scenarios. Moreover, qualitative results on a trajectory dataset indicate that our learned policy can be transferred to the real world for practical applications with real-time speeds. Demonstration videos are available at https://caipeide.github.io/dq-gat/.

Deep Reinforcement Learning for Demand Driven Services in Logistics and Transportation Systems: A Survey

Authors:Zefang Zong, Jingwei Wang, Tao Feng, Tong Xia, Depeng Jin, Yong Li
Date:2021-08-10 06:13:05

Recent technology development brings the boom of numerous new Demand-Driven Services (DDS) into urban lives, including ridesharing, on-demand delivery, express systems and warehousing. In DDS, a service loop is an elemental structure, including its service worker, the service providers and corresponding service targets. The service workers should transport either people or parcels from the providers to the target locations. Various planning tasks within DDS can thus be classified into two individual stages: 1) Dispatching, which is to form service loops from demand/supply distributions, and 2) Routing, which is to decide specific serving orders within the constructed loops. Generating high-quality strategies in both stages is important to develop DDS but faces several challenges. Meanwhile, deep reinforcement learning (DRL) has been developed rapidly in recent years. It is a powerful tool to solve these problems since DRL can learn a parametric model without relying on too many problem-based assumptions and optimize long-term effects by learning sequential decisions. In this survey, we first define DDS, then highlight common applications and important decision/control problems within. For each problem, we comprehensively introduce the existing DRL solutions. We also introduce open simulation environments for development and evaluation of DDS applications. Finally, we analyze remaining challenges and discuss further research opportunities in DRL solutions for DDS.

Mapless Humanoid Navigation Using Learned Latent Dynamics

Authors:Andre Brandenburger, Diego Rodriguez, Sven Behnke
Date:2021-08-09 08:24:54

In this paper, we propose a novel Deep Reinforcement Learning approach to address the mapless navigation problem, in which the locomotion actions of a humanoid robot are taken online based on the knowledge encoded in learned models. Planning happens by generating open-loop trajectories in a learned latent space that captures the dynamics of the environment. Our planner considers visual (RGB images) and non-visual observations (e.g., attitude estimations). This confers the agent upon awareness not only of the scenario, but also of its own state. In addition, we incorporate a termination likelihood predictor model as an auxiliary loss function of the control policy, which enables the agent to anticipate terminal states of success and failure. In this manner, the sample efficiency of the approach for episodic tasks is increased. Our model is evaluated on the NimbRo-OP2X humanoid robot that navigates in scenes avoiding collisions efficiently in simulation and with the real hardware.

Towards real-world navigation with deep differentiable planners

Authors:Shu Ishida, João F. Henriques
Date:2021-08-08 11:29:16

We train embodied neural networks to plan and navigate unseen complex 3D environments, emphasising real-world deployment. Rather than requiring prior knowledge of the agent or environment, the planner learns to model the state transitions and rewards. To avoid the potentially hazardous trial-and-error of reinforcement learning, we focus on differentiable planners such as Value Iteration Networks (VIN), which are trained offline from safe expert demonstrations. Although they work well in small simulations, we address two major limitations that hinder their deployment. First, we observed that current differentiable planners struggle to plan long-term in environments with a high branching complexity. While they should ideally learn to assign low rewards to obstacles to avoid collisions, we posit that the constraints imposed on the network are not strong enough to guarantee the network to learn sufficiently large penalties for every possible collision. We thus impose a structural constraint on the value iteration, which explicitly learns to model any impossible actions. Secondly, we extend the model to work with a limited perspective camera under translation and rotation, which is crucial for real robot deployment. Many VIN-like planners assume a 360 degrees or overhead view without rotation. In contrast, our method uses a memory-efficient lattice map to aggregate CNN embeddings of partial observations, and models the rotational dynamics explicitly using a 3D state-space grid (translation and rotation). Our proposals significantly improve semantic navigation and exploration on several 2D and 3D environments, succeeding in settings that are otherwise challenging for this class of methods. As far as we know, we are the first to successfully perform differentiable planning on the difficult Active Vision Dataset, consisting of real images captured from a robot.

Temporally Abstract Partial Models

Authors:Khimya Khetarpal, Zafarali Ahmed, Gheorghe Comanici, Doina Precup
Date:2021-08-06 17:26:21

Humans and animals have the ability to reason and make predictions about different courses of action at many time scales. In reinforcement learning, option models (Sutton, Precup \& Singh, 1999; Precup, 2000) provide the framework for this kind of temporally abstract prediction and reasoning. Natural intelligent agents are also able to focus their attention on courses of action that are relevant or feasible in a given situation, sometimes termed affordable actions. In this paper, we define a notion of affordances for options, and develop temporally abstract partial option models, that take into account the fact that an option might be affordable only in certain situations. We analyze the trade-offs between estimation and approximation error in planning and learning when using such models, and identify some interesting special cases. Additionally, we demonstrate empirically the potential impact of partial option models on the efficiency of planning.

Beyond No Regret: Instance-Dependent PAC Reinforcement Learning

Authors:Andrew Wagenmaker, Max Simchowitz, Kevin Jamieson
Date:2021-08-05 16:34:17

The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $\epsilon$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $\epsilon$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this is not possible -- there exists a fundamental tradeoff between achieving low regret and identifying an $\epsilon$-optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity -- yielding a complexity which scales with the suboptimality gaps and the "reachability" of a state. We show our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.

Planning with Learned Dynamic Model for Unsupervised Point Cloud Registration

Authors:Haobo Jiang, Jin Xie, Jianjun Qian, Jian Yang
Date:2021-08-05 13:47:11

Point cloud registration is a fundamental problem in 3D computer vision. In this paper, we cast point cloud registration into a planning problem in reinforcement learning, which can seek the transformation between the source and target point clouds through trial and error. By modeling the point cloud registration process as a Markov decision process (MDP), we develop a latent dynamic model of point clouds, consisting of a transformation network and evaluation network. The transformation network aims to predict the new transformed feature of the point cloud after performing a rigid transformation (i.e., action) on it while the evaluation network aims to predict the alignment precision between the transformed source point cloud and target point cloud as the reward signal. Once the dynamic model of the point cloud is trained, we employ the cross-entropy method (CEM) to iteratively update the planning policy by maximizing the rewards in the point cloud registration process. Thus, the optimal policy, i.e., the transformation between the source and target point clouds, can be obtained via gradually narrowing the search space of the transformation. Experimental results on ModelNet40 and 7Scene benchmark datasets demonstrate that our method can yield good registration performance in an unsupervised manner.

Learning to Design and Construct Bridge without Blueprint

Authors:Yunfei Li, Tao Kong, Lei Li, Yifeng Li, Yi Wu
Date:2021-08-05 08:17:22

Autonomous assembly has been a desired functionality of many intelligent robot systems. We study a new challenging assembly task, designing and constructing a bridge without a blueprint. In this task, the robot needs to first design a feasible bridge architecture for arbitrarily wide cliffs and then manipulate the blocks reliably to construct a stable bridge according to the proposed design. In this paper, we propose a bi-level approach to tackle this task. At the high level, the system learns a bridge blueprint policy in a physical simulator using deep reinforcement learning and curriculum learning. A policy is represented as an attention-based neural network with object-centric input, which enables generalization to different numbers of blocks and cliff widths. For low-level control, we implement a motion-planning-based policy for real-robot motion control, which can be directly combined with a trained blueprint policy for real-world bridge construction without tuning. In our field study, our bi-level robot system demonstrates the capability of manipulating blocks to construct a diverse set of bridges with different architectures.

Risk Conditioned Neural Motion Planning

Authors:Xin Huang, Meng Feng, Ashkan Jasour, Guy Rosman, Brian Williams
Date:2021-08-04 05:33:52

Risk-bounded motion planning is an important yet difficult problem for safety-critical tasks. While existing mathematical programming methods offer theoretical guarantees in the context of constrained Markov decision processes, they either lack scalability in solving larger problems or produce conservative plans. Recent advances in deep reinforcement learning improve scalability by learning policy networks as function approximators. In this paper, we propose an extension of soft actor critic model to estimate the execution risk of a plan through a risk critic and produce risk-bounded policies efficiently by adding an extra risk term in the loss function of the policy network. We define the execution risk in an accurate form, as opposed to approximating it through a summation of immediate risks at each time step that leads to conservative plans. Our proposed model is conditioned on a continuous spectrum of risk bounds, allowing the user to adjust the risk-averse level of the agent on the fly. Through a set of experiments, we show the advantage of our model in terms of both computational time and plan quality, compared to a state-of-the-art mathematical programming baseline, and validate its performance in more complicated scenarios, including nonlinear dynamics and larger state space.

MBDP: A Model-based Approach to Achieve both Robustness and Sample Efficiency via Double Dropout Planning

Authors:Wanpeng Zhang, Xi Xiao, Yao Yao, Mingzhe Chen, Dijun Luo
Date:2021-08-03 04:55:16

Model-based reinforcement learning is a widely accepted solution for solving excessive sample demands. However, the predictions of the dynamics models are often not accurate enough, and the resulting bias may incur catastrophic decisions due to insufficient robustness. Therefore, it is highly desired to investigate how to improve the robustness of model-based RL algorithms while maintaining high sampling efficiency. In this paper, we propose Model-Based Double-dropout Planning (MBDP) to balance robustness and efficiency. MBDP consists of two kinds of dropout mechanisms, where the rollout-dropout aims to improve the robustness with a small cost of sample efficiency, while the model-dropout is designed to compensate for the lost efficiency at a slight expense of robustness. By combining them in a complementary way, MBDP provides a flexible control mechanism to meet different demands of robustness and efficiency by tuning two corresponding dropout ratios. The effectiveness of MBDP is demonstrated both theoretically and experimentally.

Time-based Dynamic Controllability of Disjunctive Temporal Networks with Uncertainty: A Tree Search Approach with Graph Neural Network Guidance

Authors:Kevin Osanlou, Jeremy Frank, J. Benton, Andrei Bursuc, Christophe Guettier, Eric Jacopin, Tristan Cazenave
Date:2021-08-02 17:54:25

Scheduling in the presence of uncertainty is an area of interest in artificial intelligence due to the large number of applications. We study the problem of dynamic controllability (DC) of disjunctive temporal networks with uncertainty (DTNU), which seeks a strategy to satisfy all constraints in response to uncontrollable action durations. We introduce a more restricted, stronger form of controllability than DC for DTNUs, time-based dynamic controllability (TDC), and present a tree search approach to determine whether or not a DTNU is TDC. Moreover, we leverage the learning capability of a message passing neural network (MPNN) as a heuristic for tree search guidance. Finally, we conduct experiments for which the tree search shows superior results to state-of-the-art timed-game automata (TGA) based approaches. We observe that using an MPNN for tree search guidance leads to a significant increase in solving performance and scalability to harder DTNU problems.

UAV Trajectory Planning in Wireless Sensor Networks for Energy Consumption Minimization by Deep Reinforcement Learning

Authors:Botao Zhu, Ebrahim Bedeer, Ha H. Nguyen, Robert Barton, Jerome Henry
Date:2021-08-01 03:02:11

Unmanned aerial vehicles (UAVs) have emerged as a promising candidate solution for data collection of large-scale wireless sensor networks (WSNs). In this paper, we investigate a UAV-aided WSN, where cluster heads (CHs) receive data from their member nodes, and a UAV is dispatched to collect data from CHs along the planned trajectory. We aim to minimize the total energy consumption of the UAV-WSN system in a complete round of data collection. Toward this end, we formulate the energy consumption minimization problem as a constrained combinatorial optimization problem by jointly selecting CHs from nodes within clusters and planning the UAV's visiting order to the selected CHs. The formulated energy consumption minimization problem is NP-hard, and hence, hard to solve optimally. In order to tackle this challenge, we propose a novel deep reinforcement learning (DRL) technique, pointer network-A* (Ptr-A*), which can efficiently learn from experiences the UAV trajectory policy for minimizing the energy consumption. The UAV's start point and the WSN with a set of pre-determined clusters are fed into the Ptr-A*, and the Ptr-A* outputs a group of CHs and the visiting order to these CHs, i.e., the UAV's trajectory. The parameters of the Ptr-A* are trained on small-scale clusters problem instances for faster training by using the actor-critic algorithm in an unsupervised manner. At inference, three search strategies are also proposed to improve the quality of solutions. Simulation results show that the trained models based on 20-clusters and 40-clusters have a good generalization ability to solve the UAV's trajectory planning problem in WSNs with different numbers of clusters, without the need to retrain the models. Furthermore, the results show that our proposed DRL algorithm outperforms two baseline techniques.

Human-Level Reinforcement Learning through Theory-Based Modeling, Exploration, and Planning

Authors:Pedro A. Tsividis, Joao Loula, Jake Burga, Nathan Foss, Andres Campero, Thomas Pouncy, Samuel J. Gershman, Joshua B. Tenenbaum
Date:2021-07-27 01:38:13

Reinforcement learning (RL) studies how an agent comes to achieve reward in an environment through interactions over time. Recent advances in machine RL have surpassed human expertise at the world's oldest board games and many classic video games, but they require vast quantities of experience to learn successfully -- none of today's algorithms account for the human ability to learn so many different tasks, so quickly. Here we propose a new approach to this challenge based on a particularly strong form of model-based RL which we call Theory-Based Reinforcement Learning, because it uses human-like intuitive theories -- rich, abstract, causal models of physical objects, intentional agents, and their interactions -- to explore and model an environment, and plan effectively to achieve task goals. We instantiate the approach in a video game playing agent called EMPA (the Exploring, Modeling, and Planning Agent), which performs Bayesian inference to learn probabilistic generative models expressed as programs for a game-engine simulator, and runs internal simulations over these models to support efficient object-based, relational exploration and heuristic planning. EMPA closely matches human learning efficiency on a suite of 90 challenging Atari-style video games, learning new games in just minutes of game play and generalizing robustly to new game situations and new levels. The model also captures fine-grained structure in people's exploration trajectories and learning dynamics. Its design and behavior suggest a way forward for building more general human-like AI systems.

Using reinforcement learning to autonomously identify sources of error for agents in group missions

Authors:Keishu Utimula, Ken-taro Hayaschi, Trevor J. Bihl, Kenta Hongo, Ryo Maezono
Date:2021-07-20 02:40:19

When agents swarm to execute a mission, some of them frequently exhibit sudden failure, as observed from the command base. It is generally difficult to determine whether a failure is caused by actuators (hypothesis, $h_a$) or sensors (hypothesis, $h_s$) by solely relying on the communication between the command base and concerning agent. However, by instigating collusion between the agents, the cause of failure can be identified; in other words, we expect to detect corresponding displacements for $h_a$ but not for $h_s$. In this study, we considered the question as to whether artificial intelligence can autonomously generate an action plan $\boldsymbol{g}$ to pinpoint the cause as aforedescribed. Because the expected response to $\boldsymbol{g}$ generally depends upon the adopted hypothesis [let the difference be denoted by $D(\boldsymbol{g})$], a formulation that uses $D\left(\boldsymbol{g}\right)$ to pinpoint the cause can be made. Although a $\boldsymbol{g}^*$ that maximizes $D(\boldsymbol{g})$ would be a suitable action plan for this task, such an optimization is difficult to achieve using the conventional gradient method, as $D(\boldsymbol{g})$ becomes nonzero in rare events such as collisions with other agents, and most swarm actions $\boldsymbol{g}$ give $D(\boldsymbol{g})=0$. In other words, throughout almost the entire space of $\boldsymbol{g}$, $D(\boldsymbol{g})$ has zero gradient, and the gradient method is not applicable. To overcome this problem, we formulated an action plan using Q-table reinforcement learning. Surprisingly, the optimal action plan generated via reinforcement learning presented a human-like solution to pinpoint the problem by colliding other agents with the failed agent. Using this simple prototype, we demonstrated the potential of applying Q-table reinforcement learning methods to plan autonomous actions to pinpoint the causes of failure.

Structured World Belief for Reinforcement Learning in POMDP

Authors:Gautam Singh, Skand Peri, Junghyun Kim, Hyunseok Kim, Sungjin Ahn
Date:2021-07-19 01:47:53

Object-centric world models provide structured representation of the scene and can be an important backbone in reinforcement learning and planning. However, existing approaches suffer in partially-observable environments due to the lack of belief states. In this paper, we propose Structured World Belief, a model for learning and inference of object-centric belief states. Inferred by Sequential Monte Carlo (SMC), our belief states provide multiple object-centric scene hypotheses. To synergize the benefits of SMC particles with object representations, we also propose a new object-centric dynamics model that considers the inductive bias of object permanence. This enables tracking of object states even when they are invisible for a long time. To further facilitate object tracking in this regime, we allow our model to attend flexibly to any spatial location in the image which was restricted in previous models. In experiments, we show that object-centric belief provides a more accurate and robust performance for filtering and generation. Furthermore, we show the efficacy of structured world belief in improving the performance of reinforcement learning, planning and supervised reasoning.

Vision-Based Autonomous Car Racing Using Deep Imitative Reinforcement Learning

Authors:Peide Cai, Hengli Wang, Huaiyang Huang, Yuxuan Liu, Ming Liu
Date:2021-07-18 00:00:48

Autonomous car racing is a challenging task in the robotic control area. Traditional modular methods require accurate mapping, localization and planning, which makes them computationally inefficient and sensitive to environmental changes. Recently, deep-learning-based end-to-end systems have shown promising results for autonomous driving/racing. However, they are commonly implemented by supervised imitation learning (IL), which suffers from the distribution mismatch problem, or by reinforcement learning (RL), which requires a huge amount of risky interaction data. In this work, we present a general deep imitative reinforcement learning approach (DIRL), which successfully achieves agile autonomous racing using visual inputs. The driving knowledge is acquired from both IL and model-based RL, where the agent can learn from human teachers as well as perform self-improvement by safely interacting with an offline world model. We validate our algorithm both in a high-fidelity driving simulation and on a real-world 1/20-scale RC-car with limited onboard computation. The evaluation results demonstrate that our method outperforms previous IL and RL methods in terms of sample efficiency and task performance. Demonstration videos are available at https://caipeide.github.io/autorace-dirl/

High-Accuracy Model-Based Reinforcement Learning, a Survey

Authors:Aske Plaat, Walter Kosters, Mike Preuss
Date:2021-07-17 14:01:05

Deep reinforcement learning has shown remarkable success in the past few years. Highly complex sequential decision making problems from game playing and robotics have been solved with deep model-free methods. Unfortunately, the sample complexity of model-free methods is often high. To reduce the number of environment samples, model-based reinforcement learning creates an explicit model of the environment dynamics. Achieving high model accuracy is a challenge in high-dimensional problems. In recent years, a diverse landscape of model-based methods has been introduced to improve model accuracy, using methods such as uncertainty modeling, model-predictive control, latent models, and end-to-end learning and planning. Some of these methods succeed in achieving high accuracy at low sample complexity, most do so either in a robotics or in a games context. In this paper, we survey these methods; we explain in detail how they work and what their strengths and weaknesses are. We conclude with a research agenda for future work to make the methods more robust and more widely applicable to other applications.

PC-MLP: Model-based Reinforcement Learning with Policy Cover Guided Exploration

Authors:Yuda Song, Wen Sun
Date:2021-07-15 15:49:30

Model-based Reinforcement Learning (RL) is a popular learning paradigm due to its potential sample efficiency compared to model-free RL. However, existing empirical model-based RL approaches lack the ability to explore. This work studies a computationally and statistically efficient model-based algorithm for both Kernelized Nonlinear Regulators (KNR) and linear Markov Decision Processes (MDPs). For both models, our algorithm guarantees polynomial sample complexity and only uses access to a planning oracle. Experimentally, we first demonstrate the flexibility and efficacy of our algorithm on a set of exploration challenging control tasks where existing empirical model-based RL approaches completely fail. We then show that our approach retains excellent performance even in common dense reward control benchmarks that do not require heavy exploration. Finally, we demonstrate that our method can also perform reward-free exploration efficiently. Our code can be found at https://github.com/yudasong/PCMLP.

Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks

Authors:Ingmar Schubert, Ozgur S. Oguz, Marc Toussaint
Date:2021-07-14 12:55:41

In high-dimensional state spaces, the usefulness of Reinforcement Learning (RL) is limited by the problem of exploration. This issue has been addressed using potential-based reward shaping (PB-RS) previously. In the present work, we introduce Final-Volume-Preserving Reward Shaping (FV-RS). FV-RS relaxes the strict optimality guarantees of PB-RS to a guarantee of preserved long-term behavior. Being less restrictive, FV-RS allows for reward shaping functions that are even better suited for improving the sample efficiency of RL algorithms. In particular, we consider settings in which the agent has access to an approximate plan. Here, we use examples of simulated robotic manipulation tasks to demonstrate that plan-based FV-RS can indeed significantly improve the sample efficiency of RL over plan-based PB-RS.

Conservative Offline Distributional Reinforcement Learning

Authors:Yecheng Jason Ma, Dinesh Jayaraman, Osbert Bastani
Date:2021-07-12 15:38:06

Many reinforcement learning (RL) problems in practice are offline, learning purely from observational data. A key challenge is how to ensure the learned policy is safe, which requires quantifying the risk associated with different actions. In the online setting, distributional RL algorithms do so by learning the distribution over returns (i.e., cumulative rewards) instead of the expected return; beyond quantifying risk, they have also been shown to learn better representations for planning. We propose Conservative Offline Distributional Actor Critic (CODAC), an offline RL algorithm suitable for both risk-neutral and risk-averse domains. CODAC adapts distributional RL to the offline setting by penalizing the predicted quantiles of the return for out-of-distribution actions. We prove that CODAC learns a conservative return distribution -- in particular, for finite MDPs, CODAC converges to an uniform lower bound on the quantiles of the return distribution; our proof relies on a novel analysis of the distributional Bellman operator. In our experiments, on two challenging robot navigation tasks, CODAC successfully learns risk-averse policies using offline data collected purely from risk-neutral agents. Furthermore, CODAC is state-of-the-art on the D4RL MuJoCo benchmark in terms of both expected and risk-sensitive performance.

Entropy Regularized Motion Planning via Stein Variational Inference

Authors:Alexander Lambert, Byron Boots
Date:2021-07-11 23:39:24

Many Imitation and Reinforcement Learning approaches rely on the availability of expert-generated demonstrations for learning policies or value functions from data. Obtaining a reliable distribution of trajectories from motion planners is non-trivial, since it must broadly cover the space of states likely to be encountered during execution while also satisfying task-based constraints. We propose a sampling strategy based on variational inference to generate distributions of feasible, low-cost trajectories for high-dof motion planning tasks. This includes a distributed, particle-based motion planning algorithm which leverages a structured graphical representations for inference over multi-modal posterior distributions. We also make explicit connections to both approximate inference for trajectory optimization and entropy-regularized reinforcement learning.

Distributed Deep Reinforcement Learning for Intelligent Traffic Monitoring with a Team of Aerial Robots

Authors:Behzad Khamidehi, Elvino S. Sousa
Date:2021-07-10 22:41:32

This paper studies the traffic monitoring problem in a road network using a team of aerial robots. The problem is challenging due to two main reasons. First, the traffic events are stochastic, both temporally and spatially. Second, the problem has a non-homogeneous structure as the traffic events arrive at different locations of the road network at different rates. Accordingly, some locations require more visits by the robots compared to other locations. To address these issues, we define an uncertainty metric for each location of the road network and formulate a path planning problem for the aerial robots to minimize the network's average uncertainty. We express this problem as a partially observable Markov decision process (POMDP) and propose a distributed and scalable algorithm based on deep reinforcement learning to solve it. We consider two different scenarios depending on the communication mode between the agents (aerial robots) and the traffic management center (TMC). The first scenario assumes that the agents continuously communicate with the TMC to send/receive real-time information about the traffic events. Hence, the agents have global and real-time knowledge of the environment. However, in the second scenario, we consider a challenging setting where the observation of the aerial robots is partial and limited to their sensing ranges. Moreover, in contrast to the first scenario, the information exchange between the aerial robots and the TMC is restricted to specific time instances. We evaluate the performance of our proposed algorithm in both scenarios for a real road network topology and demonstrate its functionality in a traffic monitoring system.

Learning-to-Dispatch: Reinforcement Learning Based Flight Planning under Emergency

Authors:Kai Zhang, Yupeng Yang, Chengtao Xu, Dahai Liu, Houbing Song
Date:2021-07-10 19:21:14

The effectiveness of resource allocation under emergencies especially hurricane disasters is crucial. However, most researchers focus on emergency resource allocation in a ground transportation system. In this paper, we propose Learning-to-Dispatch (L2D), a reinforcement learning (RL) based air route dispatching system, that aims to add additional flights for hurricane evacuation while minimizing the airspace's complexity and air traffic controller's workload. Given a bipartite graph with weights that are learned from the historical flight data using RL in consideration of short- and long-term gains, we formulate the flight dispatch as an online maximum weight matching problem. Different from the conventional order dispatch problem, there is no actual or estimated index that can evaluate how the additional evacuation flights influence the air traffic complexity. Then we propose a multivariate reward function in the learning phase and compare it with other univariate reward designs to show its superior performance. The experiments using the real-world dataset for Hurricane Irma demonstrate the efficacy and efficiency of our proposed schema.

Backprop-Free Reinforcement Learning with Active Neural Generative Coding

Authors:Alexander Ororbia, Ankur Mali
Date:2021-07-10 19:02:27

In humans, perceptual awareness facilitates the fast recognition and extraction of information from sensory input. This awareness largely depends on how the human agent interacts with the environment. In this work, we propose active neural generative coding, a computational framework for learning action-driven generative models without backpropagation of errors (backprop) in dynamic environments. Specifically, we develop an intelligent agent that operates even with sparse rewards, drawing inspiration from the cognitive theory of planning as inference. We demonstrate on several simple control problems that our framework performs competitively with deep Q-learning. The robust performance of our agent offers promising evidence that a backprop-free approach for neural inference and learning can drive goal-directed behavior.

Learning Interaction-aware Guidance Policies for Motion Planning in Dense Traffic Scenarios

Authors:Bruno Brito, Achin Agarwal, Javier Alonso-Mora
Date:2021-07-09 16:43:12

Autonomous navigation in dense traffic scenarios remains challenging for autonomous vehicles (AVs) because the intentions of other drivers are not directly observable and AVs have to deal with a wide range of driving behaviors. To maneuver through dense traffic, AVs must be able to reason how their actions affect others (interaction model) and exploit this reasoning to navigate through dense traffic safely. This paper presents a novel framework for interaction-aware motion planning in dense traffic scenarios. We explore the connection between human driving behavior and their velocity changes when interacting. Hence, we propose to learn, via deep Reinforcement Learning (RL), an interaction-aware policy providing global guidance about the cooperativeness of other vehicles to an optimization-based planner ensuring safety and kinematic feasibility through constraint satisfaction. The learned policy can reason and guide the local optimization-based planner with interactive behavior to pro-actively merge in dense traffic while remaining safe in case the other vehicles do not yield. We present qualitative and quantitative results in highly interactive simulation environments (highway merging and unprotected left turns) against two baseline approaches, a learning-based and an optimization-based method. The presented results demonstrate that our method significantly reduces the number of collisions and increases the success rate with respect to both learning-based and optimization-based baselines.

Reinforcement Learning based Negotiation-aware Motion Planning of Autonomous Vehicles

Authors:Zhitao Wang, Yuzheng Zhuang, Qiang Gu, Dong Chen, Hongbo Zhang, Wulong Liu
Date:2021-07-08 04:39:35

For autonomous vehicles integrating onto roadways with human traffic participants, it requires understanding and adapting to the participants' intention and driving styles by responding in predictable ways without explicit communication. This paper proposes a reinforcement learning based negotiation-aware motion planning framework, which adopts RL to adjust the driving style of the planner by dynamically modifying the prediction horizon length of the motion planner in real time adaptively w.r.t the event of a change in environment, typically triggered by traffic participants' switch of intents with different driving styles. The framework models the interaction between the autonomous vehicle and other traffic participants as a Markov Decision Process. A temporal sequence of occupancy grid maps are taken as inputs for RL module to embed an implicit intention reasoning. Curriculum learning is employed to enhance the training efficiency and the robustness of the algorithm. We applied our method to narrow lane navigation in both simulation and real world to demonstrate that the proposed method outperforms the common alternative due to its advantage in alleviating the social dilemma problem with proper negotiation skills.

Meta-Reinforcement Learning for Heuristic Planning

Authors:Ricardo Luna Gutierrez, Matteo Leonetti
Date:2021-07-06 13:25:52

In Meta-Reinforcement Learning (meta-RL) an agent is trained on a set of tasks to prepare for and learn faster in new, unseen, but related tasks. The training tasks are usually hand-crafted to be representative of the expected distribution of test tasks and hence all used in training. We show that given a set of training tasks, learning can be both faster and more effective (leading to better performance in the test tasks), if the training tasks are appropriately selected. We propose a task selection algorithm, Information-Theoretic Task Selection (ITTS), based on information theory, which optimizes the set of tasks used for training in meta-RL, irrespectively of how they are generated. The algorithm establishes which training tasks are both sufficiently relevant for the test tasks, and different enough from one another. We reproduce different meta-RL experiments from the literature and show that ITTS improves the final performance in all of them.

Control of rough terrain vehicles using deep reinforcement learning

Authors:Viktor Wiberg, Erik Wallin, Martin Servin, Tomas Nordfjell
Date:2021-07-05 08:43:05

We explore the potential to control terrain vehicles using deep reinforcement in scenarios where human operators and traditional control methods are inadequate. This letter presents a controller that perceives, plans, and successfully controls a 16-tonne forestry vehicle with two frame articulation joints, six wheels, and their actively articulated suspensions to traverse rough terrain. The carefully shaped reward signal promotes safe, environmental, and efficient driving, which leads to the emergence of unprecedented driving skills. We test learned skills in a virtual environment, including terrains reconstructed from high-density laser scans of forest sites. The controller displays the ability to handle obstructing obstacles, slopes up to 27$^\circ$, and a variety of natural terrains, all with limited wheel slip, smooth, and upright traversal with intelligent use of the active suspensions. The results confirm that deep reinforcement learning has the potential to enhance control of vehicles with complex dynamics and high-dimensional observation data compared to human operators or traditional control methods, especially in rough terrain.

Sample Efficient Reinforcement Learning via Model-Ensemble Exploration and Exploitation

Authors:Yao Yao, Li Xiao, Zhicheng An, Wanpeng Zhang, Dijun Luo
Date:2021-07-05 07:18:20

Model-based deep reinforcement learning has achieved success in various domains that require high sample efficiencies, such as Go and robotics. However, there are some remaining issues, such as planning efficient explorations to learn more accurate dynamic models, evaluating the uncertainty of the learned models, and more rational utilization of models. To mitigate these issues, we present MEEE, a model-ensemble method that consists of optimistic exploration and weighted exploitation. During exploration, unlike prior methods directly selecting the optimal action that maximizes the expected accumulative return, our agent first generates a set of action candidates and then seeks out the optimal action that takes both expected return and future observation novelty into account. During exploitation, different discounted weights are assigned to imagined transition tuples according to their model uncertainty respectively, which will prevent model predictive error propagation in agent training. Experiments on several challenging continuous control benchmark tasks demonstrated that our approach outperforms other model-free and model-based state-of-the-art methods, especially in sample complexity.

Restless and Uncertain: Robust Policies for Restless Bandits via Deep Multi-Agent Reinforcement Learning

Authors:Jackson A. Killian, Lily Xu, Arpita Biswas, Milind Tambe
Date:2021-07-04 17:21:26

We introduce robustness in \textit{restless multi-armed bandits} (RMABs), a popular model for constrained resource allocation among independent stochastic processes (arms). Nearly all RMAB techniques assume stochastic dynamics are precisely known. However, in many real-world settings, dynamics are estimated with significant \emph{uncertainty}, e.g., via historical data, which can lead to bad outcomes if ignored. To address this, we develop an algorithm to compute minimax regret -- robust policies for RMABs. Our approach uses a double oracle framework (oracles for \textit{agent} and \textit{nature}), which is often used for single-process robust planning but requires significant new techniques to accommodate the combinatorial nature of RMABs. Specifically, we design a deep reinforcement learning (RL) algorithm, DDLPO, which tackles the combinatorial challenge by learning an auxiliary "$\lambda$-network" in tandem with policy networks per arm, greatly reducing sample complexity, with guarantees on convergence. DDLPO, of general interest, implements our reward-maximizing agent oracle. We then tackle the challenging regret-maximizing nature oracle, a non-stationary RL challenge, by formulating it as a multi-agent RL problem between a policy optimizer and adversarial nature. This formulation is of general interest -- we solve it for RMABs by creating a multi-agent extension of DDLPO with a shared critic. We show our approaches work well in three experimental domains.

Hierarchical Policies for Cluttered-Scene Grasping with Latent Plans

Authors:Lirui Wang, Xiangyun Meng, Yu Xiang, Dieter Fox
Date:2021-07-04 01:26:48

6D grasping in cluttered scenes is a longstanding problem in robotic manipulation. Open-loop manipulation pipelines may fail due to inaccurate state estimation, while most end-to-end grasping methods have not yet scaled to complex scenes with obstacles. In this work, we propose a new method for end-to-end learning of 6D grasping in cluttered scenes. Our hierarchical framework learns collision-free target-driven grasping based on partial point cloud observations. We learn an embedding space to encode expert grasping plans during training and a variational autoencoder to sample diverse grasping trajectories at test time. Furthermore, we train a critic network for plan selection and an option classifier for switching to an instance grasping policy through hierarchical reinforcement learning. We evaluate our method and compare against several baselines in simulation, as well as demonstrate that our latent planning can generalize to real-world cluttered-scene grasping tasks. Our videos and code can be found at https://sites.google.com/view/latent-grasping .

Collaborative Visual Navigation

Authors:Haiyang Wang, Wenguan Wang, Xizhou Zhu, Jifeng Dai, Liwei Wang
Date:2021-07-02 15:48:16

As a fundamental problem for Artificial Intelligence, multi-agent system (MAS) is making rapid progress, mainly driven by multi-agent reinforcement learning (MARL) techniques. However, previous MARL methods largely focused on grid-world like or game environments; MAS in visually rich environments has remained less explored. To narrow this gap and emphasize the crucial role of perception in MAS, we propose a large-scale 3D dataset, CollaVN, for multi-agent visual navigation (MAVN). In CollaVN, multiple agents are entailed to cooperatively navigate across photo-realistic environments to reach target locations. Diverse MAVN variants are explored to make our problem more general. Moreover, a memory-augmented communication framework is proposed. Each agent is equipped with a private, external memory to persistently store communication information. This allows agents to make better use of their past communication information, enabling more efficient collaboration and robust long-term planning. In our experiments, several baselines and evaluation metrics are designed. We also empirically verify the efficacy of our proposed MARL approach across different MAVN task settings.

Social Coordination and Altruism in Autonomous Driving

Authors:Behrad Toghi, Rodolfo Valiente, Dorsa Sadigh, Ramtin Pedarsani, Yaser P. Fallah
Date:2021-07-01 03:37:05

Despite the advances in the autonomous driving domain, autonomous vehicles (AVs) are still inefficient and limited in terms of cooperating with each other or coordinating with vehicles operated by humans. A group of autonomous and human-driven vehicles (HVs) which work together to optimize an altruistic social utility -- as opposed to the egoistic individual utility -- can co-exist seamlessly and assure safety and efficiency on the road. Achieving this mission without explicit coordination among agents is challenging, mainly due to the difficulty of predicting the behavior of humans with heterogeneous preferences in mixed-autonomy environments. Formally, we model an AV's maneuver planning in mixed-autonomy traffic as a partially-observable stochastic game and attempt to derive optimal policies that lead to socially-desirable outcomes using a multi-agent reinforcement learning framework. We introduce a quantitative representation of the AVs' social preferences and design a distributed reward structure that induces altruism into their decision making process. Our altruistic AVs are able to form alliances, guide the traffic, and affect the behavior of the HVs to handle competitive driving scenarios. As a case study, we compare egoistic AVs to our altruistic autonomous agents in a highway merging setting and demonstrate the emerging behaviors that lead to a noticeable improvement in the number of successful merges as well as the overall traffic flow and safety.

Survivable Robotic Control through Guided Bayesian Policy Search with Deep Reinforcement Learning

Authors:Sayyed Jaffar Ali Raza, Apan Dastider, Mingjie Lin
Date:2021-06-29 18:03:53

Many robot manipulation skills can be represented with deterministic characteristics and there exist efficient techniques for learning parameterized motor plans for those skills. However, one of the active research challenge still remains to sustain manipulation capabilities in situation of a mechanical failure. Ideally, like biological creatures, a robotic agent should be able to reconfigure its control policy by adapting to dynamic adversaries. In this paper, we propose a method that allows an agent to survive in a situation of mechanical loss, and adaptively learn manipulation with compromised degrees of freedom -- we call our method Survivable Robotic Learning (SRL). Our key idea is to leverage Bayesian policy gradient by encoding knowledge bias in posterior estimation, which in turn alleviates future policy search explorations, in terms of sample efficiency and when compared to random exploration based policy search methods. SRL represents policy priors as Gaussian process, which allows tractable computation of approximate posterior (when true gradient is intractable), by incorporating guided bias as proxy from prior replays. We evaluate our proposed method against off-the-shelf model free learning algorithm (DDPG), testing on a hexapod robot platform which encounters incremental failure emulation, and our experiments show that our method improves largely in terms of sample requirement and quantitative success ratio in all failure modes. A demonstration video of our experiments can be viewed at: https://sites.google.com/view/survivalrl

Action Set Based Policy Optimization for Safe Power Grid Management

Authors:Bo Zhou, Hongsheng Zeng, Yuecheng Liu, Kejiao Li, Fan Wang, Hao Tian
Date:2021-06-29 09:36:36

Maintaining the stability of the modern power grid is becoming increasingly difficult due to fluctuating power consumption, unstable power supply coming from renewable energies, and unpredictable accidents such as man-made and natural disasters. As the operation on the power grid must consider its impact on future stability, reinforcement learning (RL) has been employed to provide sequential decision-making in power grid management. However, existing methods have not considered the environmental constraints. As a result, the learned policy has risk of selecting actions that violate the constraints in emergencies, which will escalate the issue of overloaded power lines and lead to large-scale blackouts. In this work, we propose a novel method for this problem, which builds on top of the search-based planning algorithm. At the planning stage, the search space is limited to the action set produced by the policy. The selected action strictly follows the constraints by testing its outcome with the simulation function provided by the system. At the learning stage, to address the problem that gradients cannot be propagated to the policy, we introduce Evolutionary Strategies (ES) with black-box policy optimization to improve the policy directly, maximizing the returns of the long run. In NeurIPS 2020 Learning to Run Power Network (L2RPN) competition, our solution safely managed the power grid and ranked first in both tracks.

Habitat 2.0: Training Home Assistants to Rearrange their Habitat

Authors:Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir Vondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, Dhruv Batra
Date:2021-06-28 05:42:15

We introduce Habitat 2.0 (H2.0), a simulation platform for training virtual robots in interactive 3D environments and complex physics-enabled scenarios. We make comprehensive contributions to all levels of the embodied AI stack - data, simulation, and benchmark tasks. Specifically, we present: (i) ReplicaCAD: an artist-authored, annotated, reconfigurable 3D dataset of apartments (matching real spaces) with articulated objects (e.g. cabinets and drawers that can open/close); (ii) H2.0: a high-performance physics-enabled 3D simulator with speeds exceeding 25,000 simulation steps per second (850x real-time) on an 8-GPU node, representing 100x speed-ups over prior work; and, (iii) Home Assistant Benchmark (HAB): a suite of common tasks for assistive robots (tidy the house, prepare groceries, set the table) that test a range of mobile manipulation capabilities. These large-scale engineering contributions allow us to systematically compare deep reinforcement learning (RL) at scale and classical sense-plan-act (SPA) pipelines in long-horizon structured tasks, with an emphasis on generalization to new objects, receptacles, and layouts. We find that (1) flat RL policies struggle on HAB compared to hierarchical ones; (2) a hierarchy with independent skills suffers from 'hand-off problems', and (3) SPA pipelines are more brittle than RL policies.

Continuous Control with Deep Reinforcement Learning for Autonomous Vessels

Authors:Nader Zare, Bruno Brandoli, Mahtab Sarvmaili, Amilcar Soares, Stan Matwin
Date:2021-06-27 03:12:32

Maritime autonomous transportation has played a crucial role in the globalization of the world economy. Deep Reinforcement Learning (DRL) has been applied to automatic path planning to simulate vessel collision avoidance situations in open seas. End-to-end approaches that learn complex mappings directly from the input have poor generalization to reach the targets in different environments. In this work, we present a new strategy called state-action rotation to improve agent's performance in unseen situations by rotating the obtained experience (state-action-state) and preserving them in the replay buffer. We designed our model based on Deep Deterministic Policy Gradient, local view maker, and planner. Our agent uses two deep Convolutional Neural Networks to estimate the policy and action-value functions. The proposed model was exhaustively trained and tested in maritime scenarios with real maps from cities such as Montreal and Halifax. Experimental results show that the state-action rotation on top of the CVN consistently improves the rate of arrival to a destination (RATD) by up 11.96% with respect to the Vessel Navigator with Planner and Local View (VNPLV), as well as it achieves superior performance in unseen mappings by up 30.82%. Our proposed approach exhibits advantages in terms of robustness when tested in a new environment, supporting the idea that generalization can be achieved by using state-action rotation.

Predictive Control Using Learned State Space Models via Rolling Horizon Evolution

Authors:Alvaro Ovalle, Simon M. Lucas
Date:2021-06-25 23:23:42

A large part of the interest in model-based reinforcement learning derives from the potential utility to acquire a forward model capable of strategic long term decision making. Assuming that an agent succeeds in learning a useful predictive model, it still requires a mechanism to harness it to generate and select among competing simulated plans. In this paper, we explore this theme combining evolutionary algorithmic planning techniques with models learned via deep learning and variational inference. We demonstrate the approach with an agent that reliably performs online planning in a set of visual navigation tasks.

Compositional Reinforcement Learning from Logical Specifications

Authors:Kishor Jothimurugan, Suguman Bansal, Osbert Bastani, Rajeev Alur
Date:2021-06-25 22:54:28

We study the problem of learning control policies for complex tasks given by logical specifications. Recent approaches automatically generate a reward function from a given specification and use a suitable reinforcement learning algorithm to learn a policy that maximizes the expected reward. These approaches, however, scale poorly to complex tasks that require high-level planning. In this work, we develop a compositional learning approach, called DiRL, that interleaves high-level planning and reinforcement learning. First, DiRL encodes the specification as an abstract graph; intuitively, vertices and edges of the graph correspond to regions of the state space and simpler sub-tasks, respectively. Our approach then incorporates reinforcement learning to learn neural network policies for each edge (sub-task) within a Dijkstra-style planning algorithm to compute a high-level plan in the graph. An evaluation of the proposed approach on a set of challenging control benchmarks with continuous state and action spaces demonstrates that it outperforms state-of-the-art baselines.

Building Intelligent Autonomous Navigation Agents

Authors:Devendra Singh Chaplot
Date:2021-06-25 04:10:58

Breakthroughs in machine learning in the last decade have led to `digital intelligence', i.e. machine learning models capable of learning from vast amounts of labeled data to perform several digital tasks such as speech recognition, face recognition, machine translation and so on. The goal of this thesis is to make progress towards designing algorithms capable of `physical intelligence', i.e. building intelligent autonomous navigation agents capable of learning to perform complex navigation tasks in the physical world involving visual perception, natural language understanding, reasoning, planning, and sequential decision making. Despite several advances in classical navigation methods in the last few decades, current navigation agents struggle at long-term semantic navigation tasks. In the first part of the thesis, we discuss our work on short-term navigation using end-to-end reinforcement learning to tackle challenges such as obstacle avoidance, semantic perception, language grounding, and reasoning. In the second part, we present a new class of navigation methods based on modular learning and structured explicit map representations, which leverage the strengths of both classical and end-to-end learning methods, to tackle long-term navigation tasks. We show that these methods are able to effectively tackle challenges such as localization, mapping, long-term planning, exploration and learning semantic priors. These modular learning methods are capable of long-term spatial and semantic understanding and achieve state-of-the-art results on various navigation tasks.

Hierarchically Integrated Models: Learning to Navigate from Heterogeneous Robots

Authors:Katie Kang, Gregory Kahn, Sergey Levine
Date:2021-06-24 19:07:40

Deep reinforcement learning algorithms require large and diverse datasets in order to learn successful policies for perception-based mobile navigation. However, gathering such datasets with a single robot can be prohibitively expensive. Collecting data with multiple different robotic platforms with possibly different dynamics is a more scalable approach to large-scale data collection. But how can deep reinforcement learning algorithms leverage such heterogeneous datasets? In this work, we propose a deep reinforcement learning algorithm with hierarchically integrated models (HInt). At training time, HInt learns separate perception and dynamics models, and at test time, HInt integrates the two models in a hierarchical manner and plans actions with the integrated model. This method of planning with hierarchically integrated models allows the algorithm to train on datasets gathered by a variety of different platforms, while respecting the physical capabilities of the deployment robot at test time. Our mobile navigation experiments show that HInt outperforms conventional hierarchical policies and single-source approaches.

Model-Based Reinforcement Learning via Latent-Space Collocation

Authors:Oleh Rybkin, Chuning Zhu, Anusha Nagabandi, Kostas Daniilidis, Igor Mordatch, Sergey Levine
Date:2021-06-24 17:59:18

The ability to plan into the future while utilizing only raw high-dimensional observations, such as images, can provide autonomous agents with broad capabilities. Visual model-based reinforcement learning (RL) methods that plan future actions directly have shown impressive results on tasks that require only short-horizon reasoning, however, these methods struggle on temporally extended tasks. We argue that it is easier to solve long-horizon tasks by planning sequences of states rather than just actions, as the effects of actions greatly compound over time and are harder to optimize. To achieve this, we draw on the idea of collocation, which has shown good results on long-horizon tasks in optimal control literature, and adapt it to the image-based setting by utilizing learned latent state space models. The resulting latent collocation method (LatCo) optimizes trajectories of latent states, which improves over previously proposed shooting methods for visual model-based RL on tasks with sparse rewards and long-term goals. Videos and code at https://orybkin.github.io/latco/.

Lifted Model Checking for Relational MDPs

Authors:Wen-Chi Yang, Jean-François Raskin, Luc De Raedt
Date:2021-06-22 13:12:36

Probabilistic model checking has been developed for verifying systems that have stochastic and nondeterministic behavior. Given a probabilistic system, a probabilistic model checker takes a property and checks whether or not the property holds in that system. For this reason, probabilistic model checking provide rigorous guarantees. So far, however, probabilistic model checking has focused on propositional models where a state is represented by a symbol. On the other hand, it is commonly required to make relational abstractions in planning and reinforcement learning. Various frameworks handle relational domains, for instance, STRIPS planning and relational Markov Decision Processes. Using propositional model checking in relational settings requires one to ground the model, which leads to the well known state explosion problem and intractability. We present pCTL-REBEL, a lifted model checking approach for verifying pCTL properties of relational MDPs. It extends REBEL, a relational model-based reinforcement learning technique, toward relational pCTL model checking. PCTL-REBEL is lifted, which means that rather than grounding, the model exploits symmetries to reason about a group of objects as a whole at the relational level. Theoretically, we show that pCTL model checking is decidable for relational MDPs that have a possibly infinite domain, provided that the states have a bounded size. Practically, we contribute algorithms and an implementation of lifted relational model checking, and we show that the lifted approach improves the scalability of the model checking approach.

Proper Value Equivalence

Authors:Christopher Grimm, André Barreto, Gregory Farquhar, David Silver, Satinder Singh
Date:2021-06-18 19:05:20

One of the main challenges in model-based reinforcement learning (RL) is to decide which aspects of the environment should be modeled. The value-equivalence (VE) principle proposes a simple answer to this question: a model should capture the aspects of the environment that are relevant for value-based planning. Technically, VE distinguishes models based on a set of policies and a set of functions: a model is said to be VE to the environment if the Bellman operators it induces for the policies yield the correct result when applied to the functions. As the number of policies and functions increase, the set of VE models shrinks, eventually collapsing to a single point corresponding to a perfect model. A fundamental question underlying the VE principle is thus how to select the smallest sets of policies and functions that are sufficient for planning. In this paper we take an important step towards answering this question. We start by generalizing the concept of VE to order-$k$ counterparts defined with respect to $k$ applications of the Bellman operator. This leads to a family of VE classes that increase in size as $k \rightarrow \infty$. In the limit, all functions become value functions, and we have a special instantiation of VE which we call proper VE or simply PVE. Unlike VE, the PVE class may contain multiple models even in the limit when all value functions are used. Crucially, all these models are sufficient for planning, meaning that they will yield an optimal policy despite the fact that they may ignore many aspects of the environment. We construct a loss function for learning PVE models and argue that popular algorithms such as MuZero can be understood as minimizing an upper bound for this loss. We leverage this connection to propose a modification to MuZero and show that it can lead to improved performance in practice.

Goal-Directed Planning by Reinforcement Learning and Active Inference

Authors:Dongqi Han, Kenji Doya, Jun Tani
Date:2021-06-18 06:41:01

What is the difference between goal-directed and habitual behavior? We propose a novel computational framework of decision making with Bayesian inference, in which everything is integrated as an entire neural network model. The model learns to predict environmental state transitions by self-exploration and generating motor actions by sampling stochastic internal states ${z}$. Habitual behavior, which is obtained from the prior distribution of ${z}$, is acquired by reinforcement learning. Goal-directed behavior is determined from the posterior distribution of ${z}$ by planning, using active inference which optimizes the past, current and future ${z}$ by minimizing the variational free energy for the desired future observation constrained by the observed sensory sequence. We demonstrate the effectiveness of the proposed framework by experiments in a sensorimotor navigation task with camera observations and continuous motor actions.

Learning from Demonstration without Demonstrations

Authors:Tom Blau, Gilad Francis, Philippe Morere
Date:2021-06-17 01:57:08

State-of-the-art reinforcement learning (RL) algorithms suffer from high sample complexity, particularly in the sparse reward case. A popular strategy for mitigating this problem is to learn control policies by imitating a set of expert demonstrations. The drawback of such approaches is that an expert needs to produce demonstrations, which may be costly in practice. To address this shortcoming, we propose Probabilistic Planning for Demonstration Discovery (P2D2), a technique for automatically discovering demonstrations without access to an expert. We formulate discovering demonstrations as a search problem and leverage widely-used planning algorithms such as Rapidly-exploring Random Tree to find demonstration trajectories. These demonstrations are used to initialize a policy, then refined by a generic RL algorithm. We provide theoretical guarantees of P2D2 finding successful trajectories, as well as bounds for its sampling complexity. We experimentally demonstrate the method outperforms classic and intrinsic exploration RL techniques in a range of classic control and robotics tasks, requiring only a fraction of exploration samples and achieving better asymptotic performance.

Contrastive Reinforcement Learning of Symbolic Reasoning Domains

Authors:Gabriel Poesia, WenXin Dong, Noah Goodman
Date:2021-06-16 21:46:07

Abstract symbolic reasoning, as required in domains such as mathematics and logic, is a key component of human intelligence. Solvers for these domains have important applications, especially to computer-assisted education. But learning to solve symbolic problems is challenging for machine learning algorithms. Existing models either learn from human solutions or use hand-engineered features, making them expensive to apply in new domains. In this paper, we instead consider symbolic domains as simple environments where states and actions are given as unstructured text, and binary rewards indicate whether a problem is solved. This flexible setup makes it easy to specify new domains, but search and planning become challenging. We introduce four environments inspired by the Mathematics Common Core Curriculum, and observe that existing Reinforcement Learning baselines perform poorly. We then present a novel learning algorithm, Contrastive Policy Learning (ConPoLe) that explicitly optimizes the InfoNCE loss, which lower bounds the mutual information between the current state and next states that continue on a path to the solution. ConPoLe successfully solves all four domains. Moreover, problem representations learned by ConPoLe enable accurate prediction of the categories of problems in a real mathematics curriculum. Our results suggest new directions for reinforcement learning in symbolic domains, as well as applications to mathematics education.

Robust Reinforcement Learning Under Minimax Regret for Green Security

Authors:Lily Xu, Andrew Perrault, Fei Fang, Haipeng Chen, Milind Tambe
Date:2021-06-15 20:11:12

Green security domains feature defenders who plan patrols in the face of uncertainty about the adversarial behavior of poachers, illegal loggers, and illegal fishers. Importantly, the deterrence effect of patrols on adversaries' future behavior makes patrol planning a sequential decision-making problem. Therefore, we focus on robust sequential patrol planning for green security following the minimax regret criterion, which has not been considered in the literature. We formulate the problem as a game between the defender and nature who controls the parameter values of the adversarial behavior and design an algorithm MIRROR to find a robust policy. MIRROR uses two reinforcement learning-based oracles and solves a restricted game considering limited defender strategies and parameter values. We evaluate MIRROR on real-world poaching data.

On the Power of Multitask Representation Learning in Linear MDP

Authors:Rui Lu, Gao Huang, Simon S. Du
Date:2021-06-15 11:21:06

While multitask representation learning has become a popular approach in reinforcement learning (RL), theoretical understanding of why and when it works remains limited. This paper presents analyses for the statistical benefit of multitask representation learning in linear Markov Decision Process (MDP) under a generative model. In this paper, we consider an agent to learn a representation function $\phi$ out of a function class $\Phi$ from $T$ source tasks with $N$ data per task, and then use the learned $\hat{\phi}$ to reduce the required number of sample for a new task. We first discover a \emph{Least-Activated-Feature-Abundance} (LAFA) criterion, denoted as $\kappa$, with which we prove that a straightforward least-square algorithm learns a policy which is $\tilde{O}(H^2\sqrt{\frac{\mathcal{C}(\Phi)^2 \kappa d}{NT}+\frac{\kappa d}{n}})$ sub-optimal. Here $H$ is the planning horizon, $\mathcal{C}(\Phi)$ is $\Phi$'s complexity measure, $d$ is the dimension of the representation (usually $d\ll \mathcal{C}(\Phi)$) and $n$ is the number of samples for the new task. Thus the required $n$ is $O(\kappa d H^4)$ for the sub-optimality to be close to zero, which is much smaller than $O(\mathcal{C}(\Phi)^2\kappa d H^4)$ in the setting without multitask representation learning, whose sub-optimality gap is $\tilde{O}(H^2\sqrt{\frac{\kappa \mathcal{C}(\Phi)^2d}{n}})$. This theoretically explains the power of multitask representation learning in reducing sample complexity. Further, we note that to ensure high sample efficiency, the LAFA criterion $\kappa$ should be small. In fact, $\kappa$ varies widely in magnitude depending on the different sampling distribution for new task. This indicates adaptive sampling technique is important to make $\kappa$ solely depend on $d$. Finally, we provide empirical results of a noisy grid-world environment to corroborate our theoretical findings.

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

Authors:Haque Ishfaq, Qiwen Cui, Viet Nguyen, Alex Ayoub, Zhuoran Yang, Zhaoran Wang, Doina Precup, Lin F. Yang
Date:2021-06-15 02:23:07

We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce an optimistic reward sampling procedure. When the value functions can be represented by a function class $\mathcal{F}$, our algorithm achieves a worst-case regret bound of $\widetilde{O}(\mathrm{poly}(d_EH)\sqrt{T})$ where $T$ is the time elapsed, $H$ is the planning horizon and $d_E$ is the $\textit{eluder dimension}$ of $\mathcal{F}$. In the linear setting, our algorithm reduces to LSVI-PHE, a variant of RLSVI, that enjoys an $\widetilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret. We complement the theory with an empirical evaluation across known difficult exploration tasks.

Sample Efficient Reinforcement Learning In Continuous State Spaces: A Perspective Beyond Linearity

Authors:Dhruv Malik, Aldo Pacchiano, Vishwak Srinivasan, Yuanzhi Li
Date:2021-06-15 00:06:59

Reinforcement learning (RL) is empirically successful in complex nonlinear Markov decision processes (MDPs) with continuous state spaces. By contrast, the majority of theoretical RL literature requires the MDP to satisfy some form of linear structure, in order to guarantee sample efficient RL. Such efforts typically assume the transition dynamics or value function of the MDP are described by linear functions of the state features. To resolve this discrepancy between theory and practice, we introduce the Effective Planning Window (EPW) condition, a structural condition on MDPs that makes no linearity assumptions. We demonstrate that the EPW condition permits sample efficient RL, by providing an algorithm which provably solves MDPs satisfying this condition. Our algorithm requires minimal assumptions on the policy class, which can include multi-layer neural networks with nonlinear activation functions. Notably, the EPW condition is directly motivated by popular gaming benchmarks, and we show that many classic Atari games satisfy this condition. We additionally show the necessity of conditions like EPW, by demonstrating that simple MDPs with slight nonlinearities cannot be solved sample efficiently.

Temporal Predictive Coding For Model-Based Planning In Latent Space

Authors:Tung Nguyen, Rui Shu, Tuan Pham, Hung Bui, Stefano Ermon
Date:2021-06-14 04:31:15

High-dimensional observations are a major challenge in the application of model-based reinforcement learning (MBRL) to real-world environments. To handle high-dimensional sensory inputs, existing approaches use representation learning to map high-dimensional observations into a lower-dimensional latent space that is more amenable to dynamics estimation and planning. In this work, we present an information-theoretic approach that employs temporal predictive coding to encode elements in the environment that can be predicted across time. Since this approach focuses on encoding temporally-predictable information, we implicitly prioritize the encoding of task-relevant components over nuisance information within the environment that are provably task-irrelevant. By learning this representation in conjunction with a recurrent state space model, we can then perform planning in latent space. We evaluate our model on a challenging modification of standard DMControl tasks where the background is replaced with natural videos that contain complex but irrelevant information to the planning task. Our experiments show that our model is superior to existing methods in the challenging complex-background setting while remaining competitive with current state-of-the-art models in the standard setting.

Planning for Novelty: Width-Based Algorithms for Common Problems in Control, Planning and Reinforcement Learning

Authors:Nir Lipovetzky
Date:2021-06-09 07:46:19

Width-based algorithms search for solutions through a general definition of state novelty. These algorithms have been shown to result in state-of-the-art performance in classical planning, and have been successfully applied to model-based and model-free settings where the dynamics of the problem are given through simulation engines. Width-based algorithms performance is understood theoretically through the notion of planning width, providing polynomial guarantees on their runtime and memory consumption. To facilitate synergies across research communities, this paper summarizes the area of width-based planning, and surveys current and future research directions.

Don't Get Yourself into Trouble! Risk-aware Decision-Making for Autonomous Vehicles

Authors:Kasra Mokhtari, Alan R. Wagner
Date:2021-06-08 18:24:02

Risk is traditionally described as the expected likelihood of an undesirable outcome, such as collisions for autonomous vehicles. Accurately predicting risk or potentially risky situations is critical for the safe operation of autonomous vehicles. In our previous work, we showed that risk could be characterized by two components: 1) the probability of an undesirable outcome and 2) an estimate of how undesirable the outcome is (loss). This paper is an extension to our previous work. In this paper, using our trained deep reinforcement learning model for navigating around crowds, we developed a risk-based decision-making framework for the autonomous vehicle that integrates the high-level risk-based path planning with the reinforcement learning-based low-level control. We evaluated our method in a high-fidelity simulation such as CARLA. This work can improve safety by allowing an autonomous vehicle to one day avoid and react to risky situations.

Reconciling Rewards with Predictive State Representations

Authors:Andrea Baisero, Christopher Amato
Date:2021-06-07 19:32:08

Predictive state representations (PSRs) are models of controlled non-Markov observation sequences which exhibit the same generative process governing POMDP observations without relying on an underlying latent state. In that respect, a PSR is indistinguishable from the corresponding POMDP. However, PSRs notoriously ignore the notion of rewards, which undermines the general utility of PSR models for control, planning, or reinforcement learning. Therefore, we describe a sufficient and necessary accuracy condition which determines whether a PSR is able to accurately model POMDP rewards, we show that rewards can be approximated even when the accuracy condition is not satisfied, and we find that a non-trivial number of POMDPs taken from a well-known third-party repository do not satisfy the accuracy condition. We propose reward-predictive state representations (R-PSRs), a generalization of PSRs which accurately models both observations and rewards, and develop value iteration for R-PSRs. We show that there is a mismatch between optimal POMDP policies and the optimal PSR policies derived from approximate rewards. On the other hand, optimal R-PSR policies perfectly match optimal POMDP policies, reconfirming R-PSRs as accurate state-less generative models of observations and rewards.

Verifiable and Compositional Reinforcement Learning Systems

Authors:Cyrus Neary, Christos Verginis, Murat Cubuktepe, Ufuk Topcu
Date:2021-06-07 17:05:14

We propose a framework for verifiable and compositional reinforcement learning (RL) in which a collection of RL subsystems, each of which learns to accomplish a separate subtask, are composed to achieve an overall task. The framework consists of a high-level model, represented as a parametric Markov decision process (pMDP) which is used to plan and to analyze compositions of subsystems, and of the collection of low-level subsystems themselves. By defining interfaces between the subsystems, the framework enables automatic decompositions of task specifications, e.g., reach a target set of states with a probability of at least 0.95, into individual subtask specifications, i.e. achieve the subsystem's exit conditions with at least some minimum probability, given that its entry conditions are met. This in turn allows for the independent training and testing of the subsystems; if they each learn a policy satisfying the appropriate subtask specification, then their composition is guaranteed to satisfy the overall task specification. Conversely, if the subtask specifications cannot all be satisfied by the learned policies, we present a method, formulated as the problem of finding an optimal set of parameters in the pMDP, to automatically update the subtask specifications to account for the observed shortcomings. The result is an iterative procedure for defining subtask specifications, and for training the subsystems to meet them. As an additional benefit, this procedure allows for particularly challenging or important components of an overall task to be determined automatically, and focused on, during training. Experimental results demonstrate the presented framework's novel capabilities.

Hierarchical Robot Navigation in Novel Environments using Rough 2-D Maps

Authors:Chengguang Xu, Christopher Amato, Lawson L. S. Wong
Date:2021-06-07 14:42:51

In robot navigation, generalizing quickly to unseen environments is essential. Hierarchical methods inspired by human navigation have been proposed, typically consisting of a high-level landmark proposer and a low-level controller. However, these methods either require precise high-level information to be given in advance or need to construct such guidance from extensive interaction with the environment. In this work, we propose an approach that leverages a rough 2-D map of the environment to navigate in novel environments without requiring further learning. In particular, we introduce a dynamic topological map that can be initialized from the rough 2-D map along with a high-level planning approach for proposing reachable 2-D map patches of the intermediate landmarks between the start and goal locations. To use proposed 2-D patches, we train a deep generative model to generate intermediate landmarks in observation space which are used as subgoals by low-level goal-conditioned reinforcement learning. Importantly, because the low-level controller is only trained with local behaviors (e.g. go across the intersection, turn left at a corner) on existing environments, this framework allows us to generalize to novel environments given only a rough 2-D map, without requiring further learning. Experimental results demonstrate the effectiveness of the proposed framework in both seen and novel environments.

UAV Swarm Path Planning with Reinforcement Learning for Field prospecting

Authors:Alejandro Puente-Castro, Daniel Rivero, Alejandro Pazos, Enrique Fernandez-Blanco
Date:2021-06-04 08:04:14

Unmanned Aerial Vehicle (UAV) swarms adoption shows a steady growth among operators due to the benefits in time and cost arisen from their use. However, this kind of system faces an important problem which is the calculation of many optimal paths for each UAV. Solving this problem would allow a to control many UAVs without human intervention at the same time while saving battery between recharges and performing several tasks simultaneously. The main aim is to develop a system capable of calculating the optimal flight path for a UAV swarm. The aim of these paths is to achieve full coverage of a flight area for tasks such as field prospection. All this, regardless of the size of maps and the number of UAVs in the swarm. It is not necessary to establish targets or any other previous knowledge other than the given map. Experiments have been conducted to determine whether it is optimal to establish a single control for all UAVs in the swarm or a control for each UAV. The results show that it is better to use one control for all UAVs because of the shorter flight time. In addition, the flight time is greatly affected by the size of the map. The results give starting points for future research such as finding the optimal map size for each situation.

A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

Authors:Mingde Zhao, Zhen Liu, Sitao Luan, Shuyuan Zhang, Doina Precup, Yoshua Bengio
Date:2021-06-03 19:35:19

We present an end-to-end, model-based deep reinforcement learning agent which dynamically attends to relevant parts of its state during planning. The agent uses a bottleneck mechanism over a set-based representation to force the number of entities to which the agent attends at each planning step to be small. In experiments, we investigate the bottleneck mechanism with several sets of customized environments featuring different challenges. We consistently observe that the design allows the planning agents to generalize their learned task-solving abilities in compatible unseen environments by attending to the relevant objects, leading to better out-of-distribution generalization performance.

Towards Learning to Play Piano with Dexterous Hands and Touch

Authors:Huazhe Xu, Yuping Luo, Shaoxiong Wang, Trevor Darrell, Roberto Calandra
Date:2021-06-03 17:59:31

The virtuoso plays the piano with passion, poetry and extraordinary technical ability. As Liszt said (a virtuoso)must call up scent and blossom, and breathe the breath of life. The strongest robots that can play a piano are based on a combination of specialized robot hands/piano and hardcoded planning algorithms. In contrast to that, in this paper, we demonstrate how an agent can learn directly from machine-readable music score to play the piano with dexterous hands on a simulated piano using reinforcement learning (RL) from scratch. We demonstrate the RL agents can not only find the correct key position but also deal with various rhythmic, volume and fingering, requirements. We achieve this by using a touch-augmented reward and a novel curriculum of tasks. We conclude by carefully studying the important aspects to enable such learning algorithms and that can potentially shed light on future research in this direction.

Offline Reinforcement Learning as One Big Sequence Modeling Problem

Authors:Michael Janner, Qiyang Li, Sergey Levine
Date:2021-06-03 17:58:51

Reinforcement learning (RL) is typically concerned with estimating stationary policies or single-step models, leveraging the Markov property to factorize problems in time. However, we can also view RL as a generic sequence modeling problem, with the goal being to produce a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide effective solutions to the RL problem. To this end, we explore how RL can be tackled with the tools of sequence modeling, using a Transformer architecture to model distributions over trajectories and repurposing beam search as a planning algorithm. Framing RL as sequence modeling problem simplifies a range of design decisions, allowing us to dispense with many of the components common in offline RL algorithms. We demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL. Further, we show that this approach can be combined with existing model-free algorithms to yield a state-of-the-art planner in sparse-reward, long-horizon tasks.

Least-Restrictive Multi-Agent Collision Avoidance via Deep Meta Reinforcement Learning and Optimal Control

Authors:Salar Asayesh, Mo Chen, Mehran Mehrandezh, Kamal Gupta
Date:2021-06-02 04:49:33

Multi-agent collision-free trajectory planning and control subject to different goal requirements and system dynamics has been extensively studied, and is gaining recent attention in the realm of machine and reinforcement learning. However, in particular when using a large number of agents, constructing a least-restrictive collision avoidance policy is of utmost importance for both classical and learning-based methods. In this paper, we propose a Least-Restrictive Collision Avoidance Module (LR-CAM) that evaluates the safety of multi-agent systems and takes over control only when needed to prevent collisions. The LR-CAM is a single policy that can be wrapped around policies of all agents in a multi-agent system. It allows each agent to pursue any objective as long as it is safe to do so. The benefit of the proposed least-restrictive policy is to only interrupt and overrule the default controller in case of an upcoming inevitable danger. We use a Long Short-Term Memory (LSTM) based Variational Auto-Encoder (VAE) to enable the LR-CAM to account for a varying number of agents in the environment. Moreover, we propose an off-policy meta-reinforcement learning framework with a novel reward function based on a Hamilton-Jacobi value function to train the LR-CAM. The proposed method is fully meta-trained through a ROS based simulation and tested on real multi-agent system. Our results show that LR-CAM outperforms the classical least-restrictive baseline by 30 percent. In addition, we show that even if a subset of agents in a multi-agent system use LR-CAM, the success rate of all agents will increase significantly.

Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition

Authors:Shining Liang, Ming Gong, Jian Pei, Linjun Shou, Wanli Zuo, Xianglin Zuo, Daxin Jiang
Date:2021-06-01 05:46:22

Named entity recognition (NER) is a fundamental component in many applications, such as Web Search and Voice Assistants. Although deep neural networks greatly improve the performance of NER, due to the requirement of large amounts of training data, deep neural networks can hardly scale out to many languages in an industry setting. To tackle this challenge, cross-lingual NER transfers knowledge from a rich-resource language to languages with low resources through pre-trained multilingual language models. Instead of using training data in target languages, cross-lingual NER has to rely on only training data in source languages, and optionally adds the translated training data derived from source languages. However, the existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages, which is relatively easy to collect in industry applications. To address the opportunities and challenges, in this paper we describe our novel practice in Microsoft to leverage such large amounts of unlabeled data in target languages in real production settings. To effectively extract weak supervision signals from the unlabeled data, we develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning. The empirical study on three benchmark data sets verifies that our approach establishes the new state-of-the-art performance with clear edges. Now, the NER techniques reported in this paper are on their way to become a fundamental component for Web ranking, Entity Pane, Answers Triggering, and Question Answering in the Microsoft Bing search engine. Moreover, our techniques will also serve as part of the Spoken Language Understanding module for a commercial voice assistant. We plan to open source the code of the prototype framework after deployment.

Procedural Content Generation: Better Benchmarks for Transfer Reinforcement Learning

Authors:Matthias Müller-Brockhausen, Mike Preuss, Aske Plaat
Date:2021-05-31 08:21:03

The idea of transfer in reinforcement learning (TRL) is intriguing: being able to transfer knowledge from one problem to another problem without learning everything from scratch. This promises quicker learning and learning more complex methods. To gain an insight into the field and to detect emerging trends, we performed a database search. We note a surprisingly late adoption of deep learning that starts in 2018. The introduction of deep learning has not yet solved the greatest challenge of TRL: generalization. Transfer between different domains works well when domains have strong similarities (e.g. MountainCar to Cartpole), and most TRL publications focus on different tasks within the same domain that have few differences. Most TRL applications we encountered compare their improvements against self-defined baselines, and the field is still missing unified benchmarks. We consider this to be a disappointing situation. For the future, we note that: (1) A clear measure of task similarity is needed. (2) Generalization needs to improve. Promising approaches merge deep learning with planning via MCTS or introduce memory through LSTMs. (3) The lack of benchmarking tools will be remedied to enable meaningful comparison and measure progress. Already Alchemy and Meta-World are emerging as interesting benchmark suites. We note that another development, the increase in procedural content generation (PCG), can improve both benchmarking and generalization in TRL.

A Survey of Deep Reinforcement Learning Algorithms for Motion Planning and Control of Autonomous Vehicles

Authors:Fei Ye, Shen Zhang, Pin Wang, Ching-Yao Chan
Date:2021-05-29 05:27:07

In this survey, we systematically summarize the current literature on studies that apply reinforcement learning (RL) to the motion planning and control of autonomous vehicles. Many existing contributions can be attributed to the pipeline approach, which consists of many hand-crafted modules, each with a functionality selected for the ease of human interpretation. However, this approach does not automatically guarantee maximal performance due to the lack of a system-level optimization. Therefore, this paper also presents a growing trend of work that falls into the end-to-end approach, which typically offers better performance and smaller system scales. However, their performance also suffers from the lack of expert data and generalization issues. Finally, the remaining challenges applying deep RL algorithms on autonomous driving are summarized, and future research directions are also presented to tackle these challenges.

An Offline Risk-aware Policy Selection Method for Bayesian Markov Decision Processes

Authors:Giorgio Angelotti, Nicolas Drougard, Caroline Ponzoni Carvalho Chanel
Date:2021-05-27 20:12:20

In Offline Model Learning for Planning and in Offline Reinforcement Learning, the limited data set hinders the estimate of the Value function of the relative Markov Decision Process (MDP). Consequently, the performance of the obtained policy in the real world is bounded and possibly risky, especially when the deployment of a wrong policy can lead to catastrophic consequences. For this reason, several pathways are being followed with the scope of reducing the model error (or the distributional shift between the learned model and the true one) and, more broadly, obtaining risk-aware solutions with respect to model uncertainty. But when it comes to the final application which baseline should a practitioner choose? In an offline context where computational time is not an issue and robustness is the priority we propose Exploitation vs Caution (EvC), a paradigm that (1) elegantly incorporates model uncertainty abiding by the Bayesian formalism, and (2) selects the policy that maximizes a risk-aware objective over the Bayesian posterior between a fixed set of candidate policies provided, for instance, by the current baselines. We validate EvC with state-of-the-art approaches in different discrete, yet simple, environments offering a fair variety of MDP classes. In the tested scenarios EvC manages to select robust policies and hence stands out as a useful tool for practitioners that aim to apply offline planning and reinforcement learning solvers in the real world.

From Motor Control to Team Play in Simulated Humanoid Football

Authors:Siqi Liu, Guy Lever, Zhe Wang, Josh Merel, S. M. Ali Eslami, Daniel Hennes, Wojciech M. Czarnecki, Yuval Tassa, Shayegan Omidshafiei, Abbas Abdolmaleki, Noah Y. Siegel, Leonard Hasenclever, Luke Marris, Saran Tunyasuvunakool, H. Francis Song, Markus Wulfmeier, Paul Muller, Tuomas Haarnoja, Brendan D. Tracey, Karl Tuyls, Thore Graepel, Nicolas Heess
Date:2021-05-25 20:17:10

Intelligent behaviour in the physical world exhibits structure at multiple spatial and temporal scales. Although movements are ultimately executed at the level of instantaneous muscle tensions or joint torques, they must be selected to serve goals defined on much longer timescales, and in terms of relations that extend far beyond the body itself, ultimately involving coordination with other agents. Recent research in artificial intelligence has shown the promise of learning-based approaches to the respective problems of complex movement, longer-term planning and multi-agent coordination. However, there is limited research aimed at their integration. We study this problem by training teams of physically simulated humanoid avatars to play football in a realistic virtual environment. We develop a method that combines imitation learning, single- and multi-agent reinforcement learning and population-based training, and makes use of transferable representations of behaviour for decision making at different levels of abstraction. In a sequence of stages, players first learn to control a fully articulated body to perform realistic, human-like movements such as running and turning; they then acquire mid-level football skills such as dribbling and shooting; finally, they develop awareness of others and play as a team, bridging the gap between low-level motor control at a timescale of milliseconds, and coordinated goal-directed behaviour as a team at the timescale of tens of seconds. We investigate the emergence of behaviours at different levels of abstraction, as well as the representations that underlie these behaviours using several analysis techniques, including statistics from real-world sports analytics. Our work constitutes a complete demonstration of integrated decision-making at multiple scales in a physically embodied multi-agent setting. See project video at https://youtu.be/KHMwq9pv7mg.

Transfer Learning and Curriculum Learning in Sokoban

Authors:Zhao Yang, Mike Preuss, Aske Plaat
Date:2021-05-25 07:01:32

Transfer learning can speed up training in machine learning and is regularly used in classification tasks. It reuses prior knowledge from other tasks to pre-train networks for new tasks. In reinforcement learning, learning actions for a behavior policy that can be applied to new environments is still a challenge, especially for tasks that involve much planning. Sokoban is a challenging puzzle game. It has been used widely as a benchmark in planning-based reinforcement learning. In this paper, we show how prior knowledge improves learning in Sokoban tasks. We find that reusing feature representations learned previously can accelerate learning new, more complex, instances. In effect, we show how curriculum learning, from simple to complex tasks, works in Sokoban. Furthermore, feature representations learned in simpler instances are more general, and thus lead to positive transfers towards more complex tasks, but not vice versa. We have also studied which part of the knowledge is most important for transfer to succeed, and identify which layers should be used for pre-training.

A Reinforcement Learning based Path Planning Approach in 3D Environment

Authors:Geesara Kulathunga
Date:2021-05-21 13:33:14

Optimal motion planning involves obstacles avoidance where path planning is the key to success in optimal motion planning. Due to the computational demands, most of the path planning algorithms can not be employed for real-time based applications. Model-based reinforcement learning approaches for path planning have received certain success in the recent past. Yet, most of such approaches do not have deterministic output due to the randomness. We analyzed several types of reinforcement learning-based approaches for path planning. One of them is a deterministic tree-based approach and other two approaches are based on Q-learning and approximate policy gradient, respectively. We tested preceding approaches on two different simulators, each of which consists of a set of random obstacles that can be changed or moved dynamically. After analysing the result and computation time, we concluded that the deterministic tree search approach provides highly stable result. However, the computational time is considerably higher than the other two approaches. Finally, the comparative results are provided in terms of accuracy and computational time as evidence.

Objective-aware Traffic Simulation via Inverse Reinforcement Learning

Authors:Guanjie Zheng, Hanyang Liu, Kai Xu, Zhenhui Li
Date:2021-05-20 07:26:34

Traffic simulators act as an essential component in the operating and planning of transportation systems. Conventional traffic simulators usually employ a calibrated physical car-following model to describe vehicles' behaviors and their interactions with traffic environment. However, there is no universal physical model that can accurately predict the pattern of vehicle's behaviors in different situations. A fixed physical model tends to be less effective in a complicated environment given the non-stationary nature of traffic dynamics. In this paper, we formulate traffic simulation as an inverse reinforcement learning problem, and propose a parameter sharing adversarial inverse reinforcement learning model for dynamics-robust simulation learning. Our proposed model is able to imitate a vehicle's trajectories in the real world while simultaneously recovering the reward function that reveals the vehicle's true objective which is invariant to different dynamics. Extensive experiments on synthetic and real-world datasets show the superior performance of our approach compared to state-of-the-art methods and its robustness to variant dynamics of traffic.

Robo-Advising: Enhancing Investment with Inverse Optimization and Deep Reinforcement Learning

Authors:Haoran Wang, Shi Yu
Date:2021-05-19 17:20:03

Machine Learning (ML) has been embraced as a powerful tool by the financial industry, with notable applications spreading in various domains including investment management. In this work, we propose a full-cycle data-driven investment robo-advising framework, consisting of two ML agents. The first agent, an inverse portfolio optimization agent, infers an investor's risk preference and expected return directly from historical allocation data using online inverse optimization. The second agent, a deep reinforcement learning (RL) agent, aggregates the inferred sequence of expected returns to formulate a new multi-period mean-variance portfolio optimization problem that can be solved using deep RL approaches. The proposed investment pipeline is applied on real market data from April 1, 2016 to February 1, 2021 and has shown to consistently outperform the S&P 500 benchmark portfolio that represents the aggregate market optimal allocation. The outperformance may be attributed to the the multi-period planning (versus single-period planning) and the data-driven RL approach (versus classical estimation approach).

Adaptive ABAC Policy Learning: A Reinforcement Learning Approach

Authors:Leila Karimi, Mai Abdelhakim, James Joshi
Date:2021-05-18 15:18:02

With rapid advances in computing systems, there is an increasing demand for more effective and efficient access control (AC) approaches. Recently, Attribute Based Access Control (ABAC) approaches have been shown to be promising in fulfilling the AC needs of such emerging complex computing environments. An ABAC model grants access to a requester based on attributes of entities in a system and an authorization policy; however, its generality and flexibility come with a higher cost. Further, increasing complexities of organizational systems and the need for federated accesses to their resources make the task of AC enforcement and management much more challenging. In this paper, we propose an adaptive ABAC policy learning approach to automate the authorization management task. We model ABAC policy learning as a reinforcement learning problem. In particular, we propose a contextual bandit system, in which an authorization engine adapts an ABAC model through a feedback control loop; it relies on interacting with users/administrators of the system to receive their feedback that assists the model in making authorization decisions. We propose four methods for initializing the learning model and a planning approach based on attribute value hierarchy to accelerate the learning process. We focus on developing an adaptive ABAC policy learning model for a home IoT environment as a running example. We evaluate our proposed approach over real and synthetic data. We consider both complete and sparse datasets in our evaluations. Our experimental results show that the proposed approach achieves performance that is comparable to ones based on supervised learning in many scenarios and even outperforms them in several situations.

Online Multimodal Transportation Planning using Deep Reinforcement Learning

Authors:Amirreza Farahani, Laura Genga, Remco Dijkman
Date:2021-05-18 09:01:44

In this paper we propose a Deep Reinforcement Learning approach to solve a multimodal transportation planning problem, in which containers must be assigned to a truck or to trains that will transport them to their destination. While traditional planning methods work "offline" (i.e., they take decisions for a batch of containers before the transportation starts), the proposed approach is "online", in that it can take decisions for individual containers, while transportation is being executed. Planning transportation online helps to effectively respond to unforeseen events that may affect the original transportation plan, thus supporting companies in lowering transportation costs. We implemented different container selection heuristics within the proposed Deep Reinforcement Learning algorithm and we evaluated its performance for each heuristic using data that simulate a realistic scenario, designed on the basis of a real case study at a logistics company. The experimental results revealed that the proposed method was able to learn effective patterns of container assignment. It outperformed tested competitors in terms of total transportation costs and utilization of train capacity by 20.48% to 55.32% for the cost and by 7.51% to 20.54% for the capacity. Furthermore, it obtained results within 2.7% for the cost and 0.72% for the capacity of the optimal solution generated by an Integer Linear Programming solver in an offline setting.

Model-Based Offline Planning with Trajectory Pruning

Authors:Xianyuan Zhan, Xiangyu Zhu, Haoran Xu
Date:2021-05-16 05:00:54

The recent offline reinforcement learning (RL) studies have achieved much progress to make RL usable in real-world systems by learning policies from pre-collected datasets without environment interaction. Unfortunately, existing offline RL methods still face many practical challenges in real-world system control tasks, such as computational restriction during agent training and the requirement of extra control flexibility. The model-based planning framework provides an attractive alternative. However, most model-based planning algorithms are not designed for offline settings. Simply combining the ingredients of offline RL with existing methods either provides over-restrictive planning or leads to inferior performance. We propose a new light-weighted model-based offline planning framework, namely MOPP, which tackles the dilemma between the restrictions of offline learning and high-performance planning. MOPP encourages more aggressive trajectory rollout guided by the behavior policy learned from data, and prunes out problematic trajectories to avoid potential out-of-distribution samples. Experimental results show that MOPP provides competitive performance compared with existing model-based offline planning and RL approaches.

Acting upon Imagination: when to trust imagined trajectories in model based reinforcement learning

Authors:Adrian Remonda, Eduardo Veas, Granit Luzhnica
Date:2021-05-12 15:04:07

Model-based reinforcement learning (MBRL) aims to learn model(s) of the environment dynamics that can predict the outcome of its actions. Forward application of the model yields so called imagined trajectories (sequences of action, predicted state-reward) used to optimize the set of candidate actions that maximize expected reward. The outcome, an ideal imagined trajectory or plan, is imperfect and typically MBRL relies on model predictive control (MPC) to overcome this by continuously re-planning from scratch, incurring thus major computational cost and increasing complexity in tasks with longer receding horizon. We propose uncertainty estimation methods for online evaluation of imagined trajectories to assess whether further planned actions can be trusted to deliver acceptable reward. These methods include comparing the error after performing the last action with the standard expected error and using model uncertainty to assess the deviation from expected outcomes. Additionally, we introduce methods that exploit the forward propagation of the dynamics model to evaluate if the remainder of the plan aligns with expected results and assess the remainder of the plan in terms of the expected reward. Our experiments demonstrate the effectiveness of the proposed uncertainty estimation methods by applying them to avoid unnecessary trajectory replanning in a shooting MBRL setting. Results highlight significant reduction on computational costs without sacrificing performance.

PEARL: Parallelized Expert-Assisted Reinforcement Learning for Scene Rearrangement Planning

Authors:Hanqing Wang, Zan Wang, Wei Liang, Lap-Fai Yu
Date:2021-05-10 03:27:16

Scene Rearrangement Planning (SRP) is an interior task proposed recently. The previous work defines the action space of this task with handcrafted coarse-grained actions that are inflexible to be used for transforming scene arrangement and intractable to be deployed in practice. Additionally, this new task lacks realistic indoor scene rearrangement data to feed popular data-hungry learning approaches and meet the needs of quantitative evaluation. To address these problems, we propose a fine-grained action definition for SRP and introduce a large-scale scene rearrangement dataset. We also propose a novel learning paradigm to efficiently train an agent through self-playing, without any prior knowledge. The agent trained via our paradigm achieves superior performance on the introduced dataset compared to the baseline agents. We provide a detailed analysis of the design of our approach in our experiments.

Scalable, Decentralized Multi-Agent Reinforcement Learning Methods Inspired by Stigmergy and Ant Colonies

Authors:Austin Anhkhoi Nguyen
Date:2021-05-08 01:04:51

Bolstering multi-agent learning algorithms to tackle complex coordination and control tasks has been a long-standing challenge of on-going research. Numerous methods have been proposed to help reduce the effects of non-stationarity and unscalability. In this work, we investigate a novel approach to decentralized multi-agent learning and planning that attempts to address these two challenges. In particular, this method is inspired by the cohesion, coordination, and behavior of ant colonies. As a result, these algorithms are designed to be naturally scalable to systems with numerous agents. While no optimality is guaranteed, the method is intended to work well in practice and scale better in efficacy with the number of agents present than others. The approach combines single-agent RL and an ant-colony-inspired decentralized, stigmergic algorithm for multi-agent path planning and environment modification. Specifically, we apply this algorithm in a setting where agents must navigate to a goal location, learning to push rectangular boxes into holes to yield new traversable pathways. It is shown that while the approach yields promising success in this particular environment, it may not be as easily generalized to others. The algorithm designed is notably scalable to numerous agents but is limited in its performance due to its relatively simplistic, rule-based approach. Furthermore, the composability of RL-trained policies is called into question, where, while policies are successful in their training environments, applying trained policies to a larger-scale, multi-agent framework results in unpredictable behavior.

Solving Sokoban with forward-backward reinforcement learning

Authors:Yaron Shoham, Gal Elidan
Date:2021-05-05 07:37:57

Despite seminal advances in reinforcement learning in recent years, many domains where the rewards are sparse, e.g. given only at task completion, remain quite challenging. In such cases, it can be beneficial to tackle the task both from its beginning and end, and make the two ends meet. Existing approaches that do so, however, are not effective in the common scenario where the strategy needed near the end goal is very different from the one that is effective earlier on. In this work we propose a novel RL approach for such settings. In short, we first train a backward-looking agent with a simple relaxed goal, and then augment the state representation of the forward-looking agent with straightforward hint features. This allows the learned forward agent to leverage information from backward plans, without mimicking their policy. We demonstrate the efficacy of our approach on the challenging game of Sokoban, where we substantially surpass learned solvers that generalize across levels, and are competitive with SOTA performance of the best highly-crafted systems. Impressively, we achieve these results while learning from a small number of practice levels and using simple RL techniques.

Data-Efficient Reinforcement Learning for Malaria Control

Authors:Lixin Zou, Long Xia, Linfang Hou, Xiangyu Zhao, Dawei Yin
Date:2021-05-04 16:54:16

Sequential decision-making under cost-sensitive tasks is prohibitively daunting, especially for the problem that has a significant impact on people's daily lives, such as malaria control, treatment recommendation. The main challenge faced by policymakers is to learn a policy from scratch by interacting with a complex environment in a few trials. This work introduces a practical, data-efficient policy learning method, named Variance-Bonus Monte Carlo Tree Search~(VB-MCTS), which can copy with very little data and facilitate learning from scratch in only a few trials. Specifically, the solution is a model-based reinforcement learning method. To avoid model bias, we apply Gaussian Process~(GP) regression to estimate the transitions explicitly. With the GP world model, we propose a variance-bonus reward to measure the uncertainty about the world. Adding the reward to the planning with MCTS can result in more efficient and effective exploration. Furthermore, the derived polynomial sample complexity indicates that VB-MCTS is sample efficient. Finally, outstanding performance on a competitive world-level RL competition and extensive experimental results verify its advantage over the state-of-the-art on the challenging malaria control task.

Emotional Contagion-Aware Deep Reinforcement Learning for Antagonistic Crowd Simulation

Authors:Pei Lv, Qingqing Yu, Boya Xu, Chaochao Li, Bing Zhou, Mingliang Xu
Date:2021-04-29 01:18:13

The antagonistic behavior in the crowd usually exacerbates the seriousness of the situation in sudden riots, where the antagonistic emotional contagion and behavioral decision making play very important roles. However, the complex mechanism of antagonistic emotion influencing decision making, especially in the environment of sudden confrontation, has not yet been explored very clearly. In this paper, we propose an Emotional contagion-aware Deep reinforcement learning model for Antagonistic Crowd Simulation (ACSED). Firstly, we build a group emotional contagion module based on the improved Susceptible Infected Susceptible (SIS) infection disease model, and estimate the emotional state of the group at each time step during the simulation. Then, the tendency of crowd antagonistic action is estimated based on Deep Q Network (DQN), where the agent learns the action autonomously, and leverages the mean field theory to quickly calculate the influence of other surrounding individuals on the central one. Finally, the rationality of the predicted actions by DQN is further analyzed in combination with group emotion, and the final action of the agent is determined. The proposed method in this paper is verified through several experiments with different settings. The results prove that the antagonistic emotion has a vital impact on the group combat, and positive emotional states are more conducive to combat. Moreover, by comparing the simulation results with real scenes, the feasibility of our method is further confirmed, which can provide good reference to formulate battle plans and improve the win rate of righteous groups in a variety of situations.

Planning with Expectation Models for Control

Authors:Katya Kudashkina, Yi Wan, Abhishek Naik, Richard S. Sutton
Date:2021-04-17 13:37:14

In model-based reinforcement learning (MBRL), Wan et al. (2019) showed conditions under which the environment model could produce the expectation of the next feature vector rather than the full distribution, or a sample thereof, with no loss in planning performance. Such expectation models are of interest when the environment is stochastic and non-stationary, and the model is approximate, such as when it is learned using function approximation. In these cases a full distribution model may be impractical and a sample model may be either more expensive computationally or of high variance. Wan et al. considered only planning for prediction to evaluate a fixed policy. In this paper, we treat the control case - planning to improve and find a good approximate policy. We prove that planning with an expectation model must update a state-value function, not an action-value function as previously suggested (e.g., Sorg & Singh, 2010). This opens the question of how planning influences action selections. We consider three strategies for this and present general MBRL algorithms for each. We identify the strengths and weaknesses of these algorithms in computational experiments. Our algorithms and experiments are the first to treat MBRL with expectation models in a general setting.

Hierarchical Human-Motion Prediction and Logic-Geometric Programming for Minimal Interference Human-Robot Tasks

Authors:An T. Le, Philipp Kratzer, Simon Hagenmayer, Marc Toussaint, Jim Mainprice
Date:2021-04-16 14:35:59

In this paper, we tackle the problem of human-robot coordination in sequences of manipulation tasks. Our approach integrates hierarchical human motion prediction with Task and Motion Planning (TAMP). We first devise a hierarchical motion prediction approach by combining Inverse Reinforcement Learning and short-term motion prediction using a Recurrent Neural Network. In a second step, we propose a dynamic version of the TAMP algorithm Logic-Geometric Programming (LGP). Our version of Dynamic LGP, replans periodically to handle the mismatch between the human motion prediction and the actual human behavior. We assess the efficacy of the approach by training the prediction algorithms and testing the framework on the publicly available MoGaze dataset.

Rule-Based Reinforcement Learning for Efficient Robot Navigation with Space Reduction

Authors:Yuanyang Zhu, Zhi Wang, Chunlin Chen, Daoyi Dong
Date:2021-04-15 07:40:27

For real-world deployments, it is critical to allow robots to navigate in complex environments autonomously. Traditional methods usually maintain an internal map of the environment, and then design several simple rules, in conjunction with a localization and planning approach, to navigate through the internal map. These approaches often involve a variety of assumptions and prior knowledge. In contrast, recent reinforcement learning (RL) methods can provide a model-free, self-learning mechanism as the robot interacts with an initially unknown environment, but are expensive to deploy in real-world scenarios due to inefficient exploration. In this paper, we focus on efficient navigation with the RL technique and combine the advantages of these two kinds of methods into a rule-based RL (RuRL) algorithm for reducing the sample complexity and cost of time. First, we use the rule of wall-following to generate a closed-loop trajectory. Second, we employ a reduction rule to shrink the trajectory, which in turn effectively reduces the redundant exploration space. Besides, we give the detailed theoretical guarantee that the optimal navigation path is still in the reduced space. Third, in the reduced space, we utilize the Pledge rule to guide the exploration strategy for accelerating the RL process at the early stage. Experiments conducted on real robot navigation problems in hex-grid environments demonstrate that RuRL can achieve improved navigation performance.

Learning and Planning in Complex Action Spaces

Authors:Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, David Silver
Date:2021-04-13 15:48:48

Many important real-world problems have action spaces that are high-dimensional, continuous or both, making full enumeration of all possible actions infeasible. Instead, only small subsets of actions can be sampled for the purpose of policy evaluation and improvement. In this paper, we propose a general framework to reason in a principled way about policy evaluation and improvement over such sampled action subsets. This sample-based policy iteration framework can in principle be applied to any reinforcement learning algorithm based upon policy iteration. Concretely, we propose Sampled MuZero, an extension of the MuZero algorithm that is able to learn in domains with arbitrarily complex action spaces by planning over sampled actions. We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains: DeepMind Control Suite and Real-World RL Suite.

Online and Offline Reinforcement Learning by Planning with a Learned Model

Authors:Julian Schrittwieser, Thomas Hubert, Amol Mandhane, Mohammadamin Barekatain, Ioannis Antonoglou, David Silver
Date:2021-04-13 15:36:06

Learning efficiently from small amounts of data has long been the focus of model-based reinforcement learning, both for the online case when interacting with the environment and the offline case when learning from a fixed dataset. However, to date no single unified algorithm could demonstrate state-of-the-art results in both settings. In this work, we describe the Reanalyse algorithm which uses model-based policy and value improvement operators to compute new improved training targets on existing data points, allowing efficient learning for data budgets varying by several orders of magnitude. We further show that Reanalyse can also be used to learn entirely from demonstrations without any environment interactions, as in the case of offline Reinforcement Learning (offline RL). Combining Reanalyse with the MuZero algorithm, we introduce MuZero Unplugged, a single unified algorithm for any data budget, including offline RL. In contrast to previous work, our algorithm does not require any special adaptations for the off-policy or offline RL settings. MuZero Unplugged sets new state-of-the-art results in the RL Unplugged offline RL benchmark as well as in the online RL benchmark of Atari in the standard 200 million frame setting.

Selection-Expansion: A Unifying Framework for Motion-Planning and Diversity Search Algorithms

Authors:Alexandre Chenu, Nicolas Perrin-Gilbert, Stéphane Doncieux, Olivier Sigaud
Date:2021-04-10 13:52:27

Reinforcement learning agents need a reward signal to learn successful policies. When this signal is sparse or the corresponding gradient is deceptive, such agents need a dedicated mechanism to efficiently explore their search space without relying on the reward. Looking for a large diversity of behaviors or using Motion Planning (MP) algorithms are two options in this context. In this paper, we build on the common roots between these two options to investigate the properties of two diversity search algorithms, the Novelty Search and the Goal Exploration Process algorithms. These algorithms look for diversity in an outcome space or behavioral space which is generally hand-designed to represent what matters for a given task. The relation to MP algorithms reveals that the smoothness, or lack of smoothness of the mapping between the policy parameter space and the outcome space plays a key role in the search efficiency. In particular, we show empirically that, if the mapping is smooth enough, i.e. if two close policies in the parameter space lead to similar outcomes, then diversity algorithms tend to inherit exploration properties of MP algorithms. By contrast, if it is not, diversity algorithms lose these properties and their performance strongly depends on specific heuristics, notably filtering mechanisms that discard some of the explored policies.

Jamming-Resilient Path Planning for Multiple UAVs via Deep Reinforcement Learning

Authors:Xueyuan Wang, M. Cenk Gursoy, Tugba Erpek, Yalin E. Sagduyu
Date:2021-04-09 16:52:33

Unmanned aerial vehicles (UAVs) are expected to be an integral part of wireless networks. In this paper, we aim to find collision-free paths for multiple cellular-connected UAVs, while satisfying requirements of connectivity with ground base stations (GBSs) in the presence of a dynamic jammer. We first formulate the problem as a sequential decision making problem in discrete domain, with connectivity, collision avoidance, and kinematic constraints. We, then, propose an offline temporal difference (TD) learning algorithm with online signal-to-interference-plus-noise ratio (SINR) mapping to solve the problem. More specifically, a value network is constructed and trained offline by TD method to encode the interactions among the UAVs and between the UAVs and the environment; and an online SINR mapping deep neural network (DNN) is designed and trained by supervised learning, to encode the influence and changes due to the jammer. Numerical results show that, without any information on the jammer, the proposed algorithm can achieve performance levels close to that of the ideal scenario with the perfect SINR-map. Real-time navigation for multi-UAVs can be efficiently performed with high success rates, and collisions are avoided.

Connecting Deep-Reinforcement-Learning-based Obstacle Avoidance with Conventional Global Planners using Waypoint Generators

Authors:Linh Kästner, Teham Buiyan, Xinlin Zhao, Zhengcheng Shen, Cornelius Marx, Jens Lambrecht
Date:2021-04-08 10:23:23

Deep Reinforcement Learning has emerged as an efficient dynamic obstacle avoidance method in highly dynamic environments. It has the potential to replace overly conservative or inefficient navigation approaches. However, the integration of Deep Reinforcement Learning into existing navigation systems is still an open frontier due to the myopic nature of Deep-Reinforcement-Learning-based navigation, which hinders its widespread integration into current navigation systems. In this paper, we propose the concept of an intermediate planner to interconnect novel Deep-Reinforcement-Learning-based obstacle avoidance with conventional global planning methods using waypoint generation. Therefore, we integrate different waypoint generators into existing navigation systems and compare the joint system against traditional ones. We found an increased performance in terms of safety, efficiency and path smoothness especially in highly dynamic environments.

Arena-Rosnav: Towards Deployment of Deep-Reinforcement-Learning-Based Obstacle Avoidance into Conventional Autonomous Navigation Systems

Authors:Linh Kästner, Teham Buiyan, Xinlin Zhao, Lei Jiao, Zhengcheng Shen, Jens Lambrecht
Date:2021-04-08 08:56:53

Recently, mobile robots have become important tools in various industries, especially in logistics. Deep reinforcement learning emerged as an alternative planning method to replace overly conservative approaches and promises more efficient and flexible navigation. However, deep reinforcement learning approaches are not suitable for long-range navigation due to their proneness to local minima and lack of long term memory, which hinders its widespread integration into industrial applications of mobile robotics. In this paper, we propose a navigation system incorporating deep-reinforcement-learning-based local planners into conventional navigation stacks for long-range navigation. Therefore, a framework for training and testing the deep reinforcement learning algorithms along with classic approaches is presented. We evaluated our deep-reinforcement-learning-enhanced navigation system against various conventional planners and found that our system outperforms them in terms of safety, efficiency and robustness.

PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics

Authors:Zhiao Huang, Yuanming Hu, Tao Du, Siyuan Zhou, Hao Su, Joshua B. Tenenbaum, Chuang Gan
Date:2021-04-07 17:59:23

Simulated virtual environments serve as one of the main driving forces behind developing and evaluating skill learning algorithms. However, existing environments typically only simulate rigid body physics. Additionally, the simulation process usually does not provide gradients that might be useful for planning and control optimizations. We introduce a new differentiable physics benchmark called PasticineLab, which includes a diverse collection of soft body manipulation tasks. In each task, the agent uses manipulators to deform the plasticine into the desired configuration. The underlying physics engine supports differentiable elastic and plastic deformation using the DiffTaichi system, posing many under-explored challenges to robotic agents. We evaluate several existing reinforcement learning (RL) methods and gradient-based methods on this benchmark. Experimental results suggest that 1) RL-based approaches struggle to solve most of the tasks efficiently; 2) gradient-based approaches, by optimizing open-loop control sequences with the built-in differentiable physics engine, can rapidly find a solution within tens of iterations, but still fall short on multi-stage tasks that require long-term planning. We expect that PlasticineLab will encourage the development of novel algorithms that combine differentiable physics and RL for more complex physics-based skill learning tasks.

The Value of Planning for Infinite-Horizon Model Predictive Control

Authors:Nathan Hatch, Byron Boots
Date:2021-04-07 02:21:55

Model Predictive Control (MPC) is a classic tool for optimal control of complex, real-world systems. Although it has been successfully applied to a wide range of challenging tasks in robotics, it is fundamentally limited by the prediction horizon, which, if too short, will result in myopic decisions. Recently, several papers have suggested using a learned value function as the terminal cost for MPC. If the value function is accurate, it effectively allows MPC to reason over an infinite horizon. Unfortunately, Reinforcement Learning (RL) solutions to value function approximation can be difficult to realize for robotics tasks. In this paper, we suggest a more efficient method for value function approximation that applies to goal-directed problems, like reaching and navigation. In these problems, MPC is often formulated to track a path or trajectory returned by a planner. However, this strategy is brittle in that unexpected perturbations to the robot will require replanning, which can be costly at runtime. Instead, we show how the intermediate data structures used by modern planners can be interpreted as an approximate value function. We show that that this value function can be used by MPC directly, resulting in more efficient and resilient behavior at runtime.

GEM: Group Enhanced Model for Learning Dynamical Control Systems

Authors:Philippe Hansen-Estruch, Wenling Shang, Lerrel Pinto, Pieter Abbeel, Stas Tiomkin
Date:2021-04-07 01:08:18

Learning the dynamics of a physical system wherein an autonomous agent operates is an important task. Often these systems present apparent geometric structures. For instance, the trajectories of a robotic manipulator can be broken down into a collection of its transitional and rotational motions, fully characterized by the corresponding Lie groups and Lie algebras. In this work, we take advantage of these structures to build effective dynamical models that are amenable to sample-based learning. We hypothesize that learning the dynamics on a Lie algebra vector space is more effective than learning a direct state transition model. To verify this hypothesis, we introduce the Group Enhanced Model (GEM). GEMs significantly outperform conventional transition models on tasks of long-term prediction, planning, and model-based reinforcement learning across a diverse suite of standard continuous-control environments, including Walker, Hopper, Reacher, Half-Cheetah, Inverted Pendulums, Ant, and Humanoid. Furthermore, plugging GEM into existing state of the art systems enhances their performance, which we demonstrate on the PETS system. This work sheds light on a connection between learning of dynamics and Lie group properties, which opens doors for new research directions and practical applications along this direction. Our code is publicly available at: https://tinyurl.com/GEMMBRL.

Design and implementation of an environment for Learning to Run a Power Network (L2RPN)

Authors:Marvin Lerousseau
Date:2021-04-06 13:31:11

This report summarizes work performed as part of an internship at INRIA, in partial requirement for the completion of a master degree in math and informatics. The goal of the internship was to develop a software environment to simulate electricity transmission in a power grid and actions performed by operators to maintain this grid in security. Our environment lends itself to automate the control of the power grid with reinforcement learning agents, assisting human operators. It is amenable to organizing benchmarks, including a challenge in machine learning planned by INRIA and RTE for 2019. Our framework, built on top of open-source libraries, is available at https://github.com/MarvinLer/pypownet. In this report we present intermediary results and its usage in the context of a reinforcement learning game.

SOLO: Search Online, Learn Offline for Combinatorial Optimization Problems

Authors:Joel Oren, Chana Ross, Maksym Lefarov, Felix Richter, Ayal Taitler, Zohar Feldman, Christian Daniel, Dotan Di Castro
Date:2021-04-04 17:12:24

We study combinatorial problems with real world applications such as machine scheduling, routing, and assignment. We propose a method that combines Reinforcement Learning (RL) and planning. This method can equally be applied to both the offline, as well as online, variants of the combinatorial problem, in which the problem components (e.g., jobs in scheduling problems) are not known in advance, but rather arrive during the decision-making process. Our solution is quite generic, scalable, and leverages distributional knowledge of the problem parameters. We frame the solution process as an MDP, and take a Deep Q-Learning approach wherein states are represented as graphs, thereby allowing our trained policies to deal with arbitrary changes in a principled manner. Though learned policies work well in expectation, small deviations can have substantial negative effects in combinatorial settings. We mitigate these drawbacks by employing our graph-convolutional policies as non-optimal heuristics in a compatible search algorithm, Monte Carlo Tree Search, to significantly improve overall performance. We demonstrate our method on two problems: Machine Scheduling and Capacitated Vehicle Routing. We show that our method outperforms custom-tailored mathematical solvers, state of the art learning-based algorithms, and common heuristics, both in computation time and performance.

AdaPool: A Diurnal-Adaptive Fleet Management Framework using Model-Free Deep Reinforcement Learning and Change Point Detection

Authors:Marina Haliem, Vaneet Aggarwal, Bharat Bhargava
Date:2021-04-01 02:14:01

This paper introduces an adaptive model-free deep reinforcement approach that can recognize and adapt to the diurnal patterns in the ride-sharing environment with car-pooling. Deep Reinforcement Learning (RL) suffers from catastrophic forgetting due to being agnostic to the timescale of changes in the distribution of experiences. Although RL algorithms are guaranteed to converge to optimal policies in Markov decision processes (MDPs), this only holds in the presence of static environments. However, this assumption is very restrictive. In many real-world problems like ride-sharing, traffic control, etc., we are dealing with highly dynamic environments, where RL methods yield only sub-optimal decisions. To mitigate this problem in highly dynamic environments, we (1) adopt an online Dirichlet change point detection (ODCP) algorithm to detect the changes in the distribution of experiences, (2) develop a Deep Q Network (DQN) agent that is capable of recognizing diurnal patterns and making informed dispatching decisions according to the changes in the underlying environment. Rather than fixing patterns by time of week, the proposed approach automatically detects that the MDP has changed, and uses the results of the new model. In addition to the adaptation logic in dispatching, this paper also proposes a dynamic, demand-aware vehicle-passenger matching and route planning framework that dynamically generates optimal routes for each vehicle based on online demand, vehicle capacities, and locations. Evaluation on New York City Taxi public dataset shows the effectiveness of our approach in improving the fleet utilization, where less than 50% of the fleet are utilized to serve the demand of up to 90% of the requests, while maximizing profits and minimizing idle times.

RIS-Assisted UAV for Timely Data Collection in IoT Networks

Authors:Ahmed Al-Hilo, Moataz Samir, Mohamed Elhattab, Chadi Assi, Sanaa Sharafeddine
Date:2021-03-31 15:25:36

Intelligent Transportation Systems are thriving thanks to a wide range of technological advances, namely 5G communications, Internet of Things, artificial intelligence and edge computing. Central to this is the wide deployment of smart sensing devices and accordingly the large amount of harvested information to be processed for timely decision making. Robust network access is, hence, essential for offloading the collected data before a set deadline, beyond which the data loses its value. In environments where direct communication can be impaired by, for instance, blockages such as in urban cities, unmanned aerial vehicles (UAVs) can be considered as an alternative for providing and enhancing connectivity, particularly when IoT devices (IoTD) are constrained with their resources. Also, to conserve energy, IoTDs are assumed to alternate between their active and passive modes. This paper, therefore, considers a time-constrained data gathering problem from a network of sensing devices and with assistance from a UAV. A Reconfigurable Intelligent Surface (RIS) is deployed to further improve both the connectivity and energy efficiency of the UAV, particularly when multiple devices are served concurrently and experience different channel impairments. This integrated problem brings challenges related to the configuration of the phase shift elements of the RIS, the scheduling of IoTDs transmissions as well as the trajectory of the UAV. First, the problem is formulated with the objective of maximizing the total number of served devices each during its activation period. Owing to its complexity and the incomplete knowledge about the environment, we leverage deep reinforcement learning in our solution; the UAV trajectory planning is modeled as a Markov Decision Process, and Proximal Policy Optimization is invoked to solve it. Next, the RIS configuration is then handled via Block Coordinate Descent.

Deep Reinforcement Learning for Constrained Field Development Optimization in Subsurface Two-phase Flow

Authors:Yusuf Nasir, Jincong He, Chaoshun Hu, Shusei Tanaka, Kainan Wang, XianHuan Wen
Date:2021-03-31 07:08:24

We present a deep reinforcement learning-based artificial intelligence agent that could provide optimized development plans given a basic description of the reservoir and rock/fluid properties with minimal computational cost. This artificial intelligence agent, comprising of a convolutional neural network, provides a mapping from a given state of the reservoir model, constraints, and economic condition to the optimal decision (drill/do not drill and well location) to be taken in the next stage of the defined sequential field development planning process. The state of the reservoir model is defined using parameters that appear in the governing equations of the two-phase flow. A feedback loop training process referred to as deep reinforcement learning is used to train an artificial intelligence agent with such a capability. The training entails millions of flow simulations with varying reservoir model descriptions (structural, rock and fluid properties), operational constraints, and economic conditions. The parameters that define the reservoir model, operational constraints, and economic conditions are randomly sampled from a defined range of applicability. Several algorithmic treatments are introduced to enhance the training of the artificial intelligence agent. After appropriate training, the artificial intelligence agent provides an optimized field development plan instantly for new scenarios within the defined range of applicability. This approach has advantages over traditional optimization algorithms (e.g., particle swarm optimization, genetic algorithm) that are generally used to find a solution for a specific field development scenario and typically not generalizable to different scenarios.

Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos

Authors:Annie S. Chen, Suraj Nair, Chelsea Finn
Date:2021-03-31 05:25:05

We are motivated by the goal of generalist robots that can complete a wide range of tasks across many environments. Critical to this is the robot's ability to acquire some metric of task success or reward, which is necessary for reinforcement learning, planning, or knowing when to ask for help. For a general-purpose robot operating in the real world, this reward function must also be able to generalize broadly across environments, tasks, and objects, while depending only on on-board sensor observations (e.g. RGB images). While deep learning on large and diverse datasets has shown promise as a path towards such generalization in computer vision and natural language, collecting high quality datasets of robotic interaction at scale remains an open challenge. In contrast, "in-the-wild" videos of humans (e.g. YouTube) contain an extensive collection of people doing interesting tasks across a diverse range of settings. In this work, we propose a simple approach, Domain-agnostic Video Discriminator (DVD), that learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task, and can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos. We find that by leveraging diverse human datasets, this reward function (a) can generalize zero shot to unseen environments, (b) generalize zero shot to unseen tasks, and (c) can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.

Simultaneous Navigation and Construction Benchmarking Environments

Authors:Wenyu Han, Chen Feng, Haoran Wu, Alexander Gao, Armand Jordana, Dong Liu, Lerrel Pinto, Ludovic Righetti
Date:2021-03-31 00:05:54

We need intelligent robots for mobile construction, the process of navigating in an environment and modifying its structure according to a geometric design. In this task, a major robot vision and learning challenge is how to exactly achieve the design without GPS, due to the difficulty caused by the bi-directional coupling of accurate robot localization and navigation together with strategic environment manipulation. However, many existing robot vision and learning tasks such as visual navigation and robot manipulation address only one of these two coupled aspects. To stimulate the pursuit of a generic and adaptive solution, we reasonably simplify mobile construction as a partially observable Markov decision process (POMDP) in 1/2/3D grid worlds and benchmark the performance of a handcrafted policy with basic localization and planning, and state-of-the-art deep reinforcement learning (RL) methods. Our extensive experiments show that the coupling makes this problem very challenging for those methods, and emphasize the need for novel task-specific solutions.

Increasing the Efficiency of Policy Learning for Autonomous Vehicles by Multi-Task Representation Learning

Authors:Eshagh Kargar, Ville Kyrki
Date:2021-03-26 20:16:59

Driving in a dynamic, multi-agent, and complex urban environment is a difficult task requiring a complex decision-making policy. The learning of such a policy requires a state representation that can encode the entire environment. Mid-level representations that encode a vehicle's environment as images have become a popular choice. Still, they are quite high-dimensional, limiting their use in data-hungry approaches such as reinforcement learning. In this article, we propose to learn a low-dimensional and rich latent representation of the environment by leveraging the knowledge of relevant semantic factors. To do this, we train an encoder-decoder deep neural network to predict multiple application-relevant factors such as the trajectories of other agents and the ego car. Furthermore, we propose a hazard signal based on other vehicles' future trajectories and the planned route which is used in conjunction with the learned latent representation as input to a down-stream policy. We demonstrate that using the multi-head encoder-decoder neural network results in a more informative representation than a standard single-head model. In particular, the proposed representation learning and the hazard signal help reinforcement learning to learn faster, with increased performance and less data than baseline methods.

Personalized Adaptive Cruise Control and Impacts on Mixed Traffic

Authors:Mehmet Ozkan, Yao Ma
Date:2021-03-26 19:40:36

This paper presents a personalized adaptive cruise control (PACC) design that can learn driver behavior and adaptively control the semi-autonomous vehicle (SAV) in the car-following scenario, and investigates its impacts on mixed traffic. In mixed traffic where the SAV and human-driven vehicles share the road, the SAV's driver can choose a PACC tuning that better fits the driver's preferred driving behaviors. The individual driver's preferences are learned through the inverse reinforcement learning (IRL) approach by recovering a unique cost function from the driver's demonstrated driving data that best explains the observed driving style. The proposed PACC design plans the motion of the SAV by minimizing the learned unique cost function considering the short preview information of the preceding human-driven vehicle. The results reveal that the learned driver model can identify and replicate the personalized driving behaviors accurately and consistently when following the preceding vehicle in a variety of traffic conditions. Furthermore, we investigated the impacts of the PACC with different drivers on mixed traffic by considering time headway, gap distance, and fuel economy assessments. A statistical investigation shows that the impacts of the PACC on mixed traffic vary among tested drivers due to their intrinsic driving preferences.

Character Controllers Using Motion VAEs

Authors:Hung Yu Ling, Fabio Zinno, George Cheng, Michiel van de Panne
Date:2021-03-26 05:51:41

A fundamental problem in computer animation is that of realizing purposeful and realistic human movement given a sufficiently-rich set of motion capture clips. We learn data-driven generative models of human movement using autoregressive conditional variational autoencoders, or Motion VAEs. The latent variables of the learned autoencoder define the action space for the movement and thereby govern its evolution over time. Planning or control algorithms can then use this action space to generate desired motions. In particular, we use deep reinforcement learning to learn controllers that achieve goal-directed movements. We demonstrate the effectiveness of the approach on multiple tasks. We further evaluate system-design choices and describe the current limitations of Motion VAEs.

Self-Imitation Learning by Planning

Authors:Sha Luo, Hamidreza Kasaei, Lambert Schomaker
Date:2021-03-25 13:28:38

Imitation learning (IL) enables robots to acquire skills quickly by transferring expert knowledge, which is widely adopted in reinforcement learning (RL) to initialize exploration. However, in long-horizon motion planning tasks, a challenging problem in deploying IL and RL methods is how to generate and collect massive, broadly distributed data such that these methods can generalize effectively. In this work, we solve this problem using our proposed approach called {self-imitation learning by planning (SILP)}, where demonstration data are collected automatically by planning on the visited states from the current policy. SILP is inspired by the observation that successfully visited states in the early reinforcement learning stage are collision-free nodes in the graph-search based motion planner, so we can plan and relabel robot's own trials as demonstrations for policy learning. Due to these self-generated demonstrations, we relieve the human operator from the laborious data preparation process required by IL and RL methods in solving complex motion planning tasks. The evaluation results show that our SILP method achieves higher success rates and enhances sample efficiency compared to selected baselines, and the policy learned in simulation performs well in a real-world placement task with changing goals and obstacles.

CLAMGen: Closed-Loop Arm Motion Generation via Multi-view Vision-Based RL

Authors:Iretiayo Akinola, Zizhao Wang, Peter Allen
Date:2021-03-24 15:33:03

We propose a vision-based reinforcement learning (RL) approach for closed-loop trajectory generation in an arm reaching problem. Arm trajectory generation is a fundamental robotics problem which entails finding collision-free paths to move the robot's body (e.g. arm) in order to satisfy a goal (e.g. place end-effector at a point). While classical methods typically require the model of the environment to solve a planning, search or optimization problem, learning-based approaches hold the promise of directly mapping from observations to robot actions. However, learning a collision-avoidance policy using RL remains a challenge for various reasons, including, but not limited to, partial observability, poor exploration, low sample efficiency, and learning instabilities. To address these challenges, we present a residual-RL method that leverages a greedy goal-reaching RL policy as the base to improve exploration, and the base policy is augmented with residual state-action values and residual actions learned from images to avoid obstacles. Further more, we introduce novel learning objectives and techniques to improve 3D understanding from multiple image views and sample efficiency of our algorithm. Compared to RL baselines, our method achieves superior performance in terms of success rate.

Discriminator Augmented Model-Based Reinforcement Learning

Authors:Behzad Haghgoo, Allan Zhou, Archit Sharma, Chelsea Finn
Date:2021-03-24 06:01:55

By planning through a learned dynamics model, model-based reinforcement learning (MBRL) offers the prospect of good performance with little environment interaction. However, it is common in practice for the learned model to be inaccurate, impairing planning and leading to poor performance. This paper aims to improve planning with an importance sampling framework that accounts and corrects for discrepancy between the true and learned dynamics. This framework also motivates an alternative objective for fitting the dynamics model: to minimize the variance of value estimation during planning. We derive and implement this objective, which encourages better prediction on trajectories with larger returns. We observe empirically that our approach improves the performance of current MBRL algorithms on two stochastic control problems, and provide a theoretical basis for our method.

Integrated Decision and Control: Towards Interpretable and Computationally Efficient Driving Intelligence

Authors:Yang Guan, Yangang Ren, Qi Sun, Shengbo Eben Li, Haitong Ma, Jingliang Duan, Yifan Dai, Bo Cheng
Date:2021-03-18 14:43:31

Decision and control are core functionalities of high-level automated vehicles. Current mainstream methods, such as functionality decomposition and end-to-end reinforcement learning (RL), either suffer high time complexity or poor interpretability and adaptability on real-world autonomous driving tasks. In this paper, we present an interpretable and computationally efficient framework called integrated decision and control (IDC) for automated vehicles, which decomposes the driving task into static path planning and dynamic optimal tracking that are structured hierarchically. First, the static path planning generates several candidate paths only considering static traffic elements. Then, the dynamic optimal tracking is designed to track the optimal path while considering the dynamic obstacles. To that end, we formulate a constrained optimal control problem (OCP) for each candidate path, optimize them separately and follow the one with the best tracking performance. To unload the heavy online computation, we propose a model-based reinforcement learning (RL) algorithm that can be served as an approximate constrained OCP solver. Specifically, the OCPs for all paths are considered together to construct a single complete RL problem and then solved offline in the form of value and policy networks, for real-time online path selecting and tracking respectively. We verify our framework in both simulations and the real world. Results show that compared with baseline methods IDC has an order of magnitude higher online computing efficiency, as well as better driving performance including traffic efficiency and safety. In addition, it yields great interpretability and adaptability among different driving tasks. The effectiveness of the proposed method is also demonstrated in real road tests with complicated traffic conditions.

Reward Signal Design for Autonomous Racing

Authors:Benjamin Evans, Herman A. Engelbrecht, Hendrik W. Jordaan
Date:2021-03-18 09:21:44

Reinforcement learning (RL) has shown to be a valuable tool in training neural networks for autonomous motion planning. The application of RL to a specific problem is dependent on a reward signal to quantify how good or bad a certain action is. This paper addresses the problem of reward signal design for robotic control in the context of local planning for autonomous racing. We aim to design reward signals that are able to perform well in multiple, competing, continuous metrics. Three different methodologies of position-based, velocity-based, and action-based rewards are considered and evaluated in the context of F1/10th racing. A novel method of rewarding the agent on its state relative to an optimal trajectory is presented. Agents are trained and tested in simulation and the behaviors generated by the reward signals are compared to each other on the basis of average lap time and completion rate. The results indicate that a reward based on the distance and velocity relative to a minimum curvature trajectory produces the fastest lap times.

A Practical Guide to Multi-Objective Reinforcement Learning and Planning

Authors:Conor F. Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf, Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick Mannion, Ann Nowé, Gabriel Ramos, Marcello Restelli, Peter Vamplew, Diederik M. Roijers
Date:2021-03-17 11:07:28

Real-world decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multi-objective methods to difficult problems, and is aimed at researchers who are already familiar with single-objective reinforcement learning and planning methods who wish to adopt a multi-objective perspective on their research, as well as practitioners who encounter multi-objective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multi-objective decision-making systems for complex problems.

Hierarchical Reinforcement Learning Framework for Stochastic Spaceflight Campaign Design

Authors:Yuji Takubo, Hao Chen, Koki Ho
Date:2021-03-16 11:17:02

This paper develops a hierarchical reinforcement learning architecture for multimission spaceflight campaign design under uncertainty, including vehicle design, infrastructure deployment planning, and space transportation scheduling. This problem involves a high-dimensional design space and is challenging especially with uncertainty present. To tackle this challenge, the developed framework has a hierarchical structure with reinforcement learning and network-based mixed-integer linear programming (MILP), where the former optimizes campaign-level decisions (e.g., design of the vehicle used throughout the campaign, destination demand assigned to each mission in the campaign), whereas the latter optimizes the detailed mission-level decisions (e.g., when to launch what from where to where). The framework is applied to a set of human lunar exploration campaign scenarios with uncertain in situ resource utilization performance as a case study. The main value of this work is its integration of the rapidly growing reinforcement learning research and the existing MILP-based space logistics methods through a hierarchical framework to handle the otherwise intractable complexity of space mission design under uncertainty. This unique framework is expected to be a critical steppingstone for the emerging research direction of artificial intelligence for space mission design.

Autonomous Drone Racing with Deep Reinforcement Learning

Authors:Yunlong Song, Mats Steinweg, Elia Kaufmann, Davide Scaramuzza
Date:2021-03-15 18:05:49

In many robotic tasks, such as autonomous drone racing, the goal is to travel through a set of waypoints as fast as possible. A key challenge for this task is planning the time-optimal trajectory, which is typically solved by assuming perfect knowledge of the waypoints to pass in advance. The resulting solution is either highly specialized for a single-track layout, or suboptimal due to simplifying assumptions about the platform dynamics. In this work, a new approach to near-time-optimal trajectory generation for quadrotors is presented. Leveraging deep reinforcement learning and relative gate observations, our approach can compute near-time-optimal trajectories and adapt the trajectory to environment changes. Our method exhibits computational advantages over approaches based on trajectory optimization for non-trivial track configurations. The proposed approach is evaluated on a set of race tracks in simulation and the real world, achieving speeds of up to 60 km/h with a physical quadrotor.

Goal-Driven Autonomous Exploration Through Deep Reinforcement Learning

Authors:Reinis Cimurs, Il Hong Suh, Jin Han Lee
Date:2021-03-12 07:37:24

In this paper, we present an autonomous navigation system for goal-driven exploration of unknown environments through deep reinforcement learning (DRL). Points of interest (POI) for possible navigation directions are obtained from the environment and an optimal waypoint is selected, based on the available data. Following the waypoints, the robot is guided towards the global goal and the local optimum problem of reactive navigation is mitigated. Then, a motion policy for local navigation is learned through a DRL framework in a simulation. We develop a navigation system where this learned policy is integrated into a motion planning stack as the local navigation layer to move the robot between waypoints towards a global goal. The fully autonomous navigation is performed without any prior knowledge while a map is recorded as the robot moves through the environment. Experiments show that the proposed method has an advantage over similar exploration methods, without reliance on a map or prior information in complex static as well as dynamic environments.

Adapting User Interfaces with Model-based Reinforcement Learning

Authors:Kashyap Todi, Gilles Bailly, Luis A. Leiva, Antti Oulasvirta
Date:2021-03-11 17:24:34

Adapting an interface requires taking into account both the positive and negative effects that changes may have on the user. A carelessly picked adaptation may impose high costs to the user -- for example, due to surprise or relearning effort -- or "trap" the process to a suboptimal design immaturely. However, effects on users are hard to predict as they depend on factors that are latent and evolve over the course of interaction. We propose a novel approach for adaptive user interfaces that yields a conservative adaptation policy: It finds beneficial changes when there are such and avoids changes when there are none. Our model-based reinforcement learning method plans sequences of adaptations and consults predictive HCI models to estimate their effects. We present empirical and simulation results from the case of adaptive menus, showing that the method outperforms both a non-adaptive and a frequency-based policy.

Generalizable Episodic Memory for Deep Reinforcement Learning

Authors:Hao Hu, Jianing Ye, Guangxiang Zhu, Zhizhou Ren, Chongjie Zhang
Date:2021-03-11 05:31:21

Episodic memory-based methods can rapidly latch onto past successful strategies by a non-parametric memory and improve sample efficiency of traditional reinforcement learning. However, little effort is put into the continuous domain, where a state is never visited twice, and previous episodic methods fail to efficiently aggregate experience across trajectories. To address this problem, we propose Generalizable Episodic Memory (GEM), which effectively organizes the state-action values of episodic memory in a generalizable manner and supports implicit planning on memorized trajectories. GEM utilizes a double estimator to reduce the overestimation bias induced by value propagation in the planning process. Empirical evaluation shows that our method significantly outperforms existing trajectory-based methods on various MuJoCo continuous control tasks. To further show the general applicability, we evaluate our method on Atari games with discrete action space, which also shows a significant improvement over baseline algorithms.

WFA-IRL: Inverse Reinforcement Learning of Autonomous Behaviors Encoded as Weighted Finite Automata

Authors:Tianyu Wang, Nikolay Atanasov
Date:2021-03-10 06:42:10

This paper presents a method for learning logical task specifications and cost functions from demonstrations. Constructing specifications by hand is challenging for complex objectives and constraints in autonomous systems. Instead, we consider demonstrated task executions, whose logic structure and transition costs need to be inferred by an autonomous agent. We employ a spectral learning approach to extract a weighted finite automaton (WFA), approximating the unknown task logic. Thereafter, we define a product between the WFA for high-level task guidance and a labeled Markov decision process for low-level control. An inverse reinforcement learning (IRL) problem is considered to learn a cost function by backpropagating the loss between agent and expert behaviors through the planning algorithm. Our proposed model, termed WFA-IRL, is capable of generalizing the execution of the inferred task specification in a suite of MiniGrid environments.

A Scavenger Hunt for Service Robots

Authors:Harel Yedidsion, Jennifer Suriadinata, Zifan Xu, Stefan Debruyn, Peter Stone
Date:2021-03-09 05:06:47

Creating robots that can perform general-purpose service tasks in a human-populated environment has been a longstanding grand challenge for AI and Robotics research. One particularly valuable skill that is relevant to a wide variety of tasks is the ability to locate and retrieve objects upon request. This paper models this skill as a Scavenger Hunt (SH) game, which we formulate as a variation of the NP-hard stochastic traveling purchaser problem. In this problem, the goal is to find a set of objects as quickly as possible, given probability distributions of where they may be found. We investigate the performance of several solution algorithms for the SH problem, both in simulation and on a real mobile robot. We use Reinforcement Learning (RL) to train an agent to plan a minimal cost path, and show that the RL agent can outperform a range of heuristic algorithms, achieving near optimal performance. In order to stimulate research on this problem, we introduce a publicly available software stack and associated website that enable users to upload scavenger hunts which robots can download, perform, and learn from to continually improve their performance on future hunts.

Vision-Based Mobile Robotics Obstacle Avoidance With Deep Reinforcement Learning

Authors:Patrick Wenzel, Torsten Schön, Laura Leal-Taixé, Daniel Cremers
Date:2021-03-08 13:05:46

Obstacle avoidance is a fundamental and challenging problem for autonomous navigation of mobile robots. In this paper, we consider the problem of obstacle avoidance in simple 3D environments where the robot has to solely rely on a single monocular camera. In particular, we are interested in solving this problem without relying on localization, mapping, or planning techniques. Most of the existing work consider obstacle avoidance as two separate problems, namely obstacle detection, and control. Inspired by the recent advantages of deep reinforcement learning in Atari games and understanding highly complex situations in Go, we tackle the obstacle avoidance problem as a data-driven end-to-end deep learning approach. Our approach takes raw images as input and generates control commands as output. We show that discrete action spaces are outperforming continuous control commands in terms of expected average reward in maze-like environments. Furthermore, we show how to accelerate the learning and increase the robustness of the policy by incorporating predicted depth maps by a generative adversarial network.

A Taxonomy of Similarity Metrics for Markov Decision Processes

Authors:Álvaro Visús, Javier García, Fernando Fernández
Date:2021-03-08 12:36:42

Although the notion of task similarity is potentially interesting in a wide range of areas such as curriculum learning or automated planning, it has mostly been tied to transfer learning. Transfer is based on the idea of reusing the knowledge acquired in the learning of a set of source tasks to a new learning process in a target task, assuming that the target and source tasks are close enough. In recent years, transfer learning has succeeded in making Reinforcement Learning (RL) algorithms more efficient (e.g., by reducing the number of samples needed to achieve the (near-)optimal performance). Transfer in RL is based on the core concept of similarity: whenever the tasks are similar, the transferred knowledge can be reused to solve the target task and significantly improve the learning performance. Therefore, the selection of good metrics to measure these similarities is a critical aspect when building transfer RL algorithms, especially when this knowledge is transferred from simulation to the real world. In the literature, there are many metrics to measure the similarity between MDPs, hence, many definitions of similarity or its complement distance have been considered. In this paper, we propose a categorization of these metrics and analyze the definitions of similarity proposed so far, taking into account such categorization. We also follow this taxonomy to survey the existing literature, as well as suggesting future directions for the construction of new metrics.

Real-world Ride-hailing Vehicle Repositioning using Deep Reinforcement Learning

Authors:Yan Jiao, Xiaocheng Tang, Zhiwei Qin, Shuaiji Li, Fan Zhang, Hongtu Zhu, Jieping Ye
Date:2021-03-08 05:34:05

We present a new practical framework based on deep reinforcement learning and decision-time planning for real-world vehicle repositioning on ride-hailing (a type of mobility-on-demand, MoD) platforms. Our approach learns the spatiotemporal state-value function using a batch training algorithm with deep value networks. The optimal repositioning action is generated on-demand through value-based policy search, which combines planning and bootstrapping with the value networks. For the large-fleet problems, we develop several algorithmic features that we incorporate into our framework and that we demonstrate to induce coordination among the algorithmically-guided vehicles. We benchmark our algorithm with baselines in a ride-hailing simulation environment to demonstrate its superiority in improving income efficiency meausred by income-per-hour. We have also designed and run a real-world experiment program with regular drivers on a major ride-hailing platform. We have observed significantly positive results on key metrics comparing our method with experienced drivers who performed idle-time repositioning based on their own expertise.

Applying Machine Learning in Self-Adaptive Systems: A Systematic Literature Review

Authors:Omid Gheibi, Danny Weyns, Federico Quin
Date:2021-03-06 13:45:59

Recently, we witness a rapid increase in the use of machine learning in self-adaptive systems. Machine learning has been used for a variety of reasons, ranging from learning a model of the environment of a system during operation to filtering large sets of possible configurations before analysing them. While a body of work on the use of machine learning in self-adaptive systems exists, there is currently no systematic overview of this area. Such overview is important for researchers to understand the state of the art and direct future research efforts. This paper reports the results of a systematic literature review that aims at providing such an overview. We focus on self-adaptive systems that are based on a traditional Monitor-Analyze-Plan-Execute feedback loop (MAPE). The research questions are centred on the problems that motivate the use of machine learning in self-adaptive systems, the key engineering aspects of learning in self-adaptation, and open challenges. The search resulted in 6709 papers, of which 109 were retained for data collection. Analysis of the collected data shows that machine learning is mostly used for updating adaptation rules and policies to improve system qualities, and managing resources to better balance qualities and resources. These problems are primarily solved using supervised and interactive learning with classification, regression and reinforcement learning as the dominant methods. Surprisingly, unsupervised learning that naturally fits automation is only applied in a small number of studies. Key open challenges in this area include the performance of learning, managing the effects of learning, and dealing with more complex types of goals. From the insights derived from this systematic literature review we outline an initial design process for applying machine learning in self-adaptive systems that are based on MAPE feedback loops.

DeepFreight: Integrating Deep Reinforcement Learning and Mixed Integer Programming for Multi-transfer Truck Freight Delivery

Authors:Jiayu Chen, Abhishek K. Umrawal, Tian Lan, Vaneet Aggarwal
Date:2021-03-05 03:06:48

With the freight delivery demands and shipping costs increasing rapidly, intelligent control of fleets to enable efficient and cost-conscious solutions becomes an important problem. In this paper, we propose DeepFreight, a model-free deep-reinforcement-learning-based algorithm for multi-transfer freight delivery, which includes two closely-collaborative components: truck-dispatch and package-matching. Specifically, a deep multi-agent reinforcement learning framework called QMIX is leveraged to learn a dispatch policy, with which we can obtain the multi-step joint vehicle dispatch decisions for the fleet with respect to the delivery requests. Then an efficient multi-transfer matching algorithm is executed to assign the delivery requests to the trucks. Also, DeepFreight is integrated with a Mixed-Integer Linear Programming optimizer for further optimization. The evaluation results show that the proposed system is highly scalable and ensures a 100\% delivery success while maintaining low delivery-time and fuel consumption. The codes are available at https://github.com/LucasCJYSDL/DeepFreight.

Efficient UAV Trajectory-Planning using Economic Reinforcement Learning

Authors:Alvi Ataur Khalil, Alexander J Byrne, Mohammad Ashiqur Rahman, Mohammad Hossein Manshaei
Date:2021-03-03 20:54:19

Advances in unmanned aerial vehicle (UAV) design have opened up applications as varied as surveillance, firefighting, cellular networks, and delivery applications. Additionally, due to decreases in cost, systems employing fleets of UAVs have become popular. The uniqueness of UAVs in systems creates a novel set of trajectory or path planning and coordination problems. Environments include many more points of interest (POIs) than UAVs, with obstacles and no-fly zones. We introduce REPlanner, a novel multi-agent reinforcement learning algorithm inspired by economic transactions to distribute tasks between UAVs. This system revolves around an economic theory, in particular an auction mechanism where UAVs trade assigned POIs. We formulate the path planning problem as a multi-agent economic game, where agents can cooperate and compete for resources. We then translate the problem into a Partially Observable Markov decision process (POMDP), which is solved using a reinforcement learning (RL) model deployed on each agent. As the system computes task distributions via UAV cooperation, it is highly resilient to any change in the swarm size. Our proposed network and economic game architecture can effectively coordinate the swarm as an emergent phenomenon while maintaining the swarm's operation. Evaluation results prove that REPlanner efficiently outperforms conventional RL-based trajectory search.

On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning

Authors:Baohe Zhang, Raghu Rajan, Luis Pineda, Nathan Lambert, André Biedenkapp, Kurtland Chua, Frank Hutter, Roberto Calandra
Date:2021-02-26 18:57:47

Model-based Reinforcement Learning (MBRL) is a promising framework for learning control in a data-efficient manner. MBRL algorithms can be fairly complex due to the separate dynamics modeling and the subsequent planning algorithm, and as a result, they often possess tens of hyperparameters and architectural choices. For this reason, MBRL typically requires significant human expertise before it can be applied to new problems and domains. To alleviate this problem, we propose to use automatic hyperparameter optimization (HPO). We demonstrate that this problem can be tackled effectively with automated HPO, which we demonstrate to yield significantly improved performance compared to human experts. In addition, we show that tuning of several MBRL hyperparameters dynamically, i.e. during the training itself, further improves the performance compared to using static hyperparameters which are kept fixed for the whole training. Finally, our experiments provide valuable insights into the effects of several hyperparameters, such as plan horizon or learning rate and their influence on the stability of training and resulting rewards.

Robot Navigation in a Crowd by Integrating Deep Reinforcement Learning and Online Planning

Authors:Zhiqian Zhou, Pengming Zhu, Zhiwen Zeng, Junhao Xiao, Huimin Lu, Zongtan Zhou
Date:2021-02-26 02:17:13

It is still an open and challenging problem for mobile robots navigating along time-efficient and collision-free paths in a crowd. The main challenge comes from the complex and sophisticated interaction mechanism, which requires the robot to understand the crowd and perform proactive and foresighted behaviors. Deep reinforcement learning is a promising solution to this problem. However, most previous learning methods incur a tremendous computational burden. To address these problems, we propose a graph-based deep reinforcement learning method, SG-DQN, that (i) introduces a social attention mechanism to extract an efficient graph representation for the crowd-robot state; (ii) directly evaluates the coarse q-values of the raw state with a learned dueling deep Q network(DQN); and then (iii) refines the coarse q-values via online planning on possible future trajectories. The experimental results indicate that our model can help the robot better understand the crowd and achieve a high success rate of more than 0.99 in the crowd navigation task. Compared against previous state-of-the-art algorithms, our algorithm achieves an equivalent, if not better, performance while requiring less than half of the computational cost.

Bias-reduced Multi-step Hindsight Experience Replay for Efficient Multi-goal Reinforcement Learning

Authors:Rui Yang, Jiafei Lyu, Yu Yang, Jiangpeng Yan, Feng Luo, Dijun Luo, Lanqing Li, Xiu Li
Date:2021-02-25 16:05:57

Multi-goal reinforcement learning is widely applied in planning and robot manipulation. Two main challenges in multi-goal reinforcement learning are sparse rewards and sample inefficiency. Hindsight Experience Replay (HER) aims to tackle the two challenges via goal relabeling. However, HER-related works still need millions of samples and a huge computation. In this paper, we propose Multi-step Hindsight Experience Replay (MHER), incorporating multi-step relabeled returns based on $n$-step relabeling to improve sample efficiency. Despite the advantages of $n$-step relabeling, we theoretically and experimentally prove the off-policy $n$-step bias introduced by $n$-step relabeling may lead to poor performance in many environments. To address the above issue, two bias-reduced MHER algorithms, MHER($\lambda$) and Model-based MHER (MMHER) are presented. MHER($\lambda$) exploits the $\lambda$ return while MMHER benefits from model-based value expansions. Experimental results on numerous multi-goal robotic tasks show that our solutions can successfully alleviate off-policy $n$-step bias and achieve significantly higher sample efficiency than HER and Curriculum-guided HER with little additional computation beyond HER.

Visualizing MuZero Models

Authors:Joery A. de Vries, Ken S. Voskuil, Thomas M. Moerland, Aske Plaat
Date:2021-02-25 15:25:17

MuZero, a model-based reinforcement learning algorithm that uses a value equivalent dynamics model, achieved state-of-the-art performance in Chess, Shogi and the game of Go. In contrast to standard forward dynamics models that predict a full next state, value equivalent models are trained to predict a future value, thereby emphasizing value relevant information in the representations. While value equivalent models have shown strong empirical success, there is no research yet that visualizes and investigates what types of representations these models actually learn. Therefore, in this paper we visualize the latent representation of MuZero agents. We find that action trajectories may diverge between observation embeddings and internal state transition dynamics, which could lead to instability during planning. Based on this insight, we propose two regularization techniques to stabilize MuZero's performance. Additionally, we provide an open-source implementation of MuZero along with an interactive visualizer of learned representations, which may aid further investigation of value equivalent algorithms.

The Logical Options Framework

Authors:Brandon Araki, Xiao Li, Kiran Vodrahalli, Jonathan DeCastro, Micah J. Fry, Daniela Rus
Date:2021-02-24 21:43:16

Learning composable policies for environments with complex rules and tasks is a challenging problem. We introduce a hierarchical reinforcement learning framework called the Logical Options Framework (LOF) that learns policies that are satisfying, optimal, and composable. LOF efficiently learns policies that satisfy tasks by representing the task as an automaton and integrating it into learning and planning. We provide and prove conditions under which LOF will learn satisfying, optimal policies. And lastly, we show how LOF's learned policies can be composed to satisfy unseen tasks with only 10-50 retraining steps. We evaluate LOF on four tasks in discrete and continuous domains, including a 3D pick-and-place environment.

Deep Reinforcement Learning for Safe Landing Site Selection with Concurrent Consideration of Divert Maneuvers

Authors:Keidai Iiyama, Kento Tomita, Bhavi A. Jagatia, Tatsuwaki Nakagawa, Koki Ho
Date:2021-02-24 17:53:10

This research proposes a new integrated framework for identifying safe landing locations and planning in-flight divert maneuvers. The state-of-the-art algorithms for landing zone selection utilize local terrain features such as slopes and roughness to judge the safety and priority of the landing point. However, when there are additional chances of observation and diverting in the future, these algorithms are not able to evaluate the safety of the decision itself to target the selected landing point considering the overall descent trajectory. In response to this challenge, we propose a reinforcement learning framework that optimizes a landing site selection strategy concurrently with a guidance and control strategy to the target landing site. The trained agent could evaluate and select landing sites with explicit consideration of the terrain features, quality of future observations, and control to achieve a safe and efficient landing trajectory at a system-level. The proposed framework was able to achieve 94.8 $\%$ of successful landing in highly challenging landing sites where over 80$\%$ of the area around the initial target lading point is hazardous, by effectively updating the target landing site and feedback control gain during descent.

Modular Deep Reinforcement Learning for Continuous Motion Planning with Temporal Logic

Authors:Mingyu Cai, Mohammadhosein Hasanbeig, Shaoping Xiao, Alessandro Abate, Zhen Kan
Date:2021-02-24 01:11:25

This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP) with unknown transition probabilities over continuous state and action spaces. Linear temporal logic (LTL) is used to specify high-level tasks over infinite horizon, which can be converted into a limit deterministic generalized B\"uchi automaton (LDGBA) with several accepting sets. The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP by incorporating a synchronous tracking-frontier function to record unvisited accepting sets of the automaton, and to facilitate the satisfaction of the accepting conditions. The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states and can overcome the issues of sparse rewards. Rigorous analysis shows that any RL method that optimizes the expected discounted return is guaranteed to find an optimal policy whose traces maximize the satisfaction probability. A modular deep deterministic policy gradient (DDPG) is then developed to generate such policies over continuous state and action spaces. The performance of our framework is evaluated via an array of OpenAI gym environments.

Deep Reinforcement Learning for Dynamic Spectrum Sharing of LTE and NR

Authors:Ursula Challita, David Sandberg
Date:2021-02-22 16:56:51

In this paper, a proactive dynamic spectrum sharing scheme between 4G and 5G systems is proposed. In particular, a controller decides on the resource split between NR and LTE every subframe while accounting for future network states such as high interference subframes and multimedia broadcast single frequency network (MBSFN) subframes. To solve this problem, a deep reinforcement learning (RL) algorithm based on Monte Carlo Tree Search (MCTS) is proposed. The introduced deep RL architecture is trained offline whereby the controller predicts a sequence of future states of the wireless access network by simulating hypothetical bandwidth splits over time starting from the current network state. The action sequence resulting in the best reward is then assigned. This is realized by predicting the quantities most directly relevant to planning, i.e., the reward, the action probabilities, and the value for each network state. Simulation results show that the proposed scheme is able to take actions while accounting for future states instead of being greedy in each subframe. The results also show that the proposed framework improves system-level performance.

Program Synthesis Guided Reinforcement Learning for Partially Observed Environments

Authors:Yichen David Yang, Jeevana Priya Inala, Osbert Bastani, Yewen Pu, Armando Solar-Lezama, Martin Rinard
Date:2021-02-22 16:05:32

A key challenge for reinforcement learning is solving long-horizon planning problems. Recent work has leveraged programs to guide reinforcement learning in these settings. However, these approaches impose a high manual burden on the user since they must provide a guiding program for every new task. Partially observed environments further complicate the programming task because the program must implement a strategy that correctly, and ideally optimally, handles every possible configuration of the hidden regions of the environment. We propose a new approach, model predictive program synthesis (MPPS), that uses program synthesis to automatically generate the guiding programs. It trains a generative model to predict the unobserved portions of the world, and then synthesizes a program based on samples from this model in a way that is robust to its uncertainty. In our experiments, we show that our approach significantly outperforms non-program-guided approaches on a set of challenging benchmarks, including a 2D Minecraft-inspired environment where the agent must complete a complex sequence of subtasks to achieve its goal, and achieves a similar performance as using handcrafted programs to guide the agent. Our results demonstrate that our approach can obtain the benefits of program-guided reinforcement learning without requiring the user to provide a new guiding program for every new task.

Learning the Subsystem of Local Planning for Autonomous Racing

Authors:Benjamin Evans, Hendrik W. Jordaan, Herman A. Engelbrecht
Date:2021-02-22 14:14:51

The problem of autonomous racing is to navigate through a race course as quickly as possible while not colliding with any obstacles. We approach the autonomous racing problem with the added constraint of not maintaining an updated obstacle map of the environment. Several current approaches to this problem use end-to-end learning systems where an agent replaces the entire navigation pipeline. This paper presents a hierarchical planning architecture that combines a high level planner and path following system with a reinforcement learning agent that learns that subsystem of obstacle avoidance. The novel "modification planner" uses the path follower to track the global plan and the deep reinforcement learning agent to modify the references generated by the path follower to avoid obstacles. Importantly, our architecture does not require an updated obstacle map and only 10 laser range finders to avoid obstacles. The modification planner is evaluated in the context of F1/10th autonomous racing and compared to a end-to-end learning baseline, the Follow the Gap Method and an optimisation based planner. The results show that the modification planner can achieve faster average times compared to the baseline end-to-end planner and a 94% success rate which is similar to the baseline.

Learning Efficient Navigation in Vortical Flow Fields

Authors:Peter Gunnarson, Ioannis Mandralis, Guido Novati, Petros Koumoutsakos, John O. Dabiri
Date:2021-02-21 07:25:03

Efficient point-to-point navigation in the presence of a background flow field is important for robotic applications such as ocean surveying. In such applications, robots may only have knowledge of their immediate surroundings or be faced with time-varying currents, which limits the use of optimal control techniques for planning trajectories. Here, we apply a novel Reinforcement Learning algorithm to discover time-efficient navigation policies to steer a fixed-speed swimmer through an unsteady two-dimensional flow field. The algorithm entails inputting environmental cues into a deep neural network that determines the swimmer's actions, and deploying Remember and Forget Experience replay. We find that the resulting swimmers successfully exploit the background flow to reach the target, but that this success depends on the type of sensed environmental cue. Surprisingly, a velocity sensing approach outperformed a bio-mimetic vorticity sensing approach by nearly two-fold in success rate. Equipped with local velocity measurements, the reinforcement learning algorithm achieved near 100% success in reaching the target locations while approaching the time-efficiency of paths found by a global optimal control planner.

Deep Latent Competition: Learning to Race Using Visual Control Policies in Latent Space

Authors:Wilko Schwarting, Tim Seyde, Igor Gilitschenski, Lucas Liebenwein, Ryan Sander, Sertac Karaman, Daniela Rus
Date:2021-02-19 09:00:29

Learning competitive behaviors in multi-agent settings such as racing requires long-term reasoning about potential adversarial interactions. This paper presents Deep Latent Competition (DLC), a novel reinforcement learning algorithm that learns competitive visual control policies through self-play in imagination. The DLC agent imagines multi-agent interaction sequences in the compact latent space of a learned world model that combines a joint transition function with opponent viewpoint prediction. Imagined self-play reduces costly sample generation in the real world, while the latent representation enables planning to scale gracefully with observation dimensionality. We demonstrate the effectiveness of our algorithm in learning competitive behaviors on a novel multi-agent racing benchmark that requires planning from image observations. Code and videos available at https://sites.google.com/view/deep-latent-competition.

Near-Optimal Randomized Exploration for Tabular Markov Decision Processes

Authors:Zhihan Xiong, Ruoqi Shen, Qiwen Cui, Maryam Fazel, Simon S. Du
Date:2021-02-19 01:42:50

We study algorithms using randomized value functions for exploration in reinforcement learning. This type of algorithms enjoys appealing empirical performance. We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case $\widetilde{O}\left(H\sqrt{SAT}\right)$ regret bound for episodic time-inhomogeneous Markov Decision Process where $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon and $T$ is the number of interactions. This bound polynomially improves all existing bounds for algorithms based on randomized value functions, and for the first time, matches the $\Omega\left(H\sqrt{SAT}\right)$ lower bound up to logarithmic factors. Our result highlights that randomized exploration can be near-optimal, which was previously achieved only by optimistic algorithms. To achieve the desired result, we develop 1) a new clipping operation to ensure both the probability of being optimistic and the probability of being pessimistic are lower bounded by a constant, and 2) a new recursive formula for the absolute value of estimation errors to analyze the regret.

Multi-Agent Reinforcement Learning of 3D Furniture Layout Simulation in Indoor Graphics Scenes

Authors:Xinhan Di, Pengqian Yu
Date:2021-02-18 03:20:35

In the industrial interior design process, professional designers plan the furniture layout to achieve a satisfactory 3D design for selling. In this paper, we explore the interior graphics scenes design task as a Markov decision process (MDP) in 3D simulation, which is solved by multi-agent reinforcement learning. The goal is to produce furniture layout in the 3D simulation of the indoor graphics scenes. In particular, we firstly transform the 3D interior graphic scenes into two 2D simulated scenes. We then design the simulated environment and apply two reinforcement learning agents to learn the optimal 3D layout for the MDP formulation in a cooperative way. We conduct our experiments on a large-scale real-world interior layout dataset that contains industrial designs from professional designers. Our numerical results demonstrate that the proposed model yields higher-quality layouts as compared with the state-of-art model. The developed simulator and codes are available at \url{https://github.com/CODE-SUBMIT/simulator2}.

Multi-Stage Transmission Line Flow Control Using Centralized and Decentralized Reinforcement Learning Agents

Authors:Xiumin Shang, Jinping Yang, Bingquan Zhu, Lin Ye, Jing Zhang, Jianping Xu, Qin Lyu, Ruisheng Diao
Date:2021-02-16 19:54:30

Planning future operational scenarios of bulk power systems that meet security and economic constraints typically requires intensive labor efforts in performing massive simulations. To automate this process and relieve engineers' burden, a novel multi-stage control approach is presented in this paper to train centralized and decentralized reinforcement learning agents that can automatically adjust grid controllers for regulating transmission line flows at normal condition and under contingencies. The power grid flow control problem is formulated as Markov Decision Process (MDP). At stage one, centralized soft actor-critic (SAC) agent is trained to control generator active power outputs in a wide area to control transmission line flows against specified security limits. If line overloading issues remain unresolved, stage two is used to train decentralized SAC agent via load throw-over at local substations. The effectiveness of the proposed approach is verified on a series of actual planning cases used for operating the power grid of SGCC Zhejiang Electric Power Company.

COMBO: Conservative Offline Model-Based Policy Optimization

Authors:Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, Chelsea Finn
Date:2021-02-16 18:50:32

Model-based algorithms, which learn a dynamics model from logged experience and perform some sort of pessimistic planning under the learned model, have emerged as a promising paradigm for offline reinforcement learning (offline RL). However, practical variants of such model-based algorithms rely on explicit uncertainty quantification for incorporating pessimism. Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We overcome this limitation by developing a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-action tuples generated via rollouts under the learned model. This results in a conservative estimate of the value function for out-of-support state-action tuples, without requiring explicit uncertainty estimation. We theoretically show that our method optimizes a lower bound on the true policy value, that this bound is tighter than that of prior methods, and our approach satisfies a policy improvement guarantee in the offline setting. Through experiments, we find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods on widely studied offline RL benchmarks, including image-based tasks.

Deep Reinforcement Learning for Backup Strategies against Adversaries

Authors:Pascal Debus, Nicolas Müller, Konstantin Böttinger
Date:2021-02-12 17:19:44

Many defensive measures in cyber security are still dominated by heuristics, catalogs of standard procedures, and best practices. Considering the case of data backup strategies, we aim towards mathematically modeling the underlying threat models and decision problems. By formulating backup strategies in the language of stochastic processes, we can translate the challenge of finding optimal defenses into a reinforcement learning problem. This enables us to train autonomous agents that learn to optimally support planning of defense processes. In particular, we tackle the problem of finding an optimal backup scheme in the following adversarial setting: Given $k$ backup devices, the goal is to defend against an attacker who can infect data at one time but chooses to destroy or encrypt it at a later time, potentially also corrupting multiple backups made in between. In this setting, the usual round-robin scheme, which always replaces the oldest backup, is no longer optimal with respect to avoidable exposure. Thus, to find a defense strategy, we model the problem as a hybrid discrete-continuous action space Markov decision process and subsequently solve it using deep deterministic policy gradients. We show that the proposed algorithm can find storage device update schemes which match or exceed existing schemes with respect to various exposure metrics.

Hedging of Financial Derivative Contracts via Monte Carlo Tree Search

Authors:Oleg Szehr
Date:2021-02-11 21:17:01

The construction of approximate replication strategies for pricing and hedging of derivative contracts in incomplete markets is a key problem of financial engineering. Recently Reinforcement Learning algorithms for hedging under realistic market conditions have attracted significant interest. While research in the derivatives area mostly focused on variations of $Q$-learning, in artificial intelligence Monte Carlo Tree Search is the recognized state-of-the-art method for various planning problems, such as the games of Hex, Chess, Go,... This article introduces Monte Carlo Tree Search as a method to solve the stochastic optimal control problem behind the pricing and hedging tasks. As compared to $Q$-learning it combines Reinforcement Learning with tree search techniques. As a consequence Monte Carlo Tree Search has higher sample efficiency, is less prone to over-fitting to specific market models and generally learns stronger policies faster. In our experiments we find that Monte Carlo Tree Search, being the world-champion in games like Chess and Go, is easily capable of maximizing the utility of investor's terminal wealth without setting up an auxiliary mathematical framework.

Improving Model-Based Reinforcement Learning with Internal State Representations through Self-Supervision

Authors:Julien Scholz, Cornelius Weber, Muhammad Burhan Hafez, Stefan Wermter
Date:2021-02-10 17:55:04

Using a model of the environment, reinforcement learning agents can plan their future moves and achieve superhuman performance in board games like Chess, Shogi, and Go, while remaining relatively sample-efficient. As demonstrated by the MuZero Algorithm, the environment model can even be learned dynamically, generalizing the agent to many more tasks while at the same time achieving state-of-the-art performance. Notably, MuZero uses internal state representations derived from real environment states for its predictions. In this paper, we bind the model's predicted internal state representation to the environment state via two additional terms: a reconstruction model loss and a simpler consistency loss, both of which work independently and unsupervised, acting as constraints to stabilize the learning process. Our experiments show that this new integration of reconstruction model loss and simpler consistency loss provide a significant performance increase in OpenAI Gym environments. Our modifications also enable self-supervised pretraining for MuZero, so the algorithm can learn about environment dynamics before a goal is made available.

Adaptive Processor Frequency Adjustment for Mobile Edge Computing with Intermittent Energy Supply

Authors:Tiansheng Huang, Weiwei Lin, Xiaobin Hong, Xiumin Wang, Qingbo Wu, Rui Li, Ching-Hsien Hsu, Albert Y. Zomaya
Date:2021-02-10 14:12:10

With astonishing speed, bandwidth, and scale, Mobile Edge Computing (MEC) has played an increasingly important role in the next generation of connectivity and service delivery. Yet, along with the massive deployment of MEC servers, the ensuing energy issue is now on an increasingly urgent agenda. In the current context, the large scale deployment of renewable-energy-supplied MEC servers is perhaps the most promising solution for the incoming energy issue. Nonetheless, as a result of the intermittent nature of their power sources, these special design MEC server must be more cautious about their energy usage, in a bid to maintain their service sustainability as well as service standard. Targeting optimization on a single-server MEC scenario, we in this paper propose NAFA, an adaptive processor frequency adjustment solution, to enable an effective plan of the server's energy usage. By learning from the historical data revealing request arrival and energy harvest pattern, the deep reinforcement learning-based solution is capable of making intelligent schedules on the server's processor frequency, so as to strike a good balance between service sustainability and service quality. The superior performance of NAFA is substantiated by real-data-based experiments, wherein NAFA demonstrates up to 20% increase in average request acceptance ratio and up to 50% reduction in average request processing time.

An advantage actor-critic algorithm for robotic motion planning in dense and dynamic scenarios

Authors:Chengmin Zhou, Bingding Huang, Pasi Fränti
Date:2021-02-05 12:30:23

Intelligent robots provide a new insight into efficiency improvement in industrial and service scenarios to replace human labor. However, these scenarios include dense and dynamic obstacles that make motion planning of robots challenging. Traditional algorithms like A* can plan collision-free trajectories in static environment, but their performance degrades and computational cost increases steeply in dense and dynamic scenarios. Optimal-value reinforcement learning algorithms (RL) can address these problems but suffer slow speed and instability in network convergence. Network of policy gradient RL converge fast in Atari games where action is discrete and finite, but few works have been done to address problems where continuous actions and large action space are required. In this paper, we modify existing advantage actor-critic algorithm and suit it to complex motion planning, therefore optimal speeds and directions of robot are generated. Experimental results demonstrate that our algorithm converges faster and stable than optimal-value RL. It achieves higher success rate in motion planning with lesser processing time for robot to reach its goal.

Experience-Based Heuristic Search: Robust Motion Planning with Deep Q-Learning

Authors:Julian Bernhard, Robert Gieselmann, Klemens Esterle, Alois Knoll
Date:2021-02-05 12:08:11

Interaction-aware planning for autonomous driving requires an exploration of a combinatorial solution space when using conventional search- or optimization-based motion planners. With Deep Reinforcement Learning, optimal driving strategies for such problems can be derived also for higher-dimensional problems. However, these methods guarantee optimality of the resulting policy only in a statistical sense, which impedes their usage in safety critical systems, such as autonomous vehicles. Thus, we propose the Experience-Based-Heuristic-Search algorithm, which overcomes the statistical failure rate of a Deep-reinforcement-learning-based planner and still benefits computationally from the pre-learned optimal policy. Specifically, we show how experiences in the form of a Deep Q-Network can be integrated as heuristic into a heuristic search algorithm. We benchmark our algorithm in the field of path planning in semi-structured valet parking scenarios. There, we analyze the accuracy of such estimates and demonstrate the computational advantages and robustness of our method. Our method may encourage further investigation of the applicability of reinforcement-learning-based planning in the field of self-driving vehicles.

Deceptive Reinforcement Learning for Privacy-Preserving Planning

Authors:Zhengshang Liu, Yue Yang, Tim Miller, Peta Masters
Date:2021-02-05 06:50:04

In this paper, we study the problem of deceptive reinforcement learning to preserve the privacy of a reward function. Reinforcement learning is the problem of finding a behaviour policy based on rewards received from exploratory behaviour. A key ingredient in reinforcement learning is a reward function, which determines how much reward (negative or positive) is given and when. However, in some situations, we may want to keep a reward function private; that is, to make it difficult for an observer to determine the reward function used. We define the problem of privacy-preserving reinforcement learning, and present two models for solving it. These models are based on dissimulation -- a form of deception that `hides the truth'. We evaluate our models both computationally and via human behavioural experiments. Results show that the resulting policies are indeed deceptive, and that participants can determine the true reward function less reliably than that of an honest agent.

A review of motion planning algorithms for intelligent robotics

Authors:Chengmin Zhou, Bingding Huang, Pasi Fränti
Date:2021-02-04 02:24:04

We investigate and analyze principles of typical motion planning algorithms. These include traditional planning algorithms, supervised learning, optimal value reinforcement learning, policy gradient reinforcement learning. Traditional planning algorithms we investigated include graph search algorithms, sampling-based algorithms, and interpolating curve algorithms. Supervised learning algorithms include MSVM, LSTM, MCTS and CNN. Optimal value reinforcement learning algorithms include Q learning, DQN, double DQN, dueling DQN. Policy gradient algorithms include policy gradient method, actor-critic algorithm, A3C, A2C, DPG, DDPG, TRPO and PPO. New general criteria are also introduced to evaluate performance and application of motion planning algorithms by analytical comparisons. Convergence speed and stability of optimal value and policy gradient algorithms are specially analyzed. Future directions are presented analytically according to principles and analytical comparisons of motion planning algorithms. This paper provides researchers with a clear and comprehensive understanding about advantages, disadvantages, relationships, and future of motion planning algorithms in robotics, and paves ways for better motion planning algorithms.

Multi-UAV Mobile Edge Computing and Path Planning Platform based on Reinforcement Learning

Authors:Huan Chang, Yicheng Chen, Baochang Zhang, David Doermann
Date:2021-02-03 14:22:36

Unmanned Aerial vehicles (UAVs) are widely used as network processors in mobile networks, but more recently, UAVs have been used in Mobile Edge Computing as mobile servers. However, there are significant challenges to use UAVs in complex environments with obstacles and cooperation between UAVs. We introduce a new multi-UAV Mobile Edge Computing platform, which aims to provide better Quality-of-Service and path planning based on reinforcement learning to address these issues. The contributions of our work include: 1) optimizing the quality of service for mobile edge computing and path planning in the same reinforcement learning framework; 2) using a sigmoid-like function to depict the terminal users' demand to ensure a higher quality of service; 3) applying synthetic considerations of the terminal users' demand, risk and geometric distance in reinforcement learning reward matrix to ensure the quality of service, risk avoidance, and the cost-savings. Simulations have shown the effectiveness and feasibility of our platform, which can help advance related researches.

A deep learning model for gas storage optimization

Authors:Nicolas Curin, Michael Kettler, Xi Kleisinger-Yu, Vlatka Komaric, Thomas Krabichler, Josef Teichmann, Hanna Wutte
Date:2021-02-03 09:54:44

To the best of our knowledge, the application of deep learning in the field of quantitative risk management is still a relatively recent phenomenon. In this article, we utilize techniques inspired by reinforcement learning in order to optimize the operation plans of underground natural gas storage facilities. We provide a theoretical framework and assess the performance of the proposed method numerically in comparison to a state-of-the-art least-squares Monte-Carlo approach. Due to the inherent intricacy originating from the high-dimensional forward market as well as the numerous constraints and frictions, the optimization exercise can hardly be tackled by means of traditional techniques.

Reinforcement Learning with Probabilistic Boolean Network Models of Smart Grid Devices

Authors:Pedro J. Rivera Torres, Carlos Gershenson García, Samir Kanaan Izquierdo
Date:2021-02-02 04:13:30

The area of Smart Power Grids needs to constantly improve its efficiency and resilience, to pro-vide high quality electrical power, in a resistant grid, managing faults and avoiding failures. Achieving this requires high component reliability, adequate maintenance, and a studied failure occurrence. Correct system operation involves those activities, and novel methodologies to detect, classify, and isolate faults and failures, model and simulate processes with predictive algorithms and analytics (using data analysis and asset condition to plan and perform activities). We show-case the application of a complex-adaptive, self-organizing modeling method, Probabilistic Boolean Networks (PBN), as a way towards the understanding of the dynamics of smart grid devices, and to model and characterize their behavior. This work demonstrates that PBNs are is equivalent to the standard Reinforcement Learning Cycle, in which the agent/model has an inter-action with its environment and receives feedback from it in the form of a reward signal. Differ-ent reward structures were created in order to characterize preferred behavior. This information can be used to guide the PBN to avoid fault conditions and failures.

Improving Human Decision-Making by Discovering Efficient Strategies for Hierarchical Planning

Authors:Saksham Consul, Lovis Heindrich, Jugoslav Stojcheski, Falk Lieder
Date:2021-01-31 19:46:00

To make good decisions in the real world people need efficient planning strategies because their computational resources are limited. Knowing which planning strategies would work best for people in different situations would be very useful for understanding and improving human decision-making. But our ability to compute those strategies used to be limited to very small and very simple planning tasks. To overcome this computational bottleneck, we introduce a cognitively-inspired reinforcement learning method that can overcome this limitation by exploiting the hierarchical structure of human behavior. The basic idea is to decompose sequential decision problems into two sub-problems: setting a goal and planning how to achieve it. This hierarchical decomposition enables us to discover optimal strategies for human planning in larger and more complex tasks than was previously possible. The discovered strategies outperform existing planning algorithms and achieve a super-human level of computational efficiency. We demonstrate that teaching people to use those strategies significantly improves their performance in sequential decision-making tasks that require planning up to eight steps ahead. By contrast, none of the previous approaches was able to improve human performance on these problems. These findings suggest that our cognitively-informed approach makes it possible to leverage reinforcement learning to improve human decision-making in complex sequential decision-problems. Future work can leverage our method to develop decision support systems that improve human decision making in the real world.

Improved Variance-Aware Confidence Sets for Linear Bandits and Linear Mixture MDP

Authors:Zihan Zhang, Jiaqi Yang, Xiangyang Ji, Simon S. Du
Date:2021-01-29 18:57:52

This paper presents new \emph{variance-aware} confidence sets for linear bandits and linear mixture Markov Decision Processes (MDPs). With the new confidence sets, we obtain the follow regret bounds: For linear bandits, we obtain an $\tilde{O}(poly(d)\sqrt{1 + \sum_{k=1}^{K}\sigma_k^2})$ data-dependent regret bound, where $d$ is the feature dimension, $K$ is the number of rounds, and $\sigma_k^2$ is the \emph{unknown} variance of the reward at the $k$-th round. This is the first regret bound that only scales with the variance and the dimension but \emph{no explicit polynomial dependency on $K$}. When variances are small, this bound can be significantly smaller than the $\tilde{\Theta}\left(d\sqrt{K}\right)$ worst-case regret bound. For linear mixture MDPs, we obtain an $\tilde{O}(poly(d, \log H)\sqrt{K})$ regret bound, where $d$ is the number of base models, $K$ is the number of episodes, and $H$ is the planning horizon. This is the first regret bound that only scales \emph{logarithmically} with $H$ in the reinforcement learning with linear function approximation setting, thus \emph{exponentially improving} existing results, and resolving an open problem in \citep{zhou2020nearly}. We develop three technical ideas that may be of independent interest: 1) applications of the peeling technique to both the input norm and the variance magnitude, 2) a recursion-based estimator for the variance, and 3) a new convex potential lemma that generalizes the seminal elliptical potential lemma.

The MineRL 2020 Competition on Sample Efficient Reinforcement Learning using Human Priors

Authors:William H. Guss, Mario Ynocente Castro, Sam Devlin, Brandon Houghton, Noboru Sean Kuno, Crissman Loomis, Stephanie Milani, Sharada Mohanty, Keisuke Nakata, Ruslan Salakhutdinov, John Schulman, Shinya Shiroshita, Nicholay Topin, Avinash Ummadisingu, Oriol Vinyals
Date:2021-01-26 20:32:30

Although deep reinforcement learning has led to breakthroughs in many difficult domains, these successes have required an ever-increasing number of samples, affording only a shrinking segment of the AI community access to their development. Resolution of these limitations requires new, sample-efficient methods. To facilitate research in this direction, we propose this second iteration of the MineRL Competition. The primary goal of the competition is to foster the development of algorithms which can efficiently leverage human demonstrations to drastically reduce the number of samples needed to solve complex, hierarchical, and sparse environments. To that end, participants compete under a limited environment sample-complexity budget to develop systems which solve the MineRL ObtainDiamond task in Minecraft, a sequential decision making environment requiring long-term planning, hierarchical control, and efficient exploration methods. The competition is structured into two rounds in which competitors are provided several paired versions of the dataset and environment with different game textures and shaders. At the end of each round, competitors submit containerized versions of their learning algorithms to the AIcrowd platform where they are trained from scratch on a hold-out dataset-environment pair for a total of 4-days on a pre-specified hardware platform. In this follow-up iteration to the NeurIPS 2019 MineRL Competition, we implement new features to expand the scale and reach of the competition. In response to the feedback of the previous participants, we introduce a second minor track focusing on solutions without access to environment interactions of any kind except during test-time. Further we aim to prompt domain agnostic submissions by implementing several novel competition mechanics including action-space randomization and desemantization of observations and actions.

Independent Control and Path Planning of Microswimmers with a Uniform Magnetic Field

Authors:Lucas Amoudruz, Petros Koumoutsakos
Date:2021-01-26 08:35:14

Artificial bacteria flagella (ABFs) are magnetic helical micro-swimmers that can be remotely controlled via a uniform, rotating magnetic field. Previous studies have used the heterogeneous response of microswimmers to external magnetic fields for achieving independent control. Here we introduce analytical and reinforcement learning control strategies for path planning to a target by multiple swimmers using a uniform magnetic field. The comparison of the two algorithms shows the superiority of reinforcement learning in achieving minimal travel time to a target. The results demonstrate, for the first time, the effective independent navigation of realistic micro-swimmers with a uniform magnetic field in a viscous flow field.

Reinforcement Learning Based Temporal Logic Control with Soft Constraints Using Limit-deterministic Generalized Buchi Automata

Authors:Mingyu Cai, Shaoping Xiao, Zhijun Li, Zhen Kan
Date:2021-01-25 18:09:11

This paper studies the control synthesis of motion planning subject to uncertainties. The uncertainties are considered in robot motions and environment properties, giving rise to the probabilistic labeled Markov decision process (PL-MDP). A Model-Free Reinforcement The learning (RL) method is developed to generate a finite-memory control policy to satisfy high-level tasks expressed in linear temporal logic (LTL) formulas. Due to uncertainties and potentially conflicting tasks, this work focuses on infeasible LTL specifications, where a relaxed LTL constraint is developed to allow the agent to revise its motion plan and take violations of original tasks into account for partial satisfaction. And a novel automaton is developed to improve the density of accepting rewards and enable deterministic policies. We proposed an RL framework with rigorous analysis that is guaranteed to achieve multiple objectives in decreasing order: 1) satisfying the acceptance condition of relaxed product MDP and 2) reducing the violation cost over long-term behaviors. We provide simulation and experimental results to validate the performance.

Deep Reinforcement Learning for Producing Furniture Layout in Indoor Scenes

Authors:Xinhan Di, Pengqian Yu
Date:2021-01-19 04:38:58

In the industrial interior design process, professional designers plan the size and position of furniture in a room to achieve a satisfactory design for selling. In this paper, we explore the interior scene design task as a Markov decision process (MDP), which is solved by deep reinforcement learning. The goal is to produce an accurate position and size of the furniture simultaneously for the indoor layout task. In particular, we first formulate the furniture layout task as a MDP problem by defining the state, action, and reward function. We then design the simulated environment and train reinforcement learning agents to produce the optimal layout for the MDP formulation. We conduct our experiments on a large-scale real-world interior layout dataset that contains industrial designs from professional designers. Our numerical results demonstrate that the proposed model yields higher-quality layouts as compared with the state-of-art model. The developed simulator and codes are available at \url{https://github.com/CODE-SUBMIT/simulator1}.

A Safe Hierarchical Planning Framework for Complex Driving Scenarios based on Reinforcement Learning

Authors:Jinning Li, Liting Sun, Jianyu Chen, Masayoshi Tomizuka, Wei Zhan
Date:2021-01-17 20:45:42

Autonomous vehicles need to handle various traffic conditions and make safe and efficient decisions and maneuvers. However, on the one hand, a single optimization/sampling-based motion planner cannot efficiently generate safe trajectories in real time, particularly when there are many interactive vehicles near by. On the other hand, end-to-end learning methods cannot assure the safety of the outcomes. To address this challenge, we propose a hierarchical behavior planning framework with a set of low-level safe controllers and a high-level reinforcement learning algorithm (H-CtRL) as a coordinator for the low-level controllers. Safety is guaranteed by the low-level optimization/sampling-based controllers, while the high-level reinforcement learning algorithm makes H-CtRL an adaptive and efficient behavior planner. To train and test our proposed algorithm, we built a simulator that can reproduce traffic scenes using real-world datasets. The proposed H-CtRL is proved to be effective in various realistic simulation scenarios, with satisfying performance in terms of both safety and efficiency.

Affordance-based Reinforcement Learning for Urban Driving

Authors:Tanmay Agarwal, Hitesh Arora, Jeff Schneider
Date:2021-01-15 05:21:25

Traditional autonomous vehicle pipelines that follow a modular approach have been very successful in the past both in academia and industry, which has led to autonomy deployed on road. Though this approach provides ease of interpretation, its generalizability to unseen environments is limited and hand-engineering of numerous parameters is required, especially in the prediction and planning systems. Recently, deep reinforcement learning has been shown to learn complex strategic games and perform challenging robotic tasks, which provides an appealing framework for learning to drive. In this work, we propose a deep reinforcement learning framework to learn optimal control policy using waypoints and low-dimensional visual representations, also known as affordances. We demonstrate that our agents when trained from scratch learn the tasks of lane-following, driving around inter-sections as well as stopping in front of other actors or traffic lights even in the dense traffic setting. We note that our method achieves comparable or better performance than the baseline methods on the original and NoCrash benchmarks on the CARLA simulator.

Learning Kinematic Feasibility for Mobile Manipulation through Deep Reinforcement Learning

Authors:Daniel Honerkamp, Tim Welschehold, Abhinav Valada
Date:2021-01-13 20:00:44

Mobile manipulation tasks remain one of the critical challenges for the widespread adoption of autonomous robots in both service and industrial scenarios. While planning approaches are good at generating feasible whole-body robot trajectories, they struggle with dynamic environments as well as the incorporation of constraints given by the task and the environment. On the other hand, dynamic motion models in the action space struggle with generating kinematically feasible trajectories for mobile manipulation actions. We propose a deep reinforcement learning approach to learn feasible dynamic motions for a mobile base while the end-effector follows a trajectory in task space generated by an arbitrary system to fulfill the task at hand. This modular formulation has several benefits: it enables us to readily transform a broad range of end-effector motions into mobile applications, it allows us to use the kinematic feasibility of the end-effector trajectory as a dense reward signal and its modular formulation allows it to generalise to unseen end-effector motions at test time. We demonstrate the capabilities of our approach on multiple mobile robot platforms with different kinematic abilities and different types of wheeled platforms in extensive simulated as well as real-world experiments.

Comparative Analysis of Agent-Oriented Task Assignment and Path Planning Algorithms Applied to Drone Swarms

Authors:Rohith Gandhi Ganesan, Samantha Kappagoda, Giuseppe Loianno, David K. A. Mordecai
Date:2021-01-13 15:59:01

Autonomous drone swarms are a burgeoning technology with significant applications in the field of mapping, inspection, transportation and monitoring. To complete a task, each drone has to accomplish a sub-goal within the context of the overall task at hand and navigate through the environment by avoiding collision with obstacles and with other agents in the environment. In this work, we choose the task of optimal coverage of an environment with drone swarms where the global knowledge of the goal states and its positions are known but not of the obstacles. The drones have to choose the Points of Interest (PoI) present in the environment to visit, along with the order to be visited to ensure fast coverage. We model this task in a simulation and use an agent-oriented approach to solve the problem. We evaluate different policy networks trained with reinforcement learning algorithms based on their effectiveness, i.e. time taken to map the area and efficiency, i.e. computational requirements. We couple the task assignment with path planning in an unique way for performing collision avoidance during navigation and compare a grid-based global planning algorithm, i.e. Wavefront and a gradient-based local planning algorithm, i.e. Potential Field. We also evaluate the Potential Field planning algorithm with different cost functions, propose a method to adaptively modify the velocity of the drone when using the Huber loss function to perform collision avoidance and observe its effect on the trajectory of the drones. We demonstrate our experiments in 2D and 3D simulations.

Automated Synthesis of Steady-State Continuous Processes using Reinforcement Learning

Authors:Quirin Göttl, Dominik G. Grimm, Jakob Burger
Date:2021-01-12 11:49:34

Automated flowsheet synthesis is an important field in computer-aided process engineering. The present work demonstrates how reinforcement learning can be used for automated flowsheet synthesis without any heuristics of prior knowledge of conceptual design. The environment consists of a steady-state flowsheet simulator that contains all physical knowledge. An agent is trained to take discrete actions and sequentially built up flowsheets that solve a given process problem. A novel method named SynGameZero is developed to ensure good exploration schemes in the complex problem. Therein, flowsheet synthesis is modelled as a game of two competing players. The agent plays this game against itself during training and consists of an artificial neural network and a tree search for forward planning. The method is applied successfully to a reaction-distillation process in a quaternary system.

Cross-Modal Contrastive Learning of Representations for Navigation using Lightweight, Low-Cost Millimeter Wave Radar for Adverse Environmental Conditions

Authors:Jui-Te Huang, Chen-Lung Lu, Po-Kai Chang, Ching-I Huang, Chao-Chun Hsu, Zu Lin Ewe, Po-Jui Huang, Hsueh-Cheng Wang
Date:2021-01-10 11:21:17

Deep reinforcement learning (RL), where the agent learns from mistakes, has been successfully applied to a variety of tasks. With the aim of learning collision-free policies for unmanned vehicles, deep RL has been used for training with various types of data, such as colored images, depth images, and LiDAR point clouds, without the use of classic map--localize--plan approaches. However, existing methods are limited by their reliance on cameras and LiDAR devices, which have degraded sensing under adverse environmental conditions (e.g., smoky environments). In response, we propose the use of single-chip millimeter-wave (mmWave) radar, which is lightweight and inexpensive, for learning-based autonomous navigation. However, because mmWave radar signals are often noisy and sparse, we propose a cross-modal contrastive learning for representation (CM-CLR) method that maximizes the agreement between mmWave radar data and LiDAR data in the training stage. We evaluated our method in real-world robot compared with 1) a method with two separate networks using cross-modal generative reconstruction and an RL policy and 2) a baseline RL policy without cross-modal representation. Our proposed end-to-end deep RL policy with contrastive learning successfully navigated the robot through smoke-filled maze environments and achieved better performance compared with generative reconstruction methods, in which noisy artifact walls or obstacles were produced. All pretrained models and hardware settings are open access for reproducing this study and can be obtained at https://arg-nctu.github.io/projects/deeprl-mmWave.html

Identifying Decision Points for Safe and Interpretable Reinforcement Learning in Hypotension Treatment

Authors:Kristine Zhang, Yuanheng Wang, Jianzhun Du, Brian Chu, Leo Anthony Celi, Ryan Kindle, Finale Doshi-Velez
Date:2021-01-09 07:15:33

Many batch RL health applications first discretize time into fixed intervals. However, this discretization both loses resolution and forces a policy computation at each (potentially fine) interval. In this work, we develop a novel framework to compress continuous trajectories into a few, interpretable decision points --places where the batch data support multiple alternatives. We apply our approach to create recommendations from a cohort of hypotensive patients dataset. Our reduced state space results in faster planning and allows easy inspection by a clinical expert.

Active Screening for Recurrent Diseases: A Reinforcement Learning Approach

Authors:Han-Ching Ou, Haipeng Chen, Shahin Jabbari, Milind Tambe
Date:2021-01-07 21:07:35

Active screening is a common approach in controlling the spread of recurring infectious diseases such as tuberculosis and influenza. In this approach, health workers periodically select a subset of population for screening. However, given the limited number of health workers, only a small subset of the population can be visited in any given time period. Given the recurrent nature of the disease and rapid spreading, the goal is to minimize the number of infections over a long time horizon. Active screening can be formalized as a sequential combinatorial optimization over the network of people and their connections. The main computational challenges in this formalization arise from i) the combinatorial nature of the problem, ii) the need of sequential planning and iii) the uncertainties in the infectiousness states of the population. Previous works on active screening fail to scale to large time horizon while fully considering the future effect of current interventions. In this paper, we propose a novel reinforcement learning (RL) approach based on Deep Q-Networks (DQN), with several innovative adaptations that are designed to address the above challenges. First, we use graph convolutional networks (GCNs) to represent the Q-function that exploit the node correlations of the underlying contact network. Second, to avoid solving a combinatorial optimization problem in each time period, we decompose the node set selection as a sub-sequence of decisions, and further design a two-level RL framework that solves the problem in a hierarchical way. Finally, to speed-up the slow convergence of RL which arises from reward sparseness, we incorporate ideas from curriculum learning into our hierarchical RL approach. We evaluate our RL algorithm on several real-world networks.

qRRT: Quality-Biased Incremental RRT for Optimal Motion Planning in Non-Holonomic Systems

Authors:Nahas Pareekutty, Francis James, Balaraman Ravindran, Suril V. Shah
Date:2021-01-07 17:10:11

This paper presents a sampling-based method for optimal motion planning in non-holonomic systems in the absence of known cost functions. It uses the principle of learning through experience to deduce the cost-to-go of regions within the workspace. This cost information is used to bias an incremental graph-based search algorithm that produces solution trajectories. Iterative improvement of cost information and search biasing produces solutions that are proven to be asymptotically optimal. The proposed framework builds on incremental Rapidly-exploring Random Trees (RRT) for random sampling-based search and Reinforcement Learning (RL) to learn workspace costs. A series of experiments were performed to evaluate and demonstrate the performance of the proposed method.

A Survey of Deep RL and IL for Autonomous Driving Policy Learning

Authors:Zeyu Zhu, Huijing Zhao
Date:2021-01-06 12:43:30

Autonomous driving (AD) agents generate driving policies based on online perception results, which are obtained at multiple levels of abstraction, e.g., behavior planning, motion planning and control. Driving policies are crucial to the realization of safe, efficient and harmonious driving behaviors, where AD agents still face substantial challenges in complex scenarios. Due to their successful application in fields such as robotics and video games, the use of deep reinforcement learning (DRL) and deep imitation learning (DIL) techniques to derive AD policies have witnessed vast research efforts in recent years. This paper is a comprehensive survey of this body of work, which is conducted at three levels: First, a taxonomy of the literature studies is constructed from the system perspective, among which five modes of integration of DRL/DIL models into an AD architecture are identified. Second, the formulations of DRL/DIL models for conducting specified AD tasks are comprehensively reviewed, where various designs on the model state and action spaces and the reinforcement learning rewards are covered. Finally, an in-depth review is conducted on how the critical issues of AD applications regarding driving safety, interaction with other traffic participants and uncertainty of the environment are addressed by the DRL/DIL models. To the best of our knowledge, this is the first survey to focus on AD policy learning using DRL/DIL, which is addressed simultaneously from the system, task-driven and problem-driven perspectives. We share and discuss findings, which may lead to the investigation of various topics in the future.

An A* Curriculum Approach to Reinforcement Learning for RGBD Indoor Robot Navigation

Authors:Kaushik Balakrishnan, Punarjay Chakravarty, Shubham Shrivastava
Date:2021-01-05 20:35:14

Training robots to navigate diverse environments is a challenging problem as it involves the confluence of several different perception tasks such as mapping and localization, followed by optimal path-planning and control. Recently released photo-realistic simulators such as Habitat allow for the training of networks that output control actions directly from perception: agents use Deep Reinforcement Learning (DRL) to regress directly from the camera image to a control output in an end-to-end fashion. This is data-inefficient and can take several days to train on a GPU. Our paper tries to overcome this problem by separating the training of the perception and control neural nets and increasing the path complexity gradually using a curriculum approach. Specifically, a pre-trained twin Variational AutoEncoder (VAE) is used to compress RGBD (RGB & depth) sensing from an environment into a latent embedding, which is then used to train a DRL-based control policy. A*, a traditional path-planner is used as a guide for the policy and the distance between start and target locations is incrementally increased along the A* route, as training progresses. We demonstrate the efficacy of the proposed approach, both in terms of increased performance and decreased training times for the PointNav task in the Habitat simulation environment. This strategy of improving the training of direct-perception based DRL navigation policies is expected to hasten the deployment of robots of particular interest to industry such as co-bots on the factory floor and last-mile delivery robots.

Context-Aware Safe Reinforcement Learning for Non-Stationary Environments

Authors:Baiming Chen, Zuxin Liu, Jiacheng Zhu, Mengdi Xu, Wenhao Ding, Ding Zhao
Date:2021-01-02 23:52:22

Safety is a critical concern when deploying reinforcement learning agents for realistic tasks. Recently, safe reinforcement learning algorithms have been developed to optimize the agent's performance while avoiding violations of safety constraints. However, few studies have addressed the non-stationary disturbances in the environments, which may cause catastrophic outcomes. In this paper, we propose the context-aware safe reinforcement learning (CASRL) method, a meta-learning framework to realize safe adaptation in non-stationary environments. We use a probabilistic latent variable model to achieve fast inference of the posterior environment transition distribution given the context data. Safety constraints are then evaluated with uncertainty-aware trajectory sampling. The high cost of safety violations leads to the rareness of unsafe records in the dataset. We address this issue by enabling prioritized sampling during model training and formulating prior safety constraints with domain knowledge during constrained planning. The algorithm is evaluated in realistic safety-critical environments with non-stationary disturbances. Results show that the proposed algorithm significantly outperforms existing baselines in terms of safety and robustness.

A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost

Authors:Minbo Gao, Tianle Xie, Simon S. Du, Lin F. Yang
Date:2021-01-02 18:41:27

Many real-world applications, such as those in medical domains, recommendation systems, etc, can be formulated as large state space reinforcement learning problems with only a small budget of the number of policy changes, i.e., low switching cost. This paper focuses on the linear Markov Decision Process (MDP) recently studied in [Yang et al 2019, Jin et al 2020] where the linear function approximation is used for generalization on the large state space. We present the first algorithm for linear MDP with a low switching cost. Our algorithm achieves an $\widetilde{O}\left(\sqrt{d^3H^4K}\right)$ regret bound with a near-optimal $O\left(d H\log K\right)$ global switching cost where $d$ is the feature dimension, $H$ is the planning horizon and $K$ is the number of episodes the agent plays. Our regret bound matches the best existing polynomial algorithm by [Jin et al 2020] and our switching cost is exponentially smaller than theirs. When specialized to tabular MDP, our switching cost bound improves those in [Bai et al 2019, Zhang et al 20020]. We complement our positive result with an $\Omega\left(dH/\log d\right)$ global switching cost lower bound for any no-regret algorithm.

Effective Communications: A Joint Learning and Communication Framework for Multi-Agent Reinforcement Learning over Noisy Channels

Authors:Tze-Yang Tung, Szymon Kobus, Joan Roig Pujol, Deniz Gunduz
Date:2021-01-02 10:43:41

We propose a novel formulation of the "effectiveness problem" in communications, put forth by Shannon and Weaver in their seminal work [2], by considering multiple agents communicating over a noisy channel in order to achieve better coordination and cooperation in a multi-agent reinforcement learning (MARL) framework. Specifically, we consider a multi-agent partially observable Markov decision process (MA-POMDP), in which the agents, in addition to interacting with the environment can also communicate with each other over a noisy communication channel. The noisy communication channel is considered explicitly as part of the dynamics of the environment and the message each agent sends is part of the action that the agent can take. As a result, the agents learn not only to collaborate with each other but also to communicate "effectively" over a noisy channel. This framework generalizes both the traditional communication problem, where the main goal is to convey a message reliably over a noisy channel, and the "learning to communicate" framework that has received recent attention in the MARL literature, where the underlying communication channels are assumed to be error-free. We show via examples that the joint policy learned using the proposed framework is superior to that where the communication is considered separately from the underlying MA-POMDP. This is a very powerful framework, which has many real world applications, from autonomous vehicle planning to drone swarm control, and opens up the rich toolbox of deep reinforcement learning for the design of multi-user communication systems.

Inverse reinforcement learning for autonomous navigation via differentiable semantic mapping and planning

Authors:Tianyu Wang, Vikas Dhiman, Nikolay Atanasov
Date:2021-01-01 07:41:08

This paper focuses on inverse reinforcement learning for autonomous navigation using distance and semantic category observations. The objective is to infer a cost function that explains demonstrated behavior while relying only on the expert's observations and state-control trajectory. We develop a map encoder, that infers semantic category probabilities from the observation sequence, and a cost encoder, defined as a deep neural network over the semantic features. Since the expert cost is not directly observable, the model parameters can only be optimized by differentiating the error between demonstrated controls and a control policy computed from the cost estimate. We propose a new model of expert behavior that enables error minimization using a closed-form subgradient computed only over a subset of promising states via a motion planning algorithm. Our approach allows generalizing the learned behavior to new environments with new spatial configurations of the semantic categories. We analyze the different components of our model in a minigrid environment. We also demonstrate that our approach learns to follow traffic rules in the autonomous driving CARLA simulator by relying on semantic observations of buildings, sidewalks, and road lanes.

Robotic Grasping of Fully-Occluded Objects using RF Perception

Authors:Tara Boroushaki, Junshan Leng, Ian Clester, Alberto Rodriguez, Fadel Adib
Date:2020-12-31 04:01:45

We present the design, implementation, and evaluation of RF-Grasp, a robotic system that can grasp fully-occluded objects in unknown and unstructured environments. Unlike prior systems that are constrained by the line-of-sight perception of vision and infrared sensors, RF-Grasp employs RF (Radio Frequency) perception to identify and locate target objects through occlusions, and perform efficient exploration and complex manipulation tasks in non-line-of-sight settings. RF-Grasp relies on an eye-in-hand camera and batteryless RFID tags attached to objects of interest. It introduces two main innovations: (1) an RF-visual servoing controller that uses the RFID's location to selectively explore the environment and plan an efficient trajectory toward an occluded target, and (2) an RF-visual deep reinforcement learning network that can learn and execute efficient, complex policies for decluttering and grasping. We implemented and evaluated an end-to-end physical prototype of RF-Grasp. We demonstrate it improves success rate and efficiency by up to 40-50% over a state-of-the-art baseline. We also demonstrate RF-Grasp in novel tasks such mechanical search of fully-occluded objects behind obstacles, opening up new possibilities for robotic manipulation. Qualitative results (videos) available at rfgrasp.media.mit.edu

Model-Based Visual Planning with Self-Supervised Functional Distances

Authors:Stephen Tian, Suraj Nair, Frederik Ebert, Sudeep Dasari, Benjamin Eysenbach, Chelsea Finn, Sergey Levine
Date:2020-12-30 23:59:09

A generalist robot must be able to complete a variety of tasks in its environment. One appealing way to specify each task is in terms of a goal observation. However, learning goal-reaching policies with reinforcement learning remains a challenging problem, particularly when hand-engineered reward functions are not available. Learned dynamics models are a promising approach for learning about the environment without rewards or task-directed data, but planning to reach goals with such a model requires a notion of functional similarity between observations and goal states. We present a self-supervised method for model-based visual goal reaching, which uses both a visual dynamics model as well as a dynamical distance function learned using model-free reinforcement learning. Our approach learns entirely using offline, unlabeled data, making it practical to scale to large and diverse datasets. In our experiments, we find that our method can successfully learn models that perform a variety of tasks at test-time, moving objects amid distractors with a simulated robotic arm and even learning to open and close a drawer using a real-world robot. In comparisons, we find that this approach substantially outperforms both model-free and model-based prior methods. Videos and visualizations are available here: http://sites.google.com/berkeley.edu/mbold.

Disentangled Planning and Control in Vision Based Robotics via Reward Machines

Authors:Alberto Camacho, Jacob Varley, Deepali Jain, Atil Iscen, Dmitry Kalashnikov
Date:2020-12-28 19:54:40

In this work we augment a Deep Q-Learning agent with a Reward Machine (DQRM) to increase speed of learning vision-based policies for robot tasks, and overcome some of the limitations of DQN that prevent it from converging to good-quality policies. A reward machine (RM) is a finite state machine that decomposes a task into a discrete planning graph and equips the agent with a reward function to guide it toward task completion. The reward machine can be used for both reward shaping, and informing the policy what abstract state it is currently at. An abstract state is a high level simplification of the current state, defined in terms of task relevant features. These two supervisory signals of reward shaping and knowledge of current abstract state coming from the reward machine complement each other and can both be used to improve policy performance as demonstrated on several vision based robotic pick and place tasks. Particularly for vision based robotics applications, it is often easier to build a reward machine than to try and get a policy to learn the task without this structure.

SPOTTER: Extending Symbolic Planning Operators through Targeted Reinforcement Learning

Authors:Vasanth Sarathy, Daniel Kasenberg, Shivam Goel, Jivko Sinapov, Matthias Scheutz
Date:2020-12-24 00:31:02

Symbolic planning models allow decision-making agents to sequence actions in arbitrary ways to achieve a variety of goals in dynamic domains. However, they are typically handcrafted and tend to require precise formulations that are not robust to human error. Reinforcement learning (RL) approaches do not require such models, and instead learn domain dynamics by exploring the environment and collecting rewards. However, RL approaches tend to require millions of episodes of experience and often learn policies that are not easily transferable to other tasks. In this paper, we address one aspect of the open problem of integrating these approaches: how can decision-making agents resolve discrepancies in their symbolic planning models while attempting to accomplish goals? We propose an integrated framework named SPOTTER that uses RL to augment and support ("spot") a planning agent by discovering new operators needed by the agent to accomplish goals that are initially unreachable for the agent. SPOTTER outperforms pure-RL approaches while also discovering transferable symbolic knowledge and does not require supervision, successful plan traces or any a priori knowledge about the missing planning operator.

Are We On The Same Page? Hierarchical Explanation Generation for Planning Tasks in Human-Robot Teaming using Reinforcement Learning

Authors:Mehrdad Zakershahrak, Samira Ghodratnama
Date:2020-12-22 02:14:52

Providing explanations is considered an imperative ability for an AI agent in a human-robot teaming framework. The right explanation provides the rationale behind an AI agent's decision-making. However, to maintain the human teammate's cognitive demand to comprehend the provided explanations, prior works have focused on providing explanations in a specific order or intertwining the explanation generation with plan execution. Moreover, these approaches do not consider the degree of details required to share throughout the provided explanations. In this work, we argue that the agent-generated explanations, especially the complex ones, should be abstracted to be aligned with the level of details the human teammate desires to maintain the recipient's cognitive load. Therefore, learning a hierarchical explanations model is a challenging task. Moreover, the agent needs to follow a consistent high-level policy to transfer the learned teammate preferences to a new scenario while lower-level detailed plans are different. Our evaluation confirmed the process of understanding an explanation, especially a complex and detailed explanation, is hierarchical. The human preference that reflected this aspect corresponded exactly to creating and employing abstraction for knowledge assimilation hidden deeper in our cognitive process. We showed that hierarchical explanations achieved better task performance and behavior interpretability while reduced cognitive load. These results shed light on designing explainable agents utilizing reinforcement learning and planning across various domains.

Forming Real-World Human-Robot Cooperation for Tasks With General Goal

Authors:Lingfeng Tao, Michael Bowman, Jiucai Zhang, Xiaoli Zhang
Date:2020-12-19 20:27:09

In human-robot cooperation, the robot cooperates with humans to accomplish the task together. Existing approaches assume the human has a specific goal during the cooperation, and the robot infers and acts toward it. However, in real-world environments, a human usually only has a general goal (e.g., general direction or area in motion planning) at the beginning of the cooperation, which needs to be clarified to a specific goal (i.e., an exact position) during cooperation. The specification process is interactive and dynamic, which depends on the environment and the partner's behavior. The robot that does not consider the goal specification process may cause frustration to the human partner, elongate the time to come to an agreement, and compromise team performance. This work presents the Evolutionary Value Learning approach to model the dynamics of the goal specification process with State-based Multivariate Bayesian Inference and goal specificity-related features. This model enables the robot to enhance the process of the human's goal specification actively and find a cooperative policy in a Deep Reinforcement Learning manner. Our method outperforms existing methods with faster goal specification processes and better team performance in a dynamic ball balancing task with real human subjects.

Exact Reduction of Huge Action Spaces in General Reinforcement Learning

Authors:Sultan Javed Majeed, Marcus Hutter
Date:2020-12-18 12:45:03

The reinforcement learning (RL) framework formalizes the notion of learning with interactions. Many real-world problems have large state-spaces and/or action-spaces such as in Go, StarCraft, protein folding, and robotics or are non-Markovian, which cause significant challenges to RL algorithms. In this work we address the large action-space problem by sequentializing actions, which can reduce the action-space size significantly, even down to two actions at the expense of an increased planning horizon. We provide explicit and exact constructions and equivalence proofs for all quantities of interest for arbitrary history-based processes. In the case of MDPs, this could help RL algorithms that bootstrap. In this work we show how action-binarization in the non-MDP case can significantly improve Extreme State Aggregation (ESA) bounds. ESA allows casting any (non-MDP, non-ergodic, history-based) RL problem into a fixed-sized non-Markovian state-space with the help of a surrogate Markovian process. On the upside, ESA enjoys similar optimality guarantees as Markovian models do. But a downside is that the size of the aggregated state-space becomes exponential in the size of the action-space. In this work, we patch this issue by binarizing the action-space. We provide an upper bound on the number of states of this binarized ESA that is logarithmic in the original action-space size, a double-exponential improvement.

Content Masked Loss: Human-Like Brush Stroke Planning in a Reinforcement Learning Painting Agent

Authors:Peter Schaldenbrand, Jean Oh
Date:2020-12-18 04:02:13

The objective of most Reinforcement Learning painting agents is to minimize the loss between a target image and the paint canvas. Human painter artistry emphasizes important features of the target image rather than simply reproducing it (DiPaola 2007). Using adversarial or L2 losses in the RL painting models, although its final output is generally a work of finesse, produces a stroke sequence that is vastly different from that which a human would produce since the model does not have knowledge about the abstract features in the target image. In order to increase the human-like planning of the model without the use of expensive human data, we introduce a new loss function for use with the model's reward function: Content Masked Loss. In the context of robot painting, Content Masked Loss employs an object detection model to extract features which are used to assign higher weight to regions of the canvas that a human would find important for recognizing content. The results, based on 332 human evaluators, show that the digital paintings produced by our Content Masked model show detectable subject matter earlier in the stroke sequence than existing methods without compromising on the quality of the final painting. Our code is available at https://github.com/pschaldenbrand/ContentMaskedLoss.

ViNG: Learning Open-World Navigation with Visual Goals

Authors:Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, Sergey Levine
Date:2020-12-17 18:22:32

We propose a learning-based navigation system for reaching visually indicated goals and demonstrate this system on a real mobile robot platform. Learning provides an appealing alternative to conventional methods for robotic navigation: instead of reasoning about environments in terms of geometry and maps, learning can enable a robot to learn about navigational affordances, understand what types of obstacles are traversable (e.g., tall grass) or not (e.g., walls), and generalize over patterns in the environment. However, unlike conventional planning algorithms, it is harder to change the goal for a learned policy during deployment. We propose a method for learning to navigate towards a goal image of the desired destination. By combining a learned policy with a topological graph constructed out of previously observed data, our system can determine how to reach this visually indicated goal even in the presence of variable appearance and lighting. Three key insights, waypoint proposal, graph pruning and negative mining, enable our method to learn to navigate in real-world environments using only offline data, a setting where prior methods struggle. We instantiate our method on a real outdoor ground robot and show that our system, which we call ViNG, outperforms previously-proposed methods for goal-conditioned reinforcement learning, including other methods that incorporate reinforcement learning and search. We also study how \sysName generalizes to unseen environments and evaluate its ability to adapt to such an environment with growing experience. Finally, we demonstrate ViNG on a number of real-world applications, such as last-mile delivery and warehouse inspection. We encourage the reader to visit the project website for videos of our experiments and demonstrations sites.google.com/view/ving-robot.

Online Shielding for Stochastic Systems

Authors:Bettina Könighofer, Julian Rudolf, Alexander Palmisano, Martin Tappler, Roderick Bloem
Date:2020-12-17 12:25:48

In this paper, we propose a method to develop trustworthy reinforcement learning systems. To ensure safety especially during exploration, we automatically synthesize a correct-by-construction runtime enforcer, called a shield, that blocks all actions that are unsafe with respect to a temporal logic specification from the agent. Our main contribution is a new synthesis algorithm for computing the shield online. Existing offline shielding approaches compute exhaustively the safety of all states-action combinations ahead-of-time, resulting in huge offline computation times, large memory consumption, and significant delays at run-time due to the look-ups in a huge database. The intuition behind online shielding is to compute during run-time the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our proposed method is general and can be applied to a wide range of planning problems with stochastic behavior. For our evaluation, we selected a 2-player version of the classical computer game SNAKE. The game requires fast decisions and the multiplayer setting induces a large state space, computationally expensive to analyze exhaustively. The safety objective of collision avoidance is easily transferable to a variety of planning tasks.

Stabilizing Q Learning Via Soft Mellowmax Operator

Authors:Yaozhong Gan, Zhe Zhang, Xiaoyang Tan
Date:2020-12-17 09:11:13

Learning complicated value functions in high dimensional state space by function approximation is a challenging task, partially due to that the max-operator used in temporal difference updates can theoretically cause instability for most linear or non-linear approximation schemes. Mellowmax is a recently proposed differentiable and non-expansion softmax operator that allows a convergent behavior in learning and planning. Unfortunately, the performance bound for the fixed point it converges to remains unclear, and in practice, its parameter is sensitive to various domains and has to be tuned case by case. Finally, the Mellowmax operator may suffer from oversmoothing as it ignores the probability being taken for each action when aggregating them. In this paper, we address all the above issues with an enhanced Mellowmax operator, named SM2 (Soft Mellowmax). Particularly, the proposed operator is reliable, easy to implement, and has provable performance guarantee, while preserving all the advantages of Mellowmax. Furthermore, we show that our SM2 operator can be applied to the challenging multi-agent reinforcement learning scenarios, leading to stable value function approximation and state of the art performance.

Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning

Authors:Mohammad Babaeizadeh, Mohammad Taghi Saffar, Danijar Hafner, Harini Kannan, Chelsea Finn, Sergey Levine, Dumitru Erhan
Date:2020-12-08 18:03:21

Model-based reinforcement learning (MBRL) methods have shown strong sample efficiency and performance across a variety of tasks, including when faced with high-dimensional visual observations. These methods learn to predict the environment dynamics and expected reward from interaction and use this predictive model to plan and perform the task. However, MBRL methods vary in their fundamental design choices, and there is no strong consensus in the literature on how these design decisions affect performance. In this paper, we study a number of design decisions for the predictive model in visual MBRL algorithms, focusing specifically on methods that use a predictive model for planning. We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance. A big exception to this finding is that predicting future observations (i.e., images) leads to significant task performance improvement compared to only predicting rewards. We also empirically find that image prediction accuracy, somewhat surprisingly, correlates more strongly with downstream task performance than reward prediction accuracy. We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks (that require exploration) will perform the same as the best-performing models when trained on the same training data. Simultaneously, in the absence of exploration, models that fit the data better usually perform better on the downstream task as well, but surprisingly, these are often not the same models that perform the best when learning and exploring from scratch. These findings suggest that performance and exploration place important and potentially contradictory requirements on the model.

NavRep: Unsupervised Representations for Reinforcement Learning of Robot Navigation in Dynamic Human Environments

Authors:Daniel Dugas, Juan Nieto, Roland Siegwart, Jen Jen Chung
Date:2020-12-08 12:51:14

Robot navigation is a task where reinforcement learning approaches are still unable to compete with traditional path planning. State-of-the-art methods differ in small ways, and do not all provide reproducible, openly available implementations. This makes comparing methods a challenge. Recent research has shown that unsupervised learning methods can scale impressively, and be leveraged to solve difficult problems. In this work, we design ways in which unsupervised learning can be used to assist reinforcement learning for robot navigation. We train two end-to-end, and 18 unsupervised-learning-based architectures, and compare them, along with existing approaches, in unseen test cases. We demonstrate our approach working on a real life robot. Our results show that unsupervised learning methods are competitive with end-to-end methods. We also highlight the importance of various components such as input representation, predictive unsupervised learning, and latent features. We make all our models publicly available, as well as training and testing environments, and tools. This release also includes OpenAI-gym-compatible environments designed to emulate the training conditions described by other papers, with as much fidelity as possible. Our hope is that this helps in bringing together the field of RL for robot navigation, and allows meaningful comparisons across state-of-the-art methods.

Reset-Free Lifelong Learning with Skill-Space Planning

Authors:Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch
Date:2020-12-07 09:33:02

The objective of lifelong reinforcement learning (RL) is to optimize agents which can continuously adapt and interact in changing environments. However, current RL approaches fail drastically when environments are non-stationary and interactions are non-episodic. We propose Lifelong Skill Planning (LiSP), an algorithmic framework for non-episodic lifelong RL based on planning in an abstract space of higher-order skills. We learn the skills in an unsupervised manner using intrinsic rewards and plan over the learned skills using a learned dynamics model. Moreover, our framework permits skill discovery even from offline data, thereby reducing the need for excessive real-world interactions. We demonstrate empirically that LiSP successfully enables long-horizon planning and learns agents that can avoid catastrophic failures even in challenging non-stationary and non-episodic environments derived from gridworld and MuJoCo benchmarks.

Amortized Q-learning with Model-based Action Proposals for Autonomous Driving on Highways

Authors:Branka Mirchevska, Maria Hügle, Gabriel Kalweit, Moritz Werling, Joschka Boedecker
Date:2020-12-06 11:04:40

Well-established optimization-based methods can guarantee an optimal trajectory for a short optimization horizon, typically no longer than a few seconds. As a result, choosing the optimal trajectory for this short horizon may still result in a sub-optimal long-term solution. At the same time, the resulting short-term trajectories allow for effective, comfortable and provable safe maneuvers in a dynamic traffic environment. In this work, we address the question of how to ensure an optimal long-term driving strategy, while keeping the benefits of classical trajectory planning. We introduce a Reinforcement Learning based approach that coupled with a trajectory planner, learns an optimal long-term decision-making strategy for driving on highways. By online generating locally optimal maneuvers as actions, we balance between the infinite low-level continuous action space, and the limited flexibility of a fixed number of predefined standard lane-change actions. We evaluated our method on realistic scenarios in the open-source traffic simulator SUMO and were able to achieve better performance than the 4 benchmark approaches we compared against, including a random action selecting agent, greedy agent, high-level, discrete actions agent and an IDM-based SUMO-controlled agent.

RLOC: Terrain-Aware Legged Locomotion using Reinforcement Learning and Optimal Control

Authors:Siddhant Gangapurwala, Mathieu Geisert, Romeo Orsolino, Maurice Fallon, Ioannis Havoutis
Date:2020-12-05 18:30:23

We present a unified model-based and data-driven approach for quadrupedal planning and control to achieve dynamic locomotion over uneven terrain. We utilize on-board proprioceptive and exteroceptive feedback to map sensory information and desired base velocity commands into footstep plans using a reinforcement learning (RL) policy. This RL policy is trained in simulation over a wide range of procedurally generated terrains. When ran online, the system tracks the generated footstep plans using a model-based motion controller. We evaluate the robustness of our method over a wide variety of complex terrains. It exhibits behaviors which prioritize stability over aggressive locomotion. Additionally, we introduce two ancillary RL policies for corrective whole-body motion tracking and recovery control. These policies account for changes in physical parameters and external perturbations. We train and evaluate our framework on a complex quadrupedal system, ANYmal version B, and demonstrate transferability to a larger and heavier robot, ANYmal C, without requiring retraining.

Transfer Learning as an Enabler of the Intelligent Digital Twin

Authors:Benjamin Maschler, Dominik Braun, Nasser Jazdi, Michael Weyrich
Date:2020-12-03 13:51:05

Digital Twins have been described as beneficial in many areas, such as virtual commissioning, fault prediction or reconfiguration planning. Equipping Digital Twins with artificial intelligence functionalities can greatly expand those beneficial applications or open up altogether new areas of application, among them cross-phase industrial transfer learning. In the context of machine learning, transfer learning represents a set of approaches that enhance learning new tasks based upon previously acquired knowledge. Here, knowledge is transferred from one lifecycle phase to another in order to reduce the amount of data or time needed to train a machine learning algorithm. Looking at common challenges in developing and deploying industrial machinery with deep learning functionalities, embracing this concept would offer several advantages: Using an intelligent Digital Twin, learning algorithms can be designed, configured and tested in the design phase before the physical system exists and real data can be collected. Once real data becomes available, the algorithms must merely be fine-tuned, significantly speeding up commissioning and reducing the probability of costly modifications. Furthermore, using the Digital Twin's simulation capabilities virtually injecting rare faults in order to train an algorithm's response or using reinforcement learning, e.g. to teach a robot, become practically feasible. This article presents several cross-phase industrial transfer learning use cases utilizing intelligent Digital Twins. A real cyber physical production system consisting of an automated welding machine and an automated guided vehicle equipped with a robot arm is used to illustrate the respective benefits.

Coinbot: Intelligent Robotic Coin Bag Manipulation Using Deep Reinforcement Learning And Machine Teaching

Authors:Aleksei Gonnochenko, Aleksandr Semochkin, Dmitry Egorov, Dmitrii Statovoy, Seyedhassan Zabihifar, Aleksey Postnikov, Elena Seliverstova, Ali Zaidi, Jayson Stemmler, Kevin Limkrailassiri
Date:2020-12-02 17:56:44

Given the laborious difficulty of moving heavy bags of physical currency in the cash center of the bank, there is a large demand for training and deploying safe autonomous systems capable of conducting such tasks in a collaborative workspace. However, the deformable properties of the bag along with the large quantity of rigid-body coins contained within it, significantly increases the challenges of bag detection, grasping and manipulation by a robotic gripper and arm. In this paper, we apply deep reinforcement learning and machine learning techniques to the task of controlling a collaborative robot to automate the unloading of coin bags from a trolley. To accomplish the task-specific process of gripping flexible materials like coin bags where the center of the mass changes during manipulation, a special gripper was implemented in simulation and designed in physical hardware. Leveraging a depth camera and object detection using deep learning, a bag detection and pose estimation has been done for choosing the optimal point of grasping. An intelligent approach based on deep reinforcement learning has been introduced to propose the best configuration of the robot end-effector to maximize successful grasping. A boosted motion planning is utilized to increase the speed of motion planning during robot operation. Real-world trials with the proposed pipeline have demonstrated success rates over 96\% in a real-world setting.

Obtain Employee Turnover Rate and Optimal Reduction Strategy Based On Neural Network and Reinforcement Learning

Authors:Xiaohan Cheng
Date:2020-12-01 15:48:23

Nowadays, human resource is an important part of various resources of enterprises. For enterprises, high-loyalty and high-quality talented persons are often the core competitiveness of enterprises. Therefore, it is of great practical significance to predict whether employees leave and reduce the turnover rate of employees. First, this paper established a multi-layer perceptron predictive model of employee turnover rate. A model based on Sarsa which is a kind of reinforcement learning algorithm is proposed to automatically generate a set of strategies to reduce the employee turnover rate. These strategies are a collection of strategies that can reduce the employee turnover rate the most and cost less from the perspective of the enterprise, and can be used as a reference plan for the enterprise to optimize the employee system. The experimental results show that the algorithm can indeed improve the efficiency and accuracy of the specific strategy.

Deep Reinforcement Learning for Crowdsourced Urban Delivery: System States Characterization, Heuristics-guided Action Choice, and Rule-Interposing Integration

Authors:Tanvir Ahamed, Bo Zou, Nahid Parvez Farazi, Theja Tulabandhula
Date:2020-11-29 19:50:34

This paper investigates the problem of assigning shipping requests to ad hoc couriers in the context of crowdsourced urban delivery. The shipping requests are spatially distributed each with a limited time window between the earliest time for pickup and latest time for delivery. The ad hoc couriers, termed crowdsourcees, also have limited time availability and carrying capacity. We propose a new deep reinforcement learning (DRL)-based approach to tackling this assignment problem. A deep Q network (DQN) algorithm is trained which entails two salient features of experience replay and target network that enhance the efficiency, convergence, and stability of DRL training. More importantly, this paper makes three methodological contributions: 1) presenting a comprehensive and novel characterization of crowdshipping system states that encompasses spatial-temporal and capacity information of crowdsourcees and requests; 2) embedding heuristics that leverage the information offered by the state representation and are based on intuitive reasoning to guide specific actions to take, to preserve tractability and enhance efficiency of training; and 3) integrating rule-interposing to prevent repeated visiting of the same routes and node sequences during routing improvement, thereby further enhancing the training efficiency by accelerating learning. The effectiveness of the proposed approach is demonstrated through extensive numerical analysis. The results show the benefits brought by the heuristics-guided action choice and rule-interposing in DRL training, and the superiority of the proposed approach over existing heuristics in both solution quality, time, and scalability. Besides the potential to improve the efficiency of crowdshipping operation planning, the proposed approach also provides a new avenue and generic framework for other problems in the vehicle routing context.

Reinforcement Learning in Nonzero-sum Linear Quadratic Deep Structured Games: Global Convergence of Policy Optimization

Authors:Masoud Roudneshin, Jalal Arabneydi, Amir G. Aghdam
Date:2020-11-29 15:53:21

We study model-based and model-free policy optimization in a class of nonzero-sum stochastic dynamic games called linear quadratic (LQ) deep structured games. In such games, players interact with each other through a set of weighted averages (linear regressions) of the states and actions. In this paper, we focus our attention to homogeneous weights; however, for the special case of infinite population, the obtained results extend to asymptotically vanishing weights wherein the players learn the sequential weighted mean-field equilibrium. Despite the non-convexity of the optimization in policy space and the fact that policy optimization does not generally converge in game setting, we prove that the proposed model-based and model-free policy gradient descent and natural policy gradient descent algorithms globally converge to the sub-game perfect Nash equilibrium. To the best of our knowledge, this is the first result that provides a global convergence proof of policy optimization in a nonzero-sum LQ game. One of the salient features of the proposed algorithms is that their parameter space is independent of the number of players, and when the dimension of state space is significantly larger than that of the action space, they provide a more efficient way of computation compared to those algorithms that plan and learn in the action space. Finally, some simulations are provided to numerically verify the obtained theoretical results.

Minimax Sample Complexity for Turn-based Stochastic Game

Authors:Qiwen Cui, Lin F. Yang
Date:2020-11-29 03:58:45

The empirical success of Multi-agent reinforcement learning is encouraging, while few theoretical guarantees have been revealed. In this work, we prove that the plug-in solver approach, probably the most natural reinforcement learning algorithm, achieves minimax sample complexity for turn-based stochastic game (TBSG). Specifically, we plan in an empirical TBSG by utilizing a `simulator' that allows sampling from arbitrary state-action pair. We show that the empirical Nash equilibrium strategy is an approximate Nash equilibrium strategy in the true TBSG and give both problem-dependent and problem-independent bound. We develop absorbing TBSG and reward perturbation techniques to tackle the complex statistical dependence. The key idea is artificially introducing a suboptimality gap in TBSG and then the Nash equilibrium strategy lies in a finite set.

Latent Skill Planning for Exploration and Transfer

Authors:Kevin Xie, Homanga Bharadhwaj, Danijar Hafner, Animesh Garg, Florian Shkurti
Date:2020-11-27 18:40:03

To quickly solve new tasks in complex environments, intelligent agents need to build up reusable knowledge. For example, a learned world model captures knowledge about the environment that applies to new tasks. Similarly, skills capture general behaviors that can apply to new tasks. In this paper, we investigate how these two approaches can be integrated into a single reinforcement learning agent. Specifically, we leverage the idea of partial amortization for fast adaptation at test time. For this, actions are produced by a policy that is learned over time while the skills it conditions on are chosen using online planning. We demonstrate the benefits of our design decisions across a suite of challenging locomotion tasks and demonstrate improved sample efficiency in single tasks as well as in transfer from one task to another, as compared to competitive baselines. Videos are available at: https://sites.google.com/view/latent-skill-planning/

Reinforcement Learning-based Joint Path and Energy Optimization of Cellular-Connected Unmanned Aerial Vehicles

Authors:Arash Hooshmand
Date:2020-11-27 14:16:55

Unmanned Aerial Vehicles (UAVs) have attracted considerable research interest recently. Especially when it comes to the realm of Internet of Things, the UAVs with Internet connectivity are one of the main demands. Furthermore, the energy constraint i.e. battery limit is a bottle-neck of the UAVs that can limit their applications. We try to address and solve the energy problem. Therefore, a path planning method for a cellular-connected UAV is proposed that will enable the UAV to plan its path in an area much larger than its battery range by getting recharged in certain positions equipped with power stations (PSs). In addition to the energy constraint, there are also no-fly zones; for example, due to Air to Air (A2A) and Air to Ground (A2G) interference or for lack of necessary connectivity that impose extra constraints in the trajectory optimization of the UAV. No-fly zones determine the infeasible areas that should be avoided. We have used a reinforcement learning (RL) hierarchically to extend typical short-range path planners to consider battery recharge and solve the problem of UAVs in long missions. The problem is simulated for the UAV that flies over a large area, and Q-learning algorithm could enable the UAV to find the optimal path and recharge policy.

Optimization of the Model Predictive Control Update Interval Using Reinforcement Learning

Authors:Eivind Bøhn, Sebastien Gros, Signe Moe, Tor Arne Johansen
Date:2020-11-26 16:01:52

In control applications there is often a compromise that needs to be made with regards to the complexity and performance of the controller and the computational resources that are available. For instance, the typical hardware platform in embedded control applications is a microcontroller with limited memory and processing power, and for battery powered applications the control system can account for a significant portion of the energy consumption. We propose a controller architecture in which the computational cost is explicitly optimized along with the control objective. This is achieved by a three-part architecture where a high-level, computationally expensive controller generates plans, which a computationally simpler controller executes by compensating for prediction errors, while a recomputation policy decides when the plan should be recomputed. In this paper, we employ model predictive control (MPC) as the high-level plan-generating controller, a linear state feedback controller as the simpler compensating controller, and reinforcement learning (RL) to learn the recomputation policy. Simulation results for two examples showcase the architecture's ability to improve upon the MPC approach and find reasonable compromises weighing the performance on the control objective and the computational resources expended.

An End-to-end Deep Reinforcement Learning Approach for the Long-term Short-term Planning on the Frenet Space

Authors:Majid Moghadam, Ali Alizadeh, Engin Tekin, Gabriel Hugh Elkaim
Date:2020-11-26 02:40:07

Tactical decision making and strategic motion planning for autonomous highway driving are challenging due to the complication of predicting other road users' behaviors, diversity of environments, and complexity of the traffic interactions. This paper presents a novel end-to-end continuous deep reinforcement learning approach towards autonomous cars' decision-making and motion planning. For the first time, we define both states and action spaces on the Frenet space to make the driving behavior less variant to the road curvatures than the surrounding actors' dynamics and traffic interactions. The agent receives time-series data of past trajectories of the surrounding vehicles and applies convolutional neural networks along the time channels to extract features in the backbone. The algorithm generates continuous spatiotemporal trajectories on the Frenet frame for the feedback controller to track. Extensive high-fidelity highway simulations on CARLA show the superiority of the presented approach compared with commonly used baselines and discrete reinforcement learning on various traffic scenarios. Furthermore, the proposed method's advantage is confirmed with a more comprehensive performance evaluation against 1000 randomly generated test scenarios.

Learning Certified Control using Contraction Metric

Authors:Dawei Sun, Susmit Jha, Chuchu Fan
Date:2020-11-25 08:22:07

In this paper, we solve the problem of finding a certified control policy that drives a robot from any given initial state and under any bounded disturbance to the desired reference trajectory, with guarantees on the convergence or bounds on the tracking error. Such a controller is crucial in safe motion planning. We leverage the advanced theory in Control Contraction Metric and design a learning framework based on neural networks to co-synthesize the contraction metric and the controller for control-affine systems. We further provide methods to validate the convergence and bounded error guarantees. We demonstrate the performance of our method using a suite of challenging robotic models, including models with learned dynamics as neural networks. We compare our approach with leading methods using sum-of-squares programming, reinforcement learning, and model predictive control. Results show that our methods indeed can handle a broader class of systems with less tracking error and faster execution speed. Code is available at https://github.com/sundw2014/C3M.

World Model as a Graph: Learning Latent Landmarks for Planning

Authors:Lunjun Zhang, Ge Yang, Bradly C. Stadie
Date:2020-11-25 02:49:21

Planning - the ability to analyze the structure of a problem in the large and decompose it into interrelated subproblems - is a hallmark of human intelligence. While deep reinforcement learning (RL) has shown great promise for solving relatively straightforward control tasks, it remains an open problem how to best incorporate planning into existing deep RL paradigms to handle increasingly complex environments. One prominent framework, Model-Based RL, learns a world model and plans using step-by-step virtual rollouts. This type of world model quickly diverges from reality when the planning horizon increases, thus struggling at long-horizon planning. How can we learn world models that endow agents with the ability to do temporally extended reasoning? In this work, we propose to learn graph-structured world models composed of sparse, multi-step transitions. We devise a novel algorithm to learn latent landmarks that are scattered (in terms of reachability) across the goal space as the nodes on the graph. In this same graph, the edges are the reachability estimates distilled from Q-functions. On a variety of high-dimensional continuous control tasks ranging from robotic manipulation to navigation, we demonstrate that our method, named L3P, significantly outperforms prior work, and is oftentimes the only method capable of leveraging both the robustness of model-free RL and generalization of graph-search algorithms. We believe our work is an important step towards scalable planning in reinforcement learning.

C-Learning: Horizon-Aware Cumulative Accessibility Estimation

Authors:Panteha Naderian, Gabriel Loaiza-Ganem, Harry J. Braviner, Anthony L. Caterini, Jesse C. Cresswell, Tong Li, Animesh Garg
Date:2020-11-24 20:34:31

Multi-goal reaching is an important problem in reinforcement learning needed to achieve algorithmic generalization. Despite recent advances in this field, current algorithms suffer from three major challenges: high sample complexity, learning only a single way of reaching the goals, and difficulties in solving complex motion planning tasks. In order to address these limitations, we introduce the concept of cumulative accessibility functions, which measure the reachability of a goal from a given state within a specified horizon. We show that these functions obey a recurrence relation, which enables learning from offline interactions. We also prove that optimal cumulative accessibility functions are monotonic in the planning horizon. Additionally, our method can trade off speed and reliability in goal-reaching by suggesting multiple paths to a single goal depending on the provided horizon. We evaluate our approach on a set of multi-goal discrete and continuous control tasks. We show that our method outperforms state-of-the-art goal-reaching algorithms in success rate, sample complexity, and path optimality. Our code is available at https://github.com/layer6ai-labs/CAE, and additional visualizations can be found at https://sites.google.com/view/learning-cae/.

From Pixels to Legs: Hierarchical Learning of Quadruped Locomotion

Authors:Deepali Jain, Atil Iscen, Ken Caluwaerts
Date:2020-11-23 20:55:54

Legged robots navigating crowded scenes and complex terrains in the real world are required to execute dynamic leg movements while processing visual input for obstacle avoidance and path planning. We show that a quadruped robot can acquire both of these skills by means of hierarchical reinforcement learning (HRL). By virtue of their hierarchical structure, our policies learn to implicitly break down this joint problem by concurrently learning High Level (HL) and Low Level (LL) neural network policies. These two levels are connected by a low dimensional hidden layer, which we call latent command. HL receives a first-person camera view, whereas LL receives the latent command from HL and the robot's on-board sensors to control its actuators. We train policies to walk in two different environments: a curved cliff and a maze. We show that hierarchical policies can concurrently learn to locomote and navigate in these environments, and show they are more efficient than non-hierarchical neural network policies. This architecture also allows for knowledge reuse across tasks. LL networks trained on one task can be transferred to a new task in a new environment. Finally HL, which processes camera images, can be evaluated at much lower and varying frequencies compared to LL, thus reducing computation times and bandwidth requirements.

Evolutionary Planning in Latent Space

Authors:Thor V. A. N. Olesen, Dennis T. T. Nguyen, Rasmus Berg Palm, Sebastian Risi
Date:2020-11-23 09:21:30

Planning is a powerful approach to reinforcement learning with several desirable properties. However, it requires a model of the world, which is not readily available in many real-life problems. In this paper, we propose to learn a world model that enables Evolutionary Planning in Latent Space (EPLS). We use a Variational Auto Encoder (VAE) to learn a compressed latent representation of individual observations and extend a Mixture Density Recurrent Neural Network (MDRNN) to learn a stochastic, multi-modal forward model of the world that can be used for planning. We use the Random Mutation Hill Climbing (RMHC) to find a sequence of actions that maximize expected reward in this learned model of the world. We demonstrate how to build a model of the world by bootstrapping it with rollouts from a random policy and iteratively refining it with rollouts from an increasingly accurate planning policy using the learned world model. After a few iterations of this refinement, our planning agents are better than standard model-free reinforcement learning approaches demonstrating the viability of our approach.

Multi-Agent Reinforcement Learning for Markov Routing Games: A New Modeling Paradigm For Dynamic Traffic Assignment

Authors:Zhenyu Shou, Xu Chen, Yongjie Fu, Xuan Di
Date:2020-11-22 02:31:14

This paper aims to develop a paradigm that models the learning behavior of intelligent agents (including but not limited to autonomous vehicles, connected and automated vehicles, or human-driven vehicles with intelligent navigation systems where human drivers follow the navigation instructions completely) with a utility-optimizing goal and the system's equilibrating processes in a routing game among atomic selfish agents. Such a paradigm can assist policymakers in devising optimal operational and planning countermeasures under both normal and abnormal circumstances. To this end, we develop a Markov routing game (MRG) in which each agent learns and updates her own en-route path choice policy while interacting with others in transportation networks. To efficiently solve MRG, we formulate it as multi-agent reinforcement learning (MARL) and devise a mean field multi-agent deep Q learning (MF-MA-DQL) approach that captures the competition among agents. The linkage between the classical DUE paradigm and our proposed Markov routing game (MRG) is discussed. We show that the routing behavior of intelligent agents is shown to converge to the classical notion of predictive dynamic user equilibrium (DUE) when traffic environments are simulated using dynamic loading models (DNL). In other words, the MRG depicts DUEs assuming perfect information and deterministic environments propagated by DNL models. Four examples are solved to illustrate the algorithm efficiency and consistency between DUE and the MRG equilibrium, on a simple network without and with spillback, the Ortuzar Willumsen (OW) Network, and a real-world network near Columbia University's campus in Manhattan of New York City.

Policy Teaching in Reinforcement Learning via Environment Poisoning Attacks

Authors:Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, Adish Singla
Date:2020-11-21 16:54:45

We study a security threat to reinforcement learning where an attacker poisons the learning environment to force the agent into executing a target policy chosen by the attacker. As a victim, we consider RL agents whose objective is to find a policy that maximizes reward in infinite-horizon problem settings. The attacker can manipulate the rewards and the transition dynamics in the learning environment at training-time, and is interested in doing so in a stealthy manner. We propose an optimization framework for finding an optimal stealthy attack for different measures of attack cost. We provide lower/upper bounds on the attack cost, and instantiate our attacks in two settings: (i) an offline setting where the agent is doing planning in the poisoned environment, and (ii) an online setting where the agent is learning a policy with poisoned feedback. Our results show that the attacker can easily succeed in teaching any target policy to the victim under mild conditions and highlight a significant security threat to reinforcement learning agents in practice.

When to stop value iteration: stability and near-optimality versus computation

Authors:Mathieu Granzotto, Romain Postoyan, Dragan Nešić, Lucian Buşoniu, Jamal Daafouz
Date:2020-11-20 01:27:30

Value iteration (VI) is a ubiquitous algorithm for optimal control, planning, and reinforcement learning schemes. Under the right assumptions, VI is a vital tool to generate inputs with desirable properties for the controlled system, like optimality and Lyapunov stability. As VI usually requires an infinite number of iterations to solve general nonlinear optimal control problems, a key question is when to terminate the algorithm to produce a "good" solution, with a measurable impact on optimality and stability guarantees. By carefully analysing VI under general stabilizability and detectability properties, we provide explicit and novel relationships of the stopping criterion's impact on near-optimality, stability and performance, thus allowing to tune these desirable properties against the induced computational cost. The considered class of stopping criteria encompasses those encountered in the control, dynamic programming and reinforcement learning literature and it allows considering new ones, which may be useful to further reduce the computational cost while endowing and satisfying stability and near-optimality properties. We therefore lay a foundation to endow machine learning schemes based on VI with stability and performance guarantees, while reducing computational complexity.

PassGoodPool: Joint Passengers and Goods Fleet Management with Reinforcement Learning aided Pricing, Matching, and Route Planning

Authors:Kaushik Manchella, Marina Haliem, Vaneet Aggarwal, Bharat Bhargava
Date:2020-11-17 23:15:03

The ubiquitous growth of mobility-on-demand services for passenger and goods delivery has brought various challenges and opportunities within the realm of transportation systems. As a result, intelligent transportation systems are being developed to maximize operational profitability, user convenience, and environmental sustainability. The growth of last mile deliveries alongside ridesharing calls for an efficient and cohesive system that transports both passengers and goods. Existing methods address this using static routing methods considering neither the demands of requests nor the transfer of goods between vehicles during route planning. In this paper, we present a dynamic and demand aware fleet management framework for combined goods and passenger transportation that is capable of (1) Involving both passengers and drivers in the decision-making process by allowing drivers to negotiate to a mutually suitable price, and passengers to accept/reject, (2) Matching of goods to vehicles, and the multi-hop transfer of goods, (3) Dynamically generating optimal routes for each vehicle considering demand along their paths, based on the insertion cost which then determines the matching, (4) Dispatching idle vehicles to areas of anticipated high passenger and goods demand using Deep Reinforcement Learning (RL), (5) Allowing for distributed inference at each vehicle while collectively optimizing fleet objectives. Our proposed model is deployable independently within each vehicle as this minimizes computational costs associated with the growth of distributed systems and democratizes decision-making to each individual. Simulations on a variety of vehicle types, goods, and passenger utility functions show the effectiveness of our approach as compared to other methods that do not consider combined load transportation or dynamic multi-hop route planning.

Combining Reinforcement Learning with Model Predictive Control for On-Ramp Merging

Authors:Joseph Lubars, Harsh Gupta, Sandeep Chinchali, Liyun Li, Adnan Raja, R. Srikant, Xinzhou Wu
Date:2020-11-17 07:42:11

We consider the problem of designing an algorithm to allow a car to autonomously merge on to a highway from an on-ramp. Two broad classes of techniques have been proposed to solve motion planning problems in autonomous driving: Model Predictive Control (MPC) and Reinforcement Learning (RL). In this paper, we first establish the strengths and weaknesses of state-of-the-art MPC and RL-based techniques through simulations. We show that the performance of the RL agent is worse than that of the MPC solution from the perspective of safety and robustness to out-of-distribution traffic patterns, i.e., traffic patterns which were not seen by the RL agent during training. On the other hand, the performance of the RL agent is better than that of the MPC solution when it comes to efficiency and passenger comfort. We subsequently present an algorithm which blends the model-free RL agent with the MPC solution and show that it provides better trade-offs between all metrics -- passenger comfort, efficiency, crash rate and robustness.

Reachability-based Trajectory Safeguard (RTS): A Safe and Fast Reinforcement Learning Safety Layer for Continuous Control

Authors:Yifei Simon Shao, Chao Chen, Shreyas Kousik, Ram Vasudevan
Date:2020-11-17 04:57:15

Reinforcement Learning (RL) algorithms have achieved remarkable performance in decision making and control tasks due to their ability to reason about long-term, cumulative reward using trial and error. However, during RL training, applying this trial-and-error approach to real-world robots operating in safety critical environment may lead to collisions. To address this challenge, this paper proposes a Reachability-based Trajectory Safeguard (RTS), which leverages reachability analysis to ensure safety during training and operation. Given a known (but uncertain) model of a robot, RTS precomputes a Forward Reachable Set of the robot tracking a continuum of parameterized trajectories. At runtime, the RL agent selects from this continuum in a receding-horizon way to control the robot; the FRS is used to identify if the agent's choice is safe or not, and to adjust unsafe choices. The efficacy of this method is illustrated on three nonlinear robot models, including a 12-D quadrotor drone, in simulation and in comparison with state-of-the-art safe motion planning methods.

Distilling a Hierarchical Policy for Planning and Control via Representation and Reinforcement Learning

Authors:Jung-Su Ha, Young-Jin Park, Hyeok-Joo Chae, Soon-Seo Park, Han-Lim Choi
Date:2020-11-16 23:58:49

We present a hierarchical planning and control framework that enables an agent to perform various tasks and adapt to a new task flexibly. Rather than learning an individual policy for each particular task, the proposed framework, DISH, distills a hierarchical policy from a set of tasks by representation and reinforcement learning. The framework is based on the idea of latent variable models that represent high-dimensional observations using low-dimensional latent variables. The resulting policy consists of two levels of hierarchy: (i) a planning module that reasons a sequence of latent intentions that would lead to an optimistic future and (ii) a feedback control policy, shared across the tasks, that executes the inferred intention. Because the planning is performed in low-dimensional latent space, the learned policy can immediately be used to solve or adapt to new tasks without additional training. We demonstrate the proposed framework can learn compact representations (3- and 1-dimensional latent states and commands for a humanoid with 197- and 36-dimensional state features and actions) while solving a small number of imitation tasks, and the resulting policy is directly applicable to other types of tasks, i.e., navigation in cluttered environments. Video: https://youtu.be/HQsQysUWOhg

Critic PI2: Master Continuous Planning via Policy Improvement with Path Integrals and Deep Actor-Critic Reinforcement Learning

Authors:Jiajun Fan, He Ba, Xian Guo, Jianye Hao
Date:2020-11-13 04:14:40

Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods from AlphaGo to Muzero have enjoyed huge success in discrete domains, such as chess and Go. Unfortunately, in real-world applications like robot control and inverted pendulum, whose action space is normally continuous, those tree-based planning techniques will be struggling. To address those limitations, in this paper, we present a novel model-based reinforcement learning frameworks called Critic PI2, which combines the benefits from trajectory optimization, deep actor-critic learning, and model-based reinforcement learning. Our method is evaluated for inverted pendulum models with applicability to many continuous control systems. Extensive experiments demonstrate that Critic PI2 achieved a new state of the art in a range of challenging continuous domains. Furthermore, we show that planning with a critic significantly increases the sample efficiency and real-time performance. Our work opens a new direction toward learning the components of a model-based planning system and how to use them.

Generalized Inverse Planning: Learning Lifted non-Markovian Utility for Generalizable Task Representation

Authors:Sirui Xie, Feng Gao, Song-Chun Zhu
Date:2020-11-12 21:06:26

In searching for a generalizable representation of temporally extended tasks, we spot two necessary constituents: the utility needs to be non-Markovian to transfer temporal relations invariant to a probability shift, the utility also needs to be lifted to abstract out specific grounding objects. In this work, we study learning such utility from human demonstrations. While inverse reinforcement learning (IRL) has been accepted as a general framework of utility learning, its fundamental formulation is one concrete Markov Decision Process. Thus the learned reward function does not specify the task independently of the environment. Going beyond that, we define a domain of generalization that spans a set of planning problems following a schema. We hence propose a new quest, Generalized Inverse Planning, for utility learning in this domain. We further outline a computational framework, Maximum Entropy Inverse Planning (MEIP), that learns non-Markovian utility and associated concepts in a generative manner. The learned utility and concepts form a task representation that generalizes regardless of probability shift or structural change. Seeing that the proposed generalization problem has not been widely studied yet, we carefully define an evaluation protocol, with which we illustrate the effectiveness of MEIP on two proof-of-concept domains and one challenging task: learning to fold from demonstrations.

Dynamic allocation of limited memory resources in reinforcement learning

Authors:Nisheet Patel, Luigi Acerbi, Alexandre Pouget
Date:2020-11-12 13:58:07

Biological brains are inherently limited in their capacity to process and store information, but are nevertheless capable of solving complex tasks with apparent ease. Intelligent behavior is related to these limitations, since resource constraints drive the need to generalize and assign importance differentially to features in the environment or memories of past experiences. Recently, there have been parallel efforts in reinforcement learning and neuroscience to understand strategies adopted by artificial and biological agents to circumvent limitations in information storage. However, the two threads have been largely separate. In this article, we propose a dynamical framework to maximize expected reward under constraints of limited resources, which we implement with a cost function that penalizes precise representations of action-values in memory, each of which may vary in its precision. We derive from first principles an algorithm, Dynamic Resource Allocator (DRA), which we apply to two standard tasks in reinforcement learning and a model-based planning task, and find that it allocates more resources to items in memory that have a higher impact on cumulative rewards. Moreover, DRA learns faster when starting with a higher resource budget than what it eventually allocates for performing well on tasks, which may explain why frontal cortical areas in biological brains appear more engaged in early stages of learning before settling to lower asymptotic levels of activity. Our work provides a normative solution to the problem of learning how to allocate costly resources to a collection of uncertain memories in a manner that is capable of adapting to changes in the environment.

Decentralized Motion Planning for Multi-Robot Navigation using Deep Reinforcement Learning

Authors:Sivanathan Kandhasamy, Vinayagam Babu Kuppusamy, Tanmay Vilas Samak, Chinmay Vilas Samak
Date:2020-11-11 07:35:21

This work presents a decentralized motion planning framework for addressing the task of multi-robot navigation using deep reinforcement learning. A custom simulator was developed in order to experimentally investigate the navigation problem of 4 cooperative non-holonomic robots sharing limited state information with each other in 3 different settings. The notion of decentralized motion planning with common and shared policy learning was adopted, which allowed robust training and testing of this approach in a stochastic environment since the agents were mutually independent and exhibited asynchronous motion behavior. The task was further aggravated by providing the agents with a sparse observation space and requiring them to generate continuous action commands so as to efficiently, yet safely navigate to their respective goal locations, while avoiding collisions with other dynamic peers and static obstacles at all times. The experimental results are reported in terms of quantitative measures and qualitative remarks for both training and deployment phases.

Bimanual Regrasping for Suture Needles using Reinforcement Learning for Rapid Motion Planning

Authors:Zih-Yun Chiu, Florian Richter, Emily K. Funk, Ryan K. Orosco, Michael C. Yip
Date:2020-11-09 22:49:19

Regrasping a suture needle is an important yet time-consuming process in suturing. To bring efficiency into regrasping, prior work either designs a task-specific mechanism or guides the gripper toward some specific pick-up point for proper grasping of a needle. Yet, these methods are usually not deployable when the working space is changed. Therefore, in this work, we present rapid trajectory generation for bimanual needle regrasping via reinforcement learning (RL). Demonstrations from a sampling-based motion planning algorithm is incorporated to speed up the learning. In addition, we propose the ego-centric state and action spaces for this bimanual planning problem, where the reference frames are on the end-effectors instead of some fixed frame. Thus, the learned policy can be directly applied to any feasible robot configuration. Our experiments in simulation show that the success rate of a single pass is 97%, and the planning time is 0.0212s on average, which outperforms other widely used motion planning algorithms. For the real-world experiments, the success rate is 73.3% if the needle pose is reconstructed from an RGB image, with a planning time of 0.0846s and a run time of 5.1454s. If the needle pose is known beforehand, the success rate becomes 90.5%, with a planning time of 0.0807s and a run time of 2.8801s.

Trajectory Planning for Autonomous Vehicles Using Hierarchical Reinforcement Learning

Authors:Kaleb Ben Naveed, Zhiqian Qiao, John M. Dolan
Date:2020-11-09 20:49:54

Planning safe trajectories under uncertain and dynamic conditions makes the autonomous driving problem significantly complex. Current sampling-based methods such as Rapidly Exploring Random Trees (RRTs) are not ideal for this problem because of the high computational cost. Supervised learning methods such as Imitation Learning lack generalization and safety guarantees. To address these problems and in order to ensure a robust framework, we propose a Hierarchical Reinforcement Learning (HRL) structure combined with a Proportional-Integral-Derivative (PID) controller for trajectory planning. HRL helps divide the task of autonomous vehicle driving into sub-goals and supports the network to learn policies for both high-level options and low-level trajectory planner choices. The introduction of sub-goals decreases convergence time and enables the policies learned to be reused for other scenarios. In addition, the proposed planner is made robust by guaranteeing smooth trajectories and by handling the noisy perception system of the ego-car. The PID controller is used for tracking the waypoints, which ensures smooth trajectories and reduces jerk. The problem of incomplete observations is handled by using a Long-Short-Term-Memory (LSTM) layer in the network. Results from the high-fidelity CARLA simulator indicate that the proposed method reduces convergence time, generates smoother trajectories, and is able to handle dynamic surroundings and noisy observations.

Safe Trajectory Planning Using Reinforcement Learning for Self Driving

Authors:Josiah Coad, Zhiqian Qiao, John M. Dolan
Date:2020-11-09 19:29:14

Self-driving vehicles must be able to act intelligently in diverse and difficult environments, marked by high-dimensional state spaces, a myriad of optimization objectives and complex behaviors. Traditionally, classical optimization and search techniques have been applied to the problem of self-driving; but they do not fully address operations in environments with high-dimensional states and complex behaviors. Recently, imitation learning has been proposed for the task of self-driving; but it is labor-intensive to obtain enough training data. Reinforcement learning has been proposed as a way to directly control the car, but this has safety and comfort concerns. We propose using model-free reinforcement learning for the trajectory planning stage of self-driving and show that this approach allows us to operate the car in a more safe, general and comfortable manner, required for the task of self driving.

Behavior Planning at Urban Intersections through Hierarchical Reinforcement Learning

Authors:Zhiqian Qiao, Jeff Schneider, John M. Dolan
Date:2020-11-09 19:23:26

For autonomous vehicles, effective behavior planning is crucial to ensure safety of the ego car. In many urban scenarios, it is hard to create sufficiently general heuristic rules, especially for challenging scenarios that some new human drivers find difficult. In this work, we propose a behavior planning structure based on reinforcement learning (RL) which is capable of performing autonomous vehicle behavior planning with a hierarchical structure in simulated urban environments. Application of the hierarchical structure allows the various layers of the behavior planning system to be satisfied. Our algorithms can perform better than heuristic-rule-based methods for elective decisions such as when to turn left between vehicles approaching from the opposite direction or possible lane-change when approaching an intersection due to lane blockage or delay in front of the ego car. Such behavior is hard to evaluate as correct or incorrect, but for some aggressive expert human drivers handle such scenarios effectively and quickly. On the other hand, compared to traditional RL methods, our algorithm is more sample-efficient, due to the use of a hybrid reward mechanism and heuristic exploration during the training process. The results also show that the proposed method converges to an optimal policy faster than traditional RL methods.

Reinforced Deep Markov Models With Applications in Automatic Trading

Authors:Tadeu A. Ferreira
Date:2020-11-09 12:46:30

Inspired by the developments in deep generative models, we propose a model-based RL approach, coined Reinforced Deep Markov Model (RDMM), designed to integrate desirable properties of a reinforcement learning algorithm acting as an automatic trading system. The network architecture allows for the possibility that market dynamics are partially visible and are potentially modified by the agent's actions. The RDMM filters incomplete and noisy data, to create better-behaved input data for RL planning. The policy search optimisation also properly accounts for state uncertainty. Due to the complexity of the RKDF model architecture, we performed ablation studies to understand the contributions of individual components of the approach better. To test the financial performance of the RDMM we implement policies using variants of Q-Learning, DynaQ-ARIMA and DynaQ-LSTM algorithms. The experiments show that the RDMM is data-efficient and provides financial gains compared to the benchmarks in the optimal execution problem. The performance improvement becomes more pronounced when price dynamics are more complex, and this has been demonstrated using real data sets from the limit order book of Facebook, Intel, Vodafone and Microsoft.

On the role of planning in model-based deep reinforcement learning

Authors:Jessica B. Hamrick, Abram L. Friesen, Feryal Behbahani, Arthur Guez, Fabio Viola, Sims Witherspoon, Thomas Anthony, Lars Buesing, Petar Veličković, Théophane Weber
Date:2020-11-08 16:55:16

Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero (Schrittwieser et al., 2019), a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.

The Value Equivalence Principle for Model-Based Reinforcement Learning

Authors:Christopher Grimm, André Barreto, Satinder Singh, David Silver
Date:2020-11-06 18:25:54

Learning models of the environment from data is often viewed as an essential component to building intelligent reinforcement learning (RL) agents. The common practice is to separate the learning of the model from its use, by constructing a model of the environment's dynamics that correctly predicts the observed state transitions. In this paper we argue that the limited representational resources of model-based RL agents are better used to build models that are directly useful for value-based planning. As our main contribution, we introduce the principle of value equivalence: two models are value equivalent with respect to a set of functions and policies if they yield the same Bellman updates. We propose a formulation of the model learning problem based on the value equivalence principle and analyze how the set of feasible solutions is impacted by the choice of policies and functions. Specifically, we show that, as we augment the set of policies and functions considered, the class of value equivalent models shrinks, until eventually collapsing to a single point corresponding to a model that perfectly describes the environment. In many problems, directly modelling state-to-state transitions may be both difficult and unnecessary. By leveraging the value-equivalence principle one may find simpler models without compromising performance, saving computation and memory. We illustrate the benefits of value-equivalent model learning with experiments comparing it against more traditional counterparts like maximum likelihood estimation. More generally, we argue that the principle of value equivalence underlies a number of recent empirical successes in RL, such as Value Iteration Networks, the Predictron, Value Prediction Networks, TreeQN, and MuZero, and provides a first theoretical underpinning of those results.

LBGP: Learning Based Goal Planning for Autonomous Following in Front

Authors:Payam Nikdel, Richard Vaughan, Mo Chen
Date:2020-11-05 22:29:30

This paper investigates a hybrid solution which combines deep reinforcement learning (RL) and classical trajectory planning for the following in front application. Here, an autonomous robot aims to stay ahead of a person as the person freely walks around. Following in front is a challenging problem as the user's intended trajectory is unknown and needs to be estimated, explicitly or implicitly, by the robot. In addition, the robot needs to find a feasible way to safely navigate ahead of human trajectory. Our deep RL module implicitly estimates human trajectory and produces short-term navigational goals to guide the robot. These goals are used by a trajectory planner to smoothly navigate the robot to the short-term goals, and eventually in front of the user. We employ curriculum learning in the deep RL module to efficiently achieve a high return. Our system outperforms the state-of-the-art in following ahead and is more reliable compared to end-to-end alternatives in both the simulation and real world experiments. In contrast to a pure deep RL approach, we demonstrate zero-shot transfer of the trained policy from simulation to the real world.

Learning a Decentralized Multi-arm Motion Planner

Authors:Huy Ha, Jingxi Xu, Shuran Song
Date:2020-11-05 01:47:23

We present a closed-loop multi-arm motion planner that is scalable and flexible with team size. Traditional multi-arm robot systems have relied on centralized motion planners, whose runtimes often scale exponentially with team size, and thus, fail to handle dynamic environments with open-loop control. In this paper, we tackle this problem with multi-agent reinforcement learning, where a decentralized policy is trained to control one robot arm in the multi-arm system to reach its target end-effector pose given observations of its workspace state and target end-effector pose. The policy is trained using Soft Actor-Critic with expert demonstrations from a sampling-based motion planning algorithm (i.e., BiRRT). By leveraging classical planning algorithms, we can improve the learning efficiency of the reinforcement learning algorithm while retaining the fast inference time of neural networks. The resulting policy scales sub-linearly and can be deployed on multi-arm systems with variable team sizes. Thanks to the closed-loop and decentralized formulation, our approach generalizes to 5-10 multi-arm systems and dynamic moving targets (>90% success rate for a 10-arm system), despite being trained on only 1-4 arm planning tasks with static targets. Code and data links can be found at https://multiarm.cs.columbia.edu.

Secure Planning Against Stealthy Attacks via Model-Free Reinforcement Learning

Authors:Alper Kamil Bozkurt, Yu Wang, Miroslav Pajic
Date:2020-11-03 17:59:34

We consider the problem of security-aware planning in an unknown stochastic environment, in the presence of attacks on control signals (i.e., actuators) of the robot. We model the attacker as an agent who has the full knowledge of the controller as well as the employed intrusion-detection system and who wants to prevent the controller from performing tasks while staying stealthy. We formulate the problem as a stochastic game between the attacker and the controller and present an approach to express the objective of such an agent and the controller as a combined linear temporal logic (LTL) formula. We then show that the planning problem, described formally as the problem of satisfying an LTL formula in a stochastic game, can be solved via model-free reinforcement learning when the environment is completely unknown. Finally, we illustrate and evaluate our methods on two robotic planning case studies.

Deep Reinforcement Learning Based Dynamic Route Planning for Minimizing Travel Time

Authors:Yuanzhe Geng, Erwu Liu, Rui Wang, Yiming Liu
Date:2020-11-03 15:10:09

Route planning is important in transportation. Existing works focus on finding the shortest path solution or using metrics such as safety and energy consumption to determine the planning. It is noted that most of these studies rely on prior knowledge of road network, which may be not available in certain situations. In this paper, we design a route planning algorithm based on deep reinforcement learning (DRL) for pedestrians. We use travel time consumption as the metric, and plan the route by predicting pedestrian flow in the road network. We put an agent, which is an intelligent robot, on a virtual map. Different from previous studies, our approach assumes that the agent does not need any prior information about road network, but simply relies on the interaction with the environment. We propose a dynamically adjustable route planning (DARP) algorithm, where the agent learns strategies through a dueling deep Q network to avoid congested roads. Simulation results show that the DARP algorithm saves 52% of the time under congestion condition when compared with traditional shortest path planning algorithms.

Sample-efficient reinforcement learning using deep Gaussian processes

Authors:Charles Gadd, Markus Heinonen, Harri Lähdesmäki, Samuel Kaski
Date:2020-11-02 13:37:57

Reinforcement learning provides a framework for learning to control which actions to take towards completing a task through trial-and-error. In many applications observing interactions is costly, necessitating sample-efficient learning. In model-based reinforcement learning efficiency is improved by learning to simulate the world dynamics. The challenge is that model inaccuracies rapidly accumulate over planned trajectories. We introduce deep Gaussian processes where the depth of the compositions introduces model complexity while incorporating prior knowledge on the dynamics brings smoothness and structure. Our approach is able to sample a Bayesian posterior over trajectories. We demonstrate highly improved early sample-efficiency over competing methods. This is shown across a number of continuous control tasks, including the half-cheetah whose contact dynamics have previously posed an insurmountable problem for earlier sample-efficient Gaussian process based models.

Deep Reactive Planning in Dynamic Environments

Authors:Kei Ota, Devesh K. Jha, Tadashi Onishi, Asako Kanezaki, Yusuke Yoshiyasu, Yoko Sasaki, Toshisada Mariyama, Daniel Nikovski
Date:2020-10-31 00:46:13

The main novelty of the proposed approach is that it allows a robot to learn an end-to-end policy which can adapt to changes in the environment during execution. While goal conditioning of policies has been studied in the RL literature, such approaches are not easily extended to cases where the robot's goal can change during execution. This is something that humans are naturally able to do. However, it is difficult for robots to learn such reflexes (i.e., to naturally respond to dynamic environments), especially when the goal location is not explicitly provided to the robot, and instead needs to be perceived through a vision sensor. In the current work, we present a method that can achieve such behavior by combining traditional kinematic planning, deep learning, and deep reinforcement learning in a synergistic fashion to generalize to arbitrary environments. We demonstrate the proposed approach for several reaching and pick-and-place tasks in simulation, as well as on a real system of a 6-DoF industrial manipulator. A video describing our work could be found \url{https://youtu.be/hE-Ew59GRPQ}.

Towards Preference Learning for Autonomous Ground Robot Navigation Tasks

Authors:Cory Hayes, Matthew Marge
Date:2020-10-30 16:36:12

We are interested in the design of autonomous robot behaviors that learn the preferences of users over continued interactions, with the goal of efficiently executing navigation behaviors in a way that the user expects. In this paper, we discuss our work in progress to modify a general model for robot navigation behaviors in an exploration task on a per-user basis using preference-based reinforcement learning. The novel contribution of this approach is that it combines reinforcement learning, motion planning, and natural language processing to allow an autonomous agent to learn from sustained dialogue with a human teammate as opposed to one-off instructions.

Abstract Value Iteration for Hierarchical Reinforcement Learning

Authors:Kishor Jothimurugan, Osbert Bastani, Rajeev Alur
Date:2020-10-29 14:41:42

We propose a novel hierarchical reinforcement learning framework for control with continuous state and action spaces. In our framework, the user specifies subgoal regions which are subsets of states; then, we (i) learn options that serve as transitions between these subgoal regions, and (ii) construct a high-level plan in the resulting abstract decision process (ADP). A key challenge is that the ADP may not be Markov, which we address by proposing two algorithms for planning in the ADP. Our first algorithm is conservative, allowing us to prove theoretical guarantees on its performance, which help inform the design of subgoal regions. Our second algorithm is a practical one that interweaves planning at the abstract level and learning at the concrete level. In our experiments, we demonstrate that our approach outperforms state-of-the-art hierarchical reinforcement learning algorithms on several challenging benchmarks.

Forethought and Hindsight in Credit Assignment

Authors:Veronica Chelu, Doina Precup, Hado van Hasselt
Date:2020-10-26 16:00:47

We address the problem of credit assignment in reinforcement learning and explore fundamental questions regarding the way in which an agent can best use additional computation to propagate new information, by planning with internal models of the world to improve its predictions. Particularly, we work to understand the gains and peculiarities of planning employed as forethought via forward models or as hindsight operating with backward models. We establish the relative merits, limitations and complementary properties of both planning mechanisms in carefully constructed scenarios. Further, we investigate the best use of models in planning, primarily focusing on the selection of states in which predictions should be (re)-evaluated. Lastly, we discuss the issue of model estimation and highlight a spectrum of methods that stretch from explicit environment-dynamics predictors to more abstract planner-aware models.

Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement Learning

Authors:Younggyo Seo, Kimin Lee, Ignasi Clavera, Thanard Kurutach, Jinwoo Shin, Pieter Abbeel
Date:2020-10-26 03:20:42

Model-based reinforcement learning (RL) has shown great potential in various control tasks in terms of both sample-efficiency and final performance. However, learning a generalizable dynamics model robust to changes in dynamics remains a challenge since the target transition dynamics follow a multi-modal distribution. In this paper, we present a new model-based RL algorithm, coined trajectory-wise multiple choice learning, that learns a multi-headed dynamics model for dynamics generalization. The main idea is updating the most accurate prediction head to specialize each head in certain environments with similar dynamics, i.e., clustering environments. Moreover, we incorporate context learning, which encodes dynamics-specific information from past experiences into the context latent vector, enabling the model to perform online adaptation to unseen environments. Finally, to utilize the specialized prediction heads more effectively, we propose an adaptive planning method, which selects the most accurate prediction head over a recent experience. Our method exhibits superior zero-shot generalization performance across a variety of control tasks, compared to state-of-the-art RL methods. Source code and videos are available at https://sites.google.com/view/trajectory-mcl.

XLVIN: eXecuted Latent Value Iteration Nets

Authors:Andreea Deac, Petar Veličković, Ognjen Milinković, Pierre-Luc Bacon, Jian Tang, Mladen Nikolić
Date:2020-10-25 16:04:30

Value Iteration Networks (VINs) have emerged as a popular method to incorporate planning algorithms within deep reinforcement learning, enabling performance improvements on tasks requiring long-range reasoning and understanding of environment dynamics. This came with several limitations, however: the model is not incentivised in any way to perform meaningful planning computations, the underlying state space is assumed to be discrete, and the Markov decision process (MDP) is assumed fixed and known. We propose eXecuted Latent Value Iteration Networks (XLVINs), which combine recent developments across contrastive self-supervised learning, graph representation learning and neural algorithmic reasoning to alleviate all of the above limitations, successfully deploying VIN-style models on generic environments. XLVINs match the performance of VIN-like models when the underlying MDP is discrete, fixed and known, and provides significant improvements to model-free baselines across three general MDP setups.

Robust Hierarchical Planning with Policy Delegation

Authors:Tin Lai, Philippe Morere
Date:2020-10-25 04:36:20

We propose a novel framework and algorithm for hierarchical planning based on the principle of delegation. This framework, the Markov Intent Process, features a collection of skills which are each specialised to perform a single task well. Skills are aware of their intended effects and are able to analyse planning goals to delegate planning to the best-suited skill. This principle dynamically creates a hierarchy of plans, in which each skill plans for sub-goals for which it is specialised. The proposed planning method features on-demand execution---skill policies are only evaluated when needed. Plans are only generated at the highest level, then expanded and optimised when the latest state information is available. The high-level plan retains the initial planning intent and previously computed skills, effectively reducing the computation needed to adapt to environmental changes. We show this planning approach is experimentally very competitive to classic planning and reinforcement learning techniques on a variety of domains, both in terms of solution length and planning time.

Improving the Exploration of Deep Reinforcement Learning in Continuous Domains using Planning for Policy Search

Authors:Jakob J. Hollenstein, Erwan Renaudo, Matteo Saveriano, Justus Piater
Date:2020-10-24 20:19:06

Local policy search is performed by most Deep Reinforcement Learning (D-RL) methods, which increases the risk of getting trapped in a local minimum. Furthermore, the availability of a simulation model is not fully exploited in D-RL even in simulation-based training, which potentially decreases efficiency. To better exploit simulation models in policy search, we propose to integrate a kinodynamic planner in the exploration strategy and to learn a control policy in an offline fashion from the generated environment interactions. We call the resulting model-based reinforcement learning method PPS (Planning for Policy Search). We compare PPS with state-of-the-art D-RL methods in typical RL settings including underactuated systems. The comparison shows that PPS, guided by the kinodynamic planner, collects data from a wider region of the state space. This generates training data that helps PPS discover better policies.

Planning with Exploration: Addressing Dynamics Bottleneck in Model-based Reinforcement Learning

Authors:Xiyao Wang, Junge Zhang, Wenzhen Huang, Qiyue Yin
Date:2020-10-24 15:29:02

Model-based reinforcement learning (MBRL) is believed to have higher sample efficiency compared with model-free reinforcement learning (MFRL). However, MBRL is plagued by dynamics bottleneck dilemma. Dynamics bottleneck dilemma is the phenomenon that the performance of the algorithm falls into the local optimum instead of increasing when the interaction step with the environment increases, which means more data can not bring better performance. In this paper, we find that the trajectory reward estimation error is the main reason that causes dynamics bottleneck dilemma through theoretical analysis. We give an upper bound of the trajectory reward estimation error and point out that increasing the agent's exploration ability is the key to reduce trajectory reward estimation error, thereby alleviating dynamics bottleneck dilemma. Motivated by this, a model-based control method combined with exploration named MOdel-based Progressive Entropy-based Exploration (MOPE2) is proposed. We conduct experiments on several complex continuous control benchmark tasks. The results verify that MOPE2 can effectively alleviate dynamics bottleneck dilemma and have higher sample efficiency than previous MBRL and MFRL algorithms.

Efficient Learning in Non-Stationary Linear Markov Decision Processes

Authors:Ahmed Touati, Pascal Vincent
Date:2020-10-24 11:02:45

We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs), i.e, both the reward and transition kernel are linear with respect to a given feature map and are allowed to evolve either slowly or abruptly over time. For this problem setting, we propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value iteration which uses exponential weights to smoothly forget data that are far in the past. We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $\widetilde{\mathcal{O}}(d^{5/4}H^2 \Delta^{1/4} K^{3/4})$ where $d$ is the dimension of the feature space, $H$ is the planning horizon, $K$ is the number of episodes and $\Delta$ is a suitable measure of non-stationarity of the MDP. Moreover, we point out technical gaps in the study of forgetting strategies in non-stationary linear bandits setting made by previous works and we propose a fix to their regret analysis.

Multi-UAV Path Planning for Wireless Data Harvesting with Deep Reinforcement Learning

Authors:Harald Bayerlein, Mirco Theile, Marco Caccamo, David Gesbert
Date:2020-10-23 14:59:30

Harvesting data from distributed Internet of Things (IoT) devices with multiple autonomous unmanned aerial vehicles (UAVs) is a challenging problem requiring flexible path planning methods. We propose a multi-agent reinforcement learning (MARL) approach that, in contrast to previous work, can adapt to profound changes in the scenario parameters defining the data harvesting mission, such as the number of deployed UAVs, number, position and data amount of IoT devices, or the maximum flying time, without the need to perform expensive recomputations or relearn control policies. We formulate the path planning problem for a cooperative, non-communicating, and homogeneous team of UAVs tasked with maximizing collected data from distributed IoT sensor nodes subject to flying time and collision avoidance constraints. The path planning problem is translated into a decentralized partially observable Markov decision process (Dec-POMDP), which we solve through a deep reinforcement learning (DRL) approach, approximating the optimal UAV control policy without prior knowledge of the challenging wireless channel characteristics in dense urban environments. By exploiting a combination of centered global and local map representations of the environment that are fed into convolutional layers of the agents, we show that our proposed network architecture enables the agents to cooperate effectively by carefully dividing the data collection task among themselves, adapt to large complex environments and state spaces, and make movement decisions that balance data collection goals, flight-time efficiency, and navigation constraints. Finally, learning a control policy that generalizes over the scenario parameter space enables us to analyze the influence of individual parameters on collection performance and provide some intuition about system-level benefits.

Bridging Imagination and Reality for Model-Based Deep Reinforcement Learning

Authors:Guangxiang Zhu, Minghao Zhang, Honglak Lee, Chongjie Zhang
Date:2020-10-23 03:22:01

Sample efficiency has been one of the major challenges for deep reinforcement learning. Recently, model-based reinforcement learning has been proposed to address this challenge by performing planning on imaginary trajectories with a learned world model. However, world model learning may suffer from overfitting to training trajectories, and thus model-based value estimation and policy search will be pone to be sucked in an inferior local policy. In this paper, we propose a novel model-based reinforcement learning algorithm, called BrIdging Reality and Dream (BIRD). It maximizes the mutual information between imaginary and real trajectories so that the policy improvement learned from imaginary trajectories can be easily generalized to real trajectories. We demonstrate that our approach improves sample efficiency of model-based planning, and achieves state-of-the-art performance on challenging visual control benchmarks.

Motion Planner Augmented Reinforcement Learning for Robot Manipulation in Obstructed Environments

Authors:Jun Yamada, Youngwoon Lee, Gautam Salhotra, Karl Pertsch, Max Pflueger, Gaurav S. Sukhatme, Joseph J. Lim, Peter Englert
Date:2020-10-22 17:59:09

Deep reinforcement learning (RL) agents are able to learn contact-rich manipulation tasks by maximizing a reward signal, but require large amounts of experience, especially in environments with many obstacles that complicate exploration. In contrast, motion planners use explicit models of the agent and environment to plan collision-free paths to faraway goals, but suffer from inaccurate models in tasks that require contacts with the environment. To combine the benefits of both approaches, we propose motion planner augmented RL (MoPA-RL) which augments the action space of an RL agent with the long-horizon planning capabilities of motion planners. Based on the magnitude of the action, our approach smoothly transitions between directly executing the action and invoking a motion planner. We evaluate our approach on various simulated manipulation tasks and compare it to alternative action spaces in terms of learning efficiency and safety. The experiments demonstrate that MoPA-RL increases learning efficiency, leads to a faster exploration, and results in safer policies that avoid collisions with the environment. Videos and code are available at https://clvrai.com/mopa-rl .

Learning Spring Mass Locomotion: Guiding Policies with a Reduced-Order Model

Authors:Kevin Green, Yesh Godse, Jeremy Dao, Ross L. Hatton, Alan Fern, Jonathan Hurst
Date:2020-10-21 18:29:58

In this paper, we describe an approach to achieve dynamic legged locomotion on physical robots which combines existing methods for control with reinforcement learning. Specifically, our goal is a control hierarchy in which highest-level behaviors are planned through reduced-order models, which describe the fundamental physics of legged locomotion, and lower level controllers utilize a learned policy that can bridge the gap between the idealized, simple model and the complex, full order robot. The high-level planner can use a model of the environment and be task specific, while the low-level learned controller can execute a wide range of motions so that it applies to many different tasks. In this letter we describe this learned dynamic walking controller and show that a range of walking motions from reduced-order models can be used as the command and primary training signal for learned policies. The resulting policies do not attempt to naively track the motion (as a traditional trajectory tracking controller would) but instead balance immediate motion tracking with long term stability. The resulting controller is demonstrated on a human scale, unconstrained, untethered bipedal robot at speeds up to 1.2 m/s. This letter builds the foundation of a generic, dynamic learned walking controller that can be applied to many different tasks.

Visual Navigation in Real-World Indoor Environments Using End-to-End Deep Reinforcement Learning

Authors:Jonáš Kulhánek, Erik Derner, Robert Babuška
Date:2020-10-21 11:22:30

Visual navigation is essential for many applications in robotics, from manipulation, through mobile robotics to automated driving. Deep reinforcement learning (DRL) provides an elegant map-free approach integrating image processing, localization, and planning in one module, which can be trained and therefore optimized for a given environment. However, to date, DRL-based visual navigation was validated exclusively in simulation, where the simulator provides information that is not available in the real world, e.g., the robot's position or image segmentation masks. This precludes the use of the learned policy on a real robot. Therefore, we propose a novel approach that enables a direct deployment of the trained policy on real robots. We have designed visual auxiliary tasks, a tailored reward scheme, and a new powerful simulator to facilitate domain randomization. The policy is fine-tuned on images collected from real-world environments. We have evaluated the method on a mobile robot in a real office environment. The training took ~30 hours on a single GPU. In 30 navigation experiments, the robot reached a 0.3-meter neighborhood of the goal in more than 86.7% of cases. This result makes the proposed method directly applicable to tasks like mobile manipulation.

Negotiating Team Formation Using Deep Reinforcement Learning

Authors:Yoram Bachrach, Richard Everett, Edward Hughes, Angeliki Lazaridou, Joel Z. Leibo, Marc Lanctot, Michael Johanson, Wojciech M. Czarnecki, Thore Graepel
Date:2020-10-20 15:41:23

When autonomous agents interact in the same environment, they must often cooperate to achieve their goals. One way for agents to cooperate effectively is to form a team, make a binding agreement on a joint plan, and execute it. However, when agents are self-interested, the gains from team formation must be allocated appropriately to incentivize agreement. Various approaches for multi-agent negotiation have been proposed, but typically only work for particular negotiation protocols. More general methods usually require human input or domain-specific data, and so do not scale. To address this, we propose a framework for training agents to negotiate and form teams using deep reinforcement learning. Importantly, our method makes no assumptions about the specific negotiation protocol, and is instead completely experience driven. We evaluate our approach on both non-spatial and spatially extended team-formation negotiation environments, demonstrating that our agents beat hand-crafted bots and reach negotiation outcomes consistent with fair solutions predicted by cooperative game theory. Additionally, we investigate how the physical location of agents influences negotiation outcomes.

Dream and Search to Control: Latent Space Planning for Continuous Control

Authors:Anurag Koul, Varun V. Kumar, Alan Fern, Somdeb Majumdar
Date:2020-10-19 20:10:51

Learning and planning with latent space dynamics has been shown to be useful for sample efficiency in model-based reinforcement learning (MBRL) for discrete and continuous control tasks. In particular, recent work, for discrete action spaces, demonstrated the effectiveness of latent-space planning via Monte-Carlo Tree Search (MCTS) for bootstrapping MBRL during learning and at test time. However, the potential gains from latent-space tree search have not yet been demonstrated for environments with continuous action spaces. In this work, we propose and explore an MBRL approach for continuous action spaces based on tree-based planning over learned latent dynamics. We show that it is possible to demonstrate the types of bootstrapping benefits as previously shown for discrete spaces. In particular, the approach achieves improved sample efficiency and performance on a majority of challenging continuous-control benchmarks compared to the state-of-the-art.

Model-free conventions in multi-agent reinforcement learning with heterogeneous preferences

Authors:Raphael Köster, Kevin R. McKee, Richard Everett, Laura Weidinger, William S. Isaac, Edward Hughes, Edgar A. Duéñez-Guzmán, Thore Graepel, Matthew Botvinick, Joel Z. Leibo
Date:2020-10-18 18:18:37

Game theoretic views of convention generally rest on notions of common knowledge and hyper-rational models of individual behavior. However, decades of work in behavioral economics have questioned the validity of both foundations. Meanwhile, computational neuroscience has contributed a modernized 'dual process' account of decision-making where model-free (MF) reinforcement learning trades off with model-based (MB) reinforcement learning. The former captures habitual and procedural learning while the latter captures choices taken via explicit planning and deduction. Some conventions (e.g. international treaties) are likely supported by cognition that resonates with the game theoretic and MB accounts. However, convention formation may also occur via MF mechanisms like habit learning; though this possibility has been understudied. Here, we demonstrate that complex, large-scale conventions can emerge from MF learning mechanisms. This suggests that some conventions may be supported by habit-like cognition rather than explicit reasoning. We apply MF multi-agent reinforcement learning to a temporo-spatially extended game with incomplete information. In this game, large parts of the state space are reachable only by collective action. However, heterogeneity of tastes makes such coordinated action difficult: multiple equilibria are desirable for all players, but subgroups prefer a particular equilibrium over all others. This creates a coordination problem that can be solved by establishing a convention. We investigate start-up and free rider subproblems as well as the effects of group size, intensity of intrinsic preference, and salience on the emergence dynamics of coordination conventions. Results of our simulations show agents establish and switch between conventions, even working against their own preferred outcome when doing so is necessary for effective coordination.

Approximate information state for approximate planning and reinforcement learning in partially observed systems

Authors:Jayakumar Subramanian, Amit Sinha, Raihan Seraj, Aditya Mahajan
Date:2020-10-17 18:30:30

We propose a theoretical framework for approximate planning and learning in partially observed systems. Our framework is based on the fundamental notion of information state. We provide two equivalent definitions of information state -- i) a function of history which is sufficient to compute the expected reward and predict its next value; ii) equivalently, a function of the history which can be recursively updated and is sufficient to compute the expected reward and predict the next observation. An information state always leads to a dynamic programming decomposition. Our key result is to show that if a function of the history (called approximate information state (AIS)) approximately satisfies the properties of the information state, then there is a corresponding approximate dynamic program. We show that the policy computed using this is approximately optimal with bounded loss of optimality. We show that several approximations in state, observation and action spaces in literature can be viewed as instances of AIS. In some of these cases, we obtain tighter bounds. A salient feature of AIS is that it can be learnt from data. We present AIS based multi-time scale policy gradient algorithms. and detailed numerical experiments with low, moderate and high dimensional environments.

Robot Navigation in Constrained Pedestrian Environments using Reinforcement Learning

Authors:Claudia Pérez-D'Arpino, Can Liu, Patrick Goebel, Roberto Martín-Martín, Silvio Savarese
Date:2020-10-16 19:40:08

Navigating fluently around pedestrians is a necessary capability for mobile robots deployed in human environments, such as buildings and homes. While research on social navigation has focused mainly on the scalability with the number of pedestrians in open spaces, typical indoor environments present the additional challenge of constrained spaces such as corridors and doorways that limit maneuverability and influence patterns of pedestrian interaction. We present an approach based on reinforcement learning (RL) to learn policies capable of dynamic adaptation to the presence of moving pedestrians while navigating between desired locations in constrained environments. The policy network receives guidance from a motion planner that provides waypoints to follow a globally planned trajectory, whereas RL handles the local interactions. We explore a compositional principle for multi-layout training and find that policies trained in a small set of geometrically simple layouts successfully generalize to more complex unseen layouts that exhibit composition of the structural elements available during training. Going beyond walls-world like domains, we show transfer of the learned policy to unseen 3D reconstructions of two real environments. These results support the applicability of the compositional principle to navigation in real-world buildings and indicate promising usage of multi-agent simulation within reconstructed environments for tasks that involve interaction.

PRIMAL2: Pathfinding via Reinforcement and Imitation Multi-Agent Learning -- Lifelong

Authors:Mehul Damani, Zhiyao Luo, Emerson Wenzel, Guillaume Sartoretti
Date:2020-10-16 06:23:53

Multi-agent path finding (MAPF) is an indispensable component of large-scale robot deployments in numerous domains ranging from airport management to warehouse automation. In particular, this work addresses lifelong MAPF (LMAPF) - an online variant of the problem where agents are immediately assigned a new goal upon reaching their current one - in dense and highly structured environments, typical of real-world warehouse operations. Effectively solving LMAPF in such environments requires expensive coordination between agents as well as frequent replanning abilities, a daunting task for existing coupled and decoupled approaches alike. With the purpose of achieving considerable agent coordination without any compromise on reactivity and scalability, we introduce PRIMAL2, a distributed reinforcement learning framework for LMAPF where agents learn fully decentralized policies to reactively plan paths online in a partially observable world. We extend our previous work, which was effective in low-density sparsely occupied worlds, to highly structured and constrained worlds by identifying behaviors and conventions which improve implicit agent coordination, and enable their learning through the construction of a novel local agent observation and various training aids. We present extensive results of PRIMAL2 in both MAPF and LMAPF environments and compare its performance to state-of-the-art planners in terms of makespan and throughput. We show that PRIMAL2 significantly surpasses our previous work and performs comparably to these baselines, while allowing real-time re-planning and scaling up to 2048 agents.

Uncertainty-aware Contact-safe Model-based Reinforcement Learning

Authors:Cheng-Yu Kuo, Andreas Schaarschmidt, Yunduan Cui, Tamim Asfour, Takamitsu Matsubara
Date:2020-10-16 05:11:25

This letter presents contact-safe Model-based Reinforcement Learning (MBRL) for robot applications that achieves contact-safe behaviors in the learning process. In typical MBRL, we cannot expect the data-driven model to generate accurate and reliable policies to the intended robotic tasks during the learning process due to sample scarcity. Operating these unreliable policies in a contact-rich environment could cause damage to the robot and its surroundings. To alleviate the risk of causing damage through unexpected intensive physical contacts, we present the contact-safe MBRL that associates the probabilistic Model Predictive Control's (pMPC) control limits with the model uncertainty so that the allowed acceleration of controlled behavior is adjusted according to learning progress. Control planning with such uncertainty-aware control limits is formulated as a deterministic MPC problem using a computation-efficient approximated GP dynamics and an approximated inference technique. Our approach's effectiveness is evaluated through bowl mixing tasks with simulated and real robots, scooping tasks with a real robot as examples of contact-rich manipulation skills. (video: https://youtu.be/sdhHP3NhYi0)

Applicability and Challenges of Deep Reinforcement Learning for Satellite Frequency Plan Design

Authors:Juan Jose Garau Luis, Edward Crawley, Bruce Cameron
Date:2020-10-15 20:51:03

The study and benchmarking of Deep Reinforcement Learning (DRL) models has become a trend in many industries, including aerospace engineering and communications. Recent studies in these fields propose these kinds of models to address certain complex real-time decision-making problems in which classic approaches do not meet time requirements or fail to obtain optimal solutions. While the good performance of DRL models has been proved for specific use cases or scenarios, most studies do not discuss the compromises and generalizability of such models during real operations. In this paper we explore the tradeoffs of different elements of DRL models and how they might impact the final performance. To that end, we choose the Frequency Plan Design (FPD) problem in the context of multibeam satellite constellations as our use case and propose a DRL model to address it. We identify 6 different core elements that have a major effect in its performance: the policy, the policy optimizer, the state, action, and reward representations, and the training environment. We analyze different alternatives for each of these elements and characterize their effect. We also use multiple environments to account for different scenarios in which we vary the dimensionality or make the environment nonstationary. Our findings show that DRL is a potential method to address the FPD problem in real operations, especially because of its speed in decision-making. However, no single DRL model is able to outperform the rest in all scenarios, and the best approach for each of the 6 core elements depends on the features of the operation environment. While we agree on the potential of DRL to solve future complex problems in the aerospace industry, we also reflect on the importance of designing appropriate models and training procedures, understanding the applicability of such models, and reporting the main performance tradeoffs.

UAV Path Planning using Global and Local Map Information with Deep Reinforcement Learning

Authors:Mirco Theile, Harald Bayerlein, Richard Nai, David Gesbert, Marco Caccamo
Date:2020-10-14 09:59:10

Path planning methods for autonomous unmanned aerial vehicles (UAVs) are typically designed for one specific type of mission. This work presents a method for autonomous UAV path planning based on deep reinforcement learning (DRL) that can be applied to a wide range of mission scenarios. Specifically, we compare coverage path planning (CPP), where the UAV's goal is to survey an area of interest to data harvesting (DH), where the UAV collects data from distributed Internet of Things (IoT) sensor devices. By exploiting structured map information of the environment, we train double deep Q-networks (DDQNs) with identical architectures on both distinctly different mission scenarios to make movement decisions that balance the respective mission goal with navigation constraints. By introducing a novel approach exploiting a compressed global map of the environment combined with a cropped but uncompressed local map showing the vicinity of the UAV agent, we demonstrate that the proposed method can efficiently scale to large environments. We also extend previous results for generalizing control policies that require no retraining when scenario parameters change and offer a detailed analysis of crucial map processing parameters' effects on path planning performance.

Reinforcement Learning Based Temporal Logic Control with Maximum Probabilistic Satisfaction

Authors:Mingyu Cai, Shaoping Xiao, Baoluo Li, Zhiliang Li, Zhen Kan
Date:2020-10-14 03:49:16

This paper presents a model-free reinforcement learning (RL) algorithm to synthesize a control policy that maximizes the satisfaction probability of linear temporal logic (LTL) specifications. Due to the consideration of environment and motion uncertainties, we model the robot motion as a probabilistic labeled Markov decision process with unknown transition probabilities and unknown probabilistic label functions. The LTL task specification is converted to a limit deterministic generalized B\"uchi automaton (LDGBA) with several accepting sets to maintain dense rewards during learning. The novelty of applying LDGBA is to construct an embedded LDGBA (E-LDGBA) by designing a synchronous tracking-frontier function, which enables the record of non-visited accepting sets without increasing dimensional and computational complexity. With appropriate dependent reward and discount functions, rigorous analysis shows that any method that optimizes the expected discount return of the RL-based approach is guaranteed to find the optimal policy that maximizes the satisfaction probability of the LTL specifications. A model-free RL-based motion planning strategy is developed to generate the optimal policy in this paper. The effectiveness of the RL-based control synthesis is demonstrated via simulation and experimental results.

Deep Reinforcement Learning for Real-Time Optimization of Pumps in Water Distribution Systems

Authors:Gergely Hajgató, György Paál, Bálint Gyires-Tóth
Date:2020-10-13 15:13:49

Real-time control of pumps can be an infeasible task in water distribution systems (WDSs) because the calculation to find the optimal pump speeds is resource-intensive. The computational need cannot be lowered even with the capabilities of smart water networks when conventional optimization techniques are used. Deep reinforcement learning (DRL) is presented here as a controller of pumps in two WDSs. An agent based on a dueling deep q-network is trained to maintain the pump speeds based on instantaneous nodal pressure data. General optimization techniques (e.g., Nelder-Mead method, differential evolution) serve as baselines. The total efficiency achieved by the DRL agent compared to the best performing baseline is above 0.98, whereas the speedup is around 2x compared to that. The main contribution of the presented approach is that the agent can run the pumps in real-time because it depends only on measurement data. If the WDS is replaced with a hydraulic simulation, the agent still outperforms conventional techniques in search speed.

Model-Based Reinforcement Learning for Type 1Diabetes Blood Glucose Control

Authors:Taku Yamagata, Aisling O'Kane, Amid Ayobi, Dmitri Katz, Katarzyna Stawarz, Paul Marshall, Peter Flach, Raúl Santos-Rodríguez
Date:2020-10-13 10:17:30

In this paper we investigate the use of model-based reinforcement learning to assist people with Type 1 Diabetes with insulin dose decisions. The proposed architecture consists of multiple Echo State Networks to predict blood glucose levels combined with Model Predictive Controller for planning. Echo State Network is a version of recurrent neural networks which allows us to learn long term dependencies in the input of time series data in an online manner. Additionally, we address the quantification of uncertainty for a more robust control. Here, we used ensembles of Echo State Networks to capture model (epistemic) uncertainty. We evaluated the approach with the FDA-approved UVa/Padova Type 1 Diabetes simulator and compared the results against baseline algorithms such as Basal-Bolus controller and Deep Q-learning. The results suggest that the model-based reinforcement learning algorithm can perform equally or better than the baseline algorithms for the majority of virtual Type 1 Diabetes person profiles tested.

Nearly Minimax Optimal Reward-free Reinforcement Learning

Authors:Zihan Zhang, Simon S. Du, Xiangyang Ji
Date:2020-10-12 17:51:19

We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases. In the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal. In the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. We give a new efficient algorithm, \textbf{S}taged \textbf{S}ampling + \textbf{T}runcated \textbf{P}lanning (\algoname), which interacts with the environment at most $O\left( \frac{S^2A}{\epsilon^2}\text{poly}\log\left(\frac{SAH}{\epsilon}\right) \right)$ episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase. Here, $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon, and $\epsilon$ is the target accuracy relative to the total reward. Notably, our sample complexity scales only \emph{logarithmically} with $H$, in contrast to all existing results which scale \emph{polynomially} with $H$. Furthermore, this bound matches the minimax lower bound $\Omega\left(\frac{S^2A}{\epsilon^2}\right)$ up to logarithmic factors. Our results rely on three new techniques : 1) A new sufficient condition for the dataset to plan for an $\epsilon$-suboptimal policy; 2) A new way to plan efficiently under the proposed condition using soft-truncated planning; 3) Constructing extended MDP to maximize the truncated accumulative rewards efficiently.

Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement Learning?

Authors:Qiwen Cui, Lin F. Yang
Date:2020-10-12 13:13:01

It is believed that a model-based approach for reinforcement learning (RL) is the key to reduce sample complexity. However, the understanding of the sample optimality of model-based RL is still largely missing, even for the linear case. This work considers sample complexity of finding an $\epsilon$-optimal policy in a Markov decision process (MDP) that admits a linear additive feature representation, given only access to a generative model. We solve this problem via a plug-in solver approach, which builds an empirical model and plans in this empirical model via an arbitrary plug-in solver. We prove that under the anchor-state assumption, which implies implicit non-negativity in the feature space, the minimax sample complexity of finding an $\epsilon$-optimal policy in a $\gamma$-discounted MDP is $O(K/(1-\gamma)^3\epsilon^2)$, which only depends on the dimensionality $K$ of the feature space and has no dependence on the state or action space. We further extend our results to a relaxed setting where anchor-states may not exist and show that a plug-in approach can be sample efficient as well, providing a flexible approach to design model-based algorithms for RL.

A DRL-based Multiagent Cooperative Control Framework for CAV Networks: a Graphic Convolution Q Network

Authors:Jiqian Dong, Sikai Chen, Paul Young Joun Ha, Yujie Li, Samuel Labi
Date:2020-10-12 03:53:58

Connected Autonomous Vehicle (CAV) Network can be defined as a collection of CAVs operating at different locations on a multilane corridor, which provides a platform to facilitate the dissemination of operational information as well as control instructions. Cooperation is crucial in CAV operating systems since it can greatly enhance operation in terms of safety and mobility, and high-level cooperation between CAVs can be expected by jointly plan and control within CAV network. However, due to the highly dynamic and combinatory nature such as dynamic number of agents (CAVs) and exponentially growing joint action space in a multiagent driving task, achieving cooperative control is NP hard and cannot be governed by any simple rule-based methods. In addition, existing literature contains abundant information on autonomous driving's sensing technology and control logic but relatively little guidance on how to fuse the information acquired from collaborative sensing and build decision processor on top of fused information. In this paper, a novel Deep Reinforcement Learning (DRL) based approach combining Graphic Convolution Neural Network (GCN) and Deep Q Network (DQN), namely Graphic Convolution Q network (GCQ) is proposed as the information fusion module and decision processor. The proposed model can aggregate the information acquired from collaborative sensing and output safe and cooperative lane changing decisions for multiple CAVs so that individual intention can be satisfied even under a highly dynamic and partially observed mixed traffic. The proposed algorithm can be deployed on centralized control infrastructures such as road-side units (RSU) or cloud platforms to improve the CAV operation.

LaND: Learning to Navigate from Disengagements

Authors:Gregory Kahn, Pieter Abbeel, Sergey Levine
Date:2020-10-09 17:21:42

Consistently testing autonomous mobile robots in real world scenarios is a necessary aspect of developing autonomous navigation systems. Each time the human safety monitor disengages the robot's autonomy system due to the robot performing an undesirable maneuver, the autonomy developers gain insight into how to improve the autonomy system. However, we believe that these disengagements not only show where the system fails, which is useful for troubleshooting, but also provide a direct learning signal by which the robot can learn to navigate. We present a reinforcement learning approach for learning to navigate from disengagements, or LaND. LaND learns a neural network model that predicts which actions lead to disengagements given the current sensory observation, and then at test time plans and executes actions that avoid disengagements. Our results demonstrate LaND can successfully learn to navigate in diverse, real world sidewalk environments, outperforming both imitation learning and reinforcement learning approaches. Videos, code, and other material are available on our website https://sites.google.com/view/sidewalk-learning

CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning

Authors:Ossama Ahmed, Frederik Träuble, Anirudh Goyal, Alexander Neitz, Yoshua Bengio, Bernhard Schölkopf, Manuel Wüthrich, Stefan Bauer
Date:2020-10-08 23:01:13

Despite recent successes of reinforcement learning (RL), it remains a challenge for agents to transfer learned skills to related environments. To facilitate research addressing this problem, we propose CausalWorld, a benchmark for causal structure and transfer learning in a robotic manipulation environment. The environment is a simulation of an open-source robotic platform, hence offering the possibility of sim-to-real transfer. Tasks consist of constructing 3D shapes from a given set of blocks - inspired by how children learn to build complex structures. The key strength of CausalWorld is that it provides a combinatorial family of such tasks with common causal structure and underlying factors (including, e.g., robot and object masses, colors, sizes). The user (or the agent) may intervene on all causal variables, which allows for fine-grained control over how similar different tasks (or task distributions) are. One can thus easily define training and evaluation distributions of a desired difficulty level, targeting a specific form of generalization (e.g., only changes in appearance or object mass). Further, this common parametrization facilitates defining curricula by interpolating between an initial and a target task. While users may define their own task distributions, we present eight meaningful distributions as concrete benchmarks, ranging from simple to very challenging, all of which require long-horizon planning as well as precise low-level motor control. Finally, we provide baseline results for a subset of these tasks on distinct training curricula and corresponding evaluation protocols, verifying the feasibility of the tasks in this benchmark.

Text-based RL Agents with Commonsense Knowledge: New Challenges, Environments and Baselines

Authors:Keerthiram Murugesan, Mattia Atzeni, Pavan Kapanipathi, Pushkar Shukla, Sadhana Kumaravel, Gerald Tesauro, Kartik Talamadupula, Mrinmaya Sachan, Murray Campbell
Date:2020-10-08 06:20:00

Text-based games have emerged as an important test-bed for Reinforcement Learning (RL) research, requiring RL agents to combine grounded language understanding with sequential decision making. In this paper, we examine the problem of infusing RL agents with commonsense knowledge. Such knowledge would allow agents to efficiently act in the world by pruning out implausible actions, and to perform look-ahead planning to determine how current actions might affect future world states. We design a new text-based gaming environment called TextWorld Commonsense (TWC) for training and evaluating RL agents with a specific kind of commonsense knowledge about objects, their attributes, and affordances. We also introduce several baseline RL agents which track the sequential context and dynamically retrieve the relevant commonsense knowledge from ConceptNet. We show that agents which incorporate commonsense knowledge in TWC perform better, while acting more efficiently. We conduct user-studies to estimate human performance on TWC and show that there is ample room for future improvement.

QarSUMO: A Parallel, Congestion-optimized Traffic Simulator

Authors:Hao Chen, Ke Yang, Stefano Giovanni Rizzo, Giovanna Vantini, Phillip Taylor, Xiaosong Ma, Sanjay Chawla
Date:2020-10-07 09:10:42

Traffic simulators are important tools for tasks such as urban planning and transportation management. Microscopic simulators allow per-vehicle movement simulation, but require longer simulation time. The simulation overhead is exacerbated when there is traffic congestion and most vehicles move slowly. This in particular hurts the productivity of emerging urban computing studies based on reinforcement learning, where traffic simulations are heavily and repeatedly used for designing policies to optimize traffic related tasks. In this paper, we develop QarSUMO, a parallel, congestion-optimized version of the popular SUMO open-source traffic simulator. QarSUMO performs high-level parallelization on top of SUMO, to utilize powerful multi-core servers and enables future extension to multi-node parallel simulation if necessary. The proposed design, while partly sacrificing speedup, makes QarSUMO compatible with future SUMO improvements. We further contribute such an improvement by modifying the SUMO simulation engine for congestion scenarios where the update computation of consecutive and slow-moving vehicles can be simplified. We evaluate QarSUMO with both real-world and synthetic road network and traffic data, and examine its execution time as well as simulation accuracy relative to the original, sequential SUMO.

Reinforcement Learning in Deep Structured Teams: Initial Results with Finite and Infinite Valued Features

Authors:Jalal Arabneydi, Masoud Roudneshin, Amir G. Aghdam
Date:2020-10-06 16:45:49

In this paper, we consider Markov chain and linear quadratic models for deep structured teams with discounted and time-average cost functions under two non-classical information structures, namely, deep state sharing and no sharing. In deep structured teams, agents are coupled in dynamics and cost functions through deep state, where deep state refers to a set of orthogonal linear regressions of the states. In this article, we consider a homogeneous linear regression for Markov chain models (i.e., empirical distribution of states) and a few orthonormal linear regressions for linear quadratic models (i.e., weighted average of states). Some planning algorithms are developed for the case when the model is known, and some reinforcement learning algorithms are proposed for the case when the model is not known completely. The convergence of two model-free (reinforcement learning) algorithms, one for Markov chain models and one for linear quadratic models, is established. The results are then applied to a smart grid.

Heterogeneous Multi-Agent Reinforcement Learning for Unknown Environment Mapping

Authors:Ceyer Wakilpoor, Patrick J. Martin, Carrie Rebhuhn, Amanda Vu
Date:2020-10-06 12:23:05

Reinforcement learning in heterogeneous multi-agent scenarios is important for real-world applications but presents challenges beyond those seen in homogeneous settings and simple benchmarks. In this work, we present an actor-critic algorithm that allows a team of heterogeneous agents to learn decentralized control policies for covering an unknown environment. This task is of interest to national security and emergency response organizations that would like to enhance situational awareness in hazardous areas by deploying teams of unmanned aerial vehicles. To solve this multi-agent coverage path planning problem in unknown environments, we augment a multi-agent actor-critic architecture with a new state encoding structure and triplet learning loss to support heterogeneous agent learning. We developed a simulation environment that includes real-world environmental factors such as turbulence, delayed communication, and agent loss, to train teams of agents as well as probe their robustness and flexibility to such disturbances.

Offline Learning for Planning: A Summary

Authors:Giorgio Angelotti, Nicolas Drougard, Caroline Ponzoni Carvalho Chanel
Date:2020-10-05 11:41:11

The training of autonomous agents often requires expensive and unsafe trial-and-error interactions with the environment. Nowadays several data sets containing recorded experiences of intelligent agents performing various tasks, spanning from the control of unmanned vehicles to human-robot interaction and medical applications are accessible on the internet. With the intention of limiting the costs of the learning procedure it is convenient to exploit the information that is already available rather than collecting new data. Nevertheless, the incapability to augment the batch can lead the autonomous agents to develop far from optimal behaviours when the sampled experiences do not allow for a good estimate of the true distribution of the environment. Offline learning is the area of machine learning concerned with efficiently obtaining an optimal policy with a batch of previously collected experiences without further interaction with the environment. In this paper we adumbrate the ideas motivating the development of the state-of-the-art offline learning baselines. The listed methods consist in the introduction of epistemic uncertainty dependent constraints during the classical resolution of a Markov Decision Process, with and without function approximators, that aims to alleviate the bad effects of the distributional mismatch between the available samples and real world. We provide comments on the practical utility of the theoretical bounds that justify the application of these algorithms and suggest the utilization of Generative Adversarial Networks to estimate the distributional shift that affects all of the proposed model-free and model-based approaches.

A Distributed Model-Free Ride-Sharing Approach for Joint Matching, Pricing, and Dispatching using Deep Reinforcement Learning

Authors:Marina Haliem, Ganapathy Mani, Vaneet Aggarwal, Bharat Bhargava
Date:2020-10-05 03:13:47

Significant development of ride-sharing services presents a plethora of opportunities to transform urban mobility by providing personalized and convenient transportation while ensuring efficiency of large-scale ride pooling. However, a core problem for such services is route planning for each driver to fulfill the dynamically arriving requests while satisfying given constraints. Current models are mostly limited to static routes with only two rides per vehicle (optimally) or three (with heuristics). In this paper, we present a dynamic, demand aware, and pricing-based vehicle-passenger matching and route planning framework that (1) dynamically generates optimal routes for each vehicle based on online demand, pricing associated with each ride, vehicle capacities and locations. This matching algorithm starts greedily and optimizes over time using an insertion operation, (2) involves drivers in the decision-making process by allowing them to propose a different price based on the expected reward for a particular ride as well as the destination locations for future rides, which is influenced by supply-and demand computed by the Deep Q-network, (3) allows customers to accept or reject rides based on their set of preferences with respect to pricing and delay windows, vehicle type and carpooling preferences, and (4) based on demand prediction, our approach re-balances idle vehicles by dispatching them to the areas of anticipated high demand using deep Reinforcement Learning (RL). Our framework is validated using the New York City Taxi public dataset; however, we consider different vehicle types and designed customer utility functions to validate the setup and study different settings. Experimental results show the effectiveness of our approach in real-time and large scale settings.

Deep Reinforcement Learning for Collaborative Edge Computing in Vehicular Networks

Authors:Mushu Li, Jie Gao, Lian Zhao, Xuemin Shen
Date:2020-10-05 00:06:37

Mobile edge computing (MEC) is a promising technology to support mission-critical vehicular applications, such as intelligent path planning and safety applications. In this paper, a collaborative edge computing framework is developed to reduce the computing service latency and improve service reliability for vehicular networks. First, a task partition and scheduling algorithm (TPSA) is proposed to decide the workload allocation and schedule the execution order of the tasks offloaded to the edge servers given a computation offloading strategy. Second, an artificial intelligence (AI) based collaborative computing approach is developed to determine the task offloading, computing, and result delivery policy for vehicles. Specifically, the offloading and computing problem is formulated as a Markov decision process. A deep reinforcement learning technique, i.e., deep deterministic policy gradient, is adopted to find the optimal solution in a complex urban transportation network. By our approach, the service cost, which includes computing service latency and service failure penalty, can be minimized via the optimal workload assignment and server selection in collaborative computing. Simulation results show that the proposed AI-based collaborative computing approach can adapt to a highly dynamic environment with outstanding performance.

Beyond Tabula-Rasa: a Modular Reinforcement Learning Approach for Physically Embedded 3D Sokoban

Authors:Peter Karkus, Mehdi Mirza, Arthur Guez, Andrew Jaegle, Timothy Lillicrap, Lars Buesing, Nicolas Heess, Theophane Weber
Date:2020-10-03 07:48:06

Intelligent robots need to achieve abstract objectives using concrete, spatiotemporally complex sensory information and motor control. Tabula rasa deep reinforcement learning (RL) has tackled demanding tasks in terms of either visual, abstract, or physical reasoning, but solving these jointly remains a formidable challenge. One recent, unsolved benchmark task that integrates these challenges is Mujoban, where a robot needs to arrange 3D warehouses generated from 2D Sokoban puzzles. We explore whether integrated tasks like Mujoban can be solved by composing RL modules together in a sense-plan-act hierarchy, where modules have well-defined roles similarly to classic robot architectures. Unlike classic architectures that are typically model-based, we use only model-free modules trained with RL or supervised learning. We find that our modular RL approach dramatically outperforms the state-of-the-art monolithic RL agent on Mujoban. Further, learned modules can be reused when, e.g., using a different robot platform to solve the same task. Together our results give strong evidence for the importance of research into modular RL designs. Project website: https://sites.google.com/view/modular-rl/

Model-Free Reinforcement Learning for Stochastic Games with Linear Temporal Logic Objectives

Authors:Alper Kamil Bozkurt, Yu Wang, Michael Zavlanos, Miroslav Pajic
Date:2020-10-02 15:29:32

We study the problem of synthesizing control strategies for Linear Temporal Logic (LTL) objectives in unknown environments. We model this problem as a turn-based zero-sum stochastic game between the controller and the environment, where the transition probabilities and the model topology are fully unknown. The winning condition for the controller in this game is the satisfaction of the given LTL specification, which can be captured by the acceptance condition of a deterministic Rabin automaton (DRA) directly derived from the LTL specification. We introduce a model-free reinforcement learning (RL) methodology to find a strategy that maximizes the probability of satisfying a given LTL specification when the Rabin condition of the derived DRA has a single accepting pair. We then generalize this approach to LTL formulas for which the Rabin condition has a larger number of accepting pairs, providing a lower bound on the satisfaction probability. Finally, we illustrate applicability of our RL method on two motion planning case studies.

MADRaS : Multi Agent Driving Simulator

Authors:Anirban Santara, Sohan Rudra, Sree Aditya Buridi, Meha Kaushik, Abhishek Naik, Bharat Kaul, Balaraman Ravindran
Date:2020-10-02 13:38:49

In this work, we present MADRaS, an open-source multi-agent driving simulator for use in the design and evaluation of motion planning algorithms for autonomous driving. MADRaS provides a platform for constructing a wide variety of highway and track driving scenarios where multiple driving agents can train for motion planning tasks using reinforcement learning and other machine learning algorithms. MADRaS is built on TORCS, an open-source car-racing simulator. TORCS offers a variety of cars with different dynamic properties and driving tracks with different geometries and surface properties. MADRaS inherits these functionalities from TORCS and introduces support for multi-agent training, inter-vehicular communication, noisy observations, stochastic actions, and custom traffic cars whose behaviours can be programmed to simulate challenging traffic conditions encountered in the real world. MADRaS can be used to create driving tasks whose complexities can be tuned along eight axes in well-defined steps. This makes it particularly suited for curriculum and continual learning. MADRaS is lightweight and it provides a convenient OpenAI Gym interface for independent control of each car. Apart from the primitive steering-acceleration-brake control mode of TORCS, MADRaS offers a hierarchical track-position -- speed control that can potentially be used to achieve better generalization. MADRaS uses multiprocessing to run each agent as a parallel process for efficiency and integrates well with popular reinforcement learning libraries like RLLib.

Goal-Auxiliary Actor-Critic for 6D Robotic Grasping with Point Clouds

Authors:Lirui Wang, Yu Xiang, Wei Yang, Arsalan Mousavian, Dieter Fox
Date:2020-10-02 07:42:00

6D robotic grasping beyond top-down bin-picking scenarios is a challenging task. Previous solutions based on 6D grasp synthesis with robot motion planning usually operate in an open-loop setting, which are sensitive to grasp synthesis errors. In this work, we propose a new method for learning closed-loop control policies for 6D grasping. Our policy takes a segmented point cloud of an object from an egocentric camera as input, and outputs continuous 6D control actions of the robot gripper for grasping the object. We combine imitation learning and reinforcement learning and introduce a goal-auxiliary actor-critic algorithm for policy learning. We demonstrate that our learned policy can be integrated into a tabletop 6D grasping system and a human-robot handover system to improve the grasping performance of unseen objects. Our videos and code can be found at https://sites.google.com/view/gaddpg .

Facilitating Connected Autonomous Vehicle Operations Using Space-weighted Information Fusion and Deep Reinforcement Learning Based Control

Authors:Jiqian Dong, Sikai Chen, Yujie Li, Runjia Du, Aaron Steinfeld, Samuel Labi
Date:2020-09-30 13:38:32

The connectivity aspect of connected autonomous vehicles (CAV) is beneficial because it facilitates dissemination of traffic-related information to vehicles through Vehicle-to-External (V2X) communication. Onboard sensing equipment including LiDAR and camera can reasonably characterize the traffic environment in the immediate locality of the CAV. However, their performance is limited by their sensor range (SR). On the other hand, longer-range information is helpful for characterizing imminent conditions downstream. By contemporaneously coalescing the short- and long-range information, the CAV can construct comprehensively its surrounding environment and thereby facilitate informed, safe, and effective movement planning in the short-term (local decisions including lane change) and long-term (route choice). In this paper, we describe a Deep Reinforcement Learning based approach that integrates the data collected through sensing and connectivity capabilities from other vehicles located in the proximity of the CAV and from those located further downstream, and we use the fused data to guide lane changing, a specific context of CAV operations. In addition, recognizing the importance of the connectivity range (CR) to the performance of not only the algorithm but also of the vehicle in the actual driving environment, the paper carried out a case study. The case study demonstrates the application of the proposed algorithm and duly identifies the appropriate CR for each level of prevailing traffic density. It is expected that implementation of the algorithm in CAVs can enhance the safety and mobility associated with CAV driving operations. From a general perspective, its implementation can provide guidance to connectivity equipment manufacturers and CAV operators, regarding the default CR settings for CAVs or the recommended CR setting in a given traffic environment.

Learning to swim in potential flow

Authors:Yusheng Jiao, Feng Ling, Sina Heydari, Nicolas Heess, Josh Merel, Eva Kanso
Date:2020-09-30 06:31:27

Fish swim by undulating their bodies. These propulsive motions require coordinated shape changes of a body that interacts with its fluid environment, but the specific shape coordination that leads to robust turning and swimming motions remains unclear. To address the problem of underwater motion planning, we propose a simple model of a three-link fish swimming in a potential flow environment and we use model-free reinforcement learning for shape control. We arrive at optimal shape changes for two swimming tasks: swimming in a desired direction and swimming towards a known target. This fish model belongs to a class of problems in geometric mechanics, known as driftless dynamical systems, which allow us to analyze the swimming behavior in terms of geometric phases over the shape space of the fish. These geometric methods are less intuitive in the presence of drift. Here, we use the shape space analysis as a tool for assessing, visualizing, and interpreting the control policies obtained via reinforcement learning in the absence of drift. We then examine the robustness of these policies to drift-related perturbations. Although the fish has no direct control over the drift itself, it learns to take advantage of the presence of moderate drift to reach its target.

Bridging the gap between Markowitz planning and deep reinforcement learning

Authors:Eric Benhamou, David Saltiel, Sandrine Ungari, Abhishek Mukhopadhyay
Date:2020-09-30 04:03:27

While researchers in the asset management industry have mostly focused on techniques based on financial and risk planning techniques like Markowitz efficient frontier, minimum variance, maximum diversification or equal risk parity, in parallel, another community in machine learning has started working on reinforcement learning and more particularly deep reinforcement learning to solve other decision making problems for challenging task like autonomous driving, robot learning, and on a more conceptual side games solving like Go. This paper aims to bridge the gap between these two approaches by showing Deep Reinforcement Learning (DRL) techniques can shed new lights on portfolio allocation thanks to a more general optimization setting that casts portfolio allocation as an optimal control problem that is not just a one-step optimization, but rather a continuous control optimization with a delayed reward. The advantages are numerous: (i) DRL maps directly market conditions to actions by design and hence should adapt to changing environment, (ii) DRL does not rely on any traditional financial risk assumptions like that risk is represented by variance, (iii) DRL can incorporate additional data and be a multi inputs method as opposed to more traditional optimization methods. We present on an experiment some encouraging results using convolution networks.

Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

Authors:Zihan Zhang, Xiangyang Ji, Simon S. Du
Date:2020-09-28 17:52:32

Episodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity. We consider the episodic reinforcement learning with $S$ states, $A$ actions, planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We propose a new algorithm, \textbf{M}onotonic \textbf{V}alue \textbf{P}ropagation (MVP), which relies on a new Bernstein-type bonus. Compared to existing bonus constructions, the new bonus is tighter since it is based on a well-designed monotonic value function. In particular, the \emph{constants} in the bonus should be subtly setting to ensure optimism and monotonicity. We show MVP enjoys an $O\left(\left(\sqrt{SAK} + S^2A\right) \poly\log \left(SAHK\right)\right)$ regret, approaching the $\Omega\left(\sqrt{SAK}\right)$ lower bound of \emph{contextual bandits} up to logarithmic terms. Notably, this result 1) \emph{exponentially} improves the state-of-the-art polynomial-time algorithms by Dann et al. [2019] and Zanette et al. [2019] in terms of the dependency on $H$, and 2) \emph{exponentially} improves the running time in [Wang et al. 2020] and significantly improves the dependency on $S$, $A$ and $K$ in sample complexity.

Graph neural induction of value iteration

Authors:Andreea Deac, Pierre-Luc Bacon, Jian Tang
Date:2020-09-26 14:09:16

Many reinforcement learning tasks can benefit from explicit planning based on an internal model of the environment. Previously, such planning components have been incorporated through a neural network that partially aligns with the computational graph of value iteration. Such network have so far been focused on restrictive environments (e.g. grid-worlds), and modelled the planning procedure only indirectly. We relax these constraints, proposing a graph neural network (GNN) that executes the value iteration (VI) algorithm, across arbitrary environment models, with direct supervision on the intermediate steps of VI. The results indicate that GNNs are able to model value iteration accurately, recovering favourable metrics and policies across a variety of out-of-distribution tests. This suggests that GNN executors with strong supervision are a viable component within deep reinforcement learning systems.

Deep Reinforcement Learning with a Stage Incentive Mechanism of Dense Reward for Robotic Trajectory Planning

Authors:Gang Peng, Jin Yang, Xinde Lia, Mohammad Omar Khyam
Date:2020-09-25 07:36:32

(This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.) To improve the efficiency of deep reinforcement learning (DRL)-based methods for robot manipulator trajectory planning in random working environments, we present three dense reward functions. These rewards differ from the traditional sparse reward. First, a posture reward function is proposed to speed up the learning process with a more reasonable trajectory by modeling the distance and direction constraints, which can reduce the blindness of exploration. Second, a stride reward function is proposed to improve the stability of the learning process by modeling the distance and movement distance of joint constraints. Finally, in order to further improve learning efficiency, we are inspired by the cognitive process of human behavior and propose a stage incentive mechanism, including a hard stage incentive reward function and a soft stage incentive reward function. Extensive experiments show that the soft stage incentive reward function is able to improve the convergence rate by up to 46.9% with the state-of-the-art DRL methods. The percentage increase in the convergence mean reward was 4.4-15.5% and the percentage decreases with respect to standard deviation were 21.9-63.2%. In the evaluation experiments, the success rate of trajectory planning for a robot manipulator reached 99.6%.

Continual Model-Based Reinforcement Learning with Hypernetworks

Authors:Yizhou Huang, Kevin Xie, Homanga Bharadhwaj, Florian Shkurti
Date:2020-09-25 01:46:26

Effective planning in model-based reinforcement learning (MBRL) and model-predictive control (MPC) relies on the accuracy of the learned dynamics model. In many instances of MBRL and MPC, this model is assumed to be stationary and is periodically re-trained from scratch on state transition experience collected from the beginning of environment interactions. This implies that the time required to train the dynamics model - and the pause required between plan executions - grows linearly with the size of the collected experience. We argue that this is too slow for lifelong robot learning and propose HyperCRL, a method that continually learns the encountered dynamics in a sequence of tasks using task-conditional hypernetworks. Our method has three main attributes: first, it includes dynamics learning sessions that do not revisit training data from previous tasks, so it only needs to store the most recent fixed-size portion of the state transition experience; second, it uses fixed-capacity hypernetworks to represent non-stationary and task-aware dynamics; third, it outperforms existing continual learning alternatives that rely on fixed-capacity networks, and does competitively with baselines that remember an ever increasing coreset of past experience. We show that HyperCRL is effective in continual model-based reinforcement learning in robot locomotion and manipulation scenarios, such as tasks involving pushing and door opening. Our project website with videos is at this link https://rvl.cs.toronto.edu/blog/2020/hypercrl

Motion Planning by Reinforcement Learning for an Unmanned Aerial Vehicle in Virtual Open Space with Static Obstacles

Authors:Sanghyun Kim, Jongmin Park, Jae-Kwan Yun, Jiwon Seo
Date:2020-09-24 16:42:56

In this study, we applied reinforcement learning based on the proximal policy optimization algorithm to perform motion planning for an unmanned aerial vehicle (UAV) in an open space with static obstacles. The application of reinforcement learning through a real UAV has several limitations such as time and cost; thus, we used the Gazebo simulator to train a virtual quadrotor UAV in a virtual environment. As the reinforcement learning progressed, the mean reward and goal rate of the model were increased. Furthermore, the test of the trained model shows that the UAV reaches the goal with an 81% goal rate using the simple reward function suggested in this work.

Multi-Agent Deep Reinforcement Learning Based Trajectory Planning for Multi-UAV Assisted Mobile Edge Computing

Authors:Liang Wang, Kezhi Wang, Cunhua Pan, Wei Xu, Nauman Aslam, Lajos Hanzo
Date:2020-09-23 17:44:07

An unmanned aerial vehicle (UAV)-aided mobile edge computing (MEC) framework is proposed, where several UAVs having different trajectories fly over the target area and support the user equipments (UEs) on the ground. We aim to jointly optimize the geographical fairness among all the UEs, the fairness of each UAV' UE-load and the overall energy consumption of UEs. The above optimization problem includes both integer and continues variables and it is challenging to solve. To address the above problem, a multi-agent deep reinforcement learning based trajectory control algorithm is proposed for managing the trajectory of each UAV independently, where the popular Multi-Agent Deep Deterministic Policy Gradient (MADDPG) method is applied. Given the UAVs' trajectories, a low-complexity approach is introduced for optimizing the offloading decisions of UEs. We show that our proposed solution has considerable performance over other traditional algorithms, both in terms of the fairness for serving UEs, fairness of UE-load at each UAV and energy consumption for all the UEs.

Hierarchical Affordance Discovery using Intrinsic Motivation

Authors:Alexandre Manoury, Sao Mai Nguyen, Cédric Buche
Date:2020-09-23 07:18:21

To be capable of lifelong learning in a real-life environment, robots have to tackle multiple challenges. Being able to relate physical properties they may observe in their environment to possible interactions they may have is one of them. This skill, named affordance learning, is strongly related to embodiment and is mastered through each person's development: each individual learns affordances differently through their own interactions with their surroundings. Current methods for affordance learning usually use either fixed actions to learn these affordances or focus on static setups involving a robotic arm to be operated. In this article, we propose an algorithm using intrinsic motivation to guide the learning of affordances for a mobile robot. This algorithm is capable to autonomously discover, learn and adapt interrelated affordances without pre-programmed actions. Once learned, these affordances may be used by the algorithm to plan sequences of actions in order to perform tasks of various difficulties. We then present one experiment and analyse our system before comparing it with other approaches from reinforcement learning and affordance learning.

What is the Reward for Handwriting? -- Handwriting Generation by Imitation Learning

Authors:Keisuke Kanda, Brian Kenji Iwana, Seiichi Uchida
Date:2020-09-23 07:04:08

Analyzing the handwriting generation process is an important issue and has been tackled by various generation models, such as kinematics based models and stochastic models. In this study, we use a reinforcement learning (RL) framework to realize handwriting generation with the careful future planning ability. In fact, the handwriting process of human beings is also supported by their future planning ability; for example, the ability is necessary to generate a closed trajectory like '0' because any shortsighted model, such as a Markovian model, cannot generate it. For the algorithm, we employ generative adversarial imitation learning (GAIL). Typical RL algorithms require the manual definition of the reward function, which is very crucial to control the generation process. In contrast, GAIL trains the reward function along with the other modules of the framework. In other words, through GAIL, we can understand the reward of the handwriting generation process from handwriting examples. Our experimental results qualitatively and quantitatively show that the learned reward catches the trends in handwriting generation and thus GAIL is well suited for the acquisition of handwriting behavior.

Latent Representation Prediction Networks

Authors:Hlynur Davíð Hlynsson, Merlin Schüler, Robin Schiewer, Tobias Glasmachers, Laurenz Wiskott
Date:2020-09-20 14:26:03

Deeply-learned planning methods are often based on learning representations that are optimized for unrelated tasks. For example, they might be trained on reconstructing the environment. These representations are then combined with predictor functions for simulating rollouts to navigate the environment. We find this principle of learning representations unsatisfying and propose to learn them such that they are directly optimized for the task at hand: to be maximally predictable for the predictor function. This results in representations that are by design optimal for the downstream task of planning, where the learned predictor function is used as a forward model. To this end, we propose a new way of jointly learning this representation along with the prediction function, a system we dub Latent Representation Prediction Network (LARP). The prediction function is used as a forward model for search on a graph in a viewpoint-matching task and the representation learned to maximize predictability is found to outperform a pre-trained representation. Our approach is shown to be more sample-efficient than standard reinforcement learning methods and our learned representation transfers successfully to dissimilar objects.

POMP: Pomcp-based Online Motion Planning for active visual search in indoor environments

Authors:Yiming Wang, Francesco Giuliari, Riccardo Berra, Alberto Castellini, Alessio Del Bue, Alessandro Farinelli, Marco Cristani, Francesco Setti
Date:2020-09-17 08:23:50

In this paper we focus on the problem of learning an optimal policy for Active Visual Search (AVS) of objects in known indoor environments with an online setup. Our POMP method uses as input the current pose of an agent (e.g. a robot) and a RGB-D frame. The task is to plan the next move that brings the agent closer to the target object. We model this problem as a Partially Observable Markov Decision Process solved by a Monte-Carlo planning approach. This allows us to make decisions on the next moves by iterating over the known scenario at hand, exploring the environment and searching for the object at the same time. Differently from the current state of the art in Reinforcement Learning, POMP does not require extensive and expensive (in time and computation) labelled data so being very agile in solving AVS in small and medium real scenarios. We only require the information of the floormap of the environment, an information usually available or that can be easily extracted from an a priori single exploration run. We validate our method on the publicly available AVD benchmark, achieving an average success rate of 0.76 with an average path length of 17.1, performing close to the state of the art but without any training needed. Additionally, we show experimentally the robustness of our method when the quality of the object detection goes from ideal to faulty.

Reward Maximisation through Discrete Active Inference

Authors:Lancelot Da Costa, Noor Sajid, Thomas Parr, Karl Friston, Ryan Smith
Date:2020-09-17 07:13:59

Active inference is a probabilistic framework for modelling the behaviour of biological and artificial agents, which derives from the principle of minimising free energy. In recent years, this framework has successfully been applied to a variety of situations where the goal was to maximise reward, offering comparable and sometimes superior performance to alternative approaches. In this paper, we clarify the connection between reward maximisation and active inference by demonstrating how and when active inference agents perform actions that are optimal for maximising reward. Precisely, we show the conditions under which active inference produces the optimal solution to the Bellman equation--a formulation that underlies several approaches to model-based reinforcement learning and control. On partially observed Markov decision processes, the standard active inference scheme can produce Bellman optimal actions for planning horizons of 1, but not beyond. In contrast, a recently developed recursive active inference scheme (sophisticated inference) can produce Bellman optimal actions on any finite temporal horizon. We append the analysis with a discussion of the broader relationship between active inference and reinforcement learning.

An interpretable planning bot for pancreas stereotactic body radiation therapy

Authors:Jiahan Zhang, Chunhao Wang, Yang Sheng, Manisha Palta, Brian Czito, Christopher Willett, Jiang Zhang, P James Jensen, Fang-Fang Yin, Qiuwen Wu, Yaorong Ge, Q Jackie Wu
Date:2020-09-17 01:06:21

Pancreas stereotactic body radiotherapy treatment planning requires planners to make sequential, time consuming interactions with the treatment planning system (TPS) to reach the optimal dose distribution. We seek to develop a reinforcement learning (RL)-based planning bot to systematically address complex tradeoffs and achieve high plan quality consistently and efficiently. The focus of pancreas SBRT planning is finding a balance between organs-at-risk sparing and planning target volume (PTV) coverage. Planners evaluate dose distributions and make planning adjustments to optimize PTV coverage while adhering to OAR dose constraints. We have formulated such interactions between the planner and the TPS into a finite-horizon RL model. First, planning status features are evaluated based on human planner experience and defined as planning states. Second, planning actions are defined to represent steps that planners would commonly implement to address different planning needs. Finally, we have derived a reward system based on an objective function guided by physician-assigned constraints. The planning bot trained itself with 48 plans augmented from 16 previously treated patients and generated plans for 24 cases in a separate validation set. All 24 bot-generated plans achieve similar PTV coverages compared to clinical plans while satisfying all clinical planning constraints. Moreover, the knowledge learned by the bot can be visualized and interpreted as consistent with human planning knowledge, and the knowledge maps learned in separate training sessions are consistent, indicating reproducibility of the learning process.

Time your hedge with Deep Reinforcement Learning

Authors:Eric Benhamou, David Saltiel, Sandrine Ungari, Abhishek Mukhopadhyay
Date:2020-09-16 06:43:41

Can an asset manager plan the optimal timing for her/his hedging strategies given market conditions? The standard approach based on Markowitz or other more or less sophisticated financial rules aims to find the best portfolio allocation thanks to forecasted expected returns and risk but fails to fully relate market conditions to hedging strategies decision. In contrast, Deep Reinforcement Learning (DRL) can tackle this challenge by creating a dynamic dependency between market information and hedging strategies allocation decisions. In this paper, we present a realistic and augmented DRL framework that: (i) uses additional contextual information to decide an action, (ii) has a one period lag between observations and actions to account for one day lag turnover of common asset managers to rebalance their hedge, (iii) is fully tested in terms of stability and robustness thanks to a repetitive train test method called anchored walk forward training, similar in spirit to k fold cross validation for time series and (iv) allows managing leverage of our hedging strategy. Our experiment for an augmented asset manager interested in sizing and timing his hedges shows that our approach achieves superior returns and lower risk.

Physically Embedded Planning Problems: New Challenges for Reinforcement Learning

Authors:Mehdi Mirza, Andrew Jaegle, Jonathan J. Hunt, Arthur Guez, Saran Tunyasuvunakool, Alistair Muldal, Théophane Weber, Peter Karkus, Sébastien Racanière, Lars Buesing, Timothy Lillicrap, Nicolas Heess
Date:2020-09-11 16:56:33

Recent work in deep reinforcement learning (RL) has produced algorithms capable of mastering challenging games such as Go, chess, or shogi. In these works the RL agent directly observes the natural state of the game and controls that state directly with its actions. However, when humans play such games, they do not just reason about the moves but also interact with their physical environment. They understand the state of the game by looking at the physical board in front of them and modify it by manipulating pieces using touch and fine-grained motor control. Mastering complicated physical systems with abstract goals is a central challenge for artificial intelligence, but it remains out of reach for existing RL algorithms. To encourage progress towards this goal we introduce a set of physically embedded planning problems and make them publicly available. We embed challenging symbolic tasks (Sokoban, tic-tac-toe, and Go) in a physics engine to produce a set of tasks that require perception, reasoning, and motor control over long time horizons. Although existing RL algorithms can tackle the symbolic versions of these tasks, we find that they struggle to master even the simplest of their physically embedded counterparts. As a first step towards characterizing the space of solution to these tasks, we introduce a strong baseline that uses a pre-trained expert game player to provide hints in the abstract space to an RL agent's policy while training it on the full sensorimotor control task. The resulting agent solves many of the tasks, underlining the need for methods that bridge the gap between abstract planning and embodied control. See illustrating video at https://youtu.be/RwHiHlym_1k.

Multi-Objective Model-based Reinforcement Learning for Infectious Disease Control

Authors:Runzhe Wan, Xinyu Zhang, Rui Song
Date:2020-09-09 23:55:27

Severe infectious diseases such as the novel coronavirus (COVID-19) pose a huge threat to public health. Stringent control measures, such as school closures and stay-at-home orders, while having significant effects, also bring huge economic losses. In the face of an emerging infectious disease, a crucial question for policymakers is how to make the trade-off and implement the appropriate interventions timely given the huge uncertainty. In this work, we propose a Multi-Objective Model-based Reinforcement Learning framework to facilitate data-driven decision-making and minimize the overall long-term cost. Specifically, at each decision point, a Bayesian epidemiological model is first learned as the environment model, and then the proposed model-based multi-objective planning algorithm is applied to find a set of Pareto-optimal policies. This framework, combined with the prediction bands for each policy, provides a real-time decision support tool for policymakers. The application is demonstrated with the spread of COVID-19 in China.

Vision-Based Autonomous Drone Control using Supervised Learning in Simulation

Authors:Max Christl
Date:2020-09-09 13:45:41

Limited power and computational resources, absence of high-end sensor equipment and GPS-denied environments are challenges faced by autonomous micro areal vehicles (MAVs). We address these challenges in the context of autonomous navigation and landing of MAVs in indoor environments and propose a vision-based control approach using Supervised Learning. To achieve this, we collected data samples in a simulation environment which were labelled according to the optimal control command determined by a path planning algorithm. Based on these data samples, we trained a Convolutional Neural Network (CNN) that maps low resolution image and sensor input to high-level control commands. We have observed promising results in both obstructed and non-obstructed simulation environments, showing that our model is capable of successfully navigating a MAV towards a landing platform. Our approach requires shorter training times than similar Reinforcement Learning approaches and can potentially overcome the limitations of manual data collection faced by comparable Supervised Learning approaches.

Metis: Multi-Agent Based Crisis Simulation System

Authors:George Sidiropoulos, Chairi Kiourt, Lefteris Moussiades
Date:2020-09-08 18:22:27

With the advent of the computational technologies (Graphics Processing Units - GPUs) and Machine Learning, the research domain of crowd simulation for crisis management has flourished. Along with the new techniques and methodologies that have been proposed all those years, aiming to increase the realism of crowd simulation, several crisis simulation systems/tools have been developed, but most of them focus on special cases without providing users the ability to adapt them based on their needs. Towards these directions, in this paper, we introduce a novel multi-agent-based crisis simulation system for indoor cases. The main advantage of the system is its ease of use feature, focusing on non-expert users (users with little to no programming skills) that can exploit its capabilities a, adapt the entire environment based on their needs (Case studies) and set up building evacuation planning experiments with some of the most popular Reinforcement Learning algorithms. Simply put, the system's features focus on dynamic environment design and crisis management, interconnection with popular Reinforcement Learning libraries, agents with different characteristics (behaviors), fire propagation parameterization, realistic physics based on popular game engine, GPU-accelerated agents training and simulation end conditions. A case study exploiting a popular reinforcement learning algorithm, for training of the agents, presents the dynamics and the capabilities of the proposed systems and the paper is concluded with the highlights of the system and some future directions.

Graph neural networks-based Scheduler for Production planning problems using Reinforcement Learning

Authors:Mohammed Sharafath Abdul Hameed, Andreas Schwung
Date:2020-09-08 16:05:04

Reinforcement learning (RL) is increasingly adopted in job shop scheduling problems (JSSP). But RL for JSSP is usually done using a vectorized representation of machine features as the state space. It has three major problems: (1) the relationship between the machine units and the job sequence is not fully captured, (2) exponential increase in the size of the state space with increasing machines/jobs, and (3) the generalization of the agent to unseen scenarios. We present a novel framework - GraSP-RL, GRAph neural network-based Scheduler for Production planning problems using Reinforcement Learning. It represents JSSP as a graph and trains the RL agent using features extracted using a graph neural network (GNN). While the graph is itself in the non-euclidean space, the features extracted using the GNNs provide a rich encoding of the current production state in the euclidean space, which is then used by the RL agent to select the next job. Further, we cast the scheduling problem as a decentralized optimization problem in which the learning agent is assigned to all the production units and the agent learns asynchronously from the data collected on all the production units. The GraSP-RL is then applied to a complex injection molding production environment with 30 jobs and 4 machines. The task is to minimize the makespan of the production plan. The schedule planned by GraSP-RL is then compared and analyzed with a priority dispatch rule algorithm like first-in-first-out (FIFO) and metaheuristics like tabu search (TS) and genetic algorithm (GA). The proposed GraSP-RL outperforms the FIFO, TS, and GA for the trained task of planning 30 jobs in JSSP. We further test the generalization capability of the trained agent on two different problem classes: Open shop system (OSS) and Reactive JSSP (RJSSP) where our method produces results better than FIFO and comparable results to TS and GA.

Learning Topological Motion Primitives for Knot Planning

Authors:Mengyuan Yan, Gen Li, Yilin Zhu, Jeannette Bohg
Date:2020-09-05 23:44:33

In this paper, we approach the challenging problem of motion planning for knot tying. We propose a hierarchical approach in which the top layer produces a topological plan and the bottom layer translates this plan into continuous robot motion. The top layer decomposes a knotting task into sequences of abstract topological actions based on knot theory. The bottom layer translates each of these abstract actions into robot motion trajectories through learned topological motion primitives. To adapt each topological action to the specific rope geometry, the motion primitives take the observed rope configuration as input. We train the motion primitives by imitating human demonstrations and reinforcement learning in simulation. To generalize human demonstrations of simple knots into more complex knots, we observe similarities in the motion strategies of different topological actions and design the neural network structure to exploit such similarities. We demonstrate that our learned motion primitives can be used to efficiently generate motion plans for tying the overhand knot. The motion plan can then be executed on a real robot using visual tracking and Model Predictive Control. We also demonstrate that our learned motion primitives can be composed to tie a more complex pentagram-like knot despite being only trained on human demonstrations of simpler knots.

Adaptive Reinforcement Learning Model for Simulation of Urban Mobility during Crises

Authors:Chao Fan, Xiangqi Jiang, Ali Mostafavi
Date:2020-09-02 21:47:18

The objective of this study is to propose and test an adaptive reinforcement learning model that can learn the patterns of human mobility in a normal context and simulate the mobility during perturbations caused by crises, such as flooding, wildfire, and hurricanes. Understanding and predicting human mobility patterns, such as destination and trajectory selection, can inform emerging congestion and road closures raised by disruptions in emergencies. Data related to human movement trajectories are scarce, especially in the context of emergencies, which places a limitation on applications of existing urban mobility models learned from empirical data. Models with the capability of learning the mobility patterns from data generated in normal situations and which can adapt to emergency situations are needed to inform emergency response and urban resilience assessments. To address this gap, this study creates and tests an adaptive reinforcement learning model that can predict the destinations of movements, estimate the trajectory for each origin and destination pair, and examine the impact of perturbations on humans' decisions related to destinations and movement trajectories. The application of the proposed model is shown in the context of Houston and the flooding scenario caused by Hurricane Harvey in August 2017. The results show that the model can achieve more than 76\% precision and recall. The results also show that the model could predict traffic patterns and congestion resulting from to urban flooding. The outcomes of the analysis demonstrate the capabilities of the model for analyzing urban mobility during crises, which can inform the public and decision-makers about the response strategies and resilience planning to reduce the impacts of crises on urban mobility.

Flightmare: A Flexible Quadrotor Simulator

Authors:Yunlong Song, Selim Naji, Elia Kaufmann, Antonio Loquercio, Davide Scaramuzza
Date:2020-09-01 16:50:45

State-of-the-art quadrotor simulators have a rigid and highly-specialized structure: either are they really fast, physically accurate, or photo-realistic. In this work, we propose a novel quadrotor simulator: Flightmare. Flightmare is composed of two main components: a configurable rendering engine built on Unity and a flexible physics engine for dynamics simulation. Those two components are totally decoupled and can run independently of each other. This makes our simulator extremely fast: rendering achieves speeds of up to 230 Hz, while physics simulation of up to 200,000 Hz on a laptop. In addition, Flightmare comes with several desirable features: (i) a large multi-modal sensor suite, including an interface to extract the 3D point-cloud of the scene; (ii) an API for reinforcement learning which can simulate hundreds of quadrotors in parallel; and (iii) integration with a virtual-reality headset for interaction with the simulated environment. We demonstrate the flexibility of Flightmare by using it for two different robotic tasks: quadrotor control using deep reinforcement learning and collision-free path planning in a complex 3D environment.

Solving the single-track train scheduling problem via Deep Reinforcement Learning

Authors:Valerio Agasucci, Giorgio Grani, Leonardo Lamorgese
Date:2020-09-01 14:03:56

Every day, railways experience disturbances and disruptions, both on the network and the fleet side, that affect the stability of rail traffic. Induced delays propagate through the network, which leads to a mismatch in demand and offer for goods and passengers, and, in turn, to a loss in service quality. In these cases, it is the duty of human traffic controllers, the so-called dispatchers, to do their best to minimize the impact on traffic. However, dispatchers inevitably have a limited depth of perception of the knock-on effect of their decisions, particularly how they affect areas of the network that are outside their direct control. In recent years, much work in Decision Science has been devoted to developing methods to solve the problem automatically and support the dispatchers in this challenging task. This paper investigates Machine Learning-based methods for tackling this problem, proposing two different Deep Q-Learning methods(Decentralized and Centralized). Numerical results show the superiority of these techniques with respect to the classical linear Q-Learning based on matrices. Moreover, the Centralized approach is compared with a MILP formulation showing interesting results. The experiments are inspired by data provided by a U.S. Class 1 railroad.

Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL

Authors:Xiaoyu Chen, Jiachen Hu, Lihong Li, Liwei Wang
Date:2020-08-31 02:20:41

Reinforcement learning (RL) in episodic, factored Markov decision processes (FMDPs) is studied. We propose an algorithm called FMDP-BF, which leverages the factorization structure of FMDP. The regret of FMDP-BF is shown to be exponentially smaller than that of optimal algorithms designed for non-factored MDPs, and improves on the best previous result for FMDPs~\citep{osband2014near} by a factored of $\sqrt{H|\mathcal{S}_i|}$, where $|\mathcal{S}_i|$ is the cardinality of the factored state subspace and $H$ is the planning horizon. To show the optimality of our bounds, we also provide a lower bound for FMDP, which indicates that our algorithm is near-optimal w.r.t. timestep $T$, horizon $H$ and factored state-action subspace cardinality. Finally, as an application, we study a new formulation of constrained RL, known as RL with knapsack constraints (RLwK), and provides the first sample-efficient algorithm based on FMDP-BF.

Assessment of Reward Functions for Reinforcement Learning Traffic Signal Control under Real-World Limitations

Authors:Alvaro Cabrejas-Egea, Shaun Howell, Maksis Knutins, Colm Connaughton
Date:2020-08-26 15:47:15

Adaptive traffic signal control is one key avenue for mitigating the growing consequences of traffic congestion. Incumbent solutions such as SCOOT and SCATS require regular and time-consuming calibration, can't optimise well for multiple road use modalities, and require the manual curation of many implementation plans. A recent alternative to these approaches are deep reinforcement learning algorithms, in which an agent learns how to take the most appropriate action for a given state of the system. This is guided by neural networks approximating a reward function that provides feedback to the agent regarding the performance of the actions taken, making it sensitive to the specific reward function chosen. Several authors have surveyed the reward functions used in the literature, but attributing outcome differences to reward function choice across works is problematic as there are many uncontrolled differences, as well as different outcome metrics. This paper compares the performance of agents using different reward functions in a simulation of a junction in Greater Manchester, UK, across various demand profiles, subject to real world constraints: realistic sensor inputs, controllers, calibrated demand, intergreen times and stage sequencing. The reward metrics considered are based on the time spent stopped, lost time, change in lost time, average speed, queue length, junction throughput and variations of these magnitudes. The performance of these reward functions is compared in terms of total waiting time. We find that speed maximisation resulted in the lowest average waiting times across all demand levels, displaying significantly better performance than other rewards previously introduced in the literature.

Learning Off-Policy with Online Planning

Authors:Harshit Sikchi, Wenxuan Zhou, David Held
Date:2020-08-23 16:18:44

Reinforcement learning (RL) in low-data and risk-sensitive domains requires performant and flexible deployment policies that can readily incorporate constraints during deployment. One such class of policies are the semi-parametric H-step lookahead policies, which select actions using trajectory optimization over a dynamics model for a fixed horizon with a terminal value function. In this work, we investigate a novel instantiation of H-step lookahead with a learned model and a terminal value function learned by a model-free off-policy algorithm, named Learning Off-Policy with Online Planning (LOOP). We provide a theoretical analysis of this method, suggesting a tradeoff between model errors and value function errors and empirically demonstrate this tradeoff to be beneficial in deep reinforcement learning. Furthermore, we identify the "Actor Divergence" issue in this framework and propose Actor Regularized Control (ARC), a modified trajectory optimization procedure. We evaluate our method on a set of robotic tasks for Offline and Online RL and demonstrate improved performance. We also show the flexibility of LOOP to incorporate safety constraints during deployment with a set of navigation environments. We demonstrate that LOOP is a desirable framework for robotics applications based on its strong performance in various important RL settings. Project video and details can be found at https://hari-sikchi.github.io/loop .

A Survey of Knowledge-based Sequential Decision Making under Uncertainty

Authors:Shiqi Zhang, Mohan Sridharan
Date:2020-08-19 16:48:03

Reasoning with declarative knowledge (RDK) and sequential decision-making (SDM) are two key research areas in artificial intelligence. RDK methods reason with declarative domain knowledge, including commonsense knowledge, that is either provided a priori or acquired over time, while SDM methods (probabilistic planning and reinforcement learning) seek to compute action policies that maximize the expected cumulative utility over a time horizon; both classes of methods reason in the presence of uncertainty. Despite the rich literature in these two areas, researchers have not fully explored their complementary strengths. In this paper, we survey algorithms that leverage RDK methods while making sequential decisions under uncertainty. We discuss significant developments, open problems, and directions for future work.

Heteroscedastic Uncertainty for Robust Generative Latent Dynamics

Authors:Oliver Limoyo, Bryan Chan, Filip Marić, Brandon Wagstaff, Rupam Mahmood, Jonathan Kelly
Date:2020-08-18 21:04:33

Learning or identifying dynamics from a sequence of high-dimensional observations is a difficult challenge in many domains, including reinforcement learning and control. The problem has recently been studied from a generative perspective through latent dynamics: high-dimensional observations are embedded into a lower-dimensional space in which the dynamics can be learned. Despite some successes, latent dynamics models have not yet been applied to real-world robotic systems where learned representations must be robust to a variety of perceptual confounds and noise sources not seen during training. In this paper, we present a method to jointly learn a latent state representation and the associated dynamics that is amenable for long-term planning and closed-loop control under perceptually difficult conditions. As our main contribution, we describe how our representation is able to capture a notion of heteroscedastic or input-specific uncertainty at test time by detecting novel or out-of-distribution (OOD) inputs. We present results from prediction and control experiments on two image-based tasks: a simulated pendulum balancing task and a real-world robotic manipulator reaching task. We demonstrate that our model produces significantly more accurate predictions and exhibits improved control performance, compared to a model that assumes homoscedastic uncertainty only, in the presence of varying degrees of input degradation.

Super-Human Performance in Gran Turismo Sport Using Deep Reinforcement Learning

Authors:Florian Fuchs, Yunlong Song, Elia Kaufmann, Davide Scaramuzza, Peter Duerr
Date:2020-08-18 15:06:44

Autonomous car racing is a major challenge in robotics. It raises fundamental problems for classical approaches such as planning minimum-time trajectories under uncertain dynamics and controlling the car at the limits of its handling. Besides, the requirement of minimizing the lap time, which is a sparse objective, and the difficulty of collecting training data from human experts have also hindered researchers from directly applying learning-based approaches to solve the problem. In the present work, we propose a learning-based system for autonomous car racing by leveraging a high-fidelity physical car simulation, a course-progress proxy reward, and deep reinforcement learning. We deploy our system in Gran Turismo Sport, a world-leading car simulator known for its realistic physics simulation of different race cars and tracks, which is even used to recruit human race car drivers. Our trained policy achieves autonomous racing performance that goes beyond what had been achieved so far by the built-in AI, and, at the same time, outperforms the fastest driver in a dataset of over 50,000 human players.

ReLMoGen: Leveraging Motion Generation in Reinforcement Learning for Mobile Manipulation

Authors:Fei Xia, Chengshu Li, Roberto Martín-Martín, Or Litany, Alexander Toshev, Silvio Savarese
Date:2020-08-18 08:05:15

Many Reinforcement Learning (RL) approaches use joint control signals (positions, velocities, torques) as action space for continuous control tasks. We propose to lift the action space to a higher level in the form of subgoals for a motion generator (a combination of motion planner and trajectory executor). We argue that, by lifting the action space and by leveraging sampling-based motion planners, we can efficiently use RL to solve complex, long-horizon tasks that could not be solved with existing RL methods in the original action space. We propose ReLMoGen -- a framework that combines a learned policy to predict subgoals and a motion generator to plan and execute the motion needed to reach these subgoals. To validate our method, we apply ReLMoGen to two types of tasks: 1) Interactive Navigation tasks, navigation problems where interactions with the environment are required to reach the destination, and 2) Mobile Manipulation tasks, manipulation tasks that require moving the robot base. These problems are challenging because they are usually long-horizon, hard to explore during training, and comprise alternating phases of navigation and interaction. Our method is benchmarked on a diverse set of seven robotics tasks in photo-realistic simulation environments. In all settings, ReLMoGen outperforms state-of-the-art Reinforcement Learning and Hierarchical Reinforcement Learning baselines. ReLMoGen also shows outstanding transferability between different motion generators at test time, indicating a great potential to transfer to real robots.

MIDAS: Multi-agent Interaction-aware Decision-making with Adaptive Strategies for Urban Autonomous Navigation

Authors:Xiaoyi Chen, Pratik Chaudhari
Date:2020-08-17 04:34:25

Autonomous navigation in crowded, complex urban environments requires interacting with other agents on the road. A common solution to this problem is to use a prediction model to guess the likely future actions of other agents. While this is reasonable, it leads to overly conservative plans because it does not explicitly model the mutual influence of the actions of interacting agents. This paper builds a reinforcement learning-based method named MIDAS where an ego-agent learns to affect the control actions of other cars in urban driving scenarios. MIDAS uses an attention-mechanism to handle an arbitrary number of other agents and includes a "driver-type" parameter to learn a single policy that works across different planning objectives. We build a simulation environment that enables diverse interaction experiments with a large number of agents and methods for quantitatively studying the safety, efficiency, and interaction among vehicles. MIDAS is validated using extensive experiments and we show that it (i) can work across different road geometries, (ii) results in an adaptive ego policy that can be tuned easily to satisfy performance criteria such as aggressive or cautious driving, (iii) is robust to changes in the driving policies of external agents, and (iv) is more efficient and safer than existing approaches to interaction-aware decision-making.

Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings

Authors:Jesse Zhang, Brian Cheung, Chelsea Finn, Sergey Levine, Dinesh Jayaraman
Date:2020-08-15 01:40:59

Reinforcement learning (RL) in real-world safety-critical target settings like urban driving is hazardous, imperiling the RL agent, other agents, and the environment. To overcome this difficulty, we propose a "safety-critical adaptation" task setting: an agent first trains in non-safety-critical "source" environments such as in a simulator, before it adapts to the target environment where failures carry heavy costs. We propose a solution approach, CARL, that builds on the intuition that prior experience in diverse environments equips an agent to estimate risk, which in turn enables relative safety through risk-averse, cautious adaptation. CARL first employs model-based RL to train a probabilistic model to capture uncertainty about transition dynamics and catastrophic states across varied source environments. Then, when exploring a new safety-critical environment with unknown dynamics, the CARL agent plans to avoid actions that could lead to catastrophic states. In experiments on car driving, cartpole balancing, half-cheetah locomotion, and robotic object manipulation, CARL successfully acquires cautious exploration behaviors, yielding higher rewards with fewer failures than strong RL adaptation baselines. Website at https://sites.google.com/berkeley.edu/carl.

Sample-efficient Cross-Entropy Method for Real-time Planning

Authors:Cristina Pinneri, Shambhuraj Sawant, Sebastian Blaes, Jan Achterhold, Joerg Stueckler, Michal Rolinek, Georg Martius
Date:2020-08-14 14:25:59

Trajectory optimizers for model-based reinforcement learning, such as the Cross-Entropy Method (CEM), can yield compelling results even in high-dimensional control tasks and sparse-reward environments. However, their sampling inefficiency prevents them from being used for real-time planning and control. We propose an improved version of the CEM algorithm for fast planning, with novel additions including temporally-correlated actions and memory, requiring 2.7-22x less samples and yielding a performance increase of 1.2-10x in high-dimensional control problems.

Visuomotor Mechanical Search: Learning to Retrieve Target Objects in Clutter

Authors:Andrey Kurenkov, Joseph Taglic, Rohun Kulkarni, Marcus Dominguez-Kuhne, Animesh Garg, Roberto Martín-Martín, Silvio Savarese
Date:2020-08-13 18:23:00

When searching for objects in cluttered environments, it is often necessary to perform complex interactions in order to move occluding objects out of the way and fully reveal the object of interest and make it graspable. Due to the complexity of the physics involved and the lack of accurate models of the clutter, planning and controlling precise predefined interactions with accurate outcome is extremely hard, when not impossible. In problems where accurate (forward) models are lacking, Deep Reinforcement Learning (RL) has shown to be a viable solution to map observations (e.g. images) to good interactions in the form of close-loop visuomotor policies. However, Deep RL is sample inefficient and fails when applied directly to the problem of unoccluding objects based on images. In this work we present a novel Deep RL procedure that combines i) teacher-aided exploration, ii) a critic with privileged information, and iii) mid-level representations, resulting in sample efficient and effective learning for the problem of uncovering a target object occluded by a heap of unknown objects. Our experiments show that our approach trains faster and converges to more efficient uncovering solutions than baselines and ablations, and that our uncovering policies lead to an average improvement in the graspability of the target object, facilitating downstream retrieval applications.

Model-Based Offline Planning

Authors:Arthur Argenson, Gabriel Dulac-Arnold
Date:2020-08-12 20:06:52

Offline learning is a key part of making reinforcement learning (RL) useable in real systems. Offline RL looks at scenarios where there is data from a system's operation, but no direct access to the system when learning a policy. Recent work on training RL policies from offline data has shown results both with model-free policies learned directly from the data, or with planning on top of learnt models of the data. Model-free policies tend to be more performant, but are more opaque, harder to command externally, and less easy to integrate into larger systems. We propose an offline learner that generates a model that can be used to control the system directly through planning. This allows us to have easily controllable policies directly from data, without ever interacting with the system. We show the performance of our algorithm, Model-Based Offline Planning (MBOP) on a series of robotics-inspired tasks, and demonstrate its ability leverage planning to respect environmental constraints. We are able to find near-optimal polices for certain simulated systems from as little as 50 seconds of real-time system interaction, and create zero-shot goal-conditioned policies on a series of environments. An accompanying video can be found here: https://youtu.be/nxGGHdZOFts

Deep Model-Based Reinforcement Learning for High-Dimensional Problems, a Survey

Authors:Aske Plaat, Walter Kosters, Mike Preuss
Date:2020-08-11 08:49:04

Deep reinforcement learning has shown remarkable success in the past few years. Highly complex sequential decision making problems have been solved in tasks such as game playing and robotics. Unfortunately, the sample complexity of most deep reinforcement learning methods is high, precluding their use in some important applications. Model-based reinforcement learning creates an explicit model of the environment dynamics to reduce the need for environment samples. Current deep learning methods use high-capacity networks to solve high-dimensional problems. Unfortunately, high-capacity models typically require many samples, negating the potential benefit of lower sample complexity in model-based methods. A challenge for deep model-based methods is therefore to achieve high predictive power while maintaining low sample complexity. In recent years, many model-based methods have been introduced to address this challenge. In this paper, we survey the contemporary model-based landscape. First we discuss definitions and relations to other fields. We propose a taxonomy based on three approaches: using explicit planning on given transitions, using explicit planning on learned transitions, and end-to-end learning of both planning and transitions. We use these approaches to organize a comprehensive overview of important recent developments such as latent models. We describe methods and benchmarks, and we suggest directions for future work for each of the approaches. Among promising research directions are curriculum learning, uncertainty modeling, and use of latent models for transfer learning.

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Authors:Yongchao Liu, Yue Jin, Yong Chen, Teng Teng, Hang Ou, Rui Zhao, Yao Zhang
Date:2020-08-11 07:50:34

Accelerating deep model training and inference is crucial in practice. Existing deep learning frameworks usually concentrate on optimizing training speed and pay fewer attentions to inference-specific optimizations. Actually, model inference differs from training in terms of computation, e.g. parameters are refreshed each gradient update step during training, but kept invariant during inference. These special characteristics of model inference open new opportunities for its optimization. In this paper, we propose a hardware-aware optimization framework, namely Woodpecker-DL (WPK), to accelerate inference by taking advantage of multiple joint optimizations from the perspectives of graph optimization, automated searches, domain-specific language (DSL) compiler techniques and system-level exploration. In WPK, we investigated two new automated search approaches based on genetic algorithm and reinforcement learning, respectively, to hunt the best operator code configurations targeting specific hardware. A customized DSL compiler is further attached to these search algorithms to generate efficient codes. To create an optimized inference plan, WPK systematically explores high-speed operator implementations from third-party libraries besides our automatically generated codes and singles out the best implementation per operator for use. Extensive experiments demonstrated that on a Tesla P100 GPU, we can achieve the maximum speedup of 5.40 over cuDNN and 1.63 over TVM on individual convolution operators, and run up to 1.18 times faster than TensorRT for end-to-end model inference.

Adaptive Coordination Offsets for Signalized Arterial Intersections using Deep Reinforcement Learning

Authors:Keith Anshilo Diaz, Damian Dailisan, Umang Sharaf, Carissa Santos, Qijian Gan, Francis Aldrine Uy, May T. Lim, Alexandre M. Bayen
Date:2020-08-06 14:50:15

Coordinating intersections in arterial networks is critical to the performance of urban transportation systems. Deep reinforcement learning (RL) has gained traction in traffic control research along with data-driven approaches for traffic control systems. To date, proposed deep RL-based traffic schemes control phase activation or duration. Yet, such approaches may bypass low volume links for several cycles in order to optimize the network-level traffic flow. Here, we propose a deep RL framework that dynamically adjusts offsets based on traffic states and preserves the planned phase timings and order derived from model-based methods. This framework allows us to improve arterial coordination while maintaining phase order and timing predictability. Using a validated and calibrated traffic model, we trained the policy of a deep RL agent that aims to reduce travel delays in the network. We evaluated the resulting policy by comparing its performance against the phase offsets deployed along a segment of Huntington Drive in the city of Arcadia. The resulting policy dynamically readjusts phase offsets in response to changes in traffic demand. Simulation results show that the proposed deep RL agent outperformed the baseline on average, effectively reducing delay time by 13.21% in the AM Scenario, 2.42% in the Noon scenario, and 6.2% in the PM scenario when offsets are adjusted in 15-minute intervals. Finally, we also show the robustness of our agent to extreme traffic conditions, such as demand surges in off-peak hours and localized traffic incidents

The Emergence of Adversarial Communication in Multi-Agent Reinforcement Learning

Authors:Jan Blumenkamp, Amanda Prorok
Date:2020-08-06 12:48:08

Many real-world problems require the coordination of multiple autonomous agents. Recent work has shown the promise of Graph Neural Networks (GNNs) to learn explicit communication strategies that enable complex multi-agent coordination. These works use models of cooperative multi-agent systems whereby agents strive to achieve a shared global goal. When considering agents with self-interested local objectives, the standard design choice is to model these as separate learning systems (albeit sharing the same environment). Such a design choice, however, precludes the existence of a single, differentiable communication channel, and consequently prohibits the learning of inter-agent communication strategies. In this work, we address this gap by presenting a learning model that accommodates individual non-shared rewards and a differentiable communication channel that is common among all agents. We focus on the case where agents have self-interested objectives, and develop a learning algorithm that elicits the emergence of adversarial communications. We perform experiments on multi-agent coverage and path planning problems, and employ a post-hoc interpretability technique to visualize the messages that agents communicate to each other. We show how a single self-interested agent is capable of learning highly manipulative communication strategies that allows it to significantly outperform a cooperative team of agents.

Offline Meta Learning of Exploration

Authors:Ron Dorfman, Idan Shenfeld, Aviv Tamar
Date:2020-08-06 12:09:18

Consider the following instance of the Offline Meta Reinforcement Learning (OMRL) problem: given the complete training logs of $N$ conventional RL agents, trained on $N$ different tasks, design a meta-agent that can quickly maximize reward in a new, unseen task from the same task distribution. In particular, while each conventional RL agent explored and exploited its own different task, the meta-agent must identify regularities in the data that lead to effective exploration/exploitation in the unseen task. Here, we take a Bayesian RL (BRL) view, and seek to learn a Bayes-optimal policy from the offline data. Building on the recent VariBAD BRL approach, we develop an off-policy BRL method that learns to plan an exploration strategy based on an adaptive neural belief estimate. However, learning to infer such a belief from offline data brings a new identifiability issue we term MDP ambiguity. We characterize the problem, and suggest resolutions via data collection and modification procedures. Finally, we evaluate our framework on a diverse set of domains, including difficult sparse reward tasks, and demonstrate learning of effective exploration behavior that is qualitatively different from the exploration used by any RL agent in the data.

Learning Transition Models with Time-delayed Causal Relations

Authors:Junchi Liang, Abdeslam Boularias
Date:2020-08-04 14:35:11

This paper introduces an algorithm for discovering implicit and delayed causal relations between events observed by a robot at arbitrary times, with the objective of improving data-efficiency and interpretability of model-based reinforcement learning (RL) techniques. The proposed algorithm initially predicts observations with the Markov assumption, and incrementally introduces new hidden variables to explain and reduce the stochasticity of the observations. The hidden variables are memory units that keep track of pertinent past events. Such events are systematically identified by their information gains. The learned transition and reward models are then used for planning. Experiments on simulated and real robotic tasks show that this method significantly improves over current RL techniques.

Tracking the Race Between Deep Reinforcement Learning and Imitation Learning -- Extended Version

Authors:Timo P. Gros, Daniel Höller, Jörg Hoffmann, Verena Wolf
Date:2020-08-03 10:31:44

Learning-based approaches for solving large sequential decision making problems have become popular in recent years. The resulting agents perform differently and their characteristics depend on those of the underlying learning approach. Here, we consider a benchmark planning problem from the reinforcement learning domain, the Racetrack, to investigate the properties of agents derived from different deep (reinforcement) learning approaches. We compare the performance of deep supervised learning, in particular imitation learning, to reinforcement learning for the Racetrack model. We find that imitation learning yields agents that follow more risky paths. In contrast, the decisions of deep reinforcement learning are more foresighted, i.e., avoid states in which fatal decisions are more likely. Our evaluations show that for this sequential decision making problem, deep reinforcement learning performs best in many aspects even though for imitation learning optimal decisions are considered.

MAPPER: Multi-Agent Path Planning with Evolutionary Reinforcement Learning in Mixed Dynamic Environments

Authors:Zuxin Liu, Baiming Chen, Hongyi Zhou, Guru Koushik, Martial Hebert, Ding Zhao
Date:2020-07-30 20:14:42

Multi-agent navigation in dynamic environments is of great industrial value when deploying a large scale fleet of robot to real-world applications. This paper proposes a decentralized partially observable multi-agent path planning with evolutionary reinforcement learning (MAPPER) method to learn an effective local planning policy in mixed dynamic environments. Reinforcement learning-based methods usually suffer performance degradation on long-horizon tasks with goal-conditioned sparse rewards, so we decompose the long-range navigation task into many easier sub-tasks under the guidance of a global planner, which increases agents' performance in large environments. Moreover, most existing multi-agent planning approaches assume either perfect information of the surrounding environment or homogeneity of nearby dynamic agents, which may not hold in practice. Our approach models dynamic obstacles' behavior with an image-based representation and trains a policy in mixed dynamic environments without homogeneity assumption. To ensure multi-agent training stability and performance, we propose an evolutionary training approach that can be easily scaled to large and complex environments. Experiments show that MAPPER is able to achieve higher success rates and more stable performance when exposed to a large number of non-cooperative dynamic obstacles compared with traditional reaction-based planner LRA* and the state-of-the-art learning-based method.

Intelligent Trajectory Planning in UAV-mounted Wireless Networks: A Quantum-Inspired Reinforcement Learning Perspective

Authors:Yuanjian Li, A. Hamid Aghvami, Daoyi Dong
Date:2020-07-27 10:43:31

In this paper, we consider a wireless uplink transmission scenario in which an unmanned aerial vehicle (UAV) serves as an aerial base station collecting data from ground users. To optimize the expected sum uplink transmit rate without any prior knowledge of ground users (e.g., locations, channel state information and transmit power), the trajectory planning problem is optimized via the quantum-inspired reinforcement learning (QiRL) approach. Specifically, the QiRL method adopts novel probabilistic action selection policy and new reinforcement strategy, which are inspired by the collapse phenomenon and amplitude amplification in quantum computation theory, respectively. Numerical results demonstrate that the proposed QiRL solution can offer natural balancing between exploration and exploitation via ranking collapse probabilities of possible actions, compared to the traditional reinforcement learning approaches which are highly dependent on tuned exploration parameters.

Learning Compositional Neural Programs for Continuous Control

Authors:Thomas Pierrot, Nicolas Perrin, Feryal Behbahani, Alexandre Laterre, Olivier Sigaud, Karim Beguir, Nando de Freitas
Date:2020-07-27 08:27:14

We propose a novel solution to challenging sparse-reward, continuous control problems that require hierarchical planning at multiple levels of abstraction. Our solution, dubbed AlphaNPI-X, involves three separate stages of learning. First, we use off-policy reinforcement learning algorithms with experience replay to learn a set of atomic goal-conditioned policies, which can be easily repurposed for many tasks. Second, we learn self-models describing the effect of the atomic policies on the environment. Third, the self-models are harnessed to learn recursive compositional programs with multiple levels of abstraction. The key insight is that the self-models enable planning by imagination, obviating the need for interaction with the world when learning higher-level compositional programs. To accomplish the third stage of learning, we extend the AlphaNPI algorithm, which applies AlphaZero to learn recursive neural programmer-interpreters. We empirically show that AlphaNPI-X can effectively learn to tackle challenging sparse manipulation tasks, such as stacking multiple blocks, where powerful model-free baselines fail.

Autonomous Exploration Under Uncertainty via Deep Reinforcement Learning on Graphs

Authors:Fanfei Chen, John D. Martin, Yewei Huang, Jinkun Wang, Brendan Englot
Date:2020-07-24 16:50:41

We consider an autonomous exploration problem in which a range-sensing mobile robot is tasked with accurately mapping the landmarks in an a priori unknown environment efficiently in real-time; it must choose sensing actions that both curb localization uncertainty and achieve information gain. For this problem, belief space planning methods that forward-simulate robot sensing and estimation may often fail in real-time implementation, scaling poorly with increasing size of the state, belief and action spaces. We propose a novel approach that uses graph neural networks (GNNs) in conjunction with deep reinforcement learning (DRL), enabling decision-making over graphs containing exploration information to predict a robot's optimal sensing action in belief space. The policy, which is trained in different random environments without human intervention, offers a real-time, scalable decision-making process whose high-performance exploratory sensing actions yield accurate maps and high rates of information gain.

Improving Efficiency of Training a Virtual Treatment Planner Network via Knowledge-guided Deep Reinforcement Learning for Intelligent Automatic Treatment Planning of Radiotherapy

Authors:Chenyang Shen, Liyuan Chen, Yesenia Gonzalez, Xun Jia
Date:2020-07-24 15:48:23

We previously proposed an intelligent automatic treatment planning framework for radiotherapy, in which a virtual treatment planner network (VTPN) was built using deep reinforcement learning (DRL) to operate a treatment planning system (TPS). Despite the success, the training of VTPN via DRL was time consuming. Also the training time is expected to grow with the complexity of the treatment planning problem, preventing the development of VTPN for more complicated but clinically relevant scenarios. In this study we proposed a knowledge-guided DRL (KgDRL) that incorporated knowledge from human planners to guide the training process to improve the training efficiency. Using prostate cancer intensity modulated radiation therapy as a testbed, we first summarized a number of rules of operating our in-house TPS. In training, in addition to randomly navigating the state-action space, as in the DRL using the epsilon-greedy algorithm, we also sampled actions defined by the rules. The priority of sampling actions from rules decreased over the training process to encourage VTPN to explore new policy that was not covered by the rules. We trained a VTPN using KgDRL and compared its performance with another VTPN trained using DRL. Both VTPNs trained via KgDRL and DRL spontaneously learned to operate the TPS to generate high-quality plans, achieving plan quality scores of 8.82 and 8.43, respectively. Both VTPNs outperformed treatment planning purely based on the rules, which had a plan score of 7.81. VTPN trained with 8 episodes using KgDRL was able to perform similarly to that trained using DRL with 100 episodes. The training time was reduced from more than a week to 13 hours. The proposed KgDRL framework accelerated the training process by incorporating human knowledge, which will facilitate the development of VTPN for more complicated treatment planning scenarios.

Deep Reinforcement Learning based Automatic Exploration for Navigation in Unknown Environment

Authors:Haoran Li, Qichao Zhang, Dongbin Zhao
Date:2020-07-23 05:53:36

This paper investigates the automatic exploration problem under the unknown environment, which is the key point of applying the robotic system to some social tasks. The solution to this problem via stacking decision rules is impossible to cover various environments and sensor properties. Learning based control methods are adaptive for these scenarios. However, these methods are damaged by low learning efficiency and awkward transferability from simulation to reality. In this paper, we construct a general exploration framework via decomposing the exploration process into the decision, planning, and mapping modules, which increases the modularity of the robotic system. Based on this framework, we propose a deep reinforcement learning based decision algorithm which uses a deep neural network to learning exploration strategy from the partial map. The results show that this proposed algorithm has better learning efficiency and adaptability for unknown environments. In addition, we conduct the experiments on the physical robot, and the results suggest that the learned policy can be well transfered from simulation to the real robot.

Adaptive Traffic Control with Deep Reinforcement Learning: Towards State-of-the-art and Beyond

Authors:Siavash Alemzadeh, Ramin Moslemi, Ratnesh Sharma, Mehran Mesbahi
Date:2020-07-21 17:26:20

In this work, we study adaptive data-guided traffic planning and control using Reinforcement Learning (RL). We shift from the plain use of classic methods towards state-of-the-art in deep RL community. We embed several recent techniques in our algorithm that improve the original Deep Q-Networks (DQN) for discrete control and discuss the traffic-related interpretations that follow. We propose a novel DQN-based algorithm for Traffic Control (called TC-DQN+) as a tool for fast and more reliable traffic decision-making. We introduce a new form of reward function which is further discussed using illustrative examples with comparisons to traditional traffic control methods.

UAV Target Tracking in Urban Environments Using Deep Reinforcement Learning

Authors:Sarthak Bhagat, Sujit PB
Date:2020-07-21 16:52:48

Persistent target tracking in urban environments using UAV is a difficult task due to the limited field of view, visibility obstruction from obstacles and uncertain target motion. The vehicle needs to plan intelligently in 3D such that the target visibility is maximized. In this paper, we introduce Target Following DQN (TF-DQN), a deep reinforcement learning technique based on Deep Q-Networks with a curriculum training framework for the UAV to persistently track the target in the presence of obstacles and target motion uncertainty. The algorithm is evaluated through several simulation experiments qualitatively as well as quantitatively. The results show that the UAV tracks the target persistently in diverse environments while avoiding obstacles on the trained environments as well as on unseen environments.

Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

Authors:Kaiqing Zhang, Sham M. Kakade, Tamer Başar, Lin F. Yang
Date:2020-07-15 03:25:24

Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, our goal is to address the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of $\tilde O(|S||A||B|(1-\gamma)^{-3}\epsilon^{-2})$ for finding the Nash equilibrium (NE) value up to some $\epsilon$ error, and the $\epsilon$-NE policies with a smooth planning oracle, where $\gamma$ is the discount factor, and $S,A,B$ denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, with a $\tilde\Omega(|S|(|A|+|B|)(1-\gamma)^{-3}\epsilon^{-2})$ lower bound, where this model-based approach is near-optimal with only a gap on the $|A|,|B|$ dependence. Our results not only demonstrate the sample-efficiency of this basic model-based approach in MARL, but also elaborate on the fundamental tradeoff between its power (easily handling the more challenging reward-agnostic case) and limitation (less adaptive and suboptimal in $|A|,|B|$), particularly arises in the multi-agent context.

Goal-Aware Prediction: Learning to Model What Matters

Authors:Suraj Nair, Silvio Savarese, Chelsea Finn
Date:2020-07-14 16:42:59

Learned dynamics models combined with both planning and policy learning algorithms have shown promise in enabling artificial agents to learn to perform many diverse tasks with limited supervision. However, one of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model (future state reconstruction), and that of the downstream planner or policy (completing a specified task). This issue is exacerbated by vision-based control tasks in diverse real-world environments, where the complexity of the real world dwarfs model capacity. In this paper, we propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space, resulting in a learning objective that more closely matches the downstream task. Further, we do so in an entirely self-supervised manner, without the need for a reward function or image labels. We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.

Reinforcement Learning of Musculoskeletal Control from Functional Simulations

Authors:Emanuel Joos, Fabien Péan, Orcun Goksel
Date:2020-07-13 20:20:01

To diagnose, plan, and treat musculoskeletal pathologies, understanding and reproducing muscle recruitment for complex movements is essential. With muscle activations for movements often being highly redundant, nonlinear, and time dependent, machine learning can provide a solution for their modeling and control for anatomy-specific musculoskeletal simulations. Sophisticated biomechanical simulations often require specialized computational environments, being numerically complex and slow, hindering their integration with typical deep learning frameworks. In this work, a deep reinforcement learning (DRL) based inverse dynamics controller is trained to control muscle activations of a biomechanical model of the human shoulder. In a generalizable end-to-end fashion, muscle activations are learned given current and desired position-velocity pairs. A customized reward functions for trajectory control is introduced, enabling straightforward extension to additional muscles and higher degrees of freedom. Using the biomechanical model, multiple episodes are simulated on a cluster simultaneously using the evolving neural models of the DRL being trained. Results are presented for a single-axis motion control of shoulder abduction for the task of following randomly generated angular trajectories.

Learning Abstract Models for Strategic Exploration and Fast Reward Transfer

Authors:Evan Zheran Liu, Ramtin Keramati, Sudarshan Seshadri, Kelvin Guu, Panupong Pasupat, Emma Brunskill, Percy Liang
Date:2020-07-12 03:33:50

Model-based reinforcement learning (RL) is appealing because (i) it enables planning and thus more strategic exploration, and (ii) by decoupling dynamics from rewards, it enables fast transfer to new reward functions. However, learning an accurate Markov Decision Process (MDP) over high-dimensional states (e.g., raw pixels) is extremely challenging because it requires function approximation, which leads to compounding errors. Instead, to avoid compounding errors, we propose learning an abstract MDP over abstract states: low-dimensional coarse representations of the state (e.g., capturing agent position, ignoring other objects). We assume access to an abstraction function that maps the concrete states to abstract states. In our approach, we construct an abstract MDP, which grows through strategic exploration via planning. Similar to hierarchical RL approaches, the abstract actions of the abstract MDP are backed by learned subpolicies that navigate between abstract states. Our approach achieves strong results on three of the hardest Arcade Learning Environment games (Montezuma's Revenge, Pitfall!, and Private Eye), including superhuman performance on Pitfall! without demonstrations. After training on one task, we can reuse the learned abstract MDP for new reward functions, achieving higher reward in 1000x fewer samples than model-free methods trained from scratch.

Control as Hybrid Inference

Authors:Alexander Tschantz, Beren Millidge, Anil K. Seth, Christopher L. Buckley
Date:2020-07-11 19:44:09

The field of reinforcement learning can be split into model-based and model-free methods. Here, we unify these approaches by casting model-free policy optimisation as amortised variational inference, and model-based planning as iterative variational inference, within a `control as hybrid inference' (CHI) framework. We present an implementation of CHI which naturally mediates the balance between iterative and amortised inference. Using a didactic experiment, we demonstrate that the proposed algorithm operates in a model-based manner at the onset of learning, before converging to a model-free algorithm once sufficient data have been collected. We verify the scalability of our algorithm on a continuous control benchmark, demonstrating that it outperforms strong model-free and model-based baselines. CHI thus provides a principled framework for harnessing the sample efficiency of model-based planning while retaining the asymptotic performance of model-free policy optimisation.

Planning on the fast lane: Learning to interact using attention mechanisms in path integral inverse reinforcement learning

Authors:Sascha Rosbach, Xing Li, Simon Großjohann, Silviu Homoceanu, Stefan Roth
Date:2020-07-11 15:25:44

General-purpose trajectory planning algorithms for automated driving utilize complex reward functions to perform a combined optimization of strategic, behavioral, and kinematic features. The specification and tuning of a single reward function is a tedious task and does not generalize over a large set of traffic situations. Deep learning approaches based on path integral inverse reinforcement learning have been successfully applied to predict local situation-dependent reward functions using features of a set of sampled driving policies. Sample-based trajectory planning algorithms are able to approximate a spatio-temporal subspace of feasible driving policies that can be used to encode the context of a situation. However, the interaction with dynamic objects requires an extended planning horizon, which depends on sequential context modeling. In this work, we are concerned with the sequential reward prediction over an extended time horizon. We present a neural network architecture that uses a policy attention mechanism to generate a low-dimensional context vector by concentrating on trajectories with a human-like driving style. Apart from this, we propose a temporal attention mechanism to identify context switches and allow for stable adaptation of rewards. We evaluate our results on complex simulated driving situations, including other moving vehicles. Our evaluation shows that our policy attention mechanism learns to focus on collision-free policies in the configuration space. Furthermore, the temporal attention mechanism learns persistent interaction with other vehicles over an extended planning horizon.

Long-Term Planning with Deep Reinforcement Learning on Autonomous Drones

Authors:Ugurkan Ates
Date:2020-07-11 06:16:50

In this paper, we study a long-term planning scenario that is based on drone racing competitions held in real life. We conducted this experiment on a framework created for "Game of Drones: Drone Racing Competition" at NeurIPS 2019. The racing environment was created using Microsoft's AirSim Drone Racing Lab. A reinforcement learning agent, a simulated quadrotor in our case, has trained with the Policy Proximal Optimization(PPO) algorithm was able to successfully compete against another simulated quadrotor that was running a classical path planning algorithm. Agent observations consist of data from IMU sensors, GPS coordinates of drone obtained through simulation and opponent drone GPS information. Using opponent drone GPS information during training helps dealing with complex state spaces, serving as expert guidance allows for efficient and stable training process. All experiments performed in this paper can be found and reproduced with code at our GitHub repository

Discourse Coherence, Reference Grounding and Goal Oriented Dialogue

Authors:Baber Khalid, Malihe Alikhani, Michael Fellner, Brian McMahan, Matthew Stone
Date:2020-07-08 20:53:14

Prior approaches to realizing mixed-initiative human--computer referential communication have adopted information-state or collaborative problem-solving approaches. In this paper, we argue for a new approach, inspired by coherence-based models of discourse such as SDRT \cite{asher-lascarides:2003a}, in which utterances attach to an evolving discourse structure and the associated knowledge graph of speaker commitments serves as an interface to real-world reasoning and conversational strategy. As first steps towards implementing the approach, we describe a simple dialogue system in a referential communication domain that accumulates constraints across discourse, interprets them using a learned probabilistic model, and plans clarification using reinforcement learning.

Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads

Authors:Siyu Wang, Yi Rong, Shiqing Fan, Zhen Zheng, LanSong Diao, Guoping Long, Jun Yang, Xiaoyong Liu, Wei Lin
Date:2020-07-08 12:38:03

The last decade has witnessed growth in the computational requirements for training deep neural networks. Current approaches (e.g., data/model parallelism, pipeline parallelism) parallelize training tasks onto multiple devices. However, these approaches always rely on specific deep learning frameworks and requires elaborate manual design, which make it difficult to maintain and share between different type of models. In this paper, we propose Auto-MAP, a framework for exploring distributed execution plans for DNN workloads, which can automatically discovering fast parallelization strategies through reinforcement learning on IR level of deep learning models. Efficient exploration remains a major challenge for reinforcement learning. We leverage DQN with task-specific pruning strategies to help efficiently explore the search space including optimized strategies. Our evaluation shows that Auto-MAP can find the optimal solution in two hours, while achieving better throughput on several NLP and convolution models.

Design, Control, and Applications of a Soft Robotic Arm

Authors:Hao Jiang, Zhanchi Wang, Yusong Jin, Xiaotong Chen, Peijin Li, Yinghao Gan, Sen Lin, Xiaoping Chen
Date:2020-07-08 11:46:19

This paper presents the design, control, and applications of a multi-segment soft robotic arm. In order to design a soft arm with large load capacity, several design principles are proposed by analyzing two kinds of buckling issues, under which we present a novel structure named Honeycomb Pneumatic Networks (HPN). Parameter optimization method, based on finite element method (FEM), is proposed to optimize HPN Arm design parameters. Through a quick fabrication process, several prototypes with different performance are made, one of which can achieve the transverse load capacity of 3 kg under 3 bar pressure. Next, considering different internal and external conditions, we develop three controllers according to different model precision. Specifically, based on accurate model, an open-loop controller is realized by combining piece-wise constant curvature (PCC) modeling method and machine learning method. Based on inaccurate model, a feedback controller, using estimated Jacobian, is realized in 3D space. A model-free controller, using reinforcement learning to learn a control policy rather than a model, is realized in 2D plane, with minimal training data. Then, these three control methods are compared on a same experiment platform to explore the applicability of different methods under different conditions. Lastly, we figure out that soft arm can greatly simplify the perception, planning, and control of interaction tasks through its compliance, which is its main advantage over the rigid arm. Through plentiful experiments in three interaction application scenarios, human-robot interaction, free space interaction task, and confined space interaction task, we demonstrate the potential application prospect of the soft arm.

Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning

Authors:Ming Yin, Yu Bai, Yu-Xiang Wang
Date:2020-07-07 19:44:14

The problem of Offline Policy Evaluation (OPE) in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications. Existing work on OPE mostly focus on evaluating a fixed target policy $\pi$, which does not provide useful bounds for offline policy learning as $\pi$ will then be data-dependent. We address this problem by simultaneously evaluating all policies in a policy class $\Pi$ -- uniform convergence in OPE -- and obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of $\widetilde{O}(H^3/d_m\epsilon^2)$ in identifying an $\epsilon$-optimal policy under the time-inhomogeneous episodic MDP model ($H$ is the planning horizon, $d_m$ is a quantity that reflects the exploration of the logging policy $\mu$). To the best of our knowledge, this is the first time the optimal rate is shown to be possible for the offline RL setting and the paper is the first that systematically investigates the uniform convergence in OPE.

Sharp Analysis of Smoothed Bellman Error Embedding

Authors:Ahmed Touati, Pascal Vincent
Date:2020-07-07 19:27:09

The \textit{Smoothed Bellman Error Embedding} algorithm~\citep{dai2018sbeed}, known as SBEED, was proposed as a provably convergent reinforcement learning algorithm with general nonlinear function approximation. It has been successfully implemented with neural networks and achieved strong empirical results. In this work, we study the theoretical behavior of SBEED in batch-mode reinforcement learning. We prove a near-optimal performance guarantee that depends on the representation power of the used function classes and a tight notion of the distribution shift. Our results improve upon prior guarantees for SBEED in ~\citet{dai2018sbeed} in terms of the dependence on the planning horizon and on the sample size. Our analysis builds on the recent work of ~\citet{Xie2020} which studies a related algorithm MSBO, that could be interpreted as a \textit{non-smooth} counterpart of SBEED.

Selective Dyna-style Planning Under Limited Model Capacity

Authors:Zaheer Abbas, Samuel Sokota, Erin J. Talvitie, Martha White
Date:2020-07-05 18:51:50

In model-based reinforcement learning, planning with an imperfect model of the environment has the potential to harm learning progress. But even when a model is imperfect, it may still contain information that is useful for planning. In this paper, we investigate the idea of using an imperfect model selectively. The agent should plan in parts of the state space where the model would be helpful but refrain from using the model where it would be harmful. An effective selective planning mechanism requires estimating predictive uncertainty, which arises out of aleatoric uncertainty, parameter uncertainty, and model inadequacy, among other sources. Prior work has focused on parameter uncertainty for selective planning. In this work, we emphasize the importance of model inadequacy. We show that heteroscedastic regression can signal predictive uncertainty arising from model inadequacy that is complementary to that which is detected by methods designed for parameter uncertainty, indicating that considering both parameter uncertainty and model inadequacy may be a more promising direction for effective selective planning than either in isolation.

Mission schedule of agile satellites based on Proximal Policy Optimization Algorithm

Authors:Xinrui Liu
Date:2020-07-05 14:28:44

Mission schedule of satellites is an important part of space operation nowadays, since the number and types of satellites in orbit are increasing tremendously and their corresponding tasks are also becoming more and more complicated. In this paper, a mission schedule model combined with Proximal Policy Optimization Algorithm(PPO) is proposed. Different from the traditional heuristic planning method, this paper incorporate reinforcement learning algorithms into it and find a new way to describe the problem. Several constraints including data download are considered in this paper.

Discount Factor as a Regularizer in Reinforcement Learning

Authors:Ron Amit, Ron Meir, Kamil Ciosek
Date:2020-07-04 08:10:09

Specifying a Reinforcement Learning (RL) task involves choosing a suitable planning horizon, which is typically modeled by a discount factor. It is known that applying RL algorithms with a lower discount factor can act as a regularizer, improving performance in the limited data regime. Yet the exact nature of this regularizer has not been investigated. In this work, we fill in this gap. For several Temporal-Difference (TD) learning methods, we show an explicit equivalence between using a reduced discount factor and adding an explicit regularization term to the algorithm's loss. Motivated by the equivalence, we empirically study this technique compared to standard $L_2$ regularization by extensive experiments in discrete and continuous domains, using tabular and functional representations. Our experiments suggest the regularization effectiveness is strongly related to properties of the available data, such as size, distribution, and mixing rate.

Bidirectional Model-based Policy Optimization

Authors:Hang Lai, Jian Shen, Weinan Zhang, Yong Yu
Date:2020-07-04 03:34:09

Model-based reinforcement learning approaches leverage a forward dynamics model to support planning and decision making, which, however, may fail catastrophically if the model is inaccurate. Although there are several existing methods dedicated to combating the model error, the potential of the single forward model is still limited. In this paper, we propose to additionally construct a backward dynamics model to reduce the reliance on accuracy in forward model predictions. We develop a novel method, called Bidirectional Model-based Policy Optimization (BMPO) to utilize both the forward model and backward model to generate short branched rollouts for policy optimization. Furthermore, we theoretically derive a tighter bound of return discrepancy, which shows the superiority of BMPO against the one using merely the forward model. Extensive experiments demonstrate that BMPO outperforms state-of-the-art model-based methods in terms of sample efficiency and asymptotic performance.

Deep reinforcement learning driven inspection and maintenance planning under incomplete information and constraints

Authors:C. P. Andriotis, K. G. Papakonstantinou
Date:2020-07-02 20:44:07

Determination of inspection and maintenance policies for minimizing long-term risks and costs in deteriorating engineering environments constitutes a complex optimization problem. Major computational challenges include the (i) curse of dimensionality, due to exponential scaling of state/action set cardinalities with the number of components; (ii) curse of history, related to exponentially growing decision-trees with the number of decision-steps; (iii) presence of state uncertainties, induced by inherent environment stochasticity and variability of inspection/monitoring measurements; (iv) presence of constraints, pertaining to stochastic long-term limitations, due to resource scarcity and other infeasible/undesirable system responses. In this work, these challenges are addressed within a joint framework of constrained Partially Observable Markov Decision Processes (POMDP) and multi-agent Deep Reinforcement Learning (DRL). POMDPs optimally tackle (ii)-(iii), combining stochastic dynamic programming with Bayesian inference principles. Multi-agent DRL addresses (i), through deep function parametrizations and decentralized control assumptions. Challenge (iv) is herein handled through proper state augmentation and Lagrangian relaxation, with emphasis on life-cycle risk-based constraints and budget limitations. The underlying algorithmic steps are provided, and the proposed framework is found to outperform well-established policy baselines and facilitate adept prescription of inspection and intervention actions, in cases where decisions must be made in the most resource- and risk-aware manner.

Learning a Distributed Control Scheme for Demand Flexibility in Thermostatically Controlled Loads

Authors:Bingqing Chen, Weiran Yao, Jonathan Francis, Mario Bergés
Date:2020-07-01 22:16:59

Demand flexibility is increasingly important for power grids, in light of growing penetration of renewable generation. Careful coordination of thermostatically controlled loads (TCLs) can potentially modulate energy demand, decrease operating costs, and increase grid resiliency. However, it is challenging to control a heterogeneous population of TCLs: the control problem has a large state action space; each TCL has unique and complex dynamics; and multiple system-level objectives need to be optimized simultaneously. To address these challenges, we propose a distributed control solution, which consists of a central load aggregator that optimizes system-level objectives and building-level controllers that track the load profiles planned by the aggregator. To optimize our agents' policies, we draw inspirations from both reinforcement learning (RL) and model predictive control. Specifically, the aggregator is updated with an evolutionary strategy, which was recently demonstrated to be a competitive and scalable alternative to more sophisticated RL algorithms and enables policy updates independent of the building-level controllers. We evaluate our proposed approach across four climate zones in four nine-building clusters, using the newly-introduced CityLearn simulation environment. Our approach achieved an average reduction of 16.8% in the environment cost compared to the benchmark rule-based controller.

UAV Path Planning for Wireless Data Harvesting: A Deep Reinforcement Learning Approach

Authors:Harald Bayerlein, Mirco Theile, Marco Caccamo, David Gesbert
Date:2020-07-01 15:14:16

Autonomous deployment of unmanned aerial vehicles (UAVs) supporting next-generation communication networks requires efficient trajectory planning methods. We propose a new end-to-end reinforcement learning (RL) approach to UAV-enabled data collection from Internet of Things (IoT) devices in an urban environment. An autonomous drone is tasked with gathering data from distributed sensor nodes subject to limited flying time and obstacle avoidance. While previous approaches, learning and non-learning based, must perform expensive recomputations or relearn a behavior when important scenario parameters such as the number of sensors, sensor positions, or maximum flying time, change, we train a double deep Q-network (DDQN) with combined experience replay to learn a UAV control policy that generalizes over changing scenario parameters. By exploiting a multi-layer map of the environment fed through convolutional network layers to the agent, we show that our proposed network architecture enables the agent to make movement decisions for a variety of scenario parameters that balance the data collection goal with flight time efficiency and safety constraints. Considerable advantages in learning efficiency from using a map centered on the UAV's position over a non-centered map are also illustrated.

Convex Regularization in Monte-Carlo Tree Search

Authors:Tuan Dam, Carlo D'Eramo, Jan Peters, Joni Pajarinen
Date:2020-07-01 11:29:08

Monte-Carlo planning and Reinforcement Learning (RL) are essential to sequential decision making. The recent AlphaGo and AlphaZero algorithms have shown how to successfully combine these two paradigms in order to solve large scale sequential decision problems. These methodologies exploit a variant of the well-known UCT algorithm to trade off exploitation of good actions and exploration of unvisited states, but their empirical success comes at the cost of poor sample-efficiency and high computation time. In this paper, we overcome these limitations by considering convex regularization in Monte-Carlo Tree Search (MCTS), which has been successfully used in RL to efficiently drive exploration. First, we introduce a unifying theory on the use of generic convex regularizers in MCTS, deriving the regret analysis and providing guarantees of exponential convergence rate. Second, we exploit our theoretical framework to introduce novel regularized backup operators for MCTS, based on the relative entropy of the policy update, and on the Tsallis entropy of the policy. Finally, we empirically evaluate the proposed operators in AlphaGo and AlphaZero on problems of increasing dimensionality and branching factor, from a toy problem to several Atari games, showing their superiority w.r.t. representative baselines.

Model-based Reinforcement Learning: A Survey

Authors:Thomas M. Moerland, Joost Broekens, Aske Plaat, Catholijn M. Jonker
Date:2020-06-30 12:10:07

Sequential decision making, commonly formalized as Markov Decision Process (MDP) optimization, is a important challenge in artificial intelligence. Two key approaches to this problem are reinforcement learning (RL) and planning. This paper presents a survey of the integration of both fields, better known as model-based reinforcement learning. Model-based RL has two main steps. First, we systematically cover approaches to dynamics model learning, including challenges like dealing with stochasticity, uncertainty, partial observability, and temporal abstraction. Second, we present a systematic categorization of planning-learning integration, including aspects like: where to start planning, what budgets to allocate to planning and real data collection, how to plan, and how to integrate planning in the learning and acting loop. After these two sections, we also discuss implicit model-based RL as an end-to-end alternative for model learning and planning, and we cover the potential benefits of model-based RL. Along the way, the survey also draws connections to several related RL fields, like hierarchical RL and transfer learning. Altogether, the survey presents a broad conceptual overview of the combination of planning and learning for MDP optimization.

Supervised Learning and Reinforcement Learning of Feedback Models for Reactive Behaviors: Tactile Feedback Testbed

Authors:Giovanni Sutanto, Katharina Rombach, Yevgen Chebotar, Zhe Su, Stefan Schaal, Gaurav S. Sukhatme, Franziska Meier
Date:2020-06-29 19:56:25

Robots need to be able to adapt to unexpected changes in the environment such that they can autonomously succeed in their tasks. However, hand-designing feedback models for adaptation is tedious, if at all possible, making data-driven methods a promising alternative. In this paper we introduce a full framework for learning feedback models for reactive motion planning. Our pipeline starts by segmenting demonstrations of a complete task into motion primitives via a semi-automated segmentation algorithm. Then, given additional demonstrations of successful adaptation behaviors, we learn initial feedback models through learning from demonstrations. In the final phase, a sample-efficient reinforcement learning algorithm fine-tunes these feedback models for novel task settings through few real system interactions. We evaluate our approach on a real anthropomorphic robot in learning a tactile feedback task.

What can I do here? A Theory of Affordances in Reinforcement Learning

Authors:Khimya Khetarpal, Zafarali Ahmed, Gheorghe Comanici, David Abel, Doina Precup
Date:2020-06-26 16:34:53

Reinforcement learning algorithms usually assume that all actions are always available to an agent. However, both people and animals understand the general link between the features of their environment and the actions that are feasible. Gibson (1977) coined the term "affordances" to describe the fact that certain states enable an agent to do certain actions, in the context of embodied agents. In this paper, we develop a theory of affordances for agents who learn and plan in Markov Decision Processes. Affordances play a dual role in this case. On one hand, they allow faster planning, by reducing the number of actions available in any given situation. On the other hand, they facilitate more efficient and precise learning of transition models from data, especially when such models require function approximation. We establish these properties through theoretical results as well as illustrative examples. We also propose an approach to learn affordances and use it to estimate transition models that are simpler and generalize better.

A Unifying Framework for Reinforcement Learning and Planning

Authors:Thomas M. Moerland, Joost Broekens, Aske Plaat, Catholijn M. Jonker
Date:2020-06-26 14:30:41

Sequential decision making, commonly formalized as optimization of a Markov Decision Process, is a key challenge in artificial intelligence. Two successful approaches to MDP optimization are reinforcement learning and planning, which both largely have their own research communities. However, if both research fields solve the same problem, then we might be able to disentangle the common factors in their solution approaches. Therefore, this paper presents a unifying algorithmic framework for reinforcement learning and planning (FRAP), which identifies underlying dimensions on which MDP planning and learning algorithms have to decide. At the end of the paper, we compare a variety of well-known planning, model-free and model-based RL algorithms along these dimensions. Altogether, the framework may help provide deeper insight in the algorithmic design space of planning and reinforcement learning.

Mobile Robot Path Planning in Dynamic Environments: A Survey

Authors:Kuanqi Cai, Chaoqun Wang, Jiyu Cheng, Clarence W De Silva, Max Q. -H. Meng
Date:2020-06-25 06:20:20

There are many challenges for robot navigation in densely populated dynamic environments. This paper presents a survey of the path planning methods for robot navigation in dense environments. Particularly, the path planning in the navigation framework of mobile robots is composed of global path planning and local path planning, with regard to the planning scope and the executability. Within this framework, the recent progress of the path planning methods is presented in the paper, while examining their strengths and weaknesses. Notably, the recently developed Velocity Obstacle method and its variants that serve as the local planner are analyzed comprehensively. Moreover, as a model-free method that is widely used in current robot applications, the reinforcement learning-based path planning algorithms are detailed in this paper.

The NetHack Learning Environment

Authors:Heinrich Küttler, Nantas Nardelli, Alexander H. Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, Tim Rocktäschel
Date:2020-06-24 14:12:56

Progress in Reinforcement Learning (RL) algorithms goes hand-in-hand with the development of challenging environments that test the limits of current methods. While existing RL environments are either sufficiently complex or based on fast simulation, they are rarely both. Here, we present the NetHack Learning Environment (NLE), a scalable, procedurally generated, stochastic, rich, and challenging environment for RL research based on the popular single-player terminal-based roguelike game, NetHack. We argue that NetHack is sufficiently complex to drive long-term research on problems such as exploration, planning, skill acquisition, and language-conditioned RL, while dramatically reducing the computational resources required to gather a large amount of experience. We compare NLE and its task suite to existing alternatives, and discuss why it is an ideal medium for testing the robustness and systematic generalization of RL agents. We demonstrate empirical success for early stages of the game using a distributed Deep RL baseline and Random Network Distillation exploration, alongside qualitative analysis of various agents trained in the environment. NLE is open source at https://github.com/facebookresearch/nle.

On Reward-Free Reinforcement Learning with Linear Function Approximation

Authors:Ruosong Wang, Simon S. Du, Lin F. Yang, Ruslan Salakhutdinov
Date:2020-06-19 17:59:36

Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest. During the exploration phase, an agent collects samples without using a pre-specified reward function. After the exploration phase, a reward function is given, and the agent uses samples collected during the exploration phase to compute a near-optimal policy. Jin et al. [2020] showed that in the tabular setting, the agent only needs to collect polynomial number of samples (in terms of the number states, the number of actions, and the planning horizon) for reward-free RL. However, in practice, the number of states and actions can be large, and thus function approximation schemes are required for generalization. In this work, we give both positive and negative results for reward-free RL with linear function approximation. We give an algorithm for reward-free RL in the linear Markov decision process setting where both the transition and the reward admit linear representations. The sample complexity of our algorithm is polynomial in the feature dimension and the planning horizon, and is completely independent of the number of states and actions. We further give an exponential lower bound for reward-free RL in the setting where only the optimal $Q$-function admits a linear representation. Our results imply several interesting exponential separations on the sample complexity of reward-free RL.

Active Learning for Nonlinear System Identification with Guarantees

Authors:Horia Mania, Michael I. Jordan, Benjamin Recht
Date:2020-06-18 04:54:11

While the identification of nonlinear dynamical systems is a fundamental building block of model-based reinforcement learning and feedback control, its sample complexity is only understood for systems that either have discrete states and actions or for systems that can be identified from data generated by i.i.d. random inputs. Nonetheless, many interesting dynamical systems have continuous states and actions and can only be identified through a judicious choice of inputs. Motivated by practical settings, we study a class of nonlinear dynamical systems whose state transitions depend linearly on a known feature embedding of state-action pairs. To estimate such systems in finite time identification methods must explore all directions in feature space. We propose an active learning approach that achieves this by repeating three steps: trajectory planning, trajectory tracking, and re-estimation of the system from all available data. We show that our method estimates nonlinear dynamical systems at a parametric rate, similar to the statistical rate of standard linear regression.

Learning to Track Dynamic Targets in Partially Known Environments

Authors:Heejin Jeong, Hamed Hassani, Manfred Morari, Daniel D. Lee, George J. Pappas
Date:2020-06-17 22:45:24

We solve active target tracking, one of the essential tasks in autonomous systems, using a deep reinforcement learning (RL) approach. In this problem, an autonomous agent is tasked with acquiring information about targets of interests using its onboard sensors. The classical challenges in this problem are system model dependence and the difficulty of computing information-theoretic cost functions for a long planning horizon. RL provides solutions for these challenges as the length of its effective planning horizon does not affect the computational complexity, and it drops the strong dependency of an algorithm on system models. In particular, we introduce Active Tracking Target Network (ATTN), a unified RL policy that is capable of solving major sub-tasks of active target tracking -- in-sight tracking, navigation, and exploration. The policy shows robust behavior for tracking agile and anomalous targets with a partially known target model. Additionally, the same policy is able to navigate in obstacle environments to reach distant targets as well as explore the environment when targets are positioned in unexpected locations.

Delta Schema Network in Model-based Reinforcement Learning

Authors:Andrey Gorodetskiy, Alexandra Shlychkova, Aleksandr I. Panov
Date:2020-06-17 15:58:25

This work is devoted to unresolved problems of Artificial General Intelligence - the inefficiency of transfer learning. One of the mechanisms that are used to solve this problem in the area of reinforcement learning is a model-based approach. In the paper we are expanding the schema networks method which allows to extract the logical relationships between objects and actions from the environment data. We present algorithms for training a Delta Schema Network (DSN), predicting future states of the environment and planning actions that will lead to positive reward. DSN shows strong performance of transfer learning on the classic Atari game environment.

$Q$-learning with Logarithmic Regret

Authors:Kunhe Yang, Lin F. Yang, Simon S. Du
Date:2020-06-16 13:01:33

This paper presents the first non-asymptotic result showing that a model-free algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap in the optimal $Q$-function. We prove that the optimistic $Q$-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\Delta_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $\Delta_{\min}$ is the minimum sub-optimality gap. This bound matches the information theoretical lower bound in terms of $S,A,T$ up to a $\log\left(SA\right)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.

Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning

Authors:Sebastian Curi, Felix Berkenkamp, Andreas Krause
Date:2020-06-15 18:37:38

Model-based reinforcement learning algorithms with probabilistic dynamical models are amongst the most data-efficient learning methods. This is often attributed to their ability to distinguish between epistemic and aleatoric uncertainty. However, while most algorithms distinguish these two uncertainties for learning the model, they ignore it when optimizing the policy, which leads to greedy and insufficient exploration. At the same time, there are no practical solvers for optimistic exploration algorithms. In this paper, we propose a practical optimistic exploration algorithm (H-UCRL). H-UCRL reparameterizes the set of plausible models and hallucinates control directly on the epistemic uncertainty. By augmenting the input space with the hallucinated inputs, H-UCRL can be solved using standard greedy planners. Furthermore, we analyze H-UCRL and construct a general regret bound for well-calibrated models, which is provably sublinear in the case of Gaussian Process models. Based on this theoretical foundation, we show how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms and different probabilistic models. Our experiments demonstrate that optimistic exploration significantly speeds-up learning when there are penalties on actions, a setting that is notoriously difficult for existing model-based reinforcement learning algorithms.

Learning Heuristic Selection with Dynamic Algorithm Configuration

Authors:David Speck, André Biedenkapp, Frank Hutter, Robert Mattmüller, Marius Lindauer
Date:2020-06-15 09:35:07

A key challenge in satisficing planning is to use multiple heuristics within one heuristic search. An aggregation of multiple heuristic estimates, for example by taking the maximum, has the disadvantage that bad estimates of a single heuristic can negatively affect the whole search. Since the performance of a heuristic varies from instance to instance, approaches such as algorithm selection can be successfully applied. In addition, alternating between multiple heuristics during the search makes it possible to use all heuristics equally and improve performance. However, all these approaches ignore the internal search dynamics of a planning system, which can help to select the most useful heuristics for the current expansion step. We show that dynamic algorithm configuration can be used for dynamic heuristic selection which takes into account the internal search dynamics of a planning system. Furthermore, we prove that this approach generalizes over existing approaches and that it can exponentially improve the performance of the heuristic search. To learn dynamic heuristic selection, we propose an approach based on reinforcement learning and show empirically that domain-wise learned policies, which take the internal search dynamics of a planning system into account, can exceed existing approaches.

Reinforcement Learning as Iterative and Amortised Inference

Authors:Beren Millidge, Alexander Tschantz, Anil K Seth, Christopher L Buckley
Date:2020-06-13 16:10:03

There are several ways to categorise reinforcement learning (RL) algorithms, such as either model-based or model-free, policy-based or planning-based, on-policy or off-policy, and online or offline. Broad classification schemes such as these help provide a unified perspective on disparate techniques and can contextualise and guide the development of new algorithms. In this paper, we utilise the control as inference framework to outline a novel classification scheme based on amortised and iterative inference. We demonstrate that a wide range of algorithms can be classified in this manner providing a fresh perspective and highlighting a range of existing similarities. Moreover, we show that taking this perspective allows us to identify parts of the algorithmic design space which have been relatively unexplored, suggesting new routes to innovative RL algorithms.

Online Bayesian Goal Inference for Boundedly-Rational Planning Agents

Authors:Tan Zhi-Xuan, Jordyn L. Mann, Tom Silver, Joshua B. Tenenbaum, Vikash K. Mansinghka
Date:2020-06-13 01:48:10

People routinely infer the goals of others by observing their actions over time. Remarkably, we can do so even when those actions lead to failure, enabling us to assist others when we detect that they might not achieve their goals. How might we endow machines with similar capabilities? Here we present an architecture capable of inferring an agent's goals online from both optimal and non-optimal sequences of actions. Our architecture models agents as boundedly-rational planners that interleave search with execution by replanning, thereby accounting for sub-optimal behavior. These models are specified as probabilistic programs, allowing us to represent and perform efficient Bayesian inference over an agent's goals and internal planning processes. To perform such inference, we develop Sequential Inverse Plan Search (SIPS), a sequential Monte Carlo algorithm that exploits the online replanning assumption of these models, limiting computation by incrementally extending inferred plans as new actions are observed. We present experiments showing that this modeling and inference architecture outperforms Bayesian inverse reinforcement learning baselines, accurately inferring goals from both optimal and non-optimal trajectories involving failure and back-tracking, while generalizing across domains with compositional structure and sparse rewards.

Continuous Control for Searching and Planning with a Learned Model

Authors:Xuxi Yang, Werner Duvaud, Peng Wei
Date:2020-06-12 19:10:41

Decision-making agents with planning capabilities have achieved huge success in the challenging domain like Chess, Shogi, and Go. In an effort to generalize the planning ability to the more general tasks where the environment dynamics are not available to the agent, researchers proposed the MuZero algorithm that can learn the dynamical model through the interactions with the environment. In this paper, we provide a way and the necessary theoretical results to extend the MuZero algorithm to more generalized environments with continuous action space. Through numerical results on two relatively low-dimensional MuJoCo environments, we show the proposed algorithm outperforms the soft actor-critic (SAC) algorithm, a state-of-the-art model-free deep reinforcement learning algorithm.

Potential Field Guided Actor-Critic Reinforcement Learning

Authors:Weiya Ren
Date:2020-06-12 03:09:25

In this paper, we consider the problem of actor-critic reinforcement learning. Firstly, we extend the actor-critic architecture to actor-critic-N architecture by introducing more critics beyond rewards. Secondly, we combine the reward-based critic with a potential-field-based critic to formulate the proposed potential field guided actor-critic reinforcement learning approach (actor-critic-2). This can be seen as a combination of the model-based gradients and the model-free gradients in policy improvement. State with large potential field often contains a strong prior information, such as pointing to the target at a long distance or avoiding collision by the side of an obstacle. In this situation, we should trust potential-field-based critic more as policy evaluation to accelerate policy improvement, where action policy tends to be guided. For example, in practical application, learning to avoid obstacles should be guided rather than learned by trial and error. State with small potential filed is often lack of information, for example, at the local minimum point or around the moving target. At this time, we should trust reward-based critic as policy evaluation more to evaluate the long-term return. In this case, action policy tends to explore. In addition, potential field evaluation can be combined with planning to estimate a better state value function. In this way, reward design can focus more on the final stage of reward, rather than reward shaping or phased reward. Furthermore, potential field evaluation can make up for the lack of communication in multi-agent cooperation problem, i.e., multi-agent each has a reward-based critic and a relative unified potential-field-based critic with prior information. Thirdly, simplified experiments on predator-prey game demonstrate the effectiveness of the proposed approach.

Exploration by Maximizing Rényi Entropy for Reward-Free RL Framework

Authors:Chuheng Zhang, Yuanying Cai, Longbo Huang, Jian Li
Date:2020-06-11 05:05:31

Exploration is essential for reinforcement learning (RL). To face the challenges of exploration, we consider a reward-free RL framework that completely separates exploration from exploitation and brings new challenges for exploration algorithms. In the exploration phase, the agent learns an exploratory policy by interacting with a reward-free environment and collects a dataset of transitions by executing the policy. In the planning phase, the agent computes a good policy for any reward function based on the dataset without further interacting with the environment. This framework is suitable for the meta RL setting where there are many reward functions of interest. In the exploration phase, we propose to maximize the Renyi entropy over the state-action space and justify this objective theoretically. The success of using Renyi entropy as the objective results from its encouragement to explore the hard-to-reach state-actions. We further deduce a policy gradient formulation for this objective and design a practical exploration algorithm that can deal with complex environments. In the planning phase, we solve for good policies given arbitrary reward functions using a batch RL algorithm. Empirically, we show that our exploration algorithm is effective and sample efficient, and results in superior policies for arbitrary reward functions in the planning phase.

Deep reinforcement learning for optical systems: A case study of mode-locked lasers

Authors:Chang Sun, Eurika Kaiser, Steven L. Brunton, J. Nathan Kutz
Date:2020-06-10 00:30:36

We demonstrate that deep reinforcement learning (deep RL) provides a highly effective strategy for the control and self-tuning of optical systems. Deep RL integrates the two leading machine learning architectures of deep neural networks and reinforcement learning to produce robust and stable learning for control. Deep RL is ideally suited for optical systems as the tuning and control relies on interactions with its environment with a goal-oriented objective to achieve optimal immediate or delayed rewards. This allows the optical system to recognize bi-stable structures and navigate, via trajectory planning, to optimally performing solutions, the first such algorithm demonstrated to do so in optical systems. We specifically demonstrate the deep RL architecture on a mode-locked laser, where robust self-tuning and control can be established through access of the deep RL agent to its waveplates and polarizers. We further integrate transfer learning to help the deep RL agent rapidly learn new parameter regimes and generalize its control authority. Additionally, the deep RL learning can be easily integrated with other control paradigms to provide a broad framework to control any optical system.

Learning Navigation Costs from Demonstration with Semantic Observations

Authors:Tianyu Wang, Vikas Dhiman, Nikolay Atanasov
Date:2020-06-09 04:35:57

This paper focuses on inverse reinforcement learning (IRL) for autonomous robot navigation using semantic observations. The objective is to infer a cost function that explains demonstrated behavior while relying only on the expert's observations and state-control trajectory. We develop a map encoder, which infers semantic class probabilities from the observation sequence, and a cost encoder, defined as deep neural network over the semantic features. Since the expert cost is not directly observable, the representation parameters can only be optimized by differentiating the error between demonstrated controls and a control policy computed from the cost estimate. The error is optimized using a closed-form subgradient computed only over a subset of promising states via a motion planning algorithm. We show that our approach learns to follow traffic rules in the autonomous driving CARLA simulator by relying on semantic observations of cars, sidewalks and road lanes.

Hallucinating Value: A Pitfall of Dyna-style Planning with Imperfect Environment Models

Authors:Taher Jafferjee, Ehsan Imani, Erin Talvitie, Martha White, Micheal Bowling
Date:2020-06-08 05:30:09

Dyna-style reinforcement learning (RL) agents improve sample efficiency over model-free RL agents by updating the value function with simulated experience generated by an environment model. However, it is often difficult to learn accurate models of environment dynamics, and even small errors may result in failure of Dyna agents. In this paper, we investigate one type of model error: hallucinated states. These are states generated by the model, but that are not real states of the environment. We present the Hallucinated Value Hypothesis (HVH): updating values of real states towards values of hallucinated states results in misleading state-action values which adversely affect the control policy. We discuss and evaluate four Dyna variants; three which update real states toward simulated -- and therefore potentially hallucinated -- states and one which does not. The experimental results provide evidence for the HVH thus suggesting a fruitful direction toward developing Dyna algorithms robust to model error.

Solving Hard AI Planning Instances Using Curriculum-Driven Deep Reinforcement Learning

Authors:Dieqiao Feng, Carla P. Gomes, Bart Selman
Date:2020-06-04 08:13:12

Despite significant progress in general AI planning, certain domains remain out of reach of current AI planning systems. Sokoban is a PSPACE-complete planning task and represents one of the hardest domains for current AI planners. Even domain-specific specialized search methods fail quickly due to the exponential search complexity on hard instances. Our approach based on deep reinforcement learning augmented with a curriculum-driven method is the first one to solve hard instances within one day of training while other modern solvers cannot solve these instances within any reasonable time limit. In contrast to prior efforts, which use carefully handcrafted pruning techniques, our approach automatically uncovers domain structure. Our results reveal that deep RL provides a promising framework for solving previously unsolved AI planning problems, provided a proper training curriculum can be devised.

Causality and Batch Reinforcement Learning: Complementary Approaches To Planning In Unknown Domains

Authors:James Bannon, Brad Windsor, Wenbo Song, Tao Li
Date:2020-06-03 23:14:14

Reinforcement learning algorithms have had tremendous successes in online learning settings. However, these successes have relied on low-stakes interactions between the algorithmic agent and its environment. In many settings where RL could be of use, such as health care and autonomous driving, the mistakes made by most online RL algorithms during early training come with unacceptable costs. These settings require developing reinforcement learning algorithms that can operate in the so-called batch setting, where the algorithms must learn from set of data that is fixed, finite, and generated from some (possibly unknown) policy. Evaluating policies different from the one that collected the data is called off-policy evaluation, and naturally poses counter-factual questions. In this project we show how off-policy evaluation and the estimation of treatment effects in causal inference are two approaches to the same problem, and compare recent progress in these two areas.

Combining Reinforcement Learning and Constraint Programming for Combinatorial Optimization

Authors:Quentin Cappart, Thierry Moisan, Louis-Martin Rousseau, Isabeau Prémont-Schwarz, Andre Cire
Date:2020-06-02 13:54:27

Combinatorial optimization has found applications in numerous fields, from aerospace to transportation planning and economics. The goal is to find an optimal solution among a finite set of possibilities. The well-known challenge one faces with combinatorial optimization is the state-space explosion problem: the number of possibilities grows exponentially with the problem size, which makes solving intractable for large problems. In the last years, deep reinforcement learning (DRL) has shown its promise for designing good heuristics dedicated to solve NP-hard combinatorial optimization problems. However, current approaches have two shortcomings: (1) they mainly focus on the standard travelling salesman problem and they cannot be easily extended to other problems, and (2) they only provide an approximate solution with no systematic ways to improve it or to prove optimality. In another context, constraint programming (CP) is a generic tool to solve combinatorial optimization problems. Based on a complete search procedure, it will always find the optimal solution if we allow an execution time large enough. A critical design choice, that makes CP non-trivial to use in practice, is the branching decision, directing how the search space is explored. In this work, we propose a general and hybrid approach, based on DRL and CP, for solving combinatorial optimization problems. The core of our approach is based on a dynamic programming formulation, that acts as a bridge between both techniques. We experimentally show that our solver is efficient to solve two challenging problems: the traveling salesman problem with time windows, and the 4-moments portfolio optimization problem. Results obtained show that the framework introduced outperforms the stand-alone RL and CP solutions, while being competitive with industrial solvers.

Model-Based Reinforcement Learning with Value-Targeted Regression

Authors:Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, Lin F. Yang
Date:2020-06-01 17:47:53

This paper studies model-based reinforcement learning (RL) for regret minimization. We focus on finite-horizon episodic RL where the transition model $P$ belongs to a known family of models $\mathcal{P}$, a special case of which is when models in $\mathcal{P}$ take the form of linear mixtures: $P_{\theta} = \sum_{i=1}^{d} \theta_{i}P_{i}$. We propose a model based RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed. The criterion of consistency is based on the total squared error of that the model incurs on the task of predicting \emph{values} as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, which, in the special case of linear mixtures, the regret bound takes the form $\tilde{\mathcal{O}}(d\sqrt{H^{3}T})$, where $H$, $T$ and $d$ are the horizon, total number of steps and dimension of $\theta$, respectively. In particular, this regret bound is independent of the total number of states or actions, and is close to a lower bound $\Omega(\sqrt{HdT})$. For a general model family $\mathcal{P}$, the regret bound is derived using the notion of the so-called Eluder dimension proposed by Russo & Van Roy (2014).

PlanGAN: Model-based Planning With Sparse Rewards and Multiple Goals

Authors:Henry Charlesworth, Giovanni Montana
Date:2020-06-01 12:53:09

Learning with sparse rewards remains a significant challenge in reinforcement learning (RL), especially when the aim is to train a policy capable of achieving multiple different goals. To date, the most successful approaches for dealing with multi-goal, sparse reward environments have been model-free RL algorithms. In this work we propose PlanGAN, a model-based algorithm specifically designed for solving multi-goal tasks in environments with sparse rewards. Our method builds on the fact that any trajectory of experience collected by an agent contains useful information about how to achieve the goals observed during that trajectory. We use this to train an ensemble of conditional generative models (GANs) to generate plausible trajectories that lead the agent from its current state towards a specified goal. We then combine these imagined trajectories into a novel planning algorithm in order to achieve the desired goal as efficiently as possible. The performance of PlanGAN has been tested on a number of robotic navigation/manipulation tasks in comparison with a range of model-free reinforcement learning baselines, including Hindsight Experience Replay. Our studies indicate that PlanGAN can achieve comparable performance whilst being around 4-8 times more sample efficient.

Deep R-Learning for Continual Area Sweeping

Authors:Rishi Shah, Yuqian Jiang, Justin Hart, Peter Stone
Date:2020-05-31 19:15:28

Coverage path planning is a well-studied problem in robotics in which a robot must plan a path that passes through every point in a given area repeatedly, usually with a uniform frequency. To address the scenario in which some points need to be visited more frequently than others, this problem has been extended to non-uniform coverage planning. This paper considers the variant of non-uniform coverage in which the robot does not know the distribution of relevant events beforehand and must nevertheless learn to maximize the rate of detecting events of interest. This continual area sweeping problem has been previously formalized in a way that makes strong assumptions about the environment, and to date only a greedy approach has been proposed. We generalize the continual area sweeping formulation to include fewer environmental constraints, and propose a novel approach based on reinforcement learning in a Semi-Markov Decision Process. This approach is evaluated in an abstract simulation and in a high fidelity Gazebo simulation. These evaluations show significant improvement upon the existing approach in general settings, which is especially relevant in the growing area of service robotics.

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

Authors:Gen Li, Yuting Wei, Yuejie Chi, Yuxin Chen
Date:2020-05-26 17:53:18

This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider $\gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $\mathcal{S}$ and action space $\mathcal{A}$. Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^2}$. The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of $\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}$ (modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs, and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).

Automatic Discovery of Interpretable Planning Strategies

Authors:Julian Skirzyński, Frederic Becker, Falk Lieder
Date:2020-05-24 12:24:52

When making decisions, people often overlook critical information or are overly swayed by irrelevant information. A common approach to mitigate these biases is to provide decision-makers, especially professionals such as medical doctors, with decision aids, such as decision trees and flowcharts. Designing effective decision aids is a difficult problem. We propose that recently developed reinforcement learning methods for discovering clever heuristics for good decision-making can be partially leveraged to assist human experts in this design process. One of the biggest remaining obstacles to leveraging the aforementioned methods is that the policies they learn are opaque to people. To solve this problem, we introduce AI-Interpret: a general method for transforming idiosyncratic policies into simple and interpretable descriptions. Our algorithm combines recent advances in imitation learning and program induction with a new clustering method for identifying a large subset of demonstrations that can be accurately described by a simple, high-performing decision rule. We evaluate our new algorithm and employ it to translate information-acquisition policies discovered through metalevel reinforcement learning. The results of large behavioral experiments showed that prividing the decision rules generated by AI-Interpret as flowcharts significantly improved people's planning strategies and decisions across three diferent classes of sequential decision problems. Moreover, another experiment revealed that this approach is significantly more effective than training people by giving them performance feedback. Finally, a series of ablation studies confirmed that AI-Interpret is critical to the discovery of interpretable decision rules. We conclude that the methods and findings presented herein are an important step towards leveraging automatic strategy discovery to improve human decision-making.

Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension

Authors:Ruosong Wang, Ruslan Salakhutdinov, Lin F. Yang
Date:2020-05-21 17:36:09

Value function approximation has demonstrated phenomenal empirical success in reinforcement learning (RL). Nevertheless, despite a handful of recent progress on developing theory for RL with linear function approximation, the understanding of general function approximation schemes largely remains missing. In this paper, we establish a provably efficient RL algorithm with general value function approximation. We show that if the value functions admit an approximation with a function class $\mathcal{F}$, our algorithm achieves a regret bound of $\widetilde{O}(\mathrm{poly}(dH)\sqrt{T})$ where $d$ is a complexity measure of $\mathcal{F}$ that depends on the eluder dimension [Russo and Van Roy, 2013] and log-covering numbers, $H$ is the planning horizon, and $T$ is the number interactions with the environment. Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment. Moreover, our algorithm is model-free and provides a framework to justify the effectiveness of algorithms used in practice.

Batch-Augmented Multi-Agent Reinforcement Learning for Efficient Traffic Signal Optimization

Authors:Yueh-Hua Wu, I-Hau Yeh, David Hu, Hong-Yuan Mark Liao
Date:2020-05-19 17:53:05

The goal of this work is to provide a viable solution based on reinforcement learning for traffic signal control problems. Although the state-of-the-art reinforcement learning approaches have yielded great success in a variety of domains, directly applying it to alleviate traffic congestion can be challenging, considering the requirement of high sample efficiency and how training data is gathered. In this work, we address several challenges that we encountered when we attempted to mitigate serious traffic congestion occurring in a metropolitan area. Specifically, we are required to provide a solution that is able to (1) handle the traffic signal control when certain surveillance cameras that retrieve information for reinforcement learning are down, (2) learn from batch data without a traffic simulator, and (3) make control decisions without shared information across intersections. We present a two-stage framework to deal with the above-mentioned situations. The framework can be decomposed into an Evolution Strategies approach that gives a fixed-time traffic signal control schedule and a multi-agent off-policy reinforcement learning that is capable of learning from batch data with the aid of three proposed components, bounded action, batch augmentation, and surrogate reward clipping. Our experiments show that the proposed framework reduces traffic congestion by 36% in terms of waiting time compared with the currently used fixed-time traffic signal plan. Furthermore, the framework requires only 600 queries to a simulator to achieve the result.

The Second Type of Uncertainty in Monte Carlo Tree Search

Authors:Thomas M Moerland, Joost Broekens, Aske Plaat, Catholijn M Jonker
Date:2020-05-19 09:10:51

Monte Carlo Tree Search (MCTS) efficiently balances exploration and exploitation in tree search based on count-derived uncertainty. However, these local visit counts ignore a second type of uncertainty induced by the size of the subtree below an action. We first show how, due to the lack of this second uncertainty type, MCTS may completely fail in well-known sparse exploration problems, known from the reinforcement learning community. We then introduce a new algorithm, which estimates the size of the subtree below an action, and leverages this information in the UCB formula to better direct exploration. Subsequently, we generalize these ideas by showing that loops, i.e., the repeated occurrence of (approximately) the same state in the same trace, are actually a special case of subtree depth variation. Testing on a variety of tasks shows that our algorithms increase sample efficiency, especially when the planning budget per timestep is small.

Mutual Information Maximization for Robust Plannable Representations

Authors:Yiming Ding, Ignasi Clavera, Pieter Abbeel
Date:2020-05-16 21:58:47

Extending the capabilities of robotics to real-world complex, unstructured environments requires the need of developing better perception systems while maintaining low sample complexity. When dealing with high-dimensional state spaces, current methods are either model-free or model-based based on reconstruction objectives. The sample inefficiency of the former constitutes a major barrier for applying them to the real-world. The later, while they present low sample complexity, they learn latent spaces that need to reconstruct every single detail of the scene. In real environments, the task typically just represents a small fraction of the scene. Reconstruction objectives suffer in such scenarios as they capture all the unnecessary components. In this work, we present MIRO, an information theoretic representational learning algorithm for model-based reinforcement learning. We design a latent space that maximizes the mutual information with the future information while being able to capture all the information needed for planning. We show that our approach is more robust than reconstruction objectives in the presence of distractors and cluttered scenes

Lifelong Control of Off-grid Microgrid with Model Based Reinforcement Learning

Authors:Simone Totaro, Ioannis Boukas, Anders Jonsson, Bertrand Cornélusse
Date:2020-05-16 14:45:55

The lifelong control problem of an off-grid microgrid is composed of two tasks, namely estimation of the condition of the microgrid devices and operational planning accounting for the uncertainties by forecasting the future consumption and the renewable production. The main challenge for the effective control arises from the various changes that take place over time. In this paper, we present an open-source reinforcement framework for the modeling of an off-grid microgrid for rural electrification. The lifelong control problem of an isolated microgrid is formulated as a Markov Decision Process (MDP). We categorize the set of changes that can occur in progressive and abrupt changes. We propose a novel model based reinforcement learning algorithm that is able to address both types of changes. In particular the proposed algorithm demonstrates generalisation properties, transfer capabilities and better robustness in case of fast-changing system dynamics. The proposed algorithm is compared against a rule-based policy and a model predictive controller with look-ahead. The results show that the trained agent is able to outperform both benchmarks in the lifelong setting where the system dynamics are changing over time.

Think Too Fast Nor Too Slow: The Computational Trade-off Between Planning And Reinforcement Learning

Authors:Thomas M. Moerland, Anna Deichler, Simone Baldi, Joost Broekens, Catholijn M. Jonker
Date:2020-05-15 08:20:08

Planning and reinforcement learning are two key approaches to sequential decision making. Multi-step approximate real-time dynamic programming, a recently successful algorithm class of which AlphaZero [Silver et al., 2018] is an example, combines both by nesting planning within a learning loop. However, the combination of planning and learning introduces a new question: how should we balance time spend on planning, learning and acting? The importance of this trade-off has not been explicitly studied before. We show that it is actually of key importance, with computational results indicating that we should neither plan too long nor too short. Conceptually, we identify a new spectrum of planning-learning algorithms which ranges from exhaustive search (long planning) to model-free RL (no planning), with optimal performance achieved midway.

Context-aware Dynamics Model for Generalization in Model-Based Reinforcement Learning

Authors:Kimin Lee, Younggyo Seo, Seunghyun Lee, Honglak Lee, Jinwoo Shin
Date:2020-05-14 08:10:54

Model-based reinforcement learning (RL) enjoys several benefits, such as data-efficiency and planning, by learning a model of the environment's dynamics. However, learning a global model that can generalize across different dynamics is a challenging task. To tackle this problem, we decompose the task of learning a global dynamics model into two stages: (a) learning a context latent vector that captures the local dynamics, then (b) predicting the next state conditioned on it. In order to encode dynamics-specific information into the context latent vector, we introduce a novel loss function that encourages the context latent vector to be useful for predicting both forward and backward dynamics. The proposed method achieves superior generalization ability across various simulated robotics and control tasks, compared to existing RL schemes.

From Simulation to Real World Maneuver Execution using Deep Reinforcement Learning

Authors:Alessandro Paolo Capasso, Giulio Bacchiani, Alberto Broggi
Date:2020-05-13 14:22:20

Deep Reinforcement Learning has proved to be able to solve many control tasks in different fields, but the behavior of these systems is not always as expected when deployed in real-world scenarios. This is mainly due to the lack of domain adaptation between simulated and real-world data together with the absence of distinction between train and test datasets. In this work, we investigate these problems in the autonomous driving field, especially for a maneuver planning module for roundabout insertions. In particular, we present a system based on multiple environments in which agents are trained simultaneously, evaluating the behavior of the model in different scenarios. Finally, we analyze techniques aimed at reducing the gap between simulated and real-world data showing that this increased the generalization capabilities of the system both on unseen and real-world scenarios.

Planning to Explore via Self-Supervised World Models

Authors:Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, Deepak Pathak
Date:2020-05-12 17:59:45

Reinforcement learning allows solving complex tasks, however, the learning tends to be task-specific and the sample efficiency remains a challenge. We present Plan2Explore, a self-supervised reinforcement learning agent that tackles both these challenges through a new approach to self-supervised exploration and fast adaptation to new tasks, which need not be known during exploration. During exploration, unlike prior methods which retrospectively compute the novelty of observations after the agent has already reached them, our agent acts efficiently by leveraging planning to seek out expected future novelty. After exploration, the agent quickly adapts to multiple downstream tasks in a zero or a few-shot manner. We evaluate on challenging control tasks from high-dimensional image inputs. Without any training supervision or task-specific interaction, Plan2Explore outperforms prior self-supervised exploration methods, and in fact, almost matches the performances oracle which has access to rewards. Videos and code at https://ramanans1.github.io/plan2explore/

MOReL : Model-Based Offline Reinforcement Learning

Authors:Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, Thorsten Joachims
Date:2020-05-12 17:52:43

In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline can greatly expand the applicability of RL, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (P-MDP) using the offline dataset; and (b) learning a near-optimal policy in this P-MDP. The learned P-MDP has the property that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the P-MDP. This enables it to serve as a good surrogate for purposes of policy evaluation and learning, and overcome common pitfalls of model-based RL like model exploitation. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Through experiments, we show that MOReL matches or exceeds state-of-the-art results in widely studied offline RL benchmarks. Moreover, the modular design of MOReL enables future advances in its components (e.g. generative modeling, uncertainty estimation, planning etc.) to directly translate into advances for offline RL.

Mobile Robot Path Planning in Dynamic Environments through Globally Guided Reinforcement Learning

Authors:Binyu Wang, Zhe Liu, Qingbiao Li, Amanda Prorok
Date:2020-05-11 20:42:29

Path planning for mobile robots in large dynamic environments is a challenging problem, as the robots are required to efficiently reach their given goals while simultaneously avoiding potential conflicts with other robots or dynamic objects. In the presence of dynamic obstacles, traditional solutions usually employ re-planning strategies, which re-call a planning algorithm to search for an alternative path whenever the robot encounters a conflict. However, such re-planning strategies often cause unnecessary detours. To address this issue, we propose a learning-based technique that exploits environmental spatio-temporal information. Different from existing learning-based methods, we introduce a globally guided reinforcement learning approach (G2RL), which incorporates a novel reward structure that generalizes to arbitrary environments. We apply G2RL to solve the multi-robot path planning problem in a fully distributed reactive manner. We evaluate our method across different map types, obstacle densities, and the number of robots. Experimental results show that G2RL generalizes well, outperforming existing distributed methods, and performing very similarly to fully centralized state-of-the-art benchmarks.

Deep Reinforcement Learning for Organ Localization in CT

Authors:Fernando Navarro, Anjany Sekuboyina, Diana Waldmannstetter, Jan C. Peeken, Stephanie E. Combs, Bjoern H. Menze
Date:2020-05-11 10:06:13

Robust localization of organs in computed tomography scans is a constant pre-processing requirement for organ-specific image retrieval, radiotherapy planning, and interventional image analysis. In contrast to current solutions based on exhaustive search or region proposals, which require large amounts of annotated data, we propose a deep reinforcement learning approach for organ localization in CT. In this work, an artificial agent is actively self-taught to localize organs in CT by learning from its asserts and mistakes. Within the context of reinforcement learning, we propose a novel set of actions tailored for organ localization in CT. Our method can use as a plug-and-play module for localizing any organ of interest. We evaluate the proposed solution on the public VISCERAL dataset containing CT scans with varying fields of view and multiple organs. We achieved an overall intersection over union of 0.63, an absolute median wall distance of 2.25 mm, and a median distance between centroids of 3.65 mm.

TOMA: Topological Map Abstraction for Reinforcement Learning

Authors:Zhao-Heng Yin, Wu-Jun Li
Date:2020-05-11 05:24:47

Animals are able to discover the topological map (graph) of surrounding environment, which will be used for navigation. Inspired by this biological phenomenon, researchers have recently proposed to generate graph representation for Markov decision process (MDP) and use such graphs for planning in reinforcement learning (RL). However, existing graph generation methods suffer from many drawbacks. One drawback is that existing methods do not learn an abstraction for graphs, which results in high memory and computation cost. This drawback also makes generated graph non-robust, which degrades the planning performance. Another drawback is that existing methods cannot be used for facilitating exploration which is important in RL. In this paper, we propose a new method, called topological map abstraction (TOMA), for graph generation. TOMA can generate an abstract graph representation for MDP, which costs much less memory and computation cost than existing methods. Furthermore, TOMA can be used for facilitating exploration. In particular, we propose planning to explore, in which TOMA is used to accelerate exploration by guiding the agent towards unexplored states. A novel experience replay module called vertex memory is also proposed to improve exploration performance. Experimental results show that TOMA can outperform existing methods to achieve the state-of-the-art performance.

Learning hierarchical behavior and motion planning for autonomous driving

Authors:Jingke Wang, Yue Wang, Dongkun Zhang, Yezhou Yang, Rong Xiong
Date:2020-05-08 05:34:55

Learning-based driving solution, a new branch for autonomous driving, is expected to simplify the modeling of driving by learning the underlying mechanisms from data. To improve the tactical decision-making for learning-based driving solution, we introduce hierarchical behavior and motion planning (HBMP) to explicitly model the behavior in learning-based solution. Due to the coupled action space of behavior and motion, it is challenging to solve HBMP problem using reinforcement learning (RL) for long-horizon driving tasks. We transform HBMP problem by integrating a classical sampling-based motion planner, of which the optimal cost is regarded as the rewards for high-level behavior learning. As a result, this formulation reduces action space and diversifies the rewards without losing the optimality of HBMP. In addition, we propose a sharable representation for input sensory data across simulation platforms and real-world environment, so that models trained in a fast event-based simulator, SUMO, can be used to initialize and accelerate the RL training in a dynamics based simulator, CARLA. Experimental results demonstrate the effectiveness of the method. Besides, the model is successfully transferred to the real-world, validating the generalization capability.

Plan2Vec: Unsupervised Representation Learning by Latent Plans

Authors:Ge Yang, Amy Zhang, Ari S. Morcos, Joelle Pineau, Pieter Abbeel, Roberto Calandra
Date:2020-05-07 17:52:23

In this paper we introduce plan2vec, an unsupervised representation learning approach that is inspired by reinforcement learning. Plan2vec constructs a weighted graph on an image dataset using near-neighbor distances, and then extrapolates this local metric to a global embedding by distilling path-integral over planned path. When applied to control, plan2vec offers a way to learn goal-conditioned value estimates that are accurate over long horizons that is both compute and sample efficient. We demonstrate the effectiveness of plan2vec on one simulated and two challenging real-world image datasets. Experimental results show that plan2vec successfully amortizes the planning cost, enabling reactive planning that is linear in memory and computation complexity rather than exhaustive over the entire state space.

A Survey of Algorithms for Black-Box Safety Validation of Cyber-Physical Systems

Authors:Anthony Corso, Robert J. Moss, Mark Koren, Ritchie Lee, Mykel J. Kochenderfer
Date:2020-05-06 17:31:51

Autonomous cyber-physical systems (CPS) can improve safety and efficiency for safety-critical applications, but require rigorous testing before deployment. The complexity of these systems often precludes the use of formal verification and real-world testing can be too dangerous during development. Therefore, simulation-based techniques have been developed that treat the system under test as a black box operating in a simulated environment. Safety validation tasks include finding disturbances in the environment that cause the system to fail (falsification), finding the most-likely failure, and estimating the probability that the system fails. Motivated by the prevalence of safety-critical artificial intelligence, this work provides a survey of state-of-the-art safety validation techniques for CPS with a focus on applied algorithms and their modifications for the safety validation problem. We present and discuss algorithms in the domains of optimization, path planning, reinforcement learning, and importance sampling. Problem decomposition techniques are presented to help scale algorithms to large state spaces, which are common for CPS. A brief overview of safety-critical applications is given, including autonomous vehicles and aircraft collision avoidance systems. Finally, we present a survey of existing academic and commercially available safety validation tools.

Generalized Planning With Deep Reinforcement Learning

Authors:Or Rivlin, Tamir Hazan, Erez Karpas
Date:2020-05-05 16:06:57

A hallmark of intelligence is the ability to deduce general principles from examples, which are correct beyond the range of those observed. Generalized Planning deals with finding such principles for a class of planning problems, so that principles discovered using small instances of a domain can be used to solve much larger instances of the same domain. In this work we study the use of Deep Reinforcement Learning and Graph Neural Networks to learn such generalized policies and demonstrate that they can generalize to instances that are orders of magnitude larger than those they were trained on.

Distributed Adaptive Reinforcement Learning: A Method for Optimal Routing

Authors:Salar Rahili, Benjamin Riviere, Soon-Jo Chung
Date:2020-05-05 07:28:46

In this paper, a learning-based optimal transportation algorithm for autonomous taxis and ridesharing vehicles is presented. The goal is to design a mechanism to solve the routing problem for multiple autonomous vehicles and multiple customers in order to maximize the transportation company's profit. As a result, each vehicle selects the customer whose request maximizes the company's profit in the long run. To solve this problem, the system is modeled as a Markov Decision Process (MDP) using past customers data. By solving the defined MDP, a centralized high-level planning recommendation is obtained, where this offline solution is used as an initial value for the real-time learning. Then, a distributed SARSA reinforcement learning algorithm is proposed to capture the model errors and the environment changes, such as variations in customer distributions in each area, traffic, and fares, thereby providing optimal routing policies in real-time. Vehicles, or agents, use only their local information and interaction, such as current passenger requests and estimates of neighbors' tasks and their optimal actions, to obtain the optimal policies in a distributed fashion. An optimal adaptive rate is introduced to make the distributed SARSA algorithm capable of adapting to changes in the environment and tracking the time-varying optimal policies. Furthermore, a game-theory-based task assignment algorithm is proposed, where each agent uses the optimal policies and their values from distributed SARSA to select its customer from the set of local available requests in a distributed manner. Finally, the customers data provided by the city of Chicago is used to validate the proposed algorithms.

MARS: Malleable Actor-Critic Reinforcement Learning Scheduler

Authors:Betis Baheri, Jacob Tronge, Bo Fang, Ang Li, Vipin Chaudhary, Qiang Guan
Date:2020-05-04 15:51:41

In this paper, we introduce MARS, a new scheduling system for HPC-cloud infrastructures based on a cost-aware, flexible reinforcement learning approach, which serves as an intermediate layer for next generation HPC-cloud resource manager. MARS ensembles the pre-trained models from heuristic workloads and decides on the most cost-effective strategy for optimization. A whole workflow application would be split into several optimizable dependent sub-tasks, then based on the pre-defined resource management plan, a reward will be generated after executing a scheduled task. Lastly, MARS updates the Deep Neural Network (DNN) model based on the reward. MARS is designed to optimize the existing models through reinforcement mechanisms. MARS adapts to the dynamics of workflow applications, selects the most cost-effective scheduling solution among pre-built scheduling strategies (backfilling, SJF, etc.) and self-learning deep neural network model at run-time. We evaluate MARS with different real-world workflow traces. MARS can achieve 5%-60% increased performance compared to the state-of-the-art approaches.

Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

Authors:Ruosong Wang, Simon S. Du, Lin F. Yang, Sham M. Kakade
Date:2020-05-01 17:56:38

Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is $\varepsilon$ near to that of the optimal value, where the value is measured by the normalized cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon -- a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense. Our analysis introduces two ideas: (i) the construction of an $\varepsilon$-net for optimal policies whose log-covering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class using sample complexity that scales with the log-covering number of the given policy class. Both may be of independent interest.

Plan-Space State Embeddings for Improved Reinforcement Learning

Authors:Max Pflueger, Gaurav S. Sukhatme
Date:2020-04-30 03:38:14

Robot control problems are often structured with a policy function that maps state values into control values, but in many dynamic problems the observed state can have a difficult to characterize relationship with useful policy actions. In this paper we present a new method for learning state embeddings from plans or other forms of demonstrations such that the embedding space has a specified geometric relationship with the demonstrations. We present a novel variational framework for learning these embeddings that attempts to optimize trajectory linearity in the learned embedding space. We show how these embedding spaces can then be used as an augmentation to the robot state in reinforcement learning problems. We use kinodynamic planning to generate training trajectories for some example environments, and then train embedding spaces for these environments. We show empirically that observing a system in the learned embedding space improves the performance of policy gradient reinforcement learning algorithms, particularly by reducing the variance between training runs. Our technique is limited to environments where demonstration data is available, but places no limits on how that data is collected. Our embedding technique provides a way to transfer domain knowledge from existing technologies such as planning and control algorithms, into more flexible policy learning algorithms, by creating an abstract representation of the robot state with meaningful geometry.

Actor-Critic Reinforcement Learning for Control with Stability Guarantee

Authors:Minghao Han, Lixian Zhang, Jun Wang, Wei Pan
Date:2020-04-29 16:14:30

Reinforcement Learning (RL) and its integration with deep learning have achieved impressive performance in various robotic control tasks, ranging from motion planning and navigation to end-to-end visual manipulation. However, stability is not guaranteed in model-free RL by solely using data. From a control-theoretic perspective, stability is the most important property for any control system, since it is closely related to safety, robustness, and reliability of robotic systems. In this paper, we propose an actor-critic RL framework for control which can guarantee closed-loop stability by employing the classic Lyapunov's method in control theory. First of all, a data-based stability theorem is proposed for stochastic nonlinear systems modeled by Markov decision process. Then we show that the stability condition could be exploited as the critic in the actor-critic RL to learn a controller/policy. At last, the effectiveness of our approach is evaluated on several well-known 3-dimensional robot control tasks and a synthetic biology gene network tracking task in three different popular physics simulation platforms. As an empirical evaluation on the advantage of stability, we show that the learned policies can enable the systems to recover to the equilibrium or way-points when interfered by uncertainties such as system parametric variations and external disturbances to a certain extent.

Learning Neural-Symbolic Descriptive Planning Models via Cube-Space Priors: The Voyage Home (to STRIPS)

Authors:Masataro Asai, Christian Muise
Date:2020-04-27 15:01:54

We achieved a new milestone in the difficult task of enabling agents to learn about their environment autonomously. Our neuro-symbolic architecture is trained end-to-end to produce a succinct and effective discrete state transition model from images alone. Our target representation (the Planning Domain Definition Language) is already in a form that off-the-shelf solvers can consume, and opens the door to the rich array of modern heuristic search capabilities. We demonstrate how the sophisticated innate prior we place on the learning process significantly reduces the complexity of the learned representation, and reveals a connection to the graph-theoretic notion of "cube-like graphs", thus opening the door to a deeper understanding of the ideal properties for learned symbolic representations. We show that the powerful domain-independent heuristics allow our system to solve visual 15-Puzzle instances which are beyond the reach of blind search, without resorting to the Reinforcement Learning approach that requires a huge amount of training on the domain-dependent reward information.

The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget

Authors:Anirudh Goyal, Yoshua Bengio, Matthew Botvinick, Sergey Levine
Date:2020-04-24 18:29:31

In many applications, it is desirable to extract only the relevant information from complex input data, which involves making a decision about which input features are relevant. The information bottleneck method formalizes this as an information-theoretic optimization problem by maintaining an optimal tradeoff between compression (throwing away irrelevant input information), and predicting the target. In many problem settings, including the reinforcement learning problems we consider in this work, we might prefer to compress only part of the input. This is typically the case when we have a standard conditioning input, such as a state observation, and a "privileged" input, which might correspond to the goal of a task, the output of a costly planning algorithm, or communication with another agent. In such cases, we might prefer to compress the privileged input, either to achieve better generalization (e.g., with respect to goals) or to minimize access to costly information (e.g., in the case of communication). Practical implementations of the information bottleneck based on variational inference require access to the privileged input in order to compute the bottleneck variable, so although they perform compression, this compression operation itself needs unrestricted, lossless access. In this work, we propose the variational bandwidth bottleneck, which decides for each example on the estimated value of the privileged information before seeing it, i.e., only based on the standard input, and then accordingly chooses stochastically, whether to access the privileged input or not. We formulate a tractable approximation to this framework and demonstrate in a series of reinforcement learning experiments that it can improve generalization and reduce access to computationally costly information.

PBCS : Efficient Exploration and Exploitation Using a Synergy between Reinforcement Learning and Motion Planning

Authors:Guillaume Matheron, Nicolas Perrin, Olivier Sigaud
Date:2020-04-24 11:37:09

The exploration-exploitation trade-off is at the heart of reinforcement learning (RL). However, most continuous control benchmarks used in recent RL research only require local exploration. This led to the development of algorithms that have basic exploration capabilities, and behave poorly in benchmarks that require more versatile exploration. For instance, as demonstrated in our empirical study, state-of-the-art RL algorithms such as DDPG and TD3 are unable to steer a point mass in even small 2D mazes. In this paper, we propose a new algorithm called "Plan, Backplay, Chain Skills" (PBCS) that combines motion planning and reinforcement learning to solve hard exploration environments. In a first phase, a motion planning algorithm is used to find a single good trajectory, then an RL algorithm is trained using a curriculum derived from the trajectory, by combining a variant of the Backplay algorithm and skill chaining. We show that this method outperforms state-of-the-art RL algorithms in 2D maze environments of various sizes, and is able to improve on the trajectory obtained by the motion planning phase.

Guiding Robot Exploration in Reinforcement Learning via Automated Planning

Authors:Yohei Hayamizu, Saeid Amiri, Kishan Chandan, Keiki Takadama, Shiqi Zhang
Date:2020-04-23 21:03:30

Reinforcement learning (RL) enables an agent to learn from trial-and-error experiences toward achieving long-term goals; automated planning aims to compute plans for accomplishing tasks using action knowledge. Despite their shared goal of completing complex tasks, the development of RL and automated planning has been largely isolated due to their different computational modalities. Focusing on improving RL agents' learning efficiency, we develop Guided Dyna-Q (GDQ) to enable RL agents to reason with action knowledge to avoid exploring less-relevant states. The action knowledge is used for generating artificial experiences from an optimistic simulation. GDQ has been evaluated in simulation and using a mobile robot conducting navigation tasks in a multi-room office environment. Compared with competitive baselines, GDQ significantly reduces the effort in exploration while improving the quality of learned policies.

Divide-and-Conquer Monte Carlo Tree Search For Goal-Directed Planning

Authors:Giambattista Parascandolo, Lars Buesing, Josh Merel, Leonard Hasenclever, John Aslanides, Jessica B. Hamrick, Nicolas Heess, Alexander Neitz, Theophane Weber
Date:2020-04-23 18:08:58

Standard planners for sequential decision making (including Monte Carlo planning, tree search, dynamic programming, etc.) are constrained by an implicit sequential planning assumption: The order in which a plan is constructed is the same in which it is executed. We consider alternatives to this assumption for the class of goal-directed Reinforcement Learning (RL) problems. Instead of an environment transition model, we assume an imperfect, goal-directed policy. This low-level policy can be improved by a plan, consisting of an appropriate sequence of sub-goals that guide it from the start to the goal state. We propose a planning algorithm, Divide-and-Conquer Monte Carlo Tree Search (DC-MCTS), for approximating the optimal plan by means of proposing intermediate sub-goals which hierarchically partition the initial tasks into simpler ones that are then solved independently and recursively. The algorithm critically makes use of a learned sub-goal proposal for finding appropriate partitions trees of new tasks based on prior experience. Different strategies for learning sub-goal proposals give rise to different planning strategies that strictly generalize sequential planning. We show that this algorithmic flexibility over planning order leads to improved results in navigation tasks in grid-worlds as well as in challenging continuous control environments.

Flexible and Efficient Long-Range Planning Through Curious Exploration

Authors:Aidan Curtis, Minjian Xin, Dilip Arumugam, Kevin Feigelis, Daniel Yamins
Date:2020-04-22 21:47:29

Identifying algorithms that flexibly and efficiently discover temporally-extended multi-phase plans is an essential step for the advancement of robotics and model-based reinforcement learning. The core problem of long-range planning is finding an efficient way to search through the tree of possible action sequences. Existing non-learned planning solutions from the Task and Motion Planning (TAMP) literature rely on the existence of logical descriptions for the effects and preconditions for actions. This constraint allows TAMP methods to efficiently reduce the tree search problem but limits their ability to generalize to unseen and complex physical environments. In contrast, deep reinforcement learning (DRL) methods use flexible neural-network-based function approximators to discover policies that generalize naturally to unseen circumstances. However, DRL methods struggle to handle the very sparse reward landscapes inherent to long-range multi-step planning situations. Here, we propose the Curious Sample Planner (CSP), which fuses elements of TAMP and DRL by combining a curiosity-guided sampling strategy with imitation learning to accelerate planning. We show that CSP can efficiently discover interesting and complex temporally-extended plans for solving a wide range of physically realistic 3D tasks. In contrast, standard planning and learning methods often fail to solve these tasks at all or do so only with a huge and highly variable number of training samples. We explore the use of a variety of curiosity metrics with CSP and analyze the types of solutions that CSP discovers. Finally, we show that CSP supports task transfer so that the exploration policies learned during experience with one task can help improve efficiency on related tasks.

Reinforcement Learning to Optimize the Logistics Distribution Routes of Unmanned Aerial Vehicle

Authors:Linfei Feng
Date:2020-04-21 09:42:03

Path planning methods for the unmanned aerial vehicle (UAV) in goods delivery have drawn great attention from industry and academics because of its flexibility which is suitable for many situations in the "Last Kilometer" between customer and delivery nodes. However, the complicated situation is still a problem for traditional combinatorial optimization methods. Based on the state-of-the-art Reinforcement Learning (RL), this paper proposed an improved method to achieve path planning for UAVs in complex surroundings: multiple no-fly zones. The improved approach leverages the attention mechanism and includes the embedding mechanism as the encoder and three different widths of beam search (i.e.,~1, 5, and 10) as the decoders. Policy gradients are utilized to train the RL model for obtaining the optimal strategies during inference. The results show the feasibility and efficiency of the model applying in this kind of complicated situation. Comparing the model with the results obtained by the optimization solver OR-tools, it improves the reliability of the distribution system and has a guiding significance for the broad application of UAVs.

Model-Predictive Control via Cross-Entropy and Gradient-Based Optimization

Authors:Homanga Bharadhwaj, Kevin Xie, Florian Shkurti
Date:2020-04-19 03:54:50

Recent works in high-dimensional model-predictive control and model-based reinforcement learning with learned dynamics and reward models have resorted to population-based optimization methods, such as the Cross-Entropy Method (CEM), for planning a sequence of actions. To decide on an action to take, CEM conducts a search for the action sequence with the highest return according to the dynamics model and reward. Action sequences are typically randomly sampled from an unconditional Gaussian distribution and evaluated on the environment. This distribution is iteratively updated towards action sequences with higher returns. However, this planning method can be very inefficient, especially for high-dimensional action spaces. An alternative line of approaches optimize action sequences directly via gradient descent, but are prone to local optima. We propose a method to solve this planning problem by interleaving CEM and gradient descent steps in optimizing the action sequence. Our experiments show faster convergence of the proposed hybrid approach, even for high-dimensional action spaces, avoidance of local minima, and better or equal performance to CEM. Code accompanying the paper is available here https://github.com/homangab/gradcem.

Deep Reinforcement Learning for Adaptive Learning Systems

Authors:Xiao Li, Hanchen Xu, Jinming Zhang, Hua-hua Chang
Date:2020-04-17 18:04:03

In this paper, we formulate the adaptive learning problem---the problem of how to find an individualized learning plan (called policy) that chooses the most appropriate learning materials based on learner's latent traits---faced in adaptive learning systems as a Markov decision process (MDP). We assume latent traits to be continuous with an unknown transition model. We apply a model-free deep reinforcement learning algorithm---the deep Q-learning algorithm---that can effectively find the optimal learning policy from data on learners' learning process without knowing the actual transition model of the learners' continuous latent traits. To efficiently utilize available data, we also develop a transition model estimator that emulates the learner's learning process using neural networks. The transition model estimator can be used in the deep Q-learning algorithm so that it can more efficiently discover the optimal learning policy for a learner. Numerical simulation studies verify that the proposed algorithm is very efficient in finding a good learning policy, especially with the aid of a transition model estimator, it can find the optimal learning policy after training using a small number of learners.

Order Matters: Generating Progressive Explanations for Planning Tasks in Human-Robot Teaming

Authors:Mehrdad Zakershahrak, Shashank Rao Marpally, Akshay Sharma, Ze Gong, Yu Zhang
Date:2020-04-16 00:17:02

Prior work on generating explanations in a planning and decision-making context has focused on providing the rationale behind an AI agent's decision making. While these methods provide the right explanations from the explainer's perspective, they fail to heed the cognitive requirement of understanding an explanation from the explainee's (the human's) perspective. In this work, we set out to address this issue by first considering the influence of information order in an explanation, or the progressiveness of explanations. Intuitively, progression builds later concepts on previous ones and is known to contribute to better learning. In this work, we aim to investigate similar effects during explanation generation when an explanation is broken into multiple parts that are communicated sequentially. The challenge here lies in modeling the humans' preferences for information order in receiving such explanations to assist understanding. Given this sequential process, a formulation based on goal-based MDP for generating progressive explanations is presented. The reward function of this MDP is learned via inverse reinforcement learning based on explanations that are retrieved via human subject studies. We first evaluated our approach on a scavenger-hunt domain to demonstrate its effectively in capturing the humans' preferences. Upon analyzing the results, it revealed something more fundamental: the preferences arise strongly from both domain dependent and independence features. The correlation with domain independent features pushed us to verify this result further in an escape room domain. Results confirmed our hypothesis that the process of understanding an explanation was a dynamic process. The human preference that reflected this aspect corresponded exactly to the progression for knowledge assimilation hidden deeper in our cognitive process.

Reinforcement Learning in a Physics-Inspired Semi-Markov Environment

Authors:Colin Bellinger, Rory Coles, Mark Crowley, Isaac Tamblyn
Date:2020-04-15 20:43:29

Reinforcement learning (RL) has been demonstrated to have great potential in many applications of scientific discovery and design. Recent work includes, for example, the design of new structures and compositions of molecules for therapeutic drugs. Much of the existing work related to the application of RL to scientific domains, however, assumes that the available state representation obeys the Markov property. For reasons associated with time, cost, sensor accuracy, and gaps in scientific knowledge, many scientific design and discovery problems do not satisfy the Markov property. Thus, something other than a Markov decision process (MDP) should be used to plan / find the optimal policy. In this paper, we present a physics-inspired semi-Markov RL environment, namely the phase change environment. In addition, we evaluate the performance of value-based RL algorithms for both MDPs and partially observable MDPs (POMDPs) on the proposed environment. Our results demonstrate deep recurrent Q-networks (DRQN) significantly outperform deep Q-networks (DQN), and that DRQNs benefit from training with hindsight experience replay. Implications for the use of semi-Markovian RL and POMDPs for scientific laboratories are also discussed.

Bootstrapped model learning and error correction for planning with uncertainty in model-based RL

Authors:Alvaro Ovalle, Simon M. Lucas
Date:2020-04-15 15:41:21

Having access to a forward model enables the use of planning algorithms such as Monte Carlo Tree Search and Rolling Horizon Evolution. Where a model is unavailable, a natural aim is to learn a model that reflects accurately the dynamics of the environment. In many situations it might not be possible and minimal glitches in the model may lead to poor performance and failure. This paper explores the problem of model misspecification through uncertainty-aware reinforcement learning agents. We propose a bootstrapped multi-headed neural network that learns the distribution of future states and rewards. We experiment with a number of schemes to extract the most likely predictions. Moreover, we also introduce a global error correction filter that applies high-level constraints guided by the context provided through the predictive distribution. We illustrate our approach on Minipacman. The evaluation demonstrates that when dealing with imperfect models, our methods exhibit increased performance and stability, both in terms of model accuracy and in its use within a planning algorithm.

A Text-based Deep Reinforcement Learning Framework for Interactive Recommendation

Authors:Chaoyang Wang, Zhiqiang Guo, Jianjun Li, Peng Pan, Guohui Li
Date:2020-04-14 16:46:01

Due to its nature of learning from dynamic interactions and planning for long-run performance, reinforcement learning (RL) recently has received much attention in interactive recommender systems (IRSs). IRSs usually face the large discrete action space problem, which makes most of the existing RL-based recommendation methods inefficient. Moreover, data sparsity is another challenging problem that most IRSs are confronted with. While the textual information like reviews and descriptions is less sensitive to sparsity, existing RL-based recommendation methods either neglect or are not suitable for incorporating textual information. To address these two problems, in this paper, we propose a Text-based Deep Deterministic Policy Gradient framework (TDDPG-Rec) for IRSs. Specifically, we leverage textual information to map items and users into a feature space, which greatly alleviates the sparsity problem. Moreover, we design an effective method to construct an action candidate set. By the policy vector dynamically learned from TDDPG-Rec that expresses the user's preference, we can select actions from the candidate set effectively. Through experiments on three public datasets, we demonstrate that TDDPG-Rec achieves state-of-the-art performance over several baselines in a time-efficient manner.

A reinforcement learning application of guided Monte Carlo Tree Search algorithm for beam orientation selection in radiation therapy

Authors:Azar Sadeghnejad-Barkousaraie, Gyanendra Bohara, Steve Jiang, Dan Nguyen
Date:2020-04-14 00:28:15

Due to the large combinatorial problem, current beam orientation optimization algorithms for radiotherapy, such as column generation (CG), are typically heuristic or greedy in nature, leading to suboptimal solutions. We propose a reinforcement learning strategy using Monte Carlo Tree Search capable of finding a superior beam orientation set and in less time than CG.We utilized a reinforcement learning structure involving a supervised learning network to guide Monte Carlo tree search (GTS) to explore the decision space of beam orientation selection problem. We have previously trained a deep neural network (DNN) that takes in the patient anatomy, organ weights, and current beams, and then approximates beam fitness values, indicating the next best beam to add. This DNN is used to probabilistically guide the traversal of the branches of the Monte Carlo decision tree to add a new beam to the plan. To test the feasibility of the algorithm, we solved for 5-beam plans, using 13 test prostate cancer patients, different from the 57 training and validation patients originally trained the DNN. To show the strength of GTS to other search methods, performances of three other search methods including a guided search, uniform tree search and random search algorithms are also provided. On average GTS outperforms all other methods, it find a solution better than CG in 237 seconds on average, compared to CG which takes 360 seconds, and outperforms all other methods in finding a solution with lower objective function value in less than 1000 seconds. Using our guided tree search (GTS) method we were able to maintain a similar planning target volume (PTV) coverage within 1% error, and reduce the organ at risk (OAR) mean dose for body, rectum, left and right femoral heads, but a slight increase of 1% in bladder mean dose.

Learning to Drive Off Road on Smooth Terrain in Unstructured Environments Using an On-Board Camera and Sparse Aerial Images

Authors:Travis Manderson, Stefan Wapnick, David Meger, Gregory Dudek
Date:2020-04-09 17:27:09

We present a method for learning to drive on smooth terrain while simultaneously avoiding collisions in challenging off-road and unstructured outdoor environments using only visual inputs. Our approach applies a hybrid model-based and model-free reinforcement learning method that is entirely self-supervised in labeling terrain roughness and collisions using on-board sensors. Notably, we provide both first-person and overhead aerial image inputs to our model. We find that the fusion of these complementary inputs improves planning foresight and makes the model robust to visual obstructions. Our results show the ability to generalize to environments with plentiful vegetation, various types of rock, and sandy trails. During evaluation, our policy attained 90% smooth terrain traversal and reduced the proportion of rough terrain driven over by 6.1 times compared to a model using only first-person imagery.

Online Constrained Model-based Reinforcement Learning

Authors:Benjamin van Niekerk, Andreas Damianou, Benjamin Rosman
Date:2020-04-07 15:51:34

Applying reinforcement learning to robotic systems poses a number of challenging problems. A key requirement is the ability to handle continuous state and action spaces while remaining within a limited time and resource budget. Additionally, for safe operation, the system must make robust decisions under hard constraints. To address these challenges, we propose a model based approach that combines Gaussian Process regression and Receding Horizon Control. Using sparse spectrum Gaussian Processes, we extend previous work by updating the dynamics model incrementally from a stream of sensory data. This results in an agent that can learn and plan in real-time under non-linear constraints. We test our approach on a cart pole swing-up environment and demonstrate the benefits of online learning on an autonomous racing task. The environment's dynamics are learned from limited training data and can be reused in new task instances without retraining.

Continuous Motion Planning with Temporal Logic Specifications using Deep Neural Networks

Authors:Chuanzheng Wang, Yinan Li, Stephen L. Smith, Jun Liu
Date:2020-04-02 17:58:03

In this paper, we propose a model-free reinforcement learning method to synthesize control policies for motion planning problems with continuous states and actions. The robot is modelled as a labeled discrete-time Markov decision process (MDP) with continuous state and action spaces. Linear temporal logics (LTL) are used to specify high-level tasks. We then train deep neural networks to approximate the value function and policy using an actor-critic reinforcement learning method. The LTL specification is converted into an annotated limit-deterministic B\"uchi automaton (LDBA) for continuously shaping the reward so that dense rewards are available during training. A na\"ive way of solving a motion planning problem with LTL specifications using reinforcement learning is to sample a trajectory and then assign a high reward for training if the trajectory satisfies the entire LTL formula. However, the sampling complexity needed to find such a trajectory is too high when we have a complex LTL formula for continuous state and action spaces. As a result, it is very unlikely that we get enough reward for training if all sample trajectories start from the initial state in the automata. In this paper, we propose a method that samples not only an initial state from the state space, but also an arbitrary state in the automata at the beginning of each training episode. We test our algorithm in simulation using a car-like robot and find out that our method can learn policies for different working configurations and LTL specifications successfully.

A New Challenge: Approaching Tetris Link with AI

Authors:Matthias Muller-Brockhausen, Mike Preuss, Aske Plaat
Date:2020-04-01 12:25:36

Decades of research have been invested in making computer programs for playing games such as Chess and Go. This paper focuses on a new game, Tetris Link, a board game that is still lacking any scientific analysis. Tetris Link has a large branching factor, hampering a traditional heuristic planning approach. We explore heuristic planning and two other approaches: Reinforcement Learning, Monte Carlo tree search. We document our approach and report on their relative performance in a tournament. Curiously, the heuristic approach is stronger than the planning/learning approaches. However, experienced human players easily win the majority of the matches against the heuristic planning AIs. We, therefore, surmise that Tetris Link is more difficult than expected. We offer our findings to the community as a challenge to improve upon.

Enhanced Rolling Horizon Evolution Algorithm with Opponent Model Learning: Results for the Fighting Game AI Competition

Authors:Zhentao Tang, Yuanheng Zhu, Dongbin Zhao, Simon M. Lucas
Date:2020-03-31 04:44:33

The Fighting Game AI Competition (FTGAIC) provides a challenging benchmark for 2-player video game AI. The challenge arises from the large action space, diverse styles of characters and abilities, and the real-time nature of the game. In this paper, we propose a novel algorithm that combines Rolling Horizon Evolution Algorithm (RHEA) with opponent model learning. The approach is readily applicable to any 2-player video game. In contrast to conventional RHEA, an opponent model is proposed and is optimized by supervised learning with cross-entropy and reinforcement learning with policy gradient and Q-learning respectively, based on history observations from opponent. The model is learned during the live gameplay. With the learned opponent model, the extended RHEA is able to make more realistic plans based on what the opponent is likely to do. This tends to lead to better results. We compared our approach directly with the bots from the FTGAIC 2018 competition, and found our method to significantly outperform all of them, for all three character. Furthermore, our proposed bot with the policy-gradient-based opponent model is the only one without using Monte-Carlo Tree Search (MCTS) among top five bots in the 2019 competition in which it achieved second place, while using much less domain knowledge than the winner.

Policy Teaching via Environment Poisoning: Training-time Adversarial Attacks against Reinforcement Learning

Authors:Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, Adish Singla
Date:2020-03-28 23:22:28

We study a security threat to reinforcement learning where an attacker poisons the learning environment to force the agent into executing a target policy chosen by the attacker. As a victim, we consider RL agents whose objective is to find a policy that maximizes average reward in undiscounted infinite-horizon problem settings. The attacker can manipulate the rewards or the transition dynamics in the learning environment at training-time and is interested in doing so in a stealthy manner. We propose an optimization framework for finding an \emph{optimal stealthy attack} for different measures of attack cost. We provide sufficient technical conditions under which the attack is feasible and provide lower/upper bounds on the attack cost. We instantiate our attacks in two settings: (i) an \emph{offline} setting where the agent is doing planning in the poisoned environment, and (ii) an \emph{online} setting where the agent is learning a policy using a regret-minimization framework with poisoned feedback. Our results show that the attacker can easily succeed in teaching any target policy to the victim under mild conditions and highlight a significant security threat to reinforcement learning agents in practice.

Using Deep Reinforcement Learning Methods for Autonomous Vessels in 2D Environments

Authors:Mohammad Etemad, Nader Zare, Mahtab Sarvmaili, Amilcar Soares, Bruno Brandoli Machado, Stan Matwin
Date:2020-03-23 12:58:58

Unmanned Surface Vehicles technology (USVs) is an exciting topic that essentially deploys an algorithm to safely and efficiently performs a mission. Although reinforcement learning is a well-known approach to modeling such a task, instability and divergence may occur when combining off-policy and function approximation. In this work, we used deep reinforcement learning combining Q-learning with a neural representation to avoid instability. Our methodology uses deep q-learning and combines it with a rolling wave planning approach on agile methodology. Our method contains two critical parts in order to perform missions in an unknown environment. The first is a path planner that is responsible for generating a potential effective path to a destination without considering the details of the root. The latter is a decision-making module that is responsible for short-term decisions on avoiding obstacles during the near future steps of USV exploitation within the context of the value function. Simulations were performed using two algorithms: a basic vanilla vessel navigator (VVN) as a baseline and an improved one for the vessel navigator with a planner and local view (VNPLV). Experimental results show that the proposed method enhanced the performance of VVN by 55.31 on average for long-distance missions. Our model successfully demonstrated obstacle avoidance by means of deep reinforcement learning using planning adaptive paths in unknown environments.

Autonomous UAV Navigation: A DDPG-based Deep Reinforcement Learning Approach

Authors:Omar Bouhamed, Hakim Ghazzai, Hichem Besbes, Yehia Massoud
Date:2020-03-21 19:33:00

In this paper, we propose an autonomous UAV path planning framework using deep reinforcement learning approach. The objective is to employ a self-trained UAV as a flying mobile unit to reach spatially distributed moving or static targets in a given three dimensional urban area. In this approach, a Deep Deterministic Policy Gradient (DDPG) with continuous action space is designed to train the UAV to navigate through or over the obstacles to reach its assigned target. A customized reward function is developed to minimize the distance separating the UAV and its destination while penalizing collisions. Numerical simulations investigate the behavior of the UAV in learning the environment and autonomously determining trajectories for different selected scenarios.

Adjust Planning Strategies to Accommodate Reinforcement Learning Agents

Authors:Xuerun Chen
Date:2020-03-19 03:35:10

In agent control issues, the idea of combining reinforcement learning and planning has attracted much attention. Two methods focus on micro and macro action respectively. Their advantages would show together if there is a good cooperation between them. An essential for the cooperation is to find an appropriate boundary, assigning different functions to each method. Such boundary could be represented by parameters in a planning algorithm. In this paper, we create an optimization strategy for planning parameters, through analysis to the connection of reaction and planning; we also create a non-gradient method for accelerating the optimization. The whole algorithm can find a satisfactory setting of planning parameters, making full use of reaction capability of specific agents.

Giving Up Control: Neurons as Reinforcement Learning Agents

Authors:Jordan Ott
Date:2020-03-17 04:47:40

Artificial Intelligence has historically relied on planning, heuristics, and handcrafted approaches designed by experts. All the while claiming to pursue the creation of Intelligence. This approach fails to acknowledge that intelligence emerges from the dynamics within a complex system. Neurons in the brain are governed by local rules, where no single neuron, or group of neurons, coordinates or controls the others. This local structure gives rise to the appropriate dynamics in which intelligence can emerge. Populations of neurons must compete with their neighbors for resources, inhibition, and activity representation. At the same time, they must cooperate, so the population and organism can perform high-level functions. To this end, we introduce modeling neurons as reinforcement learning agents. Where each neuron may be viewed as an independent actor, trying to maximize its own self-interest. By framing learning in this way, we open the door to an entirely new approach to building intelligent systems.

Value Variance Minimization for Learning Approximate Equilibrium in Aggregation Systems

Authors:Tanvi Verma, Pradeep Varakantham
Date:2020-03-16 10:02:42

For effective matching of resources (e.g., taxis, food, bikes, shopping items) to customer demand, aggregation systems have been extremely successful. In aggregation systems, a central entity (e.g., Uber, Food Panda, Ofo) aggregates supply (e.g., drivers, delivery personnel) and matches demand to supply on a continuous basis (sequential decisions). Due to the objective of the central entity to maximize its profits, individual suppliers get sacrificed thereby creating incentive for individuals to leave the system. In this paper, we consider the problem of learning approximate equilibrium solutions (win-win solutions) in aggregation systems, so that individuals have an incentive to remain in the aggregation system. Unfortunately, such systems have thousands of agents and have to consider demand uncertainty and the underlying problem is a (Partially Observable) Stochastic Game. Given the significant complexity of learning or planning in a stochastic game, we make three key contributions: (a) To exploit infinitesimally small contribution of each agent and anonymity (reward and transitions between agents are dependent on agent counts) in interactions, we represent this as a Multi-Agent Reinforcement Learning (MARL) problem that builds on insights from non-atomic congestion games model; (b) We provide a novel variance reduction mechanism for moving joint solution towards Nash Equilibrium that exploits the infinitesimally small contribution of each agent; and finally (c) We provide detailed results on three different domains to demonstrate the utility of our approach in comparison to state-of-the-art methods.

Model-based Reinforcement Learning for Decentralized Multiagent Rendezvous

Authors:Rose E. Wang, J. Chase Kew, Dennis Lee, Tsang-Wei Edward Lee, Tingnan Zhang, Brian Ichter, Jie Tan, Aleksandra Faust
Date:2020-03-15 19:49:20

Collaboration requires agents to align their goals on the fly. Underlying the human ability to align goals with other agents is their ability to predict the intentions of others and actively update their own plans. We propose hierarchical predictive planning (HPP), a model-based reinforcement learning method for decentralized multiagent rendezvous. Starting with pretrained, single-agent point to point navigation policies and using noisy, high-dimensional sensor inputs like lidar, we first learn via self-supervision motion predictions of all agents on the team. Next, HPP uses the prediction models to propose and evaluate navigation subgoals for completing the rendezvous task without explicit communication among agents. We evaluate HPP in a suite of unseen environments, with increasing complexity and numbers of obstacles. We show that HPP outperforms alternative reinforcement learning, path planning, and heuristic-based baselines on challenging, unseen environments. Experiments in the real world demonstrate successful transfer of the prediction models from sim to real world without any additional fine-tuning. Altogether, HPP removes the need for a centralized operator in multiagent systems by combining model-based RL and inference methods, enabling agents to dynamically align plans.

Robot Playing Kendama with Model-Based and Model-Free Reinforcement Learning

Authors:Shidi Li
Date:2020-03-15 04:17:03

Several model-based and model-free methods have been proposed for the robot trajectory learning task. Both approaches have their benefits and drawbacks. They can usually complement each other. Many research works are trying to integrate some model-based and model-free methods into one algorithm and perform well in simulators or quasi-static robot tasks. Difficulties still exist when algorithms are used in particular trajectory learning tasks. In this paper, we propose a robot trajectory learning framework for precise tasks with discontinuous dynamics and high speed. The trajectories learned from the human demonstration are optimized by DDP and PoWER successively. The framework is tested on the Kendama manipulation task, which can also be difficult for humans to achieve. The results show that our approach can plan the trajectories to successfully complete the task.

Transfer Reinforcement Learning under Unobserved Contextual Information

Authors:Yan Zhang, Michael M. Zavlanos
Date:2020-03-09 22:00:04

In this paper, we study a transfer reinforcement learning problem where the state transitions and rewards are affected by the environmental context. Specifically, we consider a demonstrator agent that has access to a context-aware policy and can generate transition and reward data based on that policy. These data constitute the experience of the demonstrator. Then, the goal is to transfer this experience, excluding the underlying contextual information, to a learner agent that does not have access to the environmental context, so that they can learn a control policy using fewer samples. It is well known that, disregarding the causal effect of the contextual information, can introduce bias in the transition and reward models estimated by the learner, resulting in a learned suboptimal policy. To address this challenge, in this paper, we develop a method to obtain causal bounds on the transition and reward functions using the demonstrator's data, which we then use to obtain causal bounds on the value functions. Using these value function bounds, we propose new Q learning and UCB-Q learning algorithms that converge to the true value function without bias. We provide numerical experiments for robot motion planning problems that validate the proposed value function bounds and demonstrate that the proposed algorithms can effectively make use of the data from the demonstrator to accelerate the learning process of the learner.

A Multi-Agent Reinforcement Learning Approach For Safe and Efficient Behavior Planning Of Connected Autonomous Vehicles

Authors:Songyang Han, Shanglin Zhou, Jiangwei Wang, Lynn Pepin, Caiwen Ding, Jie Fu, Fei Miao
Date:2020-03-09 19:15:30

The recent advancements in wireless technology enable connected autonomous vehicles (CAVs) to gather information about their environment by vehicle-to-vehicle (V2V) communication. In this work, we design an information-sharing-based multi-agent reinforcement learning (MARL) framework for CAVs, to take advantage of the extra information when making decisions to improve traffic efficiency and safety. The safe actor-critic algorithm we propose has two new techniques: the truncated Q-function and safe action mapping. The truncated Q-function utilizes the shared information from neighboring CAVs such that the joint state and action spaces of the Q-function do not grow in our algorithm for a large-scale CAV system. We prove the bound of the approximation error between the truncated-Q and global Q-functions. The safe action mapping provides a provable safety guarantee for both the training and execution based on control barrier functions. Using the CARLA simulator for experiments, we show that our approach can improve the CAV system's efficiency in terms of average velocity and comfort under different CAV ratios and different traffic densities. We also show that our approach avoids the execution of unsafe actions and always maintains a safe distance from other vehicles. We construct an obstacle-at-corner scenario to show that the shared vision can help CAVs to observe obstacles earlier and take action to avoid traffic jams.

Learning Discrete State Abstractions With Deep Variational Inference

Authors:Ondrej Biza, Robert Platt, Jan-Willem van de Meent, Lawson L. S. Wong
Date:2020-03-09 17:58:27

Abstraction is crucial for effective sequential decision making in domains with large state spaces. In this work, we propose an information bottleneck method for learning approximate bisimulations, a type of state abstraction. We use a deep neural encoder to map states onto continuous embeddings. We map these embeddings onto a discrete representation using an action-conditioned hidden Markov model, which is trained end-to-end with the neural network. Our method is suited for environments with high-dimensional states and learns from a stream of experience collected by an agent acting in a Markov decision process. Through this learned discrete abstract model, we can efficiently plan for unseen goals in a multi-goal Reinforcement Learning setting. We test our method in simplified robotic manipulation domains with image states. We also compare it against previous model-based approaches to finding bisimulations in discrete grid-world-like environments. Source code is available at https://github.com/ondrejba/discrete_abstractions.

Zooming for Efficient Model-Free Reinforcement Learning in Metric Spaces

Authors:Ahmed Touati, Adrien Ali Taiga, Marc G. Bellemare
Date:2020-03-09 12:32:02

Despite the wealth of research into provably efficient reinforcement learning algorithms, most works focus on tabular representation and thus struggle to handle exponentially or infinitely large state-action spaces. In this paper, we consider episodic reinforcement learning with a continuous state-action space which is assumed to be equipped with a natural metric that characterizes the proximity between different states and actions. We propose ZoomRL, an online algorithm that leverages ideas from continuous bandits to learn an adaptive discretization of the joint space by zooming in more promising and frequently visited regions while carefully balancing the exploitation-exploration trade-off. We show that ZoomRL achieves a worst-case regret $\tilde{O}(H^{\frac{5}{2}} K^{\frac{d+1}{d+2}})$ where $H$ is the planning horizon, $K$ is the number of episodes and $d$ is the covering dimension of the space with respect to the metric. Moreover, our algorithm enjoys improved metric-dependent guarantees that reflect the geometry of the underlying space. Finally, we show that our algorithm is robust to small misspecification errors.

UAV Coverage Path Planning under Varying Power Constraints using Deep Reinforcement Learning

Authors:Mirco Theile, Harald Bayerlein, Richard Nai, David Gesbert, Marco Caccamo
Date:2020-03-05 13:43:47

Coverage path planning (CPP) is the task of designing a trajectory that enables a mobile agent to travel over every point of an area of interest. We propose a new method to control an unmanned aerial vehicle (UAV) carrying a camera on a CPP mission with random start positions and multiple options for landing positions in an environment containing no-fly zones. While numerous approaches have been proposed to solve similar CPP problems, we leverage end-to-end reinforcement learning (RL) to learn a control policy that generalizes over varying power constraints for the UAV. Despite recent improvements in battery technology, the maximum flying range of small UAVs is still a severe constraint, which is exacerbated by variations in the UAV's power consumption that are hard to predict. By using map-like input channels to feed spatial information through convolutional network layers to the agent, we are able to train a double deep Q-network (DDQN) to make control decisions for the UAV, balancing limited power budget and coverage goal. The proposed method can be applied to a wide variety of environments and harmonizes complex goal structures with system constraints.

BARK: Open Behavior Benchmarking in Multi-Agent Environments

Authors:Julian Bernhard, Klemens Esterle, Patrick Hart, Tobias Kessler
Date:2020-03-05 13:32:43

Predicting and planning interactive behaviors in complex traffic situations presents a challenging task. Especially in scenarios involving multiple traffic participants that interact densely, autonomous vehicles still struggle to interpret situations and to eventually achieve their own mission goal. As driving tests are costly and challenging scenarios are hard to find and reproduce, simulation is widely used to develop, test, and benchmark behavior models. However, most simulations rely on datasets and simplistic behavior models for traffic participants and do not cover the full variety of real-world, interactive human behaviors. In this work, we introduce BARK, an open-source behavior benchmarking environment designed to mitigate the shortcomings stated above. In BARK, behavior models are (re-)used for planning, prediction, and simulation. A range of models is currently available, such as Monte-Carlo Tree Search and Reinforcement Learning-based behavior models. We use a public dataset and sampling-based scenario generation to show the inter-exchangeability of behavior models in BARK. We evaluate how well the models used cope with interactions and how robust they are towards exchanging behavior models. Our evaluation shows that BARK provides a suitable framework for a systematic development of behavior models.

Adaptive Online Distributed Optimal Control of Very-Large-Scale Robotic Systems

Authors:Pingping Zhu, Chang Liu, Silvia Ferrari
Date:2020-03-04 04:49:11

This paper presents an adaptive online distributed optimal control approach that is applicable to optimal planning for very-large-scale robotics systems in highly uncertain environments. This approach is developed based on the optimal mass transport theory. It is also viewed as an online reinforcement learning and approximate dynamic programming approach in the Wasserstein-GMM space, where a novel value functional is defined based on the probability density functions of robots and the time-varying obstacle map functions describing the changing environmental information. The proposed approach is demonstrated on the path planning problem of very-largescale robotic systems where the approximated layout of obstacles in the workspace is incrementally updated by the observations of robots, and compared with some existing state-of-the-art approaches. The numerical simulation results show that the proposed approach outperforms these approaches in aspects of the average traveling distance and the energy cost.

Efficient Exploration in Constrained Environments with Goal-Oriented Reference Path

Authors:Kei Ota, Yoko Sasaki, Devesh K. Jha, Yusuke Yoshiyasu, Asako Kanezaki
Date:2020-03-03 17:07:47

In this paper, we consider the problem of building learning agents that can efficiently learn to navigate in constrained environments. The main goal is to design agents that can efficiently learn to understand and generalize to different environments using high-dimensional inputs (a 2D map), while following feasible paths that avoid obstacles in obstacle-cluttered environment. To achieve this, we make use of traditional path planning algorithms, supervised learning, and reinforcement learning algorithms in a synergistic way. The key idea is to decouple the navigation problem into planning and control, the former of which is achieved by supervised learning whereas the latter is done by reinforcement learning. Specifically, we train a deep convolutional network that can predict collision-free paths based on a map of the environment-- this is then used by a reinforcement learning algorithm to learn to closely follow the path. This allows the trained agent to achieve good generalization while learning faster. We test our proposed method in the recently proposed Safety Gym suite that allows testing of safety-constraints during training of learning agents. We compare our proposed method with existing work and show that our method consistently improves the sample efficiency and generalization capability to novel environments.

Relevance-Guided Modeling of Object Dynamics for Reinforcement Learning

Authors:William Agnew, Pedro Domingos
Date:2020-03-03 08:18:49

Current deep reinforcement learning (RL) approaches incorporate minimal prior knowledge about the environment, limiting computational and sample efficiency. \textit{Objects} provide a succinct and causal description of the world, and many recent works have proposed unsupervised object representation learning using priors and losses over static object properties like visual consistency. However, object dynamics and interactions are also critical cues for objectness. In this paper we propose a framework for reasoning about object dynamics and behavior to rapidly determine minimal and task-specific object representations. To demonstrate the need to reason over object behavior and dynamics, we introduce a suite of RGBD MuJoCo object collection and avoidance tasks that, while intuitive and visually simple, confound state-of-the-art unsupervised object representation learning algorithms. We also highlight the potential of this framework on several Atari games, using our object representation and standard RL and planning algorithms to learn dramatically faster than existing deep RL algorithms.

PPMC RL Training Algorithm: Rough Terrain Intelligent Robots through Reinforcement Learning

Authors:Tamir Blum, Kazuya Yoshida
Date:2020-03-02 10:14:52

Robots can now learn how to make decisions and control themselves, generalizing learned behaviors to unseen scenarios. In particular, AI powered robots show promise in rough environments like the lunar surface, due to the environmental uncertainties. We address this critical generalization aspect for robot locomotion in rough terrain through a training algorithm we have created called the Path Planning and Motion Control (PPMC) Training Algorithm. This algorithm is coupled with any generic reinforcement learning algorithm to teach robots how to respond to user commands and to travel to designated locations on a single neural network. In this paper, we show that the algorithm works independent of the robot structure, demonstrating that it works on a wheeled rover in addition the past results on a quadruped walking robot. Further, we take several big steps towards real world practicality by introducing a rough highly uneven terrain. Critically, we show through experiments that the robot learns to generalize to new rough terrain maps, retaining a 100% success rate. To the best of our knowledge, this is the first paper to introduce a generic training algorithm teaching generalized PPMC in rough environments to any robot, with just the use of reinforcement learning.

PlaNet of the Bayesians: Reconsidering and Improving Deep Planning Network by Incorporating Bayesian Inference

Authors:Masashi Okada, Norio Kosaka, Tadahiro Taniguchi
Date:2020-03-01 00:46:36

In the present paper, we propose an extension of the Deep Planning Network (PlaNet), also referred to as PlaNet of the Bayesians (PlaNet-Bayes). There has been a growing demand in model predictive control (MPC) in partially observable environments in which complete information is unavailable because of, for example, lack of expensive sensors. PlaNet is a promising solution to realize such latent MPC, as it is used to train state-space models via model-based reinforcement learning (MBRL) and to conduct planning in the latent space. However, recent state-of-the-art strategies mentioned in MBRR literature, such as involving uncertainty into training and planning, have not been considered, significantly suppressing the training performance. The proposed extension is to make PlaNet uncertainty-aware on the basis of Bayesian inference, in which both model and action uncertainty are incorporated. Uncertainty in latent models is represented using a neural network ensemble to approximately infer model posteriors. The ensemble of optimal action candidates is also employed to capture multimodal uncertainty in the optimality. The concept of the action ensemble relies on a general variational inference MPC (VI-MPC) framework and its instance, probabilistic action ensemble with trajectory sampling (PaETS). In this paper, we extend VI-MPC and PaETS, which have been originally introduced in previous literature, to address partially observable cases. We experimentally compare the performances on continuous control tasks, and conclude that our method can consistently improve the asymptotic performance compared with PlaNet.

Policy-Aware Model Learning for Policy Gradient Methods

Authors:Romina Abachi, Mohammad Ghavamzadeh, Amir-massoud Farahmand
Date:2020-02-28 19:18:18

This paper considers the problem of learning a model in model-based reinforcement learning (MBRL). We examine how the planning module of an MBRL algorithm uses the model, and propose that the model learning module should incorporate the way the planner is going to use the model. This is in contrast to conventional model learning approaches, such as those based on maximum likelihood estimate, that learn a predictive model of the environment without explicitly considering the interaction of the model and the planner. We focus on policy gradient type of planning algorithms and derive new loss functions for model learning that incorporate how the planner uses the model. We call this approach Policy-Aware Model Learning (PAML). We theoretically analyze a generic model-based policy gradient algorithm and provide a convergence guarantee for the optimized policy. We also empirically evaluate PAML on some benchmark problems, showing promising results.

Assembly robots with optimized control stiffness through reinforcement learning

Authors:Masahide Oikawa, Kyo Kutsuzawa, Sho Sakaino, Toshiaki Tsuji
Date:2020-02-27 15:54:43

There is an increased demand for task automation in robots. Contact-rich tasks, wherein multiple contact transitions occur in a series of operations, are extensively being studied to realize high accuracy. In this study, we propose a methodology that uses reinforcement learning (RL) to achieve high performance in robots for the execution of assembly tasks that require precise contact with objects without causing damage. The proposed method ensures the online generation of stiffness matrices that help improve the performance of local trajectory optimization. The method has an advantage of rapid response owing to short sampling time of the trajectory planning. The effectiveness of the method was verified via experiments involving two contact-rich tasks. The results indicate that the proposed method can be implemented in various contact-rich manipulations. A demonstration video shows the performance. (https://youtu.be/gxSCl7Tp4-0)

Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes

Authors:Tomas Brazdil, Krishnendu Chatterjee, Petr Novotny, Jiri Vahala
Date:2020-02-27 13:36:36

Markov decision processes (MDPs) are the defacto frame-work for sequential decision making in the presence ofstochastic uncertainty. A classical optimization criterion forMDPs is to maximize the expected discounted-sum pay-off, which ignores low probability catastrophic events withhighly negative impact on the system. On the other hand,risk-averse policies require the probability of undesirableevents to be below a given threshold, but they do not accountfor optimization of the expected payoff. We consider MDPswith discounted-sum payoff with failure states which repre-sent catastrophic outcomes. The objective of risk-constrainedplanning is to maximize the expected discounted-sum payoffamong risk-averse policies that ensure the probability to en-counter a failure state is below a desired threshold. Our maincontribution is an efficient risk-constrained planning algo-rithm that combines UCT-like search with a predictor learnedthrough interaction with the MDP (in the style of AlphaZero)and with a risk-constrained action selection via linear pro-gramming. We demonstrate the effectiveness of our approachwith experiments on classical MDPs from the literature, in-cluding benchmarks with an order of 10^6 states.

Sub-Goal Trees -- a Framework for Goal-Based Reinforcement Learning

Authors:Tom Jurgenson, Or Avner, Edward Groshev, Aviv Tamar
Date:2020-02-27 12:32:13

Many AI problems, in robotics and other domains, are goal-based, essentially seeking trajectories leading to various goal states. Reinforcement learning (RL), building on Bellman's optimality equation, naturally optimizes for a single goal, yet can be made multi-goal by augmenting the state with the goal. Instead, we propose a new RL framework, derived from a dynamic programming equation for the all pairs shortest path (APSP) problem, which naturally solves multi-goal queries. We show that this approach has computational benefits for both standard and approximate dynamic programming. Interestingly, our formulation prescribes a novel protocol for computing a trajectory: instead of predicting the next state given its predecessor, as in standard RL, a goal-conditioned trajectory is constructed by first predicting an intermediate state between start and goal, partitioning the trajectory into two. Then, recursively, predicting intermediate points on each sub-segment, until a complete trajectory is obtained. We call this trajectory structure a sub-goal tree. Building on it, we additionally extend the policy gradient methodology to recursively predict sub-goals, resulting in novel goal-based algorithms. Finally, we apply our method to neural motion planning, where we demonstrate significant improvements compared to standard RL on navigating a 7-DoF robot arm between obstacles.

Plannable Approximations to MDP Homomorphisms: Equivariance under Actions

Authors:Elise van der Pol, Thomas Kipf, Frans A. Oliehoek, Max Welling
Date:2020-02-27 08:29:10

This work exploits action equivariance for representation learning in reinforcement learning. Equivariance under actions states that transitions in the input space are mirrored by equivalent transitions in latent space, while the map and transition functions should also commute. We introduce a contrastive loss function that enforces action equivariance on the learned representations. We prove that when our loss is zero, we have a homomorphism of a deterministic Markov Decision Process (MDP). Learning equivariant maps leads to structured latent spaces, allowing us to build a model on which we plan through value iteration. We show experimentally that for deterministic MDPs, the optimal policy in the abstract MDP can be successfully lifted to the original MDP. Moreover, the approach easily adapts to changes in the goal states. Empirically, we show that in such MDPs, we obtain better representations in fewer epochs compared to representation learning approaches using reconstructions, while generalizing better to new goals than model-free approaches.

SACBP: Belief Space Planning for Continuous-Time Dynamical Systems via Stochastic Sequential Action Control

Authors:Haruki Nishimura, Mac Schwager
Date:2020-02-26 20:12:30

We propose a novel belief space planning technique for continuous dynamics by viewing the belief system as a hybrid dynamical system with time-driven switching. Our approach is based on the perturbation theory of differential equations and extends Sequential Action Control to stochastic dynamics. The resulting algorithm, which we name SACBP, does not require discretization of spaces or time and synthesizes control signals in near real-time. SACBP is an anytime algorithm that can handle general parametric Bayesian filters under certain assumptions. We demonstrate the effectiveness of our approach in an active sensing scenario and a model-based Bayesian reinforcement learning problem. In these challenging problems, we show that the algorithm significantly outperforms other existing solution techniques including approximate dynamic programming and local trajectory optimization.

Learning Navigation Costs from Demonstration in Partially Observable Environments

Authors:Tianyu Wang, Vikas Dhiman, Nikolay Atanasov
Date:2020-02-26 17:15:10

This paper focuses on inverse reinforcement learning (IRL) to enable safe and efficient autonomous navigation in unknown partially observable environments. The objective is to infer a cost function that explains expert-demonstrated navigation behavior while relying only on the observations and state-control trajectory used by the expert. We develop a cost function representation composed of two parts: a probabilistic occupancy encoder, with recurrent dependence on the observation sequence, and a cost encoder, defined over the occupancy features. The representation parameters are optimized by differentiating the error between demonstrated controls and a control policy computed from the cost encoder. Such differentiation is typically computed by dynamic programming through the value function over the whole state space. We observe that this is inefficient in large partially observable environments because most states are unexplored. Instead, we rely on a closed-form subgradient of the cost-to-go obtained only over a subset of promising states via an efficient motion-planning algorithm such as A* or RRT. Our experiments show that our model exceeds the accuracy of baseline IRL algorithms in robot navigation tasks, while substantially improving the efficiency of training and test-time inference.

G-Learner and GIRL: Goal Based Wealth Management with Reinforcement Learning

Authors:Matthew Dixon, Igor Halperin
Date:2020-02-25 16:03:38

We present a reinforcement learning approach to goal based wealth management problems such as optimization of retirement plans or target dated funds. In such problems, an investor seeks to achieve a financial goal by making periodic investments in the portfolio while being employed, and periodically draws from the account when in retirement, in addition to the ability to re-balance the portfolio by selling and buying different assets (e.g. stocks). Instead of relying on a utility of consumption, we present G-Learner: a reinforcement learning algorithm that operates with explicitly defined one-step rewards, does not assume a data generation process, and is suitable for noisy data. Our approach is based on G-learning - a probabilistic extension of the Q-learning method of reinforcement learning. In this paper, we demonstrate how G-learning, when applied to a quadratic reward and Gaussian reference policy, gives an entropy-regulated Linear Quadratic Regulator (LQR). This critical insight provides a novel and computationally tractable tool for wealth management tasks which scales to high dimensional portfolios. In addition to the solution of the direct problem of G-learning, we also present a new algorithm, GIRL, that extends our goal-based G-learning approach to the setting of Inverse Reinforcement Learning (IRL) where rewards collected by the agent are not observed, and should instead be inferred. We demonstrate that GIRL can successfully learn the reward parameters of a G-Learner agent and thus imitate its behavior. Finally, we discuss potential applications of the G-Learner and GIRL algorithms for wealth management and robo-advising.

Near-optimal Regret Bounds for Stochastic Shortest Path

Authors:Alon Cohen, Haim Kaplan, Yishay Mansour, Aviv Rosenberg
Date:2020-02-23 09:10:14

Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent is unaware of the environment dynamics (i.e., the transition function) and has to repeatedly play for a given number of episodes while reasoning about the problem's optimal solution. Unlike other well-studied models in reinforcement learning (RL), the length of an episode is not predetermined (or bounded) and is influenced by the agent's actions. Recently, Tarbouriech et al. (2019) studied this problem in the context of regret minimization and provided an algorithm whose regret bound is inversely proportional to the square root of the minimum instantaneous cost. In this work we remove this dependence on the minimum cost---we give an algorithm that guarantees a regret bound of $\widetilde{O}(B_\star |S| \sqrt{|A| K})$, where $B_\star$ is an upper bound on the expected cost of the optimal policy, $S$ is the set of states, $A$ is the set of actions and $K$ is the number of episodes. We additionally show that any learning algorithm must have at least $\Omega(B_\star \sqrt{|S| |A| K})$ regret in the worst case.

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

Authors:Yaqi Duan, Mengdi Wang
Date:2020-02-21 19:20:57

This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history generated by unknown behavioral policies. We study a regression-based fitted Q iteration method, and show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted $\chi^2$-divergence over the function class between the long-term distribution of the target policy and the distribution of past data. This restricted $\chi^2$-divergence is both instance-dependent and function-class-dependent. It characterizes the statistical limit of off-policy evaluation. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.

Informative Path Planning for Mobile Sensing with Reinforcement Learning

Authors:Yongyong Wei, Rong Zheng
Date:2020-02-18 21:47:00

Large-scale spatial data such as air quality, thermal conditions and location signatures play a vital role in a variety of applications. Collecting such data manually can be tedious and labour intensive. With the advancement of robotic technologies, it is feasible to automate such tasks using mobile robots with sensing and navigation capabilities. However, due to limited battery lifetime and scarcity of charging stations, it is important to plan paths for the robots that maximize the utility of data collection, also known as the informative path planning (IPP) problem. In this paper, we propose a novel IPP algorithm using reinforcement learning (RL). A constrained exploration and exploitation strategy is designed to address the unique challenges of IPP, and is shown to have fast convergence and better optimality than a classical reinforcement learning approach. Extensive experiments using real-world measurement data demonstrate that the proposed algorithm outperforms state-of-the-art algorithms in most test cases. Interestingly, unlike existing solutions that have to be re-executed when any input parameter changes, our RL-based solution allows a degree of transferability across different problem instances.

Spatial Concept-Based Navigation with Human Speech Instructions via Probabilistic Inference on Bayesian Generative Model

Authors:Akira Taniguchi, Yoshinobu Hagiwara, Tadahiro Taniguchi, Tetsunari Inamura
Date:2020-02-18 05:35:29

Robots are required to not only learn spatial concepts autonomously but also utilize such knowledge for various tasks in a domestic environment. Spatial concept represents a multimodal place category acquired from the robot's spatial experience including vision, speech-language, and self-position. The aim of this study is to enable a mobile robot to perform navigational tasks with human speech instructions, such as `Go to the kitchen', via probabilistic inference on a Bayesian generative model using spatial concepts. Specifically, path planning was formalized as the maximization of probabilistic distribution on the path-trajectory under speech instruction, based on a control-as-inference framework. Furthermore, we described the relationship between probabilistic inference based on the Bayesian generative model and control problem including reinforcement learning. We demonstrated path planning based on human instruction using acquired spatial concepts to verify the usefulness of the proposed approach in the simulator and in real environments. Experimentally, places instructed by the user's speech commands showed high probability values, and the trajectory toward the target place was correctly estimated. Our approach, based on probabilistic inference concerning decision-making, can lead to further improvement in robot autonomy.

PDDLGym: Gym Environments from PDDL Problems

Authors:Tom Silver, Rohan Chitnis
Date:2020-02-15 19:10:21

We present PDDLGym, a framework that automatically constructs OpenAI Gym environments from PDDL domains and problems. Observations and actions in PDDLGym are relational, making the framework particularly well-suited for research in relational reinforcement learning and relational sequential decision-making. PDDLGym is also useful as a generic framework for rapidly building numerous, diverse benchmarks from a concise and familiar specification language. We discuss design decisions and implementation details, and also illustrate empirical variations between the 20 built-in environments in terms of planning and model-learning difficulty. We hope that PDDLGym will facilitate bridge-building between the reinforcement learning community (from which Gym emerged) and the AI planning community (which produced PDDL). We look forward to gathering feedback from all those interested and expanding the set of available environments and features accordingly. Code: https://github.com/tomsilver/pddlgym

Learning Functionally Decomposed Hierarchies for Continuous Control Tasks with Path Planning

Authors:Sammy Christen, Lukas Jendele, Emre Aksan, Otmar Hilliges
Date:2020-02-14 10:19:52

We present HiDe, a novel hierarchical reinforcement learning architecture that successfully solves long horizon control tasks and generalizes to unseen test scenarios. Functional decomposition between planning and low-level control is achieved by explicitly separating the state-action spaces across the hierarchy, which allows the integration of task-relevant knowledge per layer. We propose an RL-based planner to efficiently leverage the information in the planning layer of the hierarchy, while the control layer learns a goal-conditioned control policy. The hierarchy is trained jointly but allows for the modular transfer of policy layers across hierarchies of different agents. We experimentally show that our method generalizes across unseen test environments and can scale to 3x horizon length compared to both learning and non-learning based methods. We evaluate on complex continuous control tasks with sparse rewards, including navigation and robot manipulation.

Frequency-based Search-control in Dyna

Authors:Yangchen Pan, Jincheng Mei, Amir-massoud Farahmand
Date:2020-02-14 00:27:58

Model-based reinforcement learning has been empirically demonstrated as a successful strategy to improve sample efficiency. In particular, Dyna is an elegant model-based architecture integrating learning and planning that provides huge flexibility of using a model. One of the most important components in Dyna is called search-control, which refers to the process of generating state or state-action pairs from which we query the model to acquire simulated experiences. Search-control is critical in improving learning efficiency. In this work, we propose a simple and novel search-control strategy by searching high frequency regions of the value function. Our main intuition is built on Shannon sampling theorem from signal processing, which indicates that a high frequency signal requires more samples to reconstruct. We empirically show that a high frequency function is more difficult to approximate. This suggests a search-control strategy: we should use states from high frequency regions of the value function to query the model to acquire more samples. We develop a simple strategy to locally measure the frequency of a function by gradient and hessian norms, and provide theoretical justification for this approach. We then apply our strategy to search-control in Dyna, and conduct experiments to show its property and effectiveness on benchmark domains.

Objective Mismatch in Model-based Reinforcement Learning

Authors:Nathan Lambert, Brandon Amos, Omry Yadan, Roberto Calandra
Date:2020-02-11 16:26:07

Model-based reinforcement learning (MBRL) has been shown to be a powerful framework for data-efficiently learning control of continuous tasks. Recent work in MBRL has mostly focused on using more advanced function approximators and planning schemes, with little development of the general framework. In this paper, we identify a fundamental issue of the standard MBRL framework -- what we call the objective mismatch issue. Objective mismatch arises when one objective is optimized in the hope that a second, often uncorrelated, metric will also be optimized. In the context of MBRL, we characterize the objective mismatch between training the forward dynamics model w.r.t.~the likelihood of the one-step ahead prediction, and the overall goal of improving performance on a downstream control task. For example, this issue can emerge with the realization that dynamics models effective for a specific task do not necessarily need to be globally accurate, and vice versa globally accurate models might not be sufficiently accurate locally to obtain good control performance on a specific task. In our experiments, we study this objective mismatch issue and demonstrate that the likelihood of one-step ahead predictions is not always correlated with control performance. This observation highlights a critical limitation in the MBRL framework which will require further research to be fully understood and addressed. We propose an initial method to mitigate the mismatch issue by re-weighting dynamics model training. Building on it, we conclude with a discussion about other potential directions of research for addressing this issue.

Robot Navigation with Map-Based Deep Reinforcement Learning

Authors:Guangda Chen, Lifan Pan, Yu'an Chen, Pei Xu, Zhiqiang Wang, Peichen Wu, Jianmin Ji, Xiaoping Chen
Date:2020-02-11 12:41:01

This paper proposes an end-to-end deep reinforcement learning approach for mobile robot navigation with dynamic obstacles avoidance. Using experience collected in a simulation environment, a convolutional neural network (CNN) is trained to predict proper steering actions of a robot from its egocentric local occupancy maps, which accommodate various sensors and fusion algorithms. The trained neural network is then transferred and executed on a real-world mobile robot to guide its local path planning. The new approach is evaluated both qualitatively and quantitatively in simulation and real-world robot experiments. The results show that the map-based end-to-end navigation model is easy to be deployed to a robotic platform, robust to sensor noise and outperforms other existing DRL-based models in many indicators.

Machine Learning Approaches For Motor Learning: A Short Review

Authors:Baptiste Caramiaux, Jules Françoise, Wanyu Liu, Téo Sanchez, Frédéric Bevilacqua
Date:2020-02-11 11:11:26

Machine learning approaches have seen considerable applications in human movement modeling, but remain limited for motor learning. Motor learning requires accounting for motor variability, and poses new challenges as the algorithms need to be able to differentiate between new movements and variation of known ones. In this short review, we outline existing machine learning models for motor learning and their adaptation capabilities. We identify and describe three types of adaptation: Parameter adaptation in probabilistic models, Transfer and meta-learning in deep neural networks, and Planning adaptation by reinforcement learning. To conclude, we discuss challenges for applying these models in the domain of motor learning support systems.

On Reward Shaping for Mobile Robot Navigation: A Reinforcement Learning and SLAM Based Approach

Authors:Nicolò Botteghi, Beril Sirmacek, Khaled A. A. Mustafa, Mannes Poel, Stefano Stramigioli
Date:2020-02-10 22:00:16

We present a map-less path planning algorithm based on Deep Reinforcement Learning (DRL) for mobile robots navigating in unknown environment that only relies on 40-dimensional raw laser data and odometry information. The planner is trained using a reward function shaped based on the online knowledge of the map of the training environment, obtained using grid-based Rao-Blackwellized particle filter, in an attempt to enhance the obstacle awareness of the agent. The agent is trained in a complex simulated environment and evaluated in two unseen ones. We show that the policy trained using the introduced reward function not only outperforms standard reward functions in terms of convergence speed, by a reduction of 36.9\% of the iteration steps, and reduction of the collision samples, but it also drastically improves the behaviour of the agent in unseen environments, respectively by 23\% in a simpler workspace and by 45\% in a more clustered one. Furthermore, the policy trained in the simulation environment can be directly and successfully transferred to the real robot. A video of our experiments can be found at: https://youtu.be/UEV7W6e6ZqI

Reward Tweaking: Maximizing the Total Reward While Planning for Short Horizons

Authors:Chen Tessler, Shie Mannor
Date:2020-02-09 09:50:07

In reinforcement learning, the discount factor $\gamma$ controls the agent's effective planning horizon. Traditionally, this parameter was considered part of the MDP; however, as deep reinforcement learning algorithms tend to become unstable when the effective planning horizon is long, recent works refer to $\gamma$ as a hyper-parameter -- thus changing the underlying MDP and potentially leading the agent towards sub-optimal behavior on the original task. In this work, we introduce \emph{reward tweaking}. Reward tweaking learns a surrogate reward function $\tilde r$ for the discounted setting that induces optimal behavior on the original finite-horizon total reward task. Theoretically, we show that there exists a surrogate reward that leads to optimality in the original task and discuss the robustness of our approach. Additionally, we perform experiments in high-dimensional continuous control tasks and show that reward tweaking guides the agent towards better long-horizon returns although it plans for short horizons.

Reinforcement-Learning based Portfolio Management with Augmented Asset Movement Prediction States

Authors:Yunan Ye, Hengzhi Pei, Boxin Wang, Pin-Yu Chen, Yada Zhu, Jun Xiao, Bo Li
Date:2020-02-09 08:10:03

Portfolio management (PM) is a fundamental financial planning task that aims to achieve investment goals such as maximal profits or minimal risks. Its decision process involves continuous derivation of valuable information from various data sources and sequential decision optimization, which is a prospective research direction for reinforcement learning (RL). In this paper, we propose SARL, a novel State-Augmented RL framework for PM. Our framework aims to address two unique challenges in financial PM: (1) data heterogeneity -- the collected information for each asset is usually diverse, noisy and imbalanced (e.g., news articles); and (2) environment uncertainty -- the financial market is versatile and non-stationary. To incorporate heterogeneous data and enhance robustness against environment uncertainty, our SARL augments the asset information with their price movement prediction as additional states, where the prediction can be solely based on financial data (e.g., asset prices) or derived from alternative sources such as news. Experiments on two real-world datasets, (i) Bitcoin market and (ii) HighTech stock market with 7-year Reuters news articles, validate the effectiveness of SARL over existing PM approaches, both in terms of accumulated profits and risk-adjusted profits. Moreover, extensive simulations are conducted to demonstrate the importance of our proposed state augmentation, providing new insights and boosting performance significantly over standard RL-based PM method and other baselines.

Multi-task Reinforcement Learning with a Planning Quasi-Metric

Authors:Vincent Micheli, Karthigan Sinnathamby, François Fleuret
Date:2020-02-08 22:12:59

We introduce a new reinforcement learning approach combining a planning quasi-metric (PQM) that estimates the number of steps required to go from any state to another, with task-specific "aimers" that compute a target state to reach a given goal. This decomposition allows the sharing across tasks of a task-agnostic model of the quasi-metric that captures the environment's dynamics and can be learned in a dense and unsupervised manner. We achieve multiple-fold training speed-up compared to recently published methods on the standard bit-flip problem and in the MuJoCo robotic arm simulator.

Causally Correct Partial Models for Reinforcement Learning

Authors:Danilo J. Rezende, Ivo Danihelka, George Papamakarios, Nan Rosemary Ke, Ray Jiang, Theophane Weber, Karol Gregor, Hamza Merzic, Fabio Viola, Jane Wang, Jovana Mitrovic, Frederic Besse, Ioannis Antonoglou, Lars Buesing
Date:2020-02-07 15:18:15

In reinforcement learning, we can learn a model of future observations and rewards, and use it to plan the agent's next actions. However, jointly modeling future observations can be computationally expensive or even intractable if the observations are high-dimensional (e.g. images). For this reason, previous works have considered partial models, which model only part of the observation. In this paper, we show that partial models can be causally incorrect: they are confounded by the observations they don't model, and can therefore lead to incorrect planning. To address this, we introduce a general family of partial models that are provably causally correct, yet remain fast because they do not need to fully model future observations.

Reward-Free Exploration for Reinforcement Learning

Authors:Chi Jin, Akshay Krishnamurthy, Max Simchowitz, Tiancheng Yu
Date:2020-02-07 14:03:38

Exploration is widely regarded as one of the most challenging aspects of reinforcement learning (RL), with many naive approaches succumbing to exponential sample complexity. To isolate the challenges of exploration, we propose a new "reward-free RL" framework. In the exploration phase, the agent first collects trajectories from an MDP $\mathcal{M}$ without a pre-specified reward function. After exploration, it is tasked with computing near-optimal policies under for $\mathcal{M}$ for a collection of given reward functions. This framework is particularly suitable when there are many reward functions of interest, or when the reward function is shaped by an external agent to elicit desired behavior. We give an efficient algorithm that conducts $\tilde{\mathcal{O}}(S^2A\mathrm{poly}(H)/\epsilon^2)$ episodes of exploration and returns $\epsilon$-suboptimal policies for an arbitrary number of reward functions. We achieve this by finding exploratory policies that visit each "significant" state with probability proportional to its maximum visitation probability under any possible policy. Moreover, our planning procedure can be instantiated by any black-box approximate planner, such as value iteration or natural policy gradient. We also give a nearly-matching $\Omega(S^2AH^2/\epsilon^2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.

Automated Lane Change Strategy using Proximal Policy Optimization-based Deep Reinforcement Learning

Authors:Fei Ye, Xuxin Cheng, Pin Wang, Ching-Yao Chan, Jiucai Zhang
Date:2020-02-07 08:43:34

Lane-change maneuvers are commonly executed by drivers to follow a certain routing plan, overtake a slower vehicle, adapt to a merging lane ahead, etc. However, improper lane change behaviors can be a major cause of traffic flow disruptions and even crashes. While many rule-based methods have been proposed to solve lane change problems for autonomous driving, they tend to exhibit limited performance due to the uncertainty and complexity of the driving environment. Machine learning-based methods offer an alternative approach, as Deep reinforcement learning (DRL) has shown promising success in many application domains including robotic manipulation, navigation, and playing video games. However, applying DRL to autonomous driving still faces many practical challenges in terms of slow learning rates, sample inefficiency, and safety concerns. In this study, we propose an automated lane change strategy using proximal policy optimization-based deep reinforcement learning, which shows great advantages in learning efficiency while still maintaining stable performance. The trained agent is able to learn a smooth, safe, and efficient driving policy to make lane-change decisions (i.e. when and how) in a challenging situation such as dense traffic scenarios. The effectiveness of the proposed policy is validated by using metrics of task success rate and collision rate. The simulation results demonstrate the lane change maneuvers can be efficiently learned and executed in a safe, smooth, and efficient manner.

Deep Learning Tubes for Tube MPC

Authors:David D. Fan, Ali-akbar Agha-mohammadi, Evangelos A. Theodorou
Date:2020-02-05 00:32:18

Learning-based control aims to construct models of a system to use for planning or trajectory optimization, e.g. in model-based reinforcement learning. In order to obtain guarantees of safety in this context, uncertainty must be accurately quantified. This uncertainty may come from errors in learning (due to a lack of data, for example), or may be inherent to the system. Propagating uncertainty forward in learned dynamics models is a difficult problem. In this work we use deep learning to obtain expressive and flexible models of how distributions of trajectories behave, which we then use for nonlinear Model Predictive Control (MPC). We introduce a deep quantile regression framework for control that enforces probabilistic quantile bounds and quantifies epistemic uncertainty. Using our method we explore three different approaches for learning tubes that contain the possible trajectories of the system, and demonstrate how to use each of them in a Tube MPC scheme. We prove these schemes are recursively feasible and satisfy constraints with a desired margin of probability. We present experiments in simulation on a nonlinear quadrotor system, demonstrating the practical efficacy of these ideas.

Integrating Deep Reinforcement Learning with Model-based Path Planners for Automated Driving

Authors:Ekim Yurtsever, Linda Capito, Keith Redmill, Umit Ozguner
Date:2020-02-02 17:10:19

Automated driving in urban settings is challenging. Human participant behavior is difficult to model, and conventional, rule-based Automated Driving Systems (ADSs) tend to fail when they face unmodeled dynamics. On the other hand, the more recent, end-to-end Deep Reinforcement Learning (DRL) based model-free ADSs have shown promising results. However, pure learning-based approaches lack the hard-coded safety measures of model-based controllers. Here we propose a hybrid approach for integrating a path planning pipe into a vision based DRL framework to alleviate the shortcomings of both worlds. In summary, the DRL agent is trained to follow the path planner's waypoints as close as possible. The agent learns this policy by interacting with the environment. The reward function contains two major terms: the penalty of straying away from the path planner and the penalty of having a collision. The latter has precedence in the form of having a significantly greater numerical value. Experimental results show that the proposed method can plan its path and navigate between randomly chosen origin-destination points in CARLA, a dynamic urban simulation environment. Our code is open-source and available online.

Survey of Deep Reinforcement Learning for Motion Planning of Autonomous Vehicles

Authors:Szilárd Aradi
Date:2020-01-30 09:47:22

Academic research in the field of autonomous vehicles has reached high popularity in recent years related to several topics as sensor technologies, V2X communications, safety, security, decision making, control, and even legal and standardization rules. Besides classic control design approaches, Artificial Intelligence and Machine Learning methods are present in almost all of these fields. Another part of research focuses on different layers of Motion Planning, such as strategic decisions, trajectory planning, and control. A wide range of techniques in Machine Learning itself have been developed, and this article describes one of these fields, Deep Reinforcement Learning (DRL). The paper provides insight into the hierarchical motion planning problem and describes the basics of DRL. The main elements of designing such a system are the modeling of the environment, the modeling abstractions, the description of the state and the perception models, the appropriate rewarding, and the realization of the underlying neural network. The paper describes vehicle models, simulation possibilities and computational requirements. Strategic decisions on different layers and the observation models, e.g., continuous and discrete state representations, grid-based, and camera-based solutions are presented. The paper surveys the state-of-art solutions systematized by the different tasks and levels of autonomous driving, such as car-following, lane-keeping, trajectory following, merging, or driving in dense traffic. Finally, open questions and future challenges are discussed.

Path Planning for UAV-Mounted Mobile Edge Computing with Deep Reinforcement Learning

Authors:Q. Liu, L. Shi, L. Sun, J. Li, M. Ding, F. Shu
Date:2020-01-28 11:22:03

In this letter, we study an unmanned aerial vehicle (UAV)-mounted mobile edge computing network, where the UAV executes computational tasks offloaded from mobile terminal users (TUs) and the motion of each TU follows a Gauss-Markov random model. To ensure the quality-of-service (QoS) of each TU, the UAV with limited energy dynamically plans its trajectory according to the locations of mobile TUs. Towards this end, we formulate the problem as a Markov decision process, wherein the UAV trajectory and UAV-TU association are modeled as the parameters to be optimized. To maximize the system reward and meet the QoS constraint, we develop a QoS-based action selection policy in the proposed algorithm based on double deep Q-network. Simulations show that the proposed algorithm converges more quickly and achieves a higher sum throughput than conventional algorithms.

Towards Learning Multi-agent Negotiations via Self-Play

Authors:Yichuan Charlie Tang
Date:2020-01-28 08:37:33

Making sophisticated, robust, and safe sequential decisions is at the heart of intelligent systems. This is especially critical for planning in complex multi-agent environments, where agents need to anticipate other agents' intentions and possible future actions. Traditional methods formulate the problem as a Markov Decision Process, but the solutions often rely on various assumptions and become brittle when presented with corner cases. In contrast, deep reinforcement learning (Deep RL) has been very effective at finding policies by simultaneously exploring, interacting, and learning from environments. Leveraging the powerful Deep RL paradigm, we demonstrate that an iterative procedure of self-play can create progressively more diverse environments, leading to the learning of sophisticated and robust multi-agent policies. We demonstrate this in a challenging multi-agent simulation of merging traffic, where agents must interact and negotiate with others in order to successfully merge on or off the road. While the environment starts off simple, we increase its complexity by iteratively adding an increasingly diverse set of agents to the agent "zoo" as training progresses. Qualitatively, we find that through self-play, our policies automatically learn interesting behaviors such as defensive driving, overtaking, yielding, and the use of signal lights to communicate intentions to other agents. In addition, quantitatively, we show a dramatic improvement of the success rate of merging maneuvers from 63% to over 98%.

Reinforcement Learning-based Application Autoscaling in the Cloud: A Survey

Authors:Yisel Garí, David A. Monge, Elina Pacini, Cristian Mateos, Carlos García Garino
Date:2020-01-27 18:23:43

Reinforcement Learning (RL) has demonstrated a great potential for automatically solving decision-making problems in complex uncertain environments. RL proposes a computational approach that allows learning through interaction in an environment with stochastic behavior, where agents take actions to maximize some cumulative short-term and long-term rewards. Some of the most impressive results have been shown in Game Theory where agents exhibited superhuman performance in games like Go or Starcraft 2, which led to its gradual adoption in many other domains, including Cloud Computing. Therefore, RL appears as a promising approach for Autoscaling in Cloud since it is possible to learn transparent (with no human intervention), dynamic (no static plans), and adaptable (constantly updated) resource management policies to execute applications. These are three important distinctive aspects to consider in comparison with other widely used autoscaling policies that are defined in an ad-hoc way or statically computed as in solutions based on meta-heuristics. Autoscaling exploits the Cloud elasticity to optimize the execution of applications according to given optimization criteria, which demands to decide when and how to scale-up/down computational resources, and how to assign them to the upcoming processing workload. Such actions have to be taken considering that the Cloud is a dynamic and uncertain environment. Motivated by this, many works apply RL to the autoscaling problem in the Cloud. In this work, we survey exhaustively those proposals from major venues, and uniformly compare them based on a set of proposed taxonomies. We also discuss open problems and prospective research in the area.

Context-aware Distribution of Fog Applications Using Deep Reinforcement Learning

Authors:Nan Wang, Blesson Varghese
Date:2020-01-24 23:31:59

Fog computing is an emerging paradigm that aims to meet the increasing computation demands arising from the billions of devices connected to the Internet. Offloading services of an application from the Cloud to the edge of the network can improve the overall Quality-of-Service (QoS) of the application since it can process data closer to user devices. Diverse Fog nodes ranging from Wi-Fi routers to mini-clouds with varying resource capabilities makes it challenging to determine which services of an application need to be offloaded. In this paper, a context-aware mechanism for distributing applications across the Cloud and the Fog is proposed. The mechanism dynamically generates (re)deployment plans for the application to maximise the performance efficiency of the application by taking the QoS and running costs into account. The mechanism relies on deep Q-networks to generate a distribution plan without prior knowledge of the available resources on the Fog node, the network condition and the application. The feasibility of the proposed context-aware distribution mechanism is demonstrated on two use-cases, namely a face detection application and a location-based mobile game. The benefits are increased utility of dynamic distribution in both use cases, when compared to a static distribution approach used in existing research.

EgoMap: Projective mapping and structured egocentric memory for Deep RL

Authors:Edward Beeching, Christian Wolf, Jilles Dibangoye, Olivier Simonin
Date:2020-01-24 09:59:59

Tasks involving localization, memorization and planning in partially observable 3D environments are an ongoing challenge in Deep Reinforcement Learning. We present EgoMap, a spatially structured neural memory architecture. EgoMap augments a deep reinforcement learning agent's performance in 3D environments on challenging tasks with multi-step objectives. The EgoMap architecture incorporates several inductive biases including a differentiable inverse projection of CNN feature vectors onto a top-down spatially structured map. The map is updated with ego-motion measurements through a differentiable affine transform. We show this architecture outperforms both standard recurrent agents and state of the art agents with structured memory. We demonstrate that incorporating these inductive biases into an agent's architecture allows for stable training with reward alone, circumventing the expense of acquiring and labelling expert trajectories. A detailed ablation study demonstrates the impact of key aspects of the architecture and through extensive qualitative analysis, we show how the agent exploits its structured internal memory to achieve higher performance.

The impact of analytical outage modeling on expansion planning problems in the area of power systems

Authors:S. Tsianikas, N. Yousefi, J. Zhou, D. W. Coit
Date:2020-01-23 21:25:43

Expansion planning problems refer to the monetary and unit investment needed for energy production or storage. An inherent element in these problems is the element of stochasticity in various aspects, such as the generation output of the units, climate change or frequency and duration of grid outages. Especially for the latter one, outage modeling is crucial to be carefully considered when designing systems with distributed generation at their core, such as microgrids. In most studies so far, a single statistical distribution is used, such as a Poisson Process. However, by taking a closer look at the real outage data provided by the state of NY, it is observed that the outages do not seem to come from the same distribution. In some years, there is a huge spike in the average duration per outage and this is because of catastrophic events. Therefore, in this study we propose and test an alternative modeling for outage events. This alternative scheme will be based on the premise that outages can be broadly classified into two categories: regular and severe. Under this taxonomy, it can still be assumed that each type of events follows a Poisson Process but outages, in general, follow a Poisson Process which is truly a superposition of these two types. A reinforcement learning approach is used to solve the expansion planning problem and real location-specific data are used. The results verify our initial hypothesis and show that the optimization results are significantly affected by the outage modeling. To sum up, modeling accurately the grid outage events and measuring directly the reliability performance of an energy system during catastrophic failures could provide invaluable tools and insights that could therefore be used for the best possible preparation for this type of outages.

GLIB: Efficient Exploration for Relational Model-Based Reinforcement Learning via Goal-Literal Babbling

Authors:Rohan Chitnis, Tom Silver, Joshua Tenenbaum, Leslie Pack Kaelbling, Tomas Lozano-Perez
Date:2020-01-22 22:24:06

We address the problem of efficient exploration for transition model learning in the relational model-based reinforcement learning setting without extrinsic goals or rewards. Inspired by human curiosity, we propose goal-literal babbling (GLIB), a simple and general method for exploration in such problems. GLIB samples relational conjunctive goals that can be understood as specific, targeted effects that the agent would like to achieve in the world, and plans to achieve these goals using the transition model being learned. We provide theoretical guarantees showing that exploration with GLIB will converge almost surely to the ground truth model. Experimentally, we find GLIB to strongly outperform existing methods in both prediction and planning on a range of tasks, encompassing standard PDDL and PPDDL planning benchmarks and a robotic manipulation task implemented in the PyBullet physics simulator. Video: https://youtu.be/F6lmrPT6TOY Code: https://git.io/JIsTB

Machine Learning assisted Handover and Resource Management for Cellular Connected Drones

Authors:Amin Azari, Fayezeh Ghavimi, Mustafa Ozger, Riku Jantti, Cicek Cavdar
Date:2020-01-22 10:04:26

Enabling cellular connectivity for drones introduces a wide set of challenges and opportunities. Communication of cellular-connected drones is influenced by 3-dimensional mobility and line-of-sight channel characteristics which results in higher number of handovers with increasing altitude. Our cell planning simulations in coexistence of aerial and terrestrial users indicate that the severe interference from drones to base stations is a major challenge for uplink communications of terrestrial users. Here, we first present the major challenges in co-existence of terrestrial and drone communications by considering real geographical network data for Stockholm. Then, we derive analytical models for the key performance indicators (KPIs), including communications delay and interference over cellular networks, and formulate the handover and radio resource management (H-RRM) optimization problem. Afterwards, we transform this problem into a machine learning problem, and propose a deep reinforcement learning solution to solve H-RRM problem. Finally, using simulation results, we present how the speed and altitude of drones, and the tolerable level of interference, shape the optimal H-RRM policy in the network. Especially, the heat-maps of handover decisions in different drone's altitudes/speeds have been presented, which promote a revision of the legacy handover schemes and redefining the boundaries of cells in the sky.

Reinforcement Learning with Probabilistically Complete Exploration

Authors:Philippe Morere, Gilad Francis, Tom Blau, Fabio Ramos
Date:2020-01-20 02:11:24

Balancing exploration and exploitation remains a key challenge in reinforcement learning (RL). State-of-the-art RL algorithms suffer from high sample complexity, particularly in the sparse reward case, where they can do no better than to explore in all directions until the first positive rewards are found. To mitigate this, we propose Rapidly Randomly-exploring Reinforcement Learning (R3L). We formulate exploration as a search problem and leverage widely-used planning algorithms such as Rapidly-exploring Random Tree (RRT) to find initial solutions. These solutions are used as demonstrations to initialize a policy, then refined by a generic RL algorithm, leading to faster and more stable convergence. We provide theoretical guarantees of R3L exploration finding successful solutions, as well as bounds for its sampling complexity. We experimentally demonstrate the method outperforms classic and intrinsic exploration techniques, requiring only a fraction of exploration samples and achieving better asymptotic performance.

Learning Options from Demonstration using Skill Segmentation

Authors:Matthew Cockcroft, Shahil Mawjee, Steven James, Pravesh Ranchod
Date:2020-01-19 09:29:58

We present a method for learning options from segmented demonstration trajectories. The trajectories are first segmented into skills using nonparametric Bayesian clustering and a reward function for each segment is then learned using inverse reinforcement learning. From this, a set of inferred trajectories for the demonstration are generated. Option initiation sets and termination conditions are learned from these trajectories using the one-class support vector machine clustering algorithm. We demonstrate our method in the four rooms domain, where an agent is able to autonomously discover usable options from human demonstration. Our results show that these inferred options can then be used to improve learning and planning.

Multi-agent Motion Planning for Dense and Dynamic Environments via Deep Reinforcement Learning

Authors:Samaneh Hosseini Semnani, Hugh Liu, Michael Everett, Anton de Ruiter, Jonathan P. How
Date:2020-01-18 08:24:40

This paper introduces a hybrid algorithm of deep reinforcement learning (RL) and Force-based motion planning (FMP) to solve distributed motion planning problem in dense and dynamic environments. Individually, RL and FMP algorithms each have their own limitations. FMP is not able to produce time-optimal paths and existing RL solutions are not able to produce collision-free paths in dense environments. Therefore, we first tried improving the performance of recent RL approaches by introducing a new reward function that not only eliminates the requirement of a pre supervised learning (SL) step but also decreases the chance of collision in crowded environments. That improved things, but there were still a lot of failure cases. So, we developed a hybrid approach to leverage the simpler FMP approach in stuck, simple and high-risk cases, and continue using RL for normal cases in which FMP can't produce optimal path. Also, we extend GA3C-CADRL algorithm to 3D environment. Simulation results show that the proposed algorithm outperforms both deep RL and FMP algorithms and produces up to 50% more successful scenarios than deep RL and up to 75% less extra time to reach goal than FMP.

POPCORN: Partially Observed Prediction COnstrained ReiNforcement Learning

Authors:Joseph Futoma, Michael C. Hughes, Finale Doshi-Velez
Date:2020-01-13 01:55:50

Many medical decision-making tasks can be framed as partially observed Markov decision processes (POMDPs). However, prevailing two-stage approaches that first learn a POMDP and then solve it often fail because the model that best fits the data may not be well suited for planning. We introduce a new optimization objective that (a) produces both high-performing policies and high-quality generative models, even when some observations are irrelevant for planning, and (b) does so in batch off-policy settings that are typical in healthcare, when only retrospective data is available. We demonstrate our approach on synthetic examples and a challenging medical decision-making problem.

A storage expansion planning framework using reinforcement learning and simulation-based optimization

Authors:S. Tsianikas, N. Yousefi, J. Zhou, M. Rodgers, D. W. Coit
Date:2020-01-10 15:23:30

In the wake of the highly electrified future ahead of us, the role of energy storage is crucial wherever distributed generation is abundant, such as in microgrid settings. Given the variety of storage options that are becoming more and more economical, determining which type of storage technology to invest in, along with the appropriate timing and capacity becomes a critical research question. It is inevitable that these problems will continue to become increasingly relevant in the future and require strategic planning and holistic and modern frameworks in order to be solved. Reinforcement Learning algorithms have already proven to be successful in problems where sequential decision-making is inherent. In the operations planning area, these algorithms are already used but mostly in short-term problems with well-defined constraints. On the contrary, we expand and tailor these techniques to long-term planning by utilizing model-free algorithms combined with simulation-based models. A model and expansion plan have been developed to optimally determine microgrid designs as they evolve to dynamically react to changing conditions and to exploit energy storage capabilities. We show that it is possible to derive better engineering solutions that would point to the types of energy storage units which could be at the core of future microgrid applications. Another key finding is that the optimal storage capacity threshold for a system depends heavily on the price movements of the available storage units. By utilizing the proposed approaches, it is possible to model inherent problem uncertainties and optimize the whole streamline of sequential investment decision-making.

Deep Interactive Reinforcement Learning for Path Following of Autonomous Underwater Vehicle

Authors:Qilei Zhang, Jinying Lin, Qixin Sha, Bo He, Guangliang Li
Date:2020-01-10 09:22:39

Autonomous underwater vehicle (AUV) plays an increasingly important role in ocean exploration. Existing AUVs are usually not fully autonomous and generally limited to pre-planning or pre-programming tasks. Reinforcement learning (RL) and deep reinforcement learning have been introduced into the AUV design and research to improve its autonomy. However, these methods are still difficult to apply directly to the actual AUV system because of the sparse rewards and low learning efficiency. In this paper, we proposed a deep interactive reinforcement learning method for path following of AUV by combining the advantages of deep reinforcement learning and interactive RL. In addition, since the human trainer cannot provide human rewards for AUV when it is running in the ocean and AUV needs to adapt to a changing environment, we further propose a deep reinforcement learning method that learns from both human rewards and environmental rewards at the same time. We test our methods in two path following tasks---straight line and sinusoids curve following of AUV by simulating in the Gazebo platform. Our experimental results show that with our proposed deep interactive RL method, AUV can converge faster than a DQN learner from only environmental reward. Moreover, AUV learning with our deep RL from both human and environmental rewards can also achieve a similar or even better performance than that with the deep interactive RL method and can adapt to the actual environment by further learning from environmental rewards.

What can robotics research learn from computer vision research?

Authors:Peter Corke, Feras Dayoub, David Hall, John Skinner, Niko Sünderhauf
Date:2020-01-08 04:32:10

The computer vision and robotics research communities are each strong. However progress in computer vision has become turbo-charged in recent years due to big data, GPU computing, novel learning algorithms and a very effective research methodology. By comparison, progress in robotics seems slower. It is true that robotics came later to exploring the potential of learning -- the advantages over the well-established body of knowledge in dynamics, kinematics, planning and control is still being debated, although reinforcement learning seems to offer real potential. However, the rapid development of computer vision compared to robotics cannot be only attributed to the former's adoption of deep learning. In this paper, we argue that the gains in computer vision are due to research methodology -- evaluation under strict constraints versus experiments; bold numbers versus videos.

Intelligent Roundabout Insertion using Deep Reinforcement Learning

Authors:Alessandro Paolo Capasso, Giulio Bacchiani, Daniele Molinari
Date:2020-01-03 11:16:41

An important topic in the autonomous driving research is the development of maneuver planning systems. Vehicles have to interact and negotiate with each other so that optimal choices, in terms of time and safety, are taken. For this purpose, we present a maneuver planning module able to negotiate the entering in busy roundabouts. The proposed module is based on a neural network trained to predict when and how entering the roundabout throughout the whole duration of the maneuver. Our model is trained with a novel implementation of A3C, which we will call Delayed A3C (D-A3C), in a synthetic environment where vehicles move in a realistic manner with interaction capabilities. In addition, the system is trained such that agents feature a unique tunable behavior, emulating real world scenarios where drivers have their own driving styles. Similarly, the maneuver can be performed using different aggressiveness levels, which is particularly useful to manage busy scenarios where conservative rule-based policies would result in undefined waits.

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

Authors:Nachiket Deo, Mohan M. Trivedi
Date:2020-01-03 06:12:26

We address the problem of forecasting pedestrian and vehicle trajectories in unknown environments, conditioned on their past motion and scene structure. Trajectory forecasting is a challenging problem due to the large variation in scene structure and the multimodal distribution of future trajectories. Unlike prior approaches that directly learn one-to-many mappings from observed context to multiple future trajectories, we propose to condition trajectory forecasts on plans sampled from a grid based policy learned using maximum entropy inverse reinforcement learning (MaxEnt IRL). We reformulate MaxEnt IRL to allow the policy to jointly infer plausible agent goals, and paths to those goals on a coarse 2-D grid defined over the scene. We propose an attention based trajectory generator that generates continuous valued future trajectories conditioned on state sequences sampled from the MaxEnt policy. Quantitative and qualitative evaluation on the publicly available Stanford drone and NuScenes datasets shows that our model generates trajectories that are diverse, representing the multimodal predictive distribution, and precise, conforming to the underlying scene structure over long prediction horizons.

Reinforcement Learning with Goal-Distance Gradient

Authors:Kai Jiang, XiaoLong Qin
Date:2020-01-01 02:37:34

Reinforcement learning usually uses the feedback rewards of environmental to train agents. But the rewards in the actual environment are sparse, and even some environments will not rewards. Most of the current methods are difficult to get good performance in sparse reward or non-reward environments. Although using shaped rewards is effective when solving sparse reward tasks, it is limited to specific problems and learning is also susceptible to local optima. We propose a model-free method that does not rely on environmental rewards to solve the problem of sparse rewards in the general environment. Our method use the minimum number of transitions between states as the distance to replace the rewards of environmental, and proposes a goal-distance gradient to achieve policy improvement. We also introduce a bridge point planning method based on the characteristics of our method to improve exploration efficiency, thereby solving more complex tasks. Experiments show that our method performs better on sparse reward and local optimal problems in complex environments than previous work.

Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning

Authors:Simone Parisi, Davide Tateo, Maximilian Hensel, Carlo D'Eramo, Jan Peters, Joni Pajarinen
Date:2020-01-01 01:01:15

Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods which use models of reward and dynamics, our approach is off-policy and model-free. We further propose new tabular environments for benchmarking exploration in reinforcement learning. Empirical results on classic and novel benchmarks show that the proposed approach outperforms existing methods in environments with sparse rewards, especially in the presence of rewards that create suboptimal modes of the objective function. Results also suggest that our approach scales gracefully with the size of the environment. Source code is available at https://github.com/sparisi/visit-value-explore

World Programs for Model-Based Learning and Planning in Compositional State and Action Spaces

Authors:Marwin H. S. Segler
Date:2019-12-30 17:03:16

Some of the most important tasks take place in environments which lack cheap and perfect simulators, thus hampering the application of model-free reinforcement learning (RL). While model-based RL aims to learn a dynamics model, in a more general case the learner does not know a priori what the action space is. Here we propose a formalism where the learner induces a world program by learning a dynamics model and the actions in graph-based compositional environments by observing state-state transition examples. Then, the learner can perform RL with the world program as the simulator for complex planning tasks. We highlight a recent application, and propose a challenge for the community to assess world program-based planning.

Pontryagin Differentiable Programming: An End-to-End Learning and Control Framework

Authors:Wanxin Jin, Zhaoran Wang, Zhuoran Yang, Shaoshuai Mou
Date:2019-12-30 15:35:43

This paper develops a Pontryagin Differentiable Programming (PDP) methodology, which establishes a unified framework to solve a broad class of learning and control tasks. The PDP distinguishes from existing methods by two novel techniques: first, we differentiate through Pontryagin's Maximum Principle, and this allows to obtain the analytical derivative of a trajectory with respect to tunable parameters within an optimal control system, enabling end-to-end learning of dynamics, policies, or/and control objective functions; and second, we propose an auxiliary control system in the backward pass of the PDP framework, and the output of this auxiliary control system is the analytical derivative of the original system's trajectory with respect to the parameters, which can be iteratively solved using standard control tools. We investigate three learning modes of the PDP: inverse reinforcement learning, system identification, and control/planning. We demonstrate the capability of the PDP in each learning mode on different high-dimensional systems, including multi-link robot arm, 6-DoF maneuvering quadrotor, and 6-DoF rocket powered landing.

Learning to Combat Compounding-Error in Model-Based Reinforcement Learning

Authors:Chenjun Xiao, Yifan Wu, Chen Ma, Dale Schuurmans, Martin Müller
Date:2019-12-24 04:51:47

Despite its potential to improve sample complexity versus model-free approaches, model-based reinforcement learning can fail catastrophically if the model is inaccurate. An algorithm should ideally be able to trust an imperfect model over a reasonably long planning horizon, and only rely on model-free updates when the model errors get infeasibly large. In this paper, we investigate techniques for choosing the planning horizon on a state-dependent basis, where a state's planning horizon is determined by the maximum cumulative model error around that state. We demonstrate that these state-dependent model errors can be learned with Temporal Difference methods, based on a novel approach of temporally decomposing the cumulative model errors. Experimental results show that the proposed method can successfully adapt the planning horizon to account for state-dependent model accuracy, significantly improving the efficiency of policy learning compared to model-based and model-free baselines.

Uncertainty-sensitive Learning and Planning with Ensembles

Authors:Piotr Miłoś, Łukasz Kuciński, Konrad Czechowski, Piotr Kozakowski, Maciek Klimek
Date:2019-12-19 17:58:25

We propose a reinforcement learning framework for discrete environments in which an agent makes both strategic and tactical decisions. The former manifests itself through the use of value function, while the latter is powered by a tree search planner. These tools complement each other. The planning module performs a local \textit{what-if} analysis, which allows to avoid tactical pitfalls and boost backups of the value function. The value function, being global in nature, compensates for inherent locality of the planner. In order to further solidify this synergy, we introduce an exploration mechanism with two distinctive components: uncertainty modelling and risk measurement. To model the uncertainty we use value function ensembles, and to reflect risk we use propose several functionals that summarize the implied by the ensemble. We show that our method performs well on hard exploration environments: Deep-sea, toy Montezuma's Revenge, and Sokoban. In all the cases, we obtain speed-up in learning and boost in performance.

Deep Reinforcement Learning for Motion Planning of Mobile Robots

Authors:Leonid Butyrev, Thorsten Edelhäußer, Christopher Mutschler
Date:2019-12-19 15:12:39

This paper presents a novel motion and trajectory planning algorithm for nonholonomic mobile robots that uses recent advances in deep reinforcement learning. Starting from a random initial state, i.e., position, velocity and orientation, the robot reaches an arbitrary target state while taking both kinematic and dynamic constraints into account. Our deep reinforcement learning agent not only processes a continuous state space it also executes continuous actions, i.e., the acceleration of wheels and the adaptation of the steering angle. We evaluate our motion and trajectory planning on a mobile robot with a differential drive in a simulation environment.

Planning with Abstract Learned Models While Learning Transferable Subtasks

Authors:John Winder, Stephanie Milani, Matthew Landen, Erebus Oh, Shane Parr, Shawn Squire, Marie desJardins, Cynthia Matuszek
Date:2019-12-16 17:47:57

We introduce an algorithm for model-based hierarchical reinforcement learning to acquire self-contained transition and reward models suitable for probabilistic planning at multiple levels of abstraction. We call this framework Planning with Abstract Learned Models (PALM). By representing subtasks symbolically using a new formal structure, the lifted abstract Markov decision process (L-AMDP), PALM learns models that are independent and modular. Through our experiments, we show how PALM integrates planning and execution, facilitating a rapid and efficient learning of abstract, hierarchical models. We also demonstrate the increased potential for learned models to be transferred to new and related tasks.

Resolving Congestions in the Air Traffic Management Domain via Multiagent Reinforcement Learning Methods

Authors:Theocharis Kravaris, Christos Spatharis, Alevizos Bastas, George A. Vouros, Konstantinos Blekas, Gennady Andrienko, Natalia Andrienko, Jose Manuel Cordero Garcia
Date:2019-12-14 15:06:35

In this article, we report on the efficiency and effectiveness of multiagent reinforcement learning methods (MARL) for the computation of flight delays to resolve congestion problems in the Air Traffic Management (ATM) domain. Specifically, we aim to resolve cases where demand of airspace use exceeds capacity (demand-capacity problems), via imposing ground delays to flights at the pre-tactical stage of operations (i.e. few days to few hours before operation). Casting this into the multiagent domain, agents, representing flights, need to decide on own delays w.r.t. own preferences, having no information about others' payoffs, preferences and constraints, while they plan to execute their trajectories jointly with others, adhering to operational constraints. Specifically, we formalize the problem as a multiagent Markov Decision Process (MA-MDP) and we show that it can be considered as a Markov game in which interacting agents need to reach an equilibrium: What makes the problem more interesting is the dynamic setting in which agents operate, which is also due to the unforeseen, emergent effects of their decisions in the whole system. We propose collaborative multiagent reinforcement learning methods to resolve demand-capacity imbalances: Extensive experimental study on real-world cases, shows the potential of the proposed approaches in resolving problems, while advanced visualizations provide detailed views towards understanding the quality of solutions provided.

Long-Term Planning and Situational Awareness in OpenAI Five

Authors:Jonathan Raiman, Susan Zhang, Filip Wolski
Date:2019-12-13 21:49:30

Understanding how knowledge about the world is represented within model-free deep reinforcement learning methods is a major challenge given the black box nature of its learning process within high-dimensional observation and action spaces. AlphaStar and OpenAI Five have shown that agents can be trained without any explicit hierarchical macro-actions to reach superhuman skill in games that require taking thousands of actions before reaching the final goal. Assessing the agent's plans and game understanding becomes challenging given the lack of hierarchy or explicit representations of macro-actions in these models, coupled with the incomprehensible nature of the internal representations. In this paper, we study the distributed representations learned by OpenAI Five to investigate how game knowledge is gradually obtained over the course of training. We also introduce a general technique for learning a model from the agent's hidden states to identify the formation of plans and subgoals. We show that the agent can learn situational similarity across actions, and find evidence of planning towards accomplishing subgoals minutes before they are executed. We perform a qualitative analysis of these predictions during the games against the DotA 2 world champions OG in April 2019.

Learning Latent State Spaces for Planning through Reward Prediction

Authors:Aaron Havens, Yi Ouyang, Prabhat Nagarajan, Yasuhiro Fujita
Date:2019-12-09 17:32:51

Model-based reinforcement learning methods typically learn models for high-dimensional state spaces by aiming to reconstruct and predict the original observations. However, drawing inspiration from model-free reinforcement learning, we propose learning a latent dynamics model directly from rewards. In this work, we introduce a model-based planning framework which learns a latent reward prediction model and then plans in the latent state-space. The latent representation is learned exclusively from multi-step reward prediction which we show to be the only necessary information for successful planning. With this framework, we are able to benefit from the concise model-free representation, while still enjoying the data-efficiency of model-based algorithms. We demonstrate our framework in multi-pendulum and multi-cheetah environments where several pendulums or cheetahs are shown to the agent but only one of which produces rewards. In these environments, it is important for the agent to construct a concise latent representation to filter out irrelevant observations. We find that our method can successfully learn an accurate latent reward prediction model in the presence of the irrelevant information while existing model-based methods fail. Planning in the learned latent state-space shows strong performance and high sample efficiency over model-free and model-based baselines.

Data Collection versus Data Estimation: A Fundamental Trade-off in Dynamic Networks

Authors:Jalal Arabneydi, Amir G. Aghdam
Date:2019-12-09 11:07:15

An important question that often arises in the operation of networked systems is whether to collect the real-time data or to estimate them based on the previously collected data. Various factors should be taken into account such as how informative the data are at each time instant for state estimation, how costly and credible the collected data are, and how rapidly the data vary with time. The above question can be formulated as a dynamic decision making problem with imperfect information structure, where a decision maker wishes to find an efficient way to switch between data collection and data estimation while the quality of the estimation depends on the previously collected data (i.e., duality effect). In this paper, the evolution of the state of each node is modeled as an exchangeable Markov process for discrete features and equivariant linear system for continuous features, where the data of interest are defined in the former case as the empirical distribution of the states, and in the latter case as the weighted average of the states. When the data are collected, they may or may not be credible, according to a Bernoulli distribution. Based on a novel planning space, a Bellman equation is proposed to identify a near-optimal strategy. A reinforcement learning algorithm is developed for the case when the model is not known exactly, and its convergence to the near-optimal solution is shown subsequently. In addition, a certainty threshold is introduced that determines when data estimation is more desirable than data collection, as the number of nodes increases. For the special case of linear dynamics, a separation principle is constructed wherein the optimal estimate is computed by a Kalman-like filter, irrespective of the probability distribution of random variables...

Value-of-Information based Arbitration between Model-based and Model-free Control

Authors:Krishn Bera, Yash Mandilwar, Bapi Raju
Date:2019-12-08 07:26:33

There have been numerous attempts in explaining the general learning behaviours using model-based and model-free methods. While the model-based control is flexible yet computationally expensive in planning, the model-free control is quick but inflexible. The model-based control is therefore immune from reward devaluation and contingency degradation. Multiple arbitration schemes have been suggested to achieve the data efficiency and computational efficiency of model-based and model-free control respectively. In this context, we propose a quantitative 'value of information' based arbitration between both the controllers in order to establish a general computational framework for skill learning. The interacting model-based and model-free reinforcement learning processes are arbitrated using an uncertainty-based value of information. We further show that our algorithm performs better than Q-learning as well as Q-learning with experience replay.

Driving Style Encoder: Situational Reward Adaptation for General-Purpose Planning in Automated Driving

Authors:Sascha Rosbach, Vinit James, Simon Großjohann, Silviu Homoceanu, Xing Li, Stefan Roth
Date:2019-12-07 14:30:22

General-purpose planning algorithms for automated driving combine mission, behavior, and local motion planning. Such planning algorithms map features of the environment and driving kinematics into complex reward functions. To achieve this, planning experts often rely on linear reward functions. The specification and tuning of these reward functions is a tedious process and requires significant experience. Moreover, a manually designed linear reward function does not generalize across different driving situations. In this work, we propose a deep learning approach based on inverse reinforcement learning that generates situation-dependent reward functions. Our neural network provides a mapping between features and actions of sampled driving policies of a model-predictive control-based planner and predicts reward functions for upcoming planning cycles. In our evaluation, we compare the driving style of reward functions predicted by our deep network against clustered and linear reward functions. Our proposed deep learning approach outperforms clustered linear reward functions and is at par with linear reward functions with a-priori knowledge about the situation.

A pedestrian path-planning model in accordance with obstacle's danger with reinforcement learning

Authors:Thanh-Trung Trinh, Dinh-Minh Vu, Masaomi Kimura
Date:2019-12-06 01:40:43

Most microscopic pedestrian navigation models use the concept of "forces" applied to the pedestrian agents to replicate the navigation environment. While the approach could provide believable results in regular situations, it does not always resemble natural pedestrian navigation behaviour in many typical settings. In our research, we proposed a novel approach using reinforcement learning for simulation of pedestrian agent path planning and collision avoidance problem. The primary focus of this approach is using human perception of the environment and danger awareness of interferences. The implementation of our model has shown that the path planned by the agent shares many similarities with a human pedestrian in several aspects such as following common walking conventions and human behaviours.

Inter-Level Cooperation in Hierarchical Reinforcement Learning

Authors:Abdul Rahman Kreidieh, Glen Berseth, Brandon Trabucco, Samyak Parajuli, Sergey Levine, Alexandre M. Bayen
Date:2019-12-05 03:56:44

Hierarchies of temporally decoupled policies present a promising approach for enabling structured exploration in complex long-term planning problems. To fully achieve this approach an end-to-end training paradigm is needed. However, training these multi-level policies has had limited success due to challenges arising from interactions between the goal-assigning and goal-achieving levels within a hierarchy. In this article, we consider the policy optimization process as a multi-agent process. This allows us to draw on connections between communication and cooperation in multi-agent RL, and demonstrate the benefits of increased cooperation between sub-policies on the training performance of the overall policy. We introduce a simple yet effective technique for inducing inter-level cooperation by modifying the objective function and subsequent gradients of higher-level policies. Experimental results on a wide variety of simulated robotics and traffic control tasks demonstrate that inducing cooperation results in stronger performing policies and increased sample efficiency on a set of difficult long time horizon tasks. We also find that goal-conditioned policies trained using our method display better transfer to new tasks, highlighting the benefits of our method in learning task-agnostic lower-level behaviors. Videos and code are available at: https://sites.google.com/berkeley.edu/cooperative-hrl.

Safety Guarantees for Planning Based on Iterative Gaussian Processes

Authors:Kyriakos Polymenakos, Luca Laurenti, Andrea Patane, Jan-Peter Calliess, Luca Cardelli, Marta Kwiatkowska, Alessandro Abate, Stephen Roberts
Date:2019-11-29 21:13:05

Gaussian Processes (GPs) are widely employed in control and learning because of their principled treatment of uncertainty. However, tracking uncertainty for iterative, multi-step predictions in general leads to an analytically intractable problem. While approximation methods exist, they do not come with guarantees, making it difficult to estimate their reliability and to trust their predictions. In this work, we derive formal probability error bounds for iterative prediction and planning with GPs. Building on GP properties, we bound the probability that random trajectories lie in specific regions around the predicted values. Namely, given a tolerance $\epsilon > 0 $, we compute regions around the predicted trajectory values, such that GP trajectories are guaranteed to lie inside them with probability at least $1-\epsilon$. We verify experimentally that our method tracks the predictive uncertainty correctly, even when current approximation techniques fail. Furthermore, we show how the proposed bounds can be employed within a safe reinforcement learning framework to verify the safety of candidate control policies, guiding the synthesis of provably safe controllers.

Join Query Optimization with Deep Reinforcement Learning Algorithms

Authors:Jonas Heitz, Kurt Stockinger
Date:2019-11-26 16:48:25

Join query optimization is a complex task and is central to the performance of query processing. In fact it belongs to the class of NP-hard problems. Traditional query optimizers use dynamic programming (DP) methods combined with a set of rules and restrictions to avoid exhaustive enumeration of all possible join orders. However, DP methods are very resource intensive. Moreover, given simplifying assumptions of attribute independence, traditional query optimizers rely on erroneous cost estimations, which can lead to suboptimal query plans. Recent success of deep reinforcement learning (DRL) creates new opportunities for the field of query optimization to tackle the above-mentioned problems. In this paper, we present our DRL-based Fully Observed Optimizer (FOOP) which is a generic query optimization framework that enables plugging in different machine learning algorithms. The main idea of FOOP is to use a data-adaptive learning query optimizer that avoids exhaustive enumerations of join orders and is thus significantly faster than traditional approaches based on dynamic programming. In particular, we evaluate various DRL-algorithms and show that Proximal Policy Optimization significantly outperforms Q-learning based algorithms. Finally we demonstrate how ensemble learning techniques combined with DRL can further improve the query optimizer.

Which Channel to Ask My Question? Personalized Customer Service Request Stream Routing using Deep Reinforcement Learning

Authors:Zining Liu, Chong Long, Xiaolu Lu, Zehong Hu, Jie Zhang, Yafang Wang
Date:2019-11-24 12:57:03

Customer services are critical to all companies, as they may directly connect to the brand reputation. Due to a great number of customers, e-commerce companies often employ multiple communication channels to answer customers' questions, for example, chatbot and hotline. On one hand, each channel has limited capacity to respond to customers' requests, on the other hand, customers have different preferences over these channels. The current production systems are mainly built based on business rules, which merely considers tradeoffs between resources and customers' satisfaction. To achieve the optimal tradeoff between resources and customers' satisfaction, we propose a new framework based on deep reinforcement learning, which directly takes both resources and user model into account. In addition to the framework, we also propose a new deep-reinforcement-learning based routing method-double dueling deep Q-learning with prioritized experience replay (PER-DoDDQN). We evaluate our proposed framework and method using both synthetic and a real customer service log data from a large financial technology company. We show that our proposed deep-reinforcement-learning based framework is superior to the existing production system. Moreover, we also show our proposed PER-DoDDQN is better than all other deep Q-learning variants in practice, which provides a more optimal routing plan. These observations suggest that our proposed method can seek the trade-off where both channel resources and customers' satisfaction are optimal.

Planning with Goal-Conditioned Policies

Authors:Soroush Nasiriany, Vitchyr H. Pong, Steven Lin, Sergey Levine
Date:2019-11-19 18:25:22

Planning methods can solve temporally extended sequential decision making problems by composing simple behaviors. However, planning requires suitable abstractions for the states and transitions, which typically need to be designed by hand. In contrast, model-free reinforcement learning (RL) can acquire behaviors from low-level inputs directly, but often struggles with temporally extended tasks. Can we utilize reinforcement learning to automatically form the abstractions needed for planning, thus obtaining the best of both approaches? We show that goal-conditioned policies learned with RL can be incorporated into planning, so that a planner can focus on which states to reach, rather than how those states are reached. However, with complex state observations such as images, not all inputs represent valid states. We therefore also propose using a latent variable model to compactly represent the set of valid states for the planner, so that the policies provide an abstraction of actions, and the latent variable model provides an abstraction of states. We compare our method with planning-based and model-free methods and find that our method significantly outperforms prior work when evaluated on image-based robot navigation and manipulation tasks that require non-greedy, multi-staged behavior.

Gamma-Nets: Generalizing Value Estimation over Timescale

Authors:Craig Sherstan, Shibhansh Dohare, James MacGlashan, Johannes Günther, Patrick M. Pilarski
Date:2019-11-18 17:49:06

We present $\Gamma$-nets, a method for generalizing value function estimation over timescale. By using the timescale as one of the estimator's inputs we can estimate value for arbitrary timescales. As a result, the prediction target for any timescale is available and we are free to train on multiple timescales at each timestep. Here we empirically evaluate $\Gamma$-nets in the policy evaluation setting. We first demonstrate the approach on a square wave and then on a robot arm using linear function approximation. Next, we consider the deep reinforcement learning setting using several Atari video games. Our results show that $\Gamma$-nets can be effective for predicting arbitrary timescales, with only a small cost in accuracy as compared to learning estimators for fixed timescales. $\Gamma$-nets provide a method for compactly making predictions at many timescales without requiring a priori knowledge of the task, making it a valuable contribution to ongoing work on model-based planning, representation learning, and lifelong learning algorithms.

IKEA Furniture Assembly Environment for Long-Horizon Complex Manipulation Tasks

Authors:Youngwoon Lee, Edward S. Hu, Zhengyu Yang, Alex Yin, Joseph J. Lim
Date:2019-11-17 14:32:20

The IKEA Furniture Assembly Environment is one of the first benchmarks for testing and accelerating the automation of complex manipulation tasks. The environment is designed to advance reinforcement learning from simple toy tasks to complex tasks requiring both long-term planning and sophisticated low-level control. Our environment supports over 80 different furniture models, Sawyer and Baxter robot simulation, and domain randomization. The IKEA Furniture Assembly Environment is a testbed for methods aiming to solve complex manipulation tasks. The environment is publicly available at https://clvrai.com/furniture

Automated Augmentation with Reinforcement Learning and GANs for Robust Identification of Traffic Signs using Front Camera Images

Authors:Sohini Roy Chowdhury, Lars Tornberg, Robin Halvfordsson, Jonatan Nordh, Adam Suhren Gustafsson, Joel Wall, Mattias Westerberg, Adam Wirehed, Louis Tilloy, Zhanying Hu, Haoyuan Tan, Meng Pan, Jonas Sjoberg
Date:2019-11-15 06:23:50

Traffic sign identification using camera images from vehicles plays a critical role in autonomous driving and path planning. However, the front camera images can be distorted due to blurriness, lighting variations and vandalism which can lead to degradation of detection performances. As a solution, machine learning models must be trained with data from multiple domains, and collecting and labeling more data in each new domain is time consuming and expensive. In this work, we present an end-to-end framework to augment traffic sign training data using optimal reinforcement learning policies and a variety of Generative Adversarial Network (GAN) models, that can then be used to train traffic sign detector modules. Our automated augmenter enables learning from transformed nightime, poor lighting, and varying degrees of occlusions using the LISA Traffic Sign and BDD-Nexar dataset. The proposed method enables mapping training data from one domain to another, thereby improving traffic sign detection precision/recall from 0.70/0.66 to 0.83/0.71 for nighttime images.

Efficient Planning under Partial Observability with Unnormalized Q Functions and Spectral Learning

Authors:Tianyu Li, Bogdan Mazoure, Doina Precup, Guillaume Rabusseau
Date:2019-11-12 16:56:37

Learning and planning in partially-observable domains is one of the most difficult problems in reinforcement learning. Traditional methods consider these two problems as independent, resulting in a classical two-stage paradigm: first learn the environment dynamics and then plan accordingly. This approach, however, disconnects the two problems and can consequently lead to algorithms that are sample inefficient and time consuming. In this paper, we propose a novel algorithm that combines learning and planning together. Our algorithm is closely related to the spectral learning algorithm for predicitive state representations and offers appealing theoretical guarantees and time complexity. We empirically show on two domains that our approach is more sample and time efficient compared to classical methods.

Multi-Agent Connected Autonomous Driving using Deep Reinforcement Learning

Authors:Praveen Palanisamy
Date:2019-11-11 10:55:25

The capability to learn and adapt to changes in the driving environment is crucial for developing autonomous driving systems that are scalable beyond geo-fenced operational design domains. Deep Reinforcement Learning (RL) provides a promising and scalable framework for developing adaptive learning based solutions. Deep RL methods usually model the problem as a (Partially Observable) Markov Decision Process in which an agent acts in a stationary environment to learn an optimal behavior policy. However, driving involves complex interaction between multiple, intelligent (artificial or human) agents in a highly non-stationary environment. In this paper, we propose the use of Partially Observable Markov Games(POSG) for formulating the connected autonomous driving problems with realistic assumptions. We provide a taxonomy of multi-agent learning environments based on the nature of tasks, nature of agents and the nature of the environment to help in categorizing various autonomous driving problems that can be addressed under the proposed formulation. As our main contributions, we provide MACAD-Gym, a Multi-Agent Connected, Autonomous Driving agent learning platform for furthering research in this direction. Our MACAD-Gym platform provides an extensible set of Connected Autonomous Driving (CAD) simulation environments that enable the research and development of Deep RL- based integrated sensing, perception, planning and control algorithms for CAD systems with unlimited operational design domain under realistic, multi-agent settings. We also share the MACAD-Agents that were trained successfully using the MACAD-Gym platform to learn control policies for multiple vehicle agents in a partially observable, stop-sign controlled, 3-way urban intersection environment with raw (camera) sensor observations.

Value-Added Chemical Discovery Using Reinforcement Learning

Authors:Peihong Jiang, Hieu Doan, Sandeep Madireddy, Rajeev Surendran Assary, Prasanna Balaprakash
Date:2019-11-10 07:36:37

Computer-assisted synthesis planning aims to help chemists find better reaction pathways faster. Finding viable and short pathways from sugar molecules to value-added chemicals can be modeled as a retrosynthesis planning problem with a catalyst allowed. This is a crucial step in efficient biomass conversion. The traditional computational chemistry approach to identifying possible reaction pathways involves computing the reaction energies of hundreds of intermediates, which is a critical bottleneck in silico reaction discovery. Deep reinforcement learning has shown in other domains that a well-trained agent with little or no prior human knowledge can surpass human performance. While some effort has been made to adapt machine learning techniques to the retrosynthesis planning problem, value-added chemical discovery presents unique challenges. Specifically, the reaction can occur in several different sites in a molecule, a subtle case that has never been treated in previous works. With a more versatile formulation of the problem as a Markov decision process, we address the problem using deep reinforcement learning techniques and present promising preliminary results.

Hierarchical Reinforcement Learning Method for Autonomous Vehicle Behavior Planning

Authors:Zhiqian Qiao, Zachariah Tyree, Priyantha Mudalige, Jeff Schneider, John M. Dolan
Date:2019-11-09 23:19:59

In this work, we propose a hierarchical reinforcement learning (HRL) structure which is capable of performing autonomous vehicle planning tasks in simulated environments with multiple sub-goals. In this hierarchical structure, the network is capable of 1) learning one task with multiple sub-goals simultaneously; 2) extracting attentions of states according to changing sub-goals during the learning process; 3) reusing the well-trained network of sub-goals for other similar tasks with the same sub-goals. The states are defined as processed observations which are transmitted from the perception system of the autonomous vehicle. A hybrid reward mechanism is designed for different hierarchical layers in the proposed HRL structure. Compared to traditional RL methods, our algorithm is more sample-efficient since its modular design allows reusing the policies of sub-goals across similar tasks. The results show that the proposed method converges to an optimal policy faster than traditional RL methods.

Robo-PlaNet: Learning to Poke in a Day

Authors:Maxime Chevalier-Boisvert, Guillaume Alain, Florian Golemo, Derek Nowrouzezahrai
Date:2019-11-09 02:05:18

Recently, the Deep Planning Network (PlaNet) approach was introduced as a model-based reinforcement learning method that learns environment dynamics directly from pixel observations. This architecture is useful for learning tasks in which either the agent does not have access to meaningful states (like position/velocity of robotic joints) or where the observed states significantly deviate from the physical state of the agent (which is commonly the case in low-cost robots in the form of backlash or noisy joint readings). PlaNet, by design, interleaves phases of training the dynamics model with phases of collecting more data on the target environment, leading to long training times. In this work, we introduce Robo-PlaNet, an asynchronous version of PlaNet. This algorithm consistently reaches higher performance in the same amount of time, which we demonstrate in both a simulated and a real robotic experiment.

Mapless Navigation among Dynamics with Social-safety-awareness: a reinforcement learning approach from 2D laser scans

Authors:Jun Jin, Nhat M. Nguyen, Nazmus Sakib, Daniel Graves, Hengshuai Yao, Martin Jagersand
Date:2019-11-08 06:29:31

We propose a method to tackle the problem of mapless collision-avoidance navigation where humans are present using 2D laser scans. Our proposed method uses ego-safety to measure collision from the robot's perspective while social-safety to measure the impact of our robot's actions on surrounding pedestrians. Specifically, the social-safety part predicts the intrusion impact of our robot's action into the interaction area with surrounding humans. We train the policy using reinforcement learning on a simple simulator and directly evaluate the learned policy in Gazebo and real robot tests. Experiments show the learned policy can be smoothly transferred without any fine tuning. We observe that our method demonstrates time-efficient path planning behavior with high success rate in mapless navigation tasks. Furthermore, we test our method in a navigation among dynamic crowds task considering both low and high volume traffic. Our learned policy demonstrates cooperative behavior that actively drives our robot into traffic flows while showing respect to nearby pedestrians. Evaluation videos are at https://sites.google.com/view/ssw-batman

DeepRacer: Educational Autonomous Racing Platform for Experimentation with Sim2Real Reinforcement Learning

Authors:Bharathan Balaji, Sunil Mallya, Sahika Genc, Saurabh Gupta, Leo Dirac, Vineet Khare, Gourav Roy, Tao Sun, Yunzhe Tao, Brian Townsend, Eddie Calleja, Sunil Muralidhara, Dhanasekar Karuppasamy
Date:2019-11-05 01:40:42

DeepRacer is a platform for end-to-end experimentation with RL and can be used to systematically investigate the key challenges in developing intelligent control systems. Using the platform, we demonstrate how a 1/18th scale car can learn to drive autonomously using RL with a monocular camera. It is trained in simulation with no additional tuning in physical world and demonstrates: 1) formulation and solution of a robust reinforcement learning algorithm, 2) narrowing the reality gap through joint perception and dynamics, 3) distributed on-demand compute architecture for training optimal policies, and 4) a robust evaluation method to identify when to stop training. It is the first successful large-scale deployment of deep reinforcement learning on a robotic control agent that uses only raw camera images as observations and a model-free learning method to perform robust path planning. We open source our code and video demo on GitHub: https://git.io/fjxoJ.

Explicit Explore-Exploit Algorithms in Continuous State Spaces

Authors:Mikael Henaff
Date:2019-11-01 23:58:05

We present a new model-based algorithm for reinforcement learning (RL) which consists of explicit exploration and exploitation phases, and is applicable in large or infinite state spaces. The algorithm maintains a set of dynamics models consistent with current experience and explores by finding policies which induce high disagreement between their state predictions. It then exploits using the refined set of models or experience gathered during exploration. We show that under realizability and optimal planning assumptions, our algorithm provably finds a near-optimal policy with a number of samples that is polynomial in a structural complexity measure which we show to be low in several natural settings. We then give a practical approximation using neural networks and demonstrate its performance and sample efficiency in practice.

A Distributed Model-Free Algorithm for Multi-hop Ride-sharing using Deep Reinforcement Learning

Authors:Ashutosh Singh, Abubakr Alabbasi, Vaneet Aggarwal
Date:2019-10-30 17:40:32

The growth of autonomous vehicles, ridesharing systems, and self driving technology will bring a shift in the way ride hailing platforms plan out their services. However, these advances in technology coupled with road congestion, environmental concerns, fuel usage, vehicles emissions, and the high cost of the vehicle usage have brought more attention to better utilize the use of vehicles and their capacities. In this paper, we propose a novel multi-hop ride-sharing (MHRS) algorithm that uses deep reinforcement learning to learn optimal vehicle dispatch and matching decisions by interacting with the external environment. By allowing customers to transfer between vehicles, i.e., ride with one vehicle for sometime and then transfer to another one, MHRS helps in attaining 30\% lower cost and 20\% more efficient utilization of fleets, as compared to the ride-sharing algorithms. This flexibility of multi-hop feature gives a seamless experience to customers and ride-sharing companies, and thus improves ride-sharing services.

Learning Algorithmic Solutions to Symbolic Planning Tasks with a Neural Computer Architecture

Authors:Daniel Tanneberg, Elmar Rueckert, Jan Peters
Date:2019-10-30 17:02:13

A key feature of intelligent behavior is the ability to learn abstract strategies that transfer to unfamiliar problems. Therefore, we present a novel architecture, based on memory-augmented networks, that is inspired by the von Neumann and Harvard architectures of modern computers. This architecture enables the learning of abstract algorithmic solutions via Evolution Strategies in a reinforcement learning setting. Applied to Sokoban, sliding block puzzle and robotic manipulation tasks, we show that the architecture can learn algorithmic solutions with strong generalization and abstraction: scaling to arbitrary task configurations and complexities, and being independent of both the data representation and the task domain.

Navigation Agents for the Visually Impaired: A Sidewalk Simulator and Experiments

Authors:Martin Weiss, Simon Chamorro, Roger Girgis, Margaux Luck, Samira E. Kahou, Joseph P. Cohen, Derek Nowrouzezahrai, Doina Precup, Florian Golemo, Chris Pal
Date:2019-10-29 13:23:02

Millions of blind and visually-impaired (BVI) people navigate urban environments every day, using smartphones for high-level path-planning and white canes or guide dogs for local information. However, many BVI people still struggle to travel to new places. In our endeavor to create a navigation assistant for the BVI, we found that existing Reinforcement Learning (RL) environments were unsuitable for the task. This work introduces SEVN, a sidewalk simulation environment and a neural network-based approach to creating a navigation agent. SEVN contains panoramic images with labels for house numbers, doors, and street name signs, and formulations for several navigation tasks. We study the performance of an RL algorithm (PPO) in this setting. Our policy model fuses multi-modal observations in the form of variable resolution images, visible text, and simulated GPS data to navigate to a goal door. We hope that this dataset, simulator, and experimental results will provide a foundation for further research into the creation of agents that can assist members of the BVI community with outdoor navigation.

Entity Abstraction in Visual Model-Based Reinforcement Learning

Authors:Rishi Veerapaneni, John D. Co-Reyes, Michael Chang, Michael Janner, Chelsea Finn, Jiajun Wu, Joshua B. Tenenbaum, Sergey Levine
Date:2019-10-28 17:37:46

This paper tests the hypothesis that modeling a scene in terms of entities and their local interactions, as opposed to modeling the scene globally, provides a significant benefit in generalizing to physical tasks in a combinatorial space the learner has not encountered before. We present object-centric perception, prediction, and planning (OP3), which to the best of our knowledge is the first fully probabilistic entity-centric dynamic latent variable framework for model-based reinforcement learning that acquires entity representations from raw visual observations without supervision and uses them to predict and plan. OP3 enforces entity-abstraction -- symmetric processing of each entity representation with the same locally-scoped function -- which enables it to scale to model different numbers and configurations of objects from those in training. Our approach to solving the key technical challenge of grounding these entity representations to actual objects in the environment is to frame this variable binding problem as an inference problem, and we develop an interactive inference algorithm that uses temporal continuity and interactive feedback to bind information about object properties to the entity variables. On block-stacking tasks, OP3 generalizes to novel block configurations and more objects than observed during training, outperforming an oracle model that assumes access to object supervision and achieving two to three times better accuracy than a state-of-the-art video prediction model that does not exhibit entity abstraction.

Learning Q-network for Active Information Acquisition

Authors:Heejin Jeong, Brent Schlotfeldt, Hamed Hassani, Manfred Morari, Daniel D. Lee, George J. Pappas
Date:2019-10-23 18:21:16

In this paper, we propose a novel Reinforcement Learning approach for solving the Active Information Acquisition problem, which requires an agent to choose a sequence of actions in order to acquire information about a process of interest using on-board sensors. The classic challenges in the information acquisition problem are the dependence of a planning algorithm on known models and the difficulty of computing information-theoretic cost functions over arbitrary distributions. In contrast, the proposed framework of reinforcement learning does not require any knowledge on models and alleviates the problems during an extended training stage. It results in policies that are efficient to execute online and applicable for real-time control of robotic systems. Furthermore, the state-of-the-art planning methods are typically restricted to short horizons, which may become problematic with local minima. Reinforcement learning naturally handles the issue of planning horizon in information problems as it maximizes a discounted sum of rewards over a long finite or infinite time horizon. We discuss the potential benefits of the proposed framework and compare the performance of the novel algorithm to an existing information acquisition method for multi-target tracking scenarios.

Self-Supervised Sim-to-Real Adaptation for Visual Robotic Manipulation

Authors:Rae Jeong, Yusuf Aytar, David Khosid, Yuxiang Zhou, Jackie Kay, Thomas Lampe, Konstantinos Bousmalis, Francesco Nori
Date:2019-10-21 16:00:53

Collecting and automatically obtaining reward signals from real robotic visual data for the purposes of training reinforcement learning algorithms can be quite challenging and time-consuming. Methods for utilizing unlabeled data can have a huge potential to further accelerate robotic learning. We consider here the problem of performing manipulation tasks from pixels. In such tasks, choosing an appropriate state representation is crucial for planning and control. This is even more relevant with real images where noise, occlusions and resolution affect the accuracy and reliability of state estimation. In this work, we learn a latent state representation implicitly with deep reinforcement learning in simulation, and then adapt it to the real domain using unlabeled real robot data. We propose to do so by optimizing sequence-based self supervised objectives. These exploit the temporal nature of robot experience, and can be common in both the simulated and real domains, without assuming any alignment of underlying states in simulated and unlabeled real images. We propose Contrastive Forward Dynamics loss, which combines dynamics model learning with time-contrastive techniques. The learned state representation that results from our methods can be used to robustly solve a manipulation task in simulation and to successfully transfer the learned skill on a real system. We demonstrate the effectiveness of our approaches by training a vision-based reinforcement learning agent for cube stacking. Agents trained with our method, using only 5 hours of unlabeled real robot data for adaptation, shows a clear improvement over domain randomization, and standard visual domain adaptation techniques for sim-to-real transfer.

A Survey of Deep Learning Techniques for Autonomous Driving

Authors:Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, Gigel Macesanu
Date:2019-10-17 07:05:28

The last decade witnessed increasingly rapid progress in self-driving vehicle technology, mainly backed up by advances in the area of deep learning and artificial intelligence. The objective of this paper is to survey the current state-of-the-art on deep learning technologies used in autonomous driving. We start by presenting AI-based self-driving architectures, convolutional and recurrent neural networks, as well as the deep reinforcement learning paradigm. These methodologies form a base for the surveyed driving scene perception, path planning, behavior arbitration and motion control algorithms. We investigate both the modular perception-planning-action pipeline, where each module is built using deep learning methods, as well as End2End systems, which directly map sensory information to steering commands. Additionally, we tackle current challenges encountered in designing AI architectures for autonomous driving, such as their safety, training data sources and computational hardware. The comparison presented in this survey helps to gain insight into the strengths and limitations of deep learning and AI approaches for autonomous driving and assist with design choices

On the Expressivity of Neural Networks for Deep Reinforcement Learning

Authors:Kefan Dong, Yuping Luo, Tengyu Ma
Date:2019-10-14 06:17:49

We compare the model-free reinforcement learning with the model-based approaches through the lens of the expressive power of neural networks for policies, $Q$-functions, and dynamics. We show, theoretically and empirically, that even for one-dimensional continuous state space, there are many MDPs whose optimal $Q$-functions and policies are much more complex than the dynamics. We hypothesize many real-world MDPs also have a similar property. For these MDPs, model-based planning is a favorable algorithm, because the resulting policies can approximate the optimal policy significantly better than a neural network parameterization can, and model-free or model-based policy optimization rely on policy parameterization. Motivated by the theory, we apply a simple multi-step model-based bootstrapping planner (BOOTS) to bootstrap a weak $Q$-function into a stronger policy. Empirical results show that applying BOOTS on top of model-based or model-free policy optimization algorithms at the test time improves the performance on MuJoCo benchmark tasks.

Extracting Incentives from Black-Box Decisions

Authors:Yonadav Shavit, William S. Moses
Date:2019-10-13 01:17:29

An algorithmic decision-maker incentivizes people to act in certain ways to receive better decisions. These incentives can dramatically influence subjects' behaviors and lives, and it is important that both decision-makers and decision-recipients have clarity on which actions are incentivized by the chosen model. While for linear functions, the changes a subject is incentivized to make may be clear, we prove that for many non-linear functions (e.g. neural networks, random forests), classical methods for interpreting the behavior of models (e.g. input gradients) provide poor advice to individuals on which actions they should take. In this work, we propose a mathematical framework for understanding algorithmic incentives as the challenge of solving a Markov Decision Process, where the state includes the set of input features, and the reward is a function of the model's output. We can then leverage the many toolkits for solving MDPs (e.g. tree-based planning, reinforcement learning) to identify the optimal actions each individual is incentivized to take to improve their decision under a given model. We demonstrate the utility of our method by estimating the maximally-incentivized actions in two real-world settings: a recidivism risk predictor we train using ProPublica's COMPAS dataset, and an online credit scoring tool published by the Fair Isaac Corporation (FICO).

Regularizing Model-Based Planning with Energy-Based Models

Authors:Rinu Boney, Juho Kannala, Alexander Ilin
Date:2019-10-12 08:29:24

Model-based reinforcement learning could enable sample-efficient learning by quickly acquiring rich knowledge about the world and using it to improve behaviour without additional data. Learned dynamics models can be directly used for planning actions but this has been challenging because of inaccuracies in the learned models. In this paper, we focus on planning with learned dynamics models and propose to regularize it using energy estimates of state transitions in the environment. We visually demonstrate the effectiveness of the proposed method and show that off-policy training of an energy estimator can be effectively used to regularize planning with pre-trained dynamics models. Further, we demonstrate that the proposed method enables sample-efficient learning to achieve competitive performance in challenging continuous control tasks such as Half-cheetah and Ant in just a few minutes of experience.

Machine learning strategies for path-planning microswimmers in turbulent flows

Authors:Jaya Kumar Alageshan, Akhilesh Kumar Verma, Jérémie Bec, Rahul Pandit
Date:2019-10-03 21:47:08

We develop an adversarial-reinforcement learning scheme for microswimmers in statistically homogeneous and isotropic turbulent fluid flows, in both two (2D) and three dimensions (3D). We show that this scheme allows microswimmers to find non-trivial paths, which enable them to reach a target on average in less time than a naive microswimmer, which tries, at any instant of time and at a given position in space, to swim in the direction of the target. We use pseudospectral direct numerical simulations (DNSs) of the 2D and 3D (incompressible) Navier-Stokes equations to obtain the turbulent flows. We then introduce passive microswimmers that try to swim along a given direction in these flows; the microswimmers do not affect the flow, but they are advected by it.

Review of Learning-based Longitudinal Motion Planning for Autonomous Vehicles: Research Gaps between Self-driving and Traffic Congestion

Authors:Hao Zhou, Jorge Laval, Anye Zhou, Yu Wang, Wenchao Wu, Zhu Qing, Srinivas Peeta
Date:2019-10-02 19:19:48

Self-driving technology companies and the research community are accelerating their pace to use machine learning longitudinal motion planning (mMP) for autonomous vehicles (AVs). This paper reviews the current state of the art in mMP, with an exclusive focus on its impact on traffic congestion. We identify the availability of congestion scenarios in current datasets, and summarize the required features for training mMP. For learning methods, we survey the major methods in both imitation learning and non-imitation learning. We also highlight the emerging technologies adopted by some leading AV companies, e.g. Tesla, Waymo, and Comma.ai. We find that: i) the AV industry has been mostly focusing on the long tail problem related to safety and overlooked the impact on traffic congestion, ii) the current public self-driving datasets have not included enough congestion scenarios, and mostly lack the necessary input features/output labels to train mMP, and iii) albeit reinforcement learning (RL) approach can integrate congestion mitigation into the learning goal, the major mMP method adopted by industry is still behavior cloning (BC), whose capability to learn a congestion-mitigating mMP remains to be seen. Based on the review, the study identifies the research gaps in current mMP development. Some suggestions towards congestion mitigation for future mMP studies are proposed: i) enrich data collection to facilitate the congestion learning, ii) incorporate non-imitation learning methods to combine traffic efficiency into a safety-oriented technical route, and iii) integrate domain knowledge from the traditional car following (CF) theory to improve the string stability of mMP.

End-to-End Motion Planning of Quadrotors Using Deep Reinforcement Learning

Authors:Efe Camci, Erdal Kayacan
Date:2019-09-30 11:31:59

In this work, a novel, end-to-end motion planning method is proposed for quadrotor navigation in cluttered environments. The proposed method circumvents the explicit sensing-reconstructing-planning in contrast to conventional navigation algorithms. It uses raw depth images obtained from a front-facing camera and directly generates local motion plans in the form of smooth motion primitives that move a quadrotor to a goal by avoiding obstacles. Promising training and testing results are presented in both AirSim simulations and real flights with DJI F330 Quadrotor equipped with Intel RealSense D435. The proposed system in action can be found in https://youtu.be/pYvKhc8wrTM.

Relational Graph Learning for Crowd Navigation

Authors:Changan Chen, Sha Hu, Payam Nikdel, Greg Mori, Manolis Savva
Date:2019-09-28 22:31:46

We present a relational graph learning approach for robotic crowd navigation using model-based deep reinforcement learning that plans actions by looking into the future. Our approach reasons about the relations between all agents based on their latent features and uses a Graph Convolutional Network to encode higher-order interactions in each agent's state representation, which is subsequently leveraged for state prediction and value estimation. The ability to predict human motion allows us to perform multi-step lookahead planning, taking into account the temporal evolution of human crowds. We evaluate our approach against a state-of-the-art baseline for crowd navigation and ablations of our model to demonstrate that navigation with our approach is more efficient, results in fewer collisions, and avoids failure cases involving oscillatory and freezing behaviors.

Playing Atari Ball Games with Hierarchical Reinforcement Learning

Authors:Hua Huang, Adrian Barbu
Date:2019-09-27 02:09:34

Human beings are particularly good at reasoning and inference from just a few examples. When facing new tasks, humans will leverage knowledge and skills learned before, and quickly integrate them with the new task. In addition to learning by experimentation, human also learn socio-culturally through instructions and learning by example. In this way humans can learn much faster compared with most current artificial intelligence algorithms in many tasks. In this paper, we test the idea of speeding up machine learning through social learning. We argue that in solving real-world problems, especially when the task is designed by humans, and/or for humans, there are typically instructions from user manuals and/or human experts which give guidelines on how to better accomplish the tasks. We argue that these instructions have tremendous value in designing a reinforcement learning system which can learn in human fashion, and we test the idea by playing the Atari games Tennis and Pong. We experimentally demonstrate that the instructions provide key information about the task, which can be used to decompose the learning task into sub-systems and construct options for the temporally extended planning, and dramatically accelerate the learning process.

Learning Generalizable Locomotion Skills with Hierarchical Reinforcement Learning

Authors:Tianyu Li, Nathan Lambert, Roberto Calandra, Franziska Meier, Akshara Rai
Date:2019-09-26 18:21:12

Learning to locomote to arbitrary goals on hardware remains a challenging problem for reinforcement learning. In this paper, we present a hierarchical learning framework that improves sample-efficiency and generalizability of locomotion skills on real-world robots. Our approach divides the problem of goal-oriented locomotion into two sub-problems: learning diverse primitives skills, and using model-based planning to sequence these skills. We parametrize our primitives as cyclic movements, improving sample-efficiency of learning on a 18 degrees of freedom robot. Then, we learn coarse dynamics models over primitive cycles and use them in a model predictive control framework. This allows us to learn to walk to arbitrary goals up to 12m away, after about two hours of training from scratch on hardware. Our results on a Daisy hexapod hardware and simulation demonstrate the efficacy of our approach at reaching distant targets, in different environments and with sensory noise.

RLBench: The Robot Learning Benchmark & Learning Environment

Authors:Stephen James, Zicong Ma, David Rovick Arrojo, Andrew J. Davison
Date:2019-09-26 17:26:18

We present a challenging new benchmark and learning-environment for robot learning: RLBench. The benchmark features 100 completely unique, hand-designed tasks ranging in difficulty, from simple target reaching and door opening, to longer multi-stage tasks, such as opening an oven and placing a tray in it. We provide an array of both proprioceptive observations and visual observations, which include rgb, depth, and segmentation masks from an over-the-shoulder stereo camera and an eye-in-hand monocular camera. Uniquely, each task comes with an infinite supply of demos through the use of motion planners operating on a series of waypoints given during task creation time; enabling an exciting flurry of demonstration-based learning. RLBench has been designed with scalability in mind; new tasks, along with their motion-planned demos, can be easily created and then verified by a series of tools, allowing users to submit their own tasks to the RLBench task repository. This large-scale benchmark aims to accelerate progress in a number of vision-guided manipulation research areas, including: reinforcement learning, imitation learning, multi-task learning, geometric computer vision, and in particular, few-shot learning. With the benchmark's breadth of tasks and demonstrations, we propose the first large-scale few-shot challenge in robotics. We hope that the scale and diversity of RLBench offers unparalleled research opportunities in the robot learning community and beyond.

Harnessing Structures for Value-Based Planning and Reinforcement Learning

Authors:Yuzhe Yang, Guo Zhang, Zhi Xu, Dina Katabi
Date:2019-09-26 17:01:23

Value-based methods constitute a fundamental methodology in planning and deep reinforcement learning (RL). In this paper, we propose to exploit the underlying structures of the state-action value function, i.e., Q function, for both planning and deep RL. In particular, if the underlying system dynamics lead to some global structures of the Q function, one should be capable of inferring the function better by leveraging such structures. Specifically, we investigate the low-rank structure, which widely exists for big data matrices. We verify empirically the existence of low-rank Q functions in the context of control and deep RL tasks. As our key contribution, by leveraging Matrix Estimation (ME) techniques, we propose a general framework to exploit the underlying low-rank structure in Q functions. This leads to a more efficient planning procedure for classical control, and additionally, a simple scheme that can be applied to any value-based RL techniques to consistently achieve better performance on "low-rank" tasks. Extensive experiments on control tasks and Atari games confirm the efficacy of our approach. Code is available at https://github.com/YyzHarry/SV-RL.

Visual Exploration and Energy-aware Path Planning via Reinforcement Learning

Authors:Amir Niaraki, Jeremy Roghair, Ali Jannesari
Date:2019-09-26 16:15:37

Visual exploration and smart data collection via autonomous vehicles is an attractive topic in various disciplines. Disturbances like wind significantly influence both the power consumption of the flying robots and the performance of the camera. We propose a reinforcement learning approach which combines the effects of the power consumption and the object detection modules to develop a policy for object detection in large areas with limited battery life. The learning model enables dynamic learning of the negative rewards of each action based on the drag forces that is resulted by the motion of the flying robot with respect to the wind field. The algorithm is implemented in a near-real world simulation environment both for the planar motion and flight in different altitudes. The trained agent often performed a trade-off between detecting the objects with high accuracy and increasing the area coverage within its battery life. The developed exploration policy outperformed the complete coverage algorithm by minimizing the traveled path while finding the target objects. The performance of the algorithms under various wind fields was evaluated in planar and 3D motion. During an exploration task with sparsely distributed goals and within a UAV's battery life, the proposed architecture could detect more than twice the amount of goal objects compared to the coverage path planning algorithm in moderate wind field. In high wind intensities, the energy-aware algorithm could detect 4 times the amount of goal objects when compared to its complete coverage counterpart.

A Layered Architecture for Active Perception: Image Classification using Deep Reinforcement Learning

Authors:Hossein K. Mousavi, Guangyi Liu, Weihang Yuan, Martin Takáč, Héctor Muñoz-Avila, Nader Motee
Date:2019-09-20 19:52:41

We propose a planning and perception mechanism for a robot (agent), that can only observe the underlying environment partially, in order to solve an image classification problem. A three-layer architecture is suggested that consists of a meta-layer that decides the intermediate goals, an action-layer that selects local actions as the agent navigates towards a goal, and a classification-layer that evaluates the reward and makes a prediction. We design and implement these layers using deep reinforcement learning. A generalized policy gradient algorithm is utilized to learn the parameters of these layers to maximize the expected reward. Our proposed methodology is tested on the MNIST dataset of handwritten digits, which provides us with a level of explainability while interpreting the agent's intermediate goals and course of action.

Reconnaissance and Planning algorithm for constrained MDP

Authors:Shin-ichi Maeda, Hayato Watahiki, Shintarou Okada, Masanori Koyama
Date:2019-09-20 14:44:36

Practical reinforcement learning problems are often formulated as constrained Markov decision process (CMDP) problems, in which the agent has to maximize the expected return while satisfying a set of prescribed safety constraints. In this study, we propose a novel simulator-based method to approximately solve a CMDP problem without making any compromise on the safety constraints. We achieve this by decomposing the CMDP into a pair of MDPs; reconnaissance MDP and planning MDP. The purpose of reconnaissance MDP is to evaluate the set of actions that are safe, and the purpose of planning MDP is to maximize the return while using the actions authorized by reconnaissance MDP. RMDP can define a set of safe policies for any given set of safety constraint, and this set of safe policies can be used to solve another CMDP problem with different reward. Our method is not only computationally less demanding than the previous simulator-based approaches to CMDP, but also capable of finding a competitive reward-seeking policy in a high dimensional environment, including those involving multiple moving obstacles.

DeepGait: Planning and Control of Quadrupedal Gaits using Deep Reinforcement Learning

Authors:Vassilios Tsounis, Mitja Alge, Joonho Lee, Farbod Farshidian, Marco Hutter
Date:2019-09-18 12:36:58

This paper addresses the problem of legged locomotion in non-flat terrain. As legged robots such as quadrupeds are to be deployed in terrains with geometries which are difficult to model and predict, the need arises to equip them with the capability to generalize well to unforeseen situations. In this work, we propose a novel technique for training neural-network policies for terrain-aware locomotion, which combines state-of-the-art methods for model-based motion planning and reinforcement learning. Our approach is centered on formulating Markov decision processes using the evaluation of dynamic feasibility criteria in place of physical simulation. We thus employ policy-gradient methods to independently train policies which respectively plan and execute foothold and base motions in 3D environments using both proprioceptive and exteroceptive measurements. We apply our method within a challenging suite of simulated terrain scenarios which contain features such as narrow bridges, gaps and stepping-stones, and train policies which succeed in locomoting effectively in all cases.

A Human-Centered Data-Driven Planner-Actor-Critic Architecture via Logic Programming

Authors:Daoming Lyu, Fangkai Yang, Bo Liu, Steven Gustafson
Date:2019-09-18 07:06:06

Recent successes of Reinforcement Learning (RL) allow an agent to learn policies that surpass human experts but suffers from being time-hungry and data-hungry. By contrast, human learning is significantly faster because prior and general knowledge and multiple information resources are utilized. In this paper, we propose a Planner-Actor-Critic architecture for huMAN-centered planning and learning (PACMAN), where an agent uses its prior, high-level, deterministic symbolic knowledge to plan for goal-directed actions, and also integrates the Actor-Critic algorithm of RL to fine-tune its behavior towards both environmental rewards and human feedback. This work is the first unified framework where knowledge-based planning, RL, and human teaching jointly contribute to the policy learning of an agent. Our experiments demonstrate that PACMAN leads to a significant jump-start at the early stage of learning, converges rapidly and with small variance, and is robust to inconsistent, infrequent, and misleading feedback.

Control Synthesis from Linear Temporal Logic Specifications using Model-Free Reinforcement Learning

Authors:Alper Kamil Bozkurt, Yu Wang, Michael M. Zavlanos, Miroslav Pajic
Date:2019-09-16 15:56:32

We present a reinforcement learning (RL) framework to synthesize a control policy from a given linear temporal logic (LTL) specification in an unknown stochastic environment that can be modeled as a Markov Decision Process (MDP). Specifically, we learn a policy that maximizes the probability of satisfying the LTL formula without learning the transition probabilities. We introduce a novel rewarding and path-dependent discounting mechanism based on the LTL formula such that (i) an optimal policy maximizing the total discounted reward effectively maximizes the probabilities of satisfying LTL objectives, and (ii) a model-free RL algorithm using these rewards and discount factors is guaranteed to converge to such policy. Finally, we illustrate the applicability of our RL-based synthesis approach on two motion planning case studies.

Selective Network Discovery via Deep Reinforcement Learning on Embedded Spaces

Authors:Peter Morales, Rajmonda Sulo Caceres, Tina Eliassi-Rad
Date:2019-09-16 15:51:27

Complex networks are often either too large for full exploration, partially accessible, or partially observed. Downstream learning tasks on these incomplete networks can produce low quality results. In addition, reducing the incompleteness of the network can be costly and nontrivial. As a result, network discovery algorithms optimized for specific downstream learning tasks given resource collection constraints are of great interest. In this paper, we formulate the task-specific network discovery problem in an incomplete network setting as a sequential decision making problem. Our downstream task is selective harvesting, the optimal collection of vertices with a particular attribute. We propose a framework, called Network Actor Critic (NAC), which learns a policy and notion of future reward in an offline setting via a deep reinforcement learning algorithm. The NAC paradigm utilizes a task-specific network embedding to reduce the state space complexity. A detailed comparative analysis of popular network embeddings is presented with respect to their role in supporting offline planning. Furthermore, a quantitative study is presented on several synthetic and real benchmarks using NAC and several baselines. We show that offline models of reward and network discovery policies lead to significantly improved performance when compared to competitive online discovery algorithms. Finally, we outline learning regimes where planning is critical in addressing sparse and changing reward signals.

Dynamic Pricing and Fleet Management for Electric Autonomous Mobility on Demand Systems

Authors:Berkay Turan, Ramtin Pedarsani, Mahnoosh Alizadeh
Date:2019-09-16 03:06:47

The proliferation of ride sharing systems is a major drive in the advancement of autonomous and electric vehicle technologies. This paper considers the joint routing, battery charging, and pricing problem faced by a profit-maximizing transportation service provider that operates a fleet of autonomous electric vehicles. We first establish the static planning problem by considering time-invariant system parameters and determine the optimal static policy. While the static policy provides stability of customer queues waiting for rides even if consider the system dynamics, we see that it is inefficient to utilize a static policy as it can lead to long wait times for customers and low profits. To accommodate for the stochastic nature of trip demands, renewable energy availability, and electricity prices and to further optimally manage the autonomous fleet given the need to generate integer allocations, a real-time policy is required. The optimal real-time policy that executes actions based on full state information of the system is the solution of a complex dynamic program. However, we argue that it is intractable to exactly solve for the optimal policy using exact dynamic programming methods and therefore apply deep reinforcement learning to develop a near-optimal control policy. The two case studies we conducted in Manhattan and San Francisco demonstrate the efficacy of our real-time policy in terms of network stability and profits, while keeping the queue lengths up to 200 times less than the static policy.

Off-road Autonomous Vehicles Traversability Analysis and Trajectory Planning Based on Deep Inverse Reinforcement Learning

Authors:Zeyu Zhu, Nan Li, Ruoyu Sun, Huijing Zhao, Donghao Xu
Date:2019-09-16 02:46:02

Terrain traversability analysis is a fundamental issue to achieve the autonomy of a robot at off-road environments. Geometry-based and appearance-based methods have been studied in decades, while behavior-based methods exploiting learning from demonstration (LfD) are new trends. Behavior-based methods learn cost functions that guide trajectory planning in compliance with experts' demonstrations, which can be more scalable to various scenes and driving behaviors. This research proposes a method of off-road traversability analysis and trajectory planning using Deep Maximum Entropy Inverse Reinforcement Learning. To incorporate vehicle's kinematics while solving the problem of exponential increase of state-space complexity, two convolutional neural networks, i.e., RL ConvNet and Svf ConvNet, are developed to encode kinematics into convolution kernels and achieve efficient forward reinforcement learning. We conduct experiments in off-road environments. Scene maps are generated using 3D LiDAR data, and expert demonstrations are either the vehicle's real driving trajectories at the scene or synthesized ones to represent specific behaviors such as crossing negative obstacles. Different cost functions of traversability analysis are learned and tested at various scenes of capability in guiding the trajectory planning of different behaviors. We also demonstrate the performance and computation efficiency of the proposed method.

Model Based Planning with Energy Based Models

Authors:Yilun Du, Toru Lin, Igor Mordatch
Date:2019-09-15 20:28:03

Model-based planning holds great promise for improving both sample efficiency and generalization in reinforcement learning (RL). We show that energy-based models (EBMs) are a promising class of models to use for model-based planning. EBMs naturally support inference of intermediate states given start and goal state distributions. We provide an online algorithm to train EBMs while interacting with the environment, and show that EBMs allow for significantly better online learning than corresponding feed-forward networks. We further show that EBMs support maximum entropy state inference and are able to generate diverse state space plans. We show that inference purely in state space - without planning actions - allows for better generalization to previously unseen obstacles in the environment and prevents the planner from exploiting the dynamics model by applying uncharacteristic action sequences. Finally, we show that online EBM training naturally leads to intentionally planned state exploration which performs significantly better than random exploration.

Driving in Dense Traffic with Model-Free Reinforcement Learning

Authors:Dhruv Mauria Saxena, Sangjae Bae, Alireza Nakhaei, Kikuo Fujimura, Maxim Likhachev
Date:2019-09-15 01:59:10

Traditional planning and control methods could fail to find a feasible trajectory for an autonomous vehicle to execute amongst dense traffic on roads. This is because the obstacle-free volume in spacetime is very small in these scenarios for the vehicle to drive through. However, that does not mean the task is infeasible since human drivers are known to be able to drive amongst dense traffic by leveraging the cooperativeness of other drivers to open a gap. The traditional methods fail to take into account the fact that the actions taken by an agent affect the behaviour of other vehicles on the road. In this work, we rely on the ability of deep reinforcement learning to implicitly model such interactions and learn a continuous control policy over the action space of an autonomous vehicle. The application we consider requires our agent to negotiate and open a gap in the road in order to successfully merge or change lanes. Our policy learns to repeatedly probe into the target road lane while trying to find a safe spot to move in to. We compare against two model-predictive control-based algorithms and show that our policy outperforms them in simulation.

Flight Controller Synthesis Via Deep Reinforcement Learning

Authors:William Koch
Date:2019-09-14 00:35:21

Traditional control methods are inadequate in many deployment settings involving control of Cyber-Physical Systems (CPS). In such settings, CPS controllers must operate and respond to unpredictable interactions, conditions, or failure modes. Dealing with such unpredictability requires the use of executive and cognitive control functions that allow for planning and reasoning. Motivated by the sport of drone racing, this dissertation addresses these concerns for state-of-the-art flight control by investigating the use of deep neural networks to bring essential elements of higher-level cognition for constructing low level flight controllers. This thesis reports on the development and release of an open source, full solution stack for building neuro-flight controllers. This stack consists of the methodology for constructing a multicopter digital twin for synthesize the flight controller unique to a specific aircraft, a tuning framework for implementing training environments (GymFC), and a firmware for the world's first neural network supported flight controller (Neuroflight). GymFC's novel approach fuses together the digital twinning paradigm for flight control training to provide seamless transfer to hardware. Additionally, this thesis examines alternative reward system functions as well as changes to the software environment to bridge the gap between the simulation and real world deployment environments. Work summarized in this thesis demonstrates that reinforcement learning is able to be leveraged for training neural network controllers capable, not only of maintaining stable flight, but also precision aerobatic maneuvers in real world settings. As such, this work provides a foundation for developing the next generation of flight control systems.

Petri Net Machines for Human-Agent Interaction

Authors:Christian Dondrup, Ioannis Papaioannou, Oliver Lemon
Date:2019-09-13 12:31:39

Smart speakers and robots become ever more prevalent in our daily lives. These agents are able to execute a wide range of tasks and actions and, therefore, need systems to control their execution. Current state-of-the-art such as (deep) reinforcement learning, however, requires vast amounts of data for training which is often hard to come by when interacting with humans. To overcome this issue, most systems still rely on Finite State Machines. We introduce Petri Net Machines which present a formal definition for state machines based on Petri Nets that are able to execute concurrent actions reliably, execute and interleave several plans at the same time, and provide an easy to use modelling language. We show their workings based on the example of Human-Robot Interaction in a shopping mall.

HJB Optimal Feedback Control with Deep Differential Value Functions and Action Constraints

Authors:Michael Lutter, Boris Belousov, Kim Listmann, Debora Clever, Jan Peters
Date:2019-09-13 11:34:40

Learning optimal feedback control laws capable of executing optimal trajectories is essential for many robotic applications. Such policies can be learned using reinforcement learning or planned using optimal control. While reinforcement learning is sample inefficient, optimal control only plans an optimal trajectory from a specific starting configuration. In this paper we propose deep optimal feedback control to learn an optimal feedback policy rather than a single trajectory. By exploiting the inherent structure of the robot dynamics and strictly convex action cost, we can derive principled cost functions such that the optimal policy naturally obeys the action limits, is globally optimal and stable on the training domain given the optimal value function. The corresponding optimal value function is learned end-to-end by embedding a deep differential network in the Hamilton-Jacobi-Bellmann differential equation and minimizing the error of this equality while simultaneously decreasing the discounting from short- to far-sighted to enable the learning. Our proposed approach enables us to learn an optimal feedback control law in continuous time, that in contrast to existing approaches generates an optimal trajectory from any point in state-space without the need of replanning. The resulting approach is evaluated on non-linear systems and achieves optimal feedback control, where standard optimal control methods require frequent replanning.

MAT: Multi-Fingered Adaptive Tactile Grasping via Deep Reinforcement Learning

Authors:Bohan Wu, Iretiayo Akinola, Jacob Varley, Peter Allen
Date:2019-09-10 23:02:04

Vision-based grasping systems typically adopt an open-loop execution of a planned grasp. This policy can fail due to many reasons, including ubiquitous calibration error. Recovery from a failed grasp is further complicated by visual occlusion, as the hand is usually occluding the vision sensor as it attempts another open-loop regrasp. This work presents MAT, a tactile closed-loop method capable of realizing grasps provided by a coarse initial positioning of the hand above an object. Our algorithm is a deep reinforcement learning (RL) policy optimized through the clipped surrogate objective within a maximum entropy RL framework to balance exploitation and exploration. The method utilizes tactile and proprioceptive information to act through both fine finger motions and larger regrasp movements to execute stable grasps. A novel curriculum of action motion magnitude makes learning more tractable and helps turn common failure cases into successes. Careful selection of features that exhibit small sim-to-real gaps enables this tactile grasping policy, trained purely in simulation, to transfer well to real world environments without the need for additional learning. Experimentally, this methodology improves over a vision-only grasp success rate substantially on a multi-fingered robot hand. When this methodology is used to realize grasps from coarse initial positions provided by a vision-only planner, the system is made dramatically more robust to calibration errors in the camera-robot transform.

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Authors:Lior Shani, Yonathan Efroni, Shie Mannor
Date:2019-09-06 08:43:38

Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be 'close' to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish $\tilde O(1/\sqrt{N})$ convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of $\tilde O(1/N)$, much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward.

How to Build User Simulators to Train RL-based Dialog Systems

Authors:Weiyan Shi, Kun Qian, Xuewei Wang, Zhou Yu
Date:2019-09-03 18:22:24

User simulators are essential for training reinforcement learning (RL) based dialog models. The performance of the simulator directly impacts the RL policy. However, building a good user simulator that models real user behaviors is challenging. We propose a method of standardizing user simulator building that can be used by the community to compare dialog system quality using the same set of user simulators fairly. We present implementations of six user simulators trained with different dialog planning and generation methods. We then calculate a set of automatic metrics to evaluate the quality of these simulators both directly and indirectly. We also ask human users to assess the simulators directly and indirectly by rating the simulated dialogs and interacting with the trained systems. This paper presents a comprehensive evaluation framework for user simulator study and provides a better understanding of the pros and cons of different user simulators, as well as their impacts on the trained systems.

OpenSpiel: A Framework for Reinforcement Learning in Games

Authors:Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Paul Muller, Timo Ewalds, Ryan Faulkner, János Kramár, Bart De Vylder, Brennan Saeta, James Bradbury, David Ding, Sebastian Borgeaud, Matthew Lai, Julian Schrittwieser, Thomas Anthony, Edward Hughes, Ivo Danihelka, Jonah Ryan-Davis
Date:2019-08-26 03:31:35

OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games. OpenSpiel supports n-player (single- and multi- agent) zero-sum, cooperative and general-sum, one-shot and sequential, strictly turn-taking and simultaneous-move, perfect and imperfect information games, as well as traditional multiagent environments such as (partially- and fully- observable) grid worlds and social dilemmas. OpenSpiel also includes tools to analyze learning dynamics and other common evaluation metrics. This document serves both as an overview of the code base and an introduction to the terminology, core concepts, and algorithms across the fields of reinforcement learning, computational game theory, and search.

Sample-efficient Deep Reinforcement Learning with Imaginary Rollouts for Human-Robot Interaction

Authors:Mohammad Thabet, Massimiliano Patacchiola, Angelo Cangelosi
Date:2019-08-15 13:56:12

Deep reinforcement learning has proven to be a great success in allowing agents to learn complex tasks. However, its application to actual robots can be prohibitively expensive. Furthermore, the unpredictability of human behavior in human-robot interaction tasks can hinder convergence to a good policy. In this paper, we present an architecture that allows agents to learn models of stochastic environments and use them to accelerate learning. We descirbe how an environment model can be learned online and used to generate synthetic transitions, as well as how an agent can leverage these synthetic data to accelerate learning. We validate our approach using an experiment in which a robotic arm has to complete a task composed of a series of actions based on human gestures. Results show that our approach leads to significantly faster learning, requiring much less interaction with the environment. Furthermore, we demonstrate how learned models can be used by a robot to produce optimal plans in real world applications.

Model-based Lookahead Reinforcement Learning

Authors:Zhang-Wei Hong, Joni Pajarinen, Jan Peters
Date:2019-08-15 04:10:13

Model-based Reinforcement Learning (MBRL) allows data-efficient learning which is required in real world applications such as robotics. However, despite the impressive data-efficiency, MBRL does not achieve the final performance of state-of-the-art Model-free Reinforcement Learning (MFRL) methods. We leverage the strengths of both realms and propose an approach that obtains high performance with a small amount of data. In particular, we combine MFRL and Model Predictive Control (MPC). While MFRL's strength in exploration allows us to train a better forward dynamics model for MPC, MPC improves the performance of the MFRL policy by sampling-based planning. The experimental results in standard continuous control benchmarks show that our approach can achieve MFRL`s level of performance while being as data-efficient as MBRL.

Superstition in the Network: Deep Reinforcement Learning Plays Deceptive Games

Authors:Philip Bontrager, Ahmed Khalifa, Damien Anderson, Matthew Stephenson, Christoph Salge, Julian Togelius
Date:2019-08-12 23:27:26

Deep reinforcement learning has learned to play many games well, but failed on others. To better characterize the modes and reasons of failure of deep reinforcement learners, we test the widely used Asynchronous Actor-Critic (A2C) algorithm on four deceptive games, which are specially designed to provide challenges to game-playing agents. These games are implemented in the General Video Game AI framework, which allows us to compare the behavior of reinforcement learning-based agents with planning agents based on tree search. We find that several of these games reliably deceive deep reinforcement learners, and that the resulting behavior highlights the shortcomings of the learning algorithm. The particular ways in which agents fail differ from how planning-based agents fail, further illuminating the character of these algorithms. We propose an initial typology of deceptions which could help us better understand pitfalls and failure modes of (deep) reinforcement learning.

Learning to combine primitive skills: A step towards versatile robotic manipulation

Authors:Robin Strudel, Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Josef Sivic, Cordelia Schmid
Date:2019-08-02 07:04:17

Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision. Traditional task and motion planning (TAMP) methods can solve complex tasks but require full state observability and are not adapted to dynamic scene changes. Recent learning methods can operate directly on visual inputs but typically require many demonstrations and/or task-specific reward engineering. In this work we aim to overcome previous limitations and propose a reinforcement learning (RL) approach to task planning that learns to combine primitive skills. First, compared to previous learning methods, our approach requires neither intermediate rewards nor complete task demonstrations during training. Second, we demonstrate the versatility of our vision-based task planning in challenging settings with temporary occlusions and dynamic scene changes. Third, we propose an efficient training of basic skills from few synthetic demonstrations by exploring recent CNN architectures and data augmentation. Notably, while all of our policies are learned on visual inputs in simulated environments, we demonstrate the successful transfer and high success rates when applying such policies to manipulation tasks on a real UR5 robotic arm.

Learning When to Drive in Intersections by Combining Reinforcement Learning and Model Predictive Control

Authors:Tommy Tram, Ivo Batkovic, Mohammad Ali, Jonas Sjöberg
Date:2019-08-01 02:00:49

In this paper, we propose a decision making algorithm intended for automated vehicles that negotiate with other possibly non-automated vehicles in intersections. The decision algorithm is separated into two parts: a high-level decision module based on reinforcement learning, and a low-level planning module based on model predictive control. Traffic is simulated with numerous predefined driver behaviors and intentions, and the performance of the proposed decision algorithm was evaluated against another controller. The results show that the proposed decision algorithm yields shorter training episodes and an increased performance in success rate compared to the other controller.

Learning to Solve a Rubik's Cube with a Dexterous Hand

Authors:Tingguang Li, Weitao Xi, Meng Fang, Jia Xu, Max Qing-Hu Meng
Date:2019-07-26 06:09:22

We present a learning-based approach to solving a Rubik's cube with a multi-fingered dexterous hand. Despite the promising performance of dexterous in-hand manipulation, solving complex tasks which involve multiple steps and diverse internal object structure has remained an important, yet challenging task. In this paper, we tackle this challenge with a hierarchical deep reinforcement learning method, which separates planning and manipulation. A model-based cube solver finds an optimal move sequence for restoring the cube and a model-free cube operator controls all five fingers to execute each move step by step. To train our models, we build a high-fidelity simulator which manipulates a Rubik's Cube, an object containing high-dimensional state space, with a 24-DoF robot hand. Extensive experiments on 1400 randomly scrambled Rubik's cubes demonstrate the effectiveness of our method, achieving an average success rate of 90.3%.

Action Guidance with MCTS for Deep Reinforcement Learning

Authors:Bilal Kartal, Pablo Hernandez-Leal, Matthew E. Taylor
Date:2019-07-25 19:19:42

Deep reinforcement learning has achieved great successes in recent years, however, one main challenge is the sample inefficiency. In this paper, we focus on how to use action guidance by means of a non-expert demonstrator to improve sample efficiency in a domain with sparse, delayed, and possibly deceptive rewards: the recently-proposed multi-agent benchmark of Pommerman. We propose a new framework where even a non-expert simulated demonstrator, e.g., planning algorithms such as Monte Carlo tree search with a small number rollouts, can be integrated within asynchronous distributed deep reinforcement learning methods. Compared to a vanilla deep RL algorithm, our proposed methods both learn faster and converge to better policies on a two-player mini version of the Pommerman game.

Learning Hybrid Object Kinematics for Efficient Hierarchical Planning Under Uncertainty

Authors:Ajinkya Jain, Scott Niekum
Date:2019-07-21 18:13:52

Sudden changes in the dynamics of robotic tasks, such as contact with an object or the latching of a door, are often viewed as inconvenient discontinuities that make manipulation difficult. However, when these transitions are well-understood, they can be leveraged to reduce uncertainty or aid manipulation---for example, wiggling a screw to determine if it is fully inserted or not. Current model-free reinforcement learning approaches require large amounts of data to learn to leverage such dynamics, scale poorly as problem complexity grows, and do not transfer well to significantly different problems. By contrast, hierarchical POMDP planning-based methods scale well via plan decomposition, work well on novel problems, and directly consider uncertainty, but often rely on precise hand-specified models and task decompositions. To combine the advantages of these opposing paradigms, we propose a new method, MICAH, which given unsegmented data of an object's motion under applied actions, (1) detects changepoints in the object motion model using action-conditional inference, (2) estimates the individual local motion models with their parameters, and (3) converts them into a hybrid automaton that is compatible with hierarchical POMDP planning. We show that model learning under MICAH is more accurate and robust to noise than prior approaches. Further, we combine MICAH with a hierarchical POMDP planner to demonstrate that the learned models are rich enough to be used for performing manipulation tasks under uncertainty that require the objects to be used in novel ways not encountered during training.

Learning High-Level Planning Symbols from Intrinsically Motivated Experience

Authors:Angelo Oddi, Riccardo Rasconi, Emilio Cartoni, Gabriele Sartor, Gianluca Baldassarre, Vieri Giuliano Santucci
Date:2019-07-18 22:42:35

In symbolic planning systems, the knowledge on the domain is commonly provided by an expert. Recently, an automatic abstraction procedure has been proposed in the literature to create a Planning Domain Definition Language (PDDL) representation, which is the most widely used input format for most off-the-shelf automated planners, starting from `options', a data structure used to represent actions within the hierarchical reinforcement learning framework. We propose an architecture that potentially removes the need for human intervention. In particular, the architecture first acquires options in a fully autonomous fashion on the basis of open-ended learning, then builds a PDDL domain based on symbols and operators that can be used to accomplish user-defined goals through a standard PDDL planner. We start from an implementation of the above mentioned procedure tested on a set of benchmark domains in which a humanoid robot can change the state of some objects through direct interaction with the environment. We then investigate some critical aspects of the information abstraction process that have been observed, and propose an extension that mitigates such criticalities, in particular by analysing the type of classifiers that allow a suitable grounding of symbols.

Learning Safe Unlabeled Multi-Robot Planning with Motion Constraints

Authors:Arbaaz Khan, Chi Zhang, Shuo Li, Jiayue Wu, Brent Schlotfeldt, Sarah Y. Tang, Alejandro Ribeiro, Osbert Bastani, Vijay Kumar
Date:2019-07-11 15:20:50

In this paper, we present a learning approach to goal assignment and trajectory planning for unlabeled robots operating in 2D, obstacle-filled workspaces. More specifically, we tackle the unlabeled multi-robot motion planning problem with motion constraints as a multi-agent reinforcement learning problem with some sparse global reward. In contrast with previous works, which formulate an entirely new hand-crafted optimization cost or trajectory generation algorithm for a different robot dynamic model, our framework is a general approach that is applicable to arbitrary robot models. Further, by using the velocity obstacle, we devise a smooth projection that guarantees collision free trajectories for all robots with respect to their neighbors and obstacles. The efficacy of our algorithm is demonstrated through varied simulations.

RL-RRT: Kinodynamic Motion Planning via Learning Reachability Estimators from RL Policies

Authors:Hao-Tien Lewis Chiang, Jasmine Hsu, Marek Fiser, Lydia Tapia, Aleksandra Faust
Date:2019-07-10 15:36:03

This paper addresses two challenges facing sampling-based kinodynamic motion planning: a way to identify good candidate states for local transitions and the subsequent computationally intractable steering between these candidate states. Through the combination of sampling-based planning, a Rapidly Exploring Randomized Tree (RRT) and an efficient kinodynamic motion planner through machine learning, we propose an efficient solution to long-range planning for kinodynamic motion planning. First, we use deep reinforcement learning to learn an obstacle-avoiding policy that maps a robot's sensor observations to actions, which is used as a local planner during planning and as a controller during execution. Second, we train a reachability estimator in a supervised manner, which predicts the RL policy's time to reach a state in the presence of obstacles. Lastly, we introduce RL-RRT that uses the RL policy as a local planner, and the reachability estimator as the distance function to bias tree-growth towards promising regions. We evaluate our method on three kinodynamic systems, including physical robot experiments. Results across all three robots tested indicate that RL-RRT outperforms state of the art kinodynamic planners in efficiency, and also provides a shorter path finish time than a steering function free method. The learned local planner policy and accompanying reachability estimator demonstrate transferability to the previously unseen experimental environments, making RL-RRT fast because the expensive computations are replaced with simple neural network inference. Video: https://youtu.be/dDMVMTOI8KY

Deep Reinforcement-Learning-based Driving Policy for Autonomous Road Vehicles

Authors:Konstantinos Makantasis, Maria Kontorinaki, Ioannis Nikolos
Date:2019-07-10 11:44:09

In this work the problem of path planning for an autonomous vehicle that moves on a freeway is considered. The most common approaches that are used to address this problem are based on optimal control methods, which make assumptions about the model of the environment and the system dynamics. On the contrary, this work proposes the development of a driving policy based on reinforcement learning. In this way, the proposed driving policy makes minimal or no assumptions about the environment, since a priori knowledge about the system dynamics is not required. Driving scenarios where the road is occupied both by autonomous and manual driving vehicles are considered. To the best of our knowledge, this is one of the first approaches that propose a reinforcement learning driving policy for mixed driving environments. The derived reinforcement learning policy, firstly, is compared against an optimal policy derived via dynamic programming, and, secondly, its efficiency is evaluated under realistic scenarios generated by the established SUMO microscopic traffic flow simulator. Finally, some initial results regarding the effect of autonomous vehicles' behavior on the overall traffic flow are presented.

Deep Active Inference as Variational Policy Gradients

Authors:Beren Millidge
Date:2019-07-08 21:14:29

Active Inference is a theory of action arising from neuroscience which casts action and planning as a bayesian inference problem to be solved by minimizing a single quantity - the variational free energy. Active Inference promises a unifying account of action and perception coupled with a biologically plausible process theory. Despite these potential advantages, current implementations of Active Inference can only handle small, discrete policy and state-spaces and typically require the environmental dynamics to be known. In this paper we propose a novel deep Active Inference algorithm which approximates key densities using deep neural networks as flexible function approximators, which enables Active Inference to scale to significantly larger and more complex tasks. We demonstrate our approach on a suite of OpenAIGym benchmark tasks and obtain performance comparable with common reinforcement learning baselines. Moreover, our algorithm shows similarities with maximum entropy reinforcement learning and the policy gradients algorithm, which reveals interesting connections between the Active Inference framework and reinforcement learning.

Data Efficient Reinforcement Learning for Legged Robots

Authors:Yuxiang Yang, Ken Caluwaerts, Atil Iscen, Tingnan Zhang, Jie Tan, Vikas Sindhwani
Date:2019-07-08 13:43:06

We present a model-based framework for robot locomotion that achieves walking based on only 4.5 minutes (45,000 control steps) of data collected on a quadruped robot. To accurately model the robot's dynamics over a long horizon, we introduce a loss function that tracks the model's prediction over multiple timesteps. We adapt model predictive control to account for planning latency, which allows the learned model to be used for real time control. Additionally, to ensure safe exploration during model learning, we embed prior knowledge of leg trajectories into the action space. The resulting system achieves fast and robust locomotion. Unlike model-free methods, which optimize for a particular task, our planner can use the same learned dynamics for various tasks, simply by changing the reward function. To the best of our knowledge, our approach is more than an order of magnitude more sample efficient than current model-free methods.

Learning a Behavioral Repertoire from Demonstrations

Authors:Niels Justesen, Miguel Gonzalez Duque, Daniel Cabarcas Jaramillo, Jean-Baptiste Mouret, Sebastian Risi
Date:2019-07-05 23:08:08

Imitation Learning (IL) is a machine learning approach to learn a policy from a dataset of demonstrations. IL can be useful to kick-start learning before applying reinforcement learning (RL) but it can also be useful on its own, e.g. to learn to imitate human players in video games. However, a major limitation of current IL approaches is that they learn only a single "average" policy based on a dataset that possibly contains demonstrations of numerous different types of behaviors. In this paper, we propose a new approach called Behavioral Repertoire Imitation Learning (BRIL) that instead learns a repertoire of behaviors from a set of demonstrations by augmenting the state-action pairs with behavioral descriptions. The outcome of this approach is a single neural network policy conditioned on a behavior description that can be precisely modulated. We apply this approach to train a policy on 7,777 human replays to perform build-order planning in StarCraft II. Principal Component Analysis (PCA) is applied to construct a low-dimensional behavioral space from the high-dimensional army unit composition of each demonstration. The results demonstrate that the learned policy can be effectively manipulated to express distinct behaviors. Additionally, by applying the UCB1 algorithm, we are able to adapt the behavior of the policy - in-between games - to reach a performance beyond that of the traditional IL baseline approach.

Integration of Imitation Learning using GAIL and Reinforcement Learning using Task-achievement Rewards via Probabilistic Graphical Model

Authors:Akira Kinose, Tadahiro Taniguchi
Date:2019-07-03 21:38:48

Integration of reinforcement learning and imitation learning is an important problem that has been studied for a long time in the field of intelligent robotics. Reinforcement learning optimizes policies to maximize the cumulative reward, whereas imitation learning attempts to extract general knowledge about the trajectories demonstrated by experts, i.e., demonstrators. Because each of them has their own drawbacks, methods combining them and compensating for each set of drawbacks have been explored thus far. However, many of the methods are heuristic and do not have a solid theoretical basis. In this paper, we present a new theory for integrating reinforcement and imitation learning by extending the probabilistic generative model framework for reinforcement learning, {\it plan by inference}. We develop a new probabilistic graphical model for reinforcement learning with multiple types of rewards and a probabilistic graphical model for Markov decision processes with multiple optimality emissions (pMDP-MO). Furthermore, we demonstrate that the integrated learning method of reinforcement learning and imitation learning can be formulated as a probabilistic inference of policies on pMDP-MO by considering the output of the discriminator in generative adversarial imitation learning as an additional optimal emission observation. We adapt the generative adversarial imitation learning and task-achievement reward to our proposed framework, achieving significantly better performance than agents trained with reinforcement learning or imitation learning alone. Experiments demonstrate that our framework successfully integrates imitation and reinforcement learning even when the number of demonstrators is only a few.

Benchmarking Model-Based Reinforcement Learning

Authors:Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, Jimmy Ba
Date:2019-07-03 17:53:02

Model-based reinforcement learning (MBRL) is widely seen as having the potential to be significantly more sample efficient than model-free RL. However, research in model-based RL has not been very standardized. It is fairly common for authors to experiment with self-designed environments, and there are several separate lines of research, which are sometimes closed-sourced or not reproducible. Accordingly, it is an open question how these various existing MBRL algorithms perform relative to each other. To facilitate research in MBRL, in this paper we gather a wide collection of MBRL algorithms and propose over 18 benchmarking environments specially designed for MBRL. We benchmark these algorithms with unified problem settings, including noisy environments. Beyond cataloguing performance, we explore and unify the underlying algorithmic differences across MBRL algorithms. We characterize three key research challenges for future MBRL research: the dynamics bottleneck, the planning horizon dilemma, and the early-termination dilemma. Finally, to maximally facilitate future research on MBRL, we open-source our benchmark in http://www.cs.toronto.edu/~tingwuwang/mbrl.html.

Co-training for Policy Learning

Authors:Jialin Song, Ravi Lanka, Yisong Yue, Masahiro Ono
Date:2019-07-03 02:54:13

We study the problem of learning sequential decision-making policies in settings with multiple state-action representations. Such settings naturally arise in many domains, such as planning (e.g., multiple integer programming formulations) and various combinatorial optimization problems (e.g., those with both integer programming and graph-based formulations). Inspired by the classical co-training framework for classification, we study the problem of co-training for policy learning. We present sufficient conditions under which learning from two views can improve upon learning from a single view alone. Motivated by these theoretical insights, we present a meta-algorithm for co-training for sequential decision making. Our framework is compatible with both reinforcement learning and imitation learning. We validate the effectiveness of our approach across a wide range of tasks, including discrete/continuous control and combinatorial optimization.

Dynamics-Aware Unsupervised Discovery of Skills

Authors:Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, Karol Hausman
Date:2019-07-02 21:32:19

Conventionally, model-based reinforcement learning (MBRL) aims to learn a global model for the dynamics of the environment. A good model can potentially enable planning algorithms to generate a large variety of behaviors and solve diverse tasks. However, learning an accurate model for complex dynamical systems is difficult, and even then, the model might not generalize well outside the distribution of states on which it was trained. In this work, we combine model-based learning with model-free learning of primitives that make model-based planning easy. To that end, we aim to answer the question: how can we discover skills whose outcomes are easy to predict? We propose an unsupervised learning algorithm, Dynamics-Aware Discovery of Skills (DADS), which simultaneously discovers predictable behaviors and learns their dynamics. Our method can leverage continuous skill spaces, theoretically, allowing us to learn infinitely many behaviors even for high-dimensional state-spaces. We demonstrate that zero-shot planning in the learned latent space significantly outperforms standard MBRL and model-free goal-conditioned RL, can handle sparse-reward tasks, and substantially improves over prior hierarchical RL methods for unsupervised skill discovery.

From self-tuning regulators to reinforcement learning and back again

Authors:Nikolai Matni, Alexandre Proutiere, Anders Rantzer, Stephen Tu
Date:2019-06-27 00:01:54

Machine and reinforcement learning (RL) are increasingly being applied to plan and control the behavior of autonomous systems interacting with the physical world. Examples include self-driving vehicles, distributed sensor networks, and agile robots. However, when machine learning is to be applied in these new settings, the algorithms had better come with the same type of reliability, robustness, and safety bounds that are hallmarks of control theory, or failures could be catastrophic. Thus, as learning algorithms are increasingly and more aggressively deployed in safety critical settings, it is imperative that control theorists join the conversation. The goal of this tutorial paper is to provide a starting point for control theorists wishing to work on learning related problems, by covering recent advances bridging learning and control theory, and by placing these results within an appropriate historical context of system identification and adaptive control.

Cooperation-Aware Reinforcement Learning for Merging in Dense Traffic

Authors:Maxime Bouton, Alireza Nakhaei, Kikuo Fujimura, Mykel J. Kochenderfer
Date:2019-06-26 12:22:13

Decision making in dense traffic can be challenging for autonomous vehicles. An autonomous system only relying on predefined road priorities and considering other drivers as moving objects will cause the vehicle to freeze and fail the maneuver. Human drivers leverage the cooperation of other drivers to avoid such deadlock situations and convince others to change their behavior. Decision making algorithms must reason about the interaction with other drivers and anticipate a broad range of driver behaviors. In this work, we present a reinforcement learning approach to learn how to interact with drivers with different cooperation levels. We enhanced the performance of traditional reinforcement learning algorithms by maintaining a belief over the level of cooperation of other drivers. We show that our agent successfully learns how to navigate a dense merging scenario with less deadlocks than with online planning methods.

NeuroTrajectory: A Neuroevolutionary Approach to Local State Trajectory Learning for Autonomous Vehicles

Authors:Sorin Grigorescu, Bogdan Trasnea, Liviu Marina, Andrei Vasilcoi, Tiberiu Cocias
Date:2019-06-26 11:05:18

Autonomous vehicles are controlled today either based on sequences of decoupled perception-planning-action operations, either based on End2End or Deep Reinforcement Learning (DRL) systems. Current deep learning solutions for autonomous driving are subject to several limitations (e.g. they estimate driving actions through a direct mapping of sensors to actuators, or require complex reward shaping methods). Although the cost function used for training can aggregate multiple weighted objectives, the gradient descent step is computed by the backpropagation algorithm using a single-objective loss. To address these issues, we introduce NeuroTrajectory, which is a multi-objective neuroevolutionary approach to local state trajectory learning for autonomous driving, where the desired state trajectory of the ego-vehicle is estimated over a finite prediction horizon by a perception-planning deep neural network. In comparison to DRL methods, which predict optimal actions for the upcoming sampling time, we estimate a sequence of optimal states that can be used for motion control. We propose an approach which uses genetic algorithms for training a population of deep neural networks, where each network individual is evaluated based on a multi-objective fitness vector, with the purpose of establishing a so-called Pareto front of optimal deep neural networks. The performance of an individual is given by a fitness vector composed of three elements. Each element describes the vehicle's travel path, lateral velocity and longitudinal speed, respectively. The same network structure can be trained on synthetic, as well as on real-world data sequences. We have benchmarked our system against a baseline Dynamic Window Approach (DWA), as well as against an End2End supervised learning method.

DynoPlan: Combining Motion Planning and Deep Neural Network based Controllers for Safe HRL

Authors:Daniel Angelov, Yordan Hristov, Subramanian Ramamoorthy
Date:2019-06-24 17:34:51

Many realistic robotics tasks are best solved compositionally, through control architectures that sequentially invoke primitives and achieve error correction through the use of loops and conditionals taking the system back to alternative earlier states. Recent end-to-end approaches to task learning attempt to directly learn a single controller that solves an entire task, but this has been difficult for complex control tasks that would have otherwise required a diversity of local primitive moves, and the resulting solutions are also not easy to inspect for plan monitoring purposes. In this work, we aim to bridge the gap between hand designed and learned controllers, by representing each as an option in a hybrid hierarchical Reinforcement Learning framework - DynoPlan. We extend the options framework by adding a dynamics model and the use of a nearness-to-goal heuristic, derived from demonstrations. This translates the optimization of a hierarchical policy controller to a problem of planning with a model predictive controller. By unrolling the dynamics of each option and assessing the expected value of each future state, we can create a simple switching controller for choosing the optimal policy within a constrained time horizon similarly to hill climbing heuristic search. The individual dynamics model allows each option to iterate and be activated independently of the specific underlying instantiation, thus allowing for a mix of motion planning and deep neural network based primitives. We can assess the safety regions of the resulting hybrid controller by investigating the initiation sets of the different options, and also by reasoning about the completeness and performance guarantees of the underpinning motion planners.

On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference

Authors:Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca D. Dragan
Date:2019-06-23 18:41:31

Our goal is for agents to optimize the right reward function, despite how difficult it is for us to specify what that is. Inverse Reinforcement Learning (IRL) enables us to infer reward functions from demonstrations, but it usually assumes that the expert is noisily optimal. Real people, on the other hand, often have systematic biases: risk-aversion, myopia, etc. One option is to try to characterize these biases and account for them explicitly during learning. But in the era of deep learning, a natural suggestion researchers make is to avoid mathematical models of human behavior that are fraught with specific assumptions, and instead use a purely data-driven approach. We decided to put this to the test -- rather than relying on assumptions about which specific bias the demonstrator has when planning, we instead learn the demonstrator's planning algorithm that they use to generate demonstrations, as a differentiable planner. Our exploration yielded mixed findings: on the one hand, learning the planner can lead to better reward inference than relying on the wrong assumption; on the other hand, this benefit is dwarfed by the loss we incur by going from an exact to a differentiable planner. This suggest that at least for the foreseeable future, agents need a middle ground between the flexibility of data-driven methods and the useful bias of known human biases. Code is available at https://tinyurl.com/learningbiases.

A Deep Reinforcement Learning Approach for Global Routing

Authors:Haiguang Liao, Wentai Zhang, Xuliang Dong, Barnabas Poczos, Kenji Shimada, Levent Burak Kara
Date:2019-06-20 19:07:01

Global routing has been a historically challenging problem in electronic circuit design, where the challenge is to connect a large and arbitrary number of circuit components with wires without violating the design rules for the printed circuit boards or integrated circuits. Similar routing problems also exist in the design of complex hydraulic systems, pipe systems and logistic networks. Existing solutions typically consist of greedy algorithms and hard-coded heuristics. As such, existing approaches suffer from a lack of model flexibility and non-optimum solutions. As an alternative approach, this work presents a deep reinforcement learning method for solving the global routing problem in a simulated environment. At the heart of the proposed method is deep reinforcement learning that enables an agent to produce an optimal policy for routing based on the variety of problems it is presented with leveraging the conjoint optimization mechanism of deep reinforcement learning. Conjoint optimization mechanism is explained and demonstrated in details; the best network structure and the parameters of the learned model are explored. Based on the fine-tuned model, routing solutions and rewards are presented and analyzed. The results indicate that the approach can outperform the benchmark method of a sequential A* method, suggesting a promising potential for deep reinforcement learning for global routing and other routing or path planning problems in general. Another major contribution of this work is the development of a global routing problem sets generator with the ability to generate parameterized global routing problem sets with different size and constraints, enabling evaluation of different routing algorithms and the generation of training datasets for future data-driven routing approaches.

Exploring Model-based Planning with Policy Networks

Authors:Tingwu Wang, Jimmy Ba
Date:2019-06-20 14:13:12

Model-based reinforcement learning (MBRL) with model-predictive control or online planning has shown great potential for locomotion control tasks in terms of both sample efficiency and asymptotic performance. Despite their initial successes, the existing planning methods search from candidate sequences randomly generated in the action space, which is inefficient in complex high-dimensional environments. In this paper, we propose a novel MBRL algorithm, model-based policy planning (POPLIN), that combines policy networks with online planning. More specifically, we formulate action planning at each time-step as an optimization problem using neural networks. We experiment with both optimization w.r.t. the action sequences initialized from the policy network, and also online optimization directly w.r.t. the parameters of the policy network. We show that POPLIN obtains state-of-the-art performance in the MuJoCo benchmarking environments, being about 3x more sample efficient than the state-of-the-art algorithms, such as PETS, TD3 and SAC. To explain the effectiveness of our algorithm, we show that the optimization surface in parameter space is smoother than in action space. Further more, we found the distilled policy network can be effectively applied without the expansive model predictive control during test time for some environments such as Cheetah. Code is released in https://github.com/WilsonWangTHU/POPLIN.

Calibrated Model-Based Deep Reinforcement Learning

Authors:Ali Malik, Volodymyr Kuleshov, Jiaming Song, Danny Nemer, Harlan Seymour, Stefano Ermon
Date:2019-06-19 19:10:26

Estimates of predictive uncertainty are important for accurate model-based planning and reinforcement learning. However, predictive uncertainties---especially ones derived from modern deep learning systems---can be inaccurate and impose a bottleneck on performance. This paper explores which uncertainties are needed for model-based reinforcement learning and argues that good uncertainties must be calibrated, i.e. their probabilities should match empirical frequencies of predicted events. We describe a simple way to augment any model-based reinforcement learning agent with a calibrated model and show that doing so consistently improves planning, sample complexity, and exploration. On the \textsc{HalfCheetah} MuJoCo task, our system achieves state-of-the-art performance using 50\% fewer samples than the current leading approach. Our findings suggest that calibration can improve the performance of model-based reinforcement learning with minimal computational and implementation overhead.

Learning to Plan Hierarchically from Curriculum

Authors:Philippe Morere, Lionel Ott, Fabio Ramos
Date:2019-06-18 04:31:25

We present a framework for learning to plan hierarchically in domains with unknown dynamics. We enhance planning performance by exploiting problem structure in several ways: (i) We simplify the search over plans by leveraging knowledge of skill objectives, (ii) Shorter plans are generated by enforcing aggressively hierarchical planning, (iii) We learn transition dynamics with sparse local models for better generalisation. Our framework decomposes transition dynamics into skill effects and success conditions, which allows fast planning by reasoning on effects, while learning conditions from interactions with the world. We propose a simple method for learning new abstract skills, using successful trajectories stemming from completing the goals of a curriculum. Learned skills are then refined to leverage other abstract skills and enhance subsequent planning. We show that both conditions and abstract skills can be learned simultaneously while planning, even in stochastic domains. Our method is validated in experiments of increasing complexity, with up to 2^100 states, showing superior planning to classic non-hierarchical planners or reinforcement learning methods. Applicability to real-world problems is demonstrated in a simulation-to-real transfer experiment on a robotic manipulator.

A Joint Planning and Learning Framework for Human-Aided Decision-Making

Authors:Daoming Lyu, Fangkai Yang, Bo Liu, Steven Gustafson
Date:2019-06-17 20:56:31

Conventional reinforcement learning (RL) allows an agent to learn policies via environmental rewards only, with a long and slow learning curve, especially at the beginning stage. On the contrary, human learning is usually much faster because prior and general knowledge and multiple information resources are utilized. In this paper, we propose a \textbf{P}lanner-\textbf{A}ctor-\textbf{C}ritic architecture for hu\textbf{MAN}-centered planning and learning (\textbf{PACMAN}), where an agent uses prior, high-level, deterministic symbolic knowledge to plan for goal-directed actions. PACMAN integrates Actor-Critic algorithm of RL to fine-tune its behavior towards both environmental rewards and human feedback. To the best our knowledge, This is the first unified framework where knowledge-based planning, RL, and human teaching jointly contribute to the policy learning of an agent. Our experiments demonstrate that PACMAN leads to a significant jump-start at the early stage of learning, converges rapidly and with small variance, and is robust to inconsistent, infrequent, and misleading feedback.

Reinforcement Learning with Non-uniform State Representations for Adaptive Search

Authors:Sandeep Manjanna, Herke van Hoof, Gregory Dudek
Date:2019-06-15 16:37:07

Efficient spatial exploration is a key aspect of search and rescue. In this paper, we present a search algorithm that generates efficient trajectories that optimize the rate at which probability mass is covered by a searcher. This should allow an autonomous vehicle find one or more lost targets as rapidly as possible. We do this by performing non-uniform sampling of the search region. The path generated minimizes the expected time to locate the missing target by visiting high probability regions using non-myopic path generation based on reinforcement learning. We model the target probability distribution using a classic mixture of Gaussians model with means and mixture coefficients tuned according to the location and time of sightings of the lost target. Key features of our search algorithm are the ability to employ a very general non-deterministic action model and the ability to generate action plans for any new probability distribution using the parameters learned on other similar looking distributions. One of the key contributions of this paper is the use of non-uniform state aggregation for policy search in the context of robotics.

Sub-Goal Trees -- a Framework for Goal-Directed Trajectory Prediction and Optimization

Authors:Tom Jurgenson, Edward Groshev, Aviv Tamar
Date:2019-06-12 19:06:51

Many AI problems, in robotics and other domains, are goal-directed, essentially seeking a trajectory leading to some goal state. In such problems, the way we choose to represent a trajectory underlies algorithms for trajectory prediction and optimization. Interestingly, most all prior work in imitation and reinforcement learning builds on a sequential trajectory representation -- calculating the next state in the trajectory given its predecessors. We propose a different perspective: a goal-conditioned trajectory can be represented by first selecting an intermediate state between start and goal, partitioning the trajectory into two. Then, recursively, predicting intermediate points on each sub-segment, until a complete trajectory is obtained. We call this representation a sub-goal tree, and building on it, we develop new methods for trajectory prediction, learning, and optimization. We show that in a supervised learning setting, sub-goal trees better account for trajectory variability, and can predict trajectories exponentially faster at test time by leveraging a concurrent computation. Then, for optimization, we derive a new dynamic programming equation for sub-goal trees, and use it to develop new planning and reinforcement learning algorithms. These algorithms, which are not based on the standard Bellman equation, naturally account for hierarchical sub-goal structure in a task. Empirical results on motion planning domains show that the sub-goal tree framework significantly improves both accuracy and prediction time.

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Authors:Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine
Date:2019-06-12 17:24:03

The history of learning for control has been an exciting back and forth between two broad classes of algorithms: planning and reinforcement learning. Planning algorithms effectively reason over long horizons, but assume access to a local policy and distance metric over collision-free paths. Reinforcement learning excels at learning policies and the relative values of states, but fails to plan over long horizons. Despite the successes of each method in various domains, tasks that require reasoning over long horizons with limited feedback and high-dimensional observations remain exceedingly challenging for both planning and reinforcement learning algorithms. Frustratingly, these sorts of tasks are potentially the most useful, as they are simple to design (a human only need to provide an example goal state) and avoid reward shaping, which can bias the agent towards finding a sub-optimal solution. We introduce a general control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. Our aim is to decompose the task of reaching a distant goal state into a sequence of easier tasks, each of which corresponds to reaching a subgoal. Planning algorithms can automatically find these waypoints, but only if provided with suitable abstractions of the environment -- namely, a graph consisting of nodes and edges. Our main insight is that this graph can be constructed via reinforcement learning, where a goal-conditioned value function provides edge weights, and nodes are taken to be previously seen observations in a replay buffer. Using graph search over our replay buffer, we can automatically generate this sequence of subgoals, even in image-based environments. Our algorithm, search on the replay buffer (SoRB), enables agents to solve sparse reward tasks over one hundred steps, and generalizes substantially better than standard RL algorithms.

When to use parametric models in reinforcement learning?

Authors:Hado van Hasselt, Matteo Hessel, John Aslanides
Date:2019-06-12 16:57:00

We examine the question of when and how parametric models are most useful in reinforcement learning. In particular, we look at commonalities and differences between parametric models and experience replay. Replay-based learning algorithms share important traits with model-based approaches, including the ability to plan: to use more computation without additional data to improve predictions and behaviour. We discuss when to expect benefits from either approach, and interpret prior work in this context. We hypothesise that, under suitable conditions, replay-based algorithms should be competitive to or better than model-based algorithms if the model is used only to generate fictional transitions from observed states for an update rule that is otherwise model-free. We validated this hypothesis on Atari 2600 video games. The replay-based algorithm attained state-of-the-art data efficiency, improving over prior results with parametric models.

Optimizing city-scale traffic through modeling observations of vehicle movements

Authors:Fan Yang, Alina Vereshchaka, Bruno Lepri, Wen Dong
Date:2019-06-12 12:46:56

The capability of traffic-information systems to sense the movement of millions of users and offer trip plans through mobile phones has enabled a new way of optimizing city traffic dynamics, turning transportation big data into insights and actions in a closed-loop and evaluating this approach in the real world. Existing research has applied dynamic Bayesian networks and deep neural networks to make traffic predictions from floating car data, utilized dynamic programming and simulation approaches to identify how people normally travel with dynamic traffic assignment for policy research, and introduced Markov decision processes and reinforcement learning to optimally control traffic signals. However, none of these works utilized floating car data to suggest departure times and route choices in order to optimize city traffic dynamics. In this paper, we present a study showing that floating car data can lead to lower average trip time, higher on-time arrival ratio, and higher Charypar-Nagel score compared with how people normally travel. The study is based on optimizing a partially observable discrete-time decision process and is evaluated in one synthesized scenario, one partly synthesized scenario, and three real-world scenarios. This study points to the potential of a "living lab" approach where we learn, predict, and optimize behaviors in the real world.

Towards Big data processing in IoT: Path Planning and Resource Management of UAV Base Stations in Mobile-Edge Computing System

Authors:Shuo Wan, Jiaxun Lu, Pingyi Fan, Khaled B. Letaief
Date:2019-06-12 09:21:41

Heavy data load and wide cover range have always been crucial problems for online data processing in internet of things (IoT). Recently, mobile-edge computing (MEC) and unmanned aerial vehicle base stations (UAV-BSs) have emerged as promising techniques in IoT. In this paper, we propose a three-layer online data processing network based on MEC technique. On the bottom layer, raw data are generated by widely distributed sensors, which reflects local information. Upon them, unmanned aerial vehicle base stations (UAV-BSs) are deployed as moving MEC servers, which collect data and conduct initial steps of data processing. On top of them, a center cloud receives processed results and conducts further evaluation. As this is an online data processing system, the edge nodes should stabilize delay to ensure data freshness. Furthermore, limited onboard energy poses constraints to edge processing capability. To smartly manage network resources for saving energy and stabilizing delay, we develop an online determination policy based on Lyapunov Optimization. In cases of low data rate, it tends to reduce edge processor frequency for saving energy. In the presence of high data rate, it will smartly allocate bandwidth for edge data offloading. Meanwhile, hovering UAV-BSs bring a large and flexible service coverage, which results in the problem of effective path planning. In this paper, we apply deep reinforcement learning and develop an online path planning algorithm. Taking observations of around environment as input, a CNN network is trained to predict the reward of each action. By simulations, we validate its effectiveness in enhancing service coverage. The result will contribute to big data processing in future IoT.

Reinforcement Learning for Integer Programming: Learning to Cut

Authors:Yunhao Tang, Shipra Agrawal, Yuri Faenza
Date:2019-06-11 23:14:46

Integer programming (IP) is a general optimization framework widely applicable to a variety of unstructured and structured problems arising in, e.g., scheduling, production planning, and graph optimization. As IP models many provably hard to solve problems, modern IP solvers rely on many heuristics. These heuristics are usually human-designed, and naturally prone to suboptimality. The goal of this work is to show that the performance of those solvers can be greatly enhanced using reinforcement learning (RL). In particular, we investigate a specific methodology for solving IPs, known as the Cutting Plane Method. This method is employed as a subroutine by all modern IP solvers. We present a deep RL formulation, network architecture, and algorithms for intelligent adaptive selection of cutting planes (aka cuts). Across a wide range of IP tasks, we show that the trained RL agent significantly outperforms human-designed heuristics, and effectively generalizes to 10X larger instances and across IP problem classes. The trained agent is also demonstrated to benefit the popular downstream application of cutting plane methods in Branch-and-Cut algorithm, which is the backbone of state-of-the-art commercial IP solvers.

Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal

Authors:Alekh Agarwal, Sham Kakade, Lin F. Yang
Date:2019-06-10 05:50:03

This work considers the sample and computational complexity of obtaining an $\epsilon$-optimal policy in a discounted Markov Decision Process (MDP), given only access to a generative model. In this work, we study the effectiveness of the most natural plug-in approach to model-based planning: we build the maximum likelihood estimate of the transition model in the MDP from observations and then find an optimal policy in this empirical MDP. We ask arguably the most basic and unresolved question in model based planning: is the naive "plug-in" approach, non-asymptotically, minimax optimal in the quality of the policy it finds, given a fixed sample size? Here, the non-asymptotic regime refers to when the sample size is sublinear in the model size. With access to a generative model, we resolve this question in the strongest possible sense: our main result shows that \emph{any} high accuracy solution in the plug-in model constructed with $N$ samples, provides an $\epsilon$-optimal policy in the true underlying MDP (where $\epsilon$ is the minimax accuracy with $N$ samples at every state, action pair). In comparison, all prior (non-asymptotically) minimax optimal results use model free approaches, such as the Variance Reduced Q-value iteration algorithm (Sidford et al 2018), while the best known model-based results (e.g. Azar et al 2013) require larger sample sizes in their dependence on the planning horizon or the state space. Notably, we show that the model-based approach allows the use of \emph{any} efficient planning algorithm in the empirical MDP, which simplifies algorithm design as this approach does not tie the algorithm to the sampling procedure. The core of our analysis is avnovel "absorbing MDP" construction to address the statistical dependency issues that arise in the analysis of model-based planning approaches, a construction which may be helpful more generally.

Planning With Uncertain Specifications (PUnS)

Authors:Ankit Shah, Shen Li, Julie Shah
Date:2019-06-07 16:32:16

Reward engineering is crucial to high performance in reinforcement learning systems. Prior research into reward design has largely focused on Markovian functions representing the reward. While there has been research into expressing non-Markov rewards as linear temporal logic (LTL) formulas, this has focused on task specifications directly defined by the user. However, in many real-world applications, task specifications are ambiguous, and can only be expressed as a belief over LTL formulas. In this paper, we introduce planning with uncertain specifications (PUnS), a novel formulation that addresses the challenge posed by non-Markovian specifications expressed as beliefs over LTL formulas. We present four criteria that capture the semantics of satisfying a belief over specifications for different applications, and analyze the qualitative implications of these criteria within a synthetic domain. We demonstrate the existence of an equivalent Markov decision process (MDP) for any instance of PUnS. Finally, we demonstrate our approach on the real-world task of setting a dinner table automatically with a robot that inferred task specifications from human demonstrations.

Worst-Case Regret Bounds for Exploration via Randomized Value Functions

Authors:Daniel Russo
Date:2019-06-07 02:36:00

This paper studies a recent proposal to use randomized value functions to drive exploration in reinforcement learning. These randomized value functions are generated by injecting random noise into the training data, making the approach compatible with many popular methods for estimating parameterized value functions. By providing a worst-case regret bound for tabular finite-horizon Markov decision processes, we show that planning with respect to these randomized value functions can induce provably efficient exploration.

Harnessing Reinforcement Learning for Neural Motion Planning

Authors:Tom Jurgenson, Aviv Tamar
Date:2019-06-01 12:19:37

Motion planning is an essential component in most of today's robotic applications. In this work, we consider the learning setting, where a set of solved motion planning problems is used to improve the efficiency of motion planning on different, yet similar problems. This setting is important in applications with rapidly changing environments such as in e-commerce, among others. We investigate a general deep learning based approach, where a neural network is trained to map an image of the domain, the current robot state, and a goal robot state to the next robot state in the plan. We focus on the learning algorithm, and compare supervised learning methods with reinforcement learning (RL) algorithms. We first establish that supervised learning approaches are inferior in their accuracy due to insufficient data on the boundary of the obstacles, an issue that RL methods mitigate by actively exploring the domain. We then propose a modification of the popular DDPG RL algorithm that is tailored to motion planning domains, by exploiting the known model in the problem and the set of solved plans in the data. We show that our algorithm, dubbed DDPG-MP, significantly improves the accuracy of the learned motion planning policy. Finally, we show that given enough training data, our method can plan significantly faster on novel domains than off-the-shelf sampling based motion planners. Results of our experiments are shown in https://youtu.be/wHQ4Y4mBRb8.

Combating the Compounding-Error Problem with a Multi-step Model

Authors:Kavosh Asadi, Dipendra Misra, Seungchan Kim, Michel L. Littman
Date:2019-05-30 21:30:29

Model-based reinforcement learning is an appealing framework for creating agents that learn, plan, and act in sequential environments. Model-based algorithms typically involve learning a transition model that takes a state and an action and outputs the next state---a one-step model. This model can be composed with itself to enable predicting multiple steps into the future, but one-step prediction errors can get magnified, leading to unacceptable inaccuracy. This compounding-error problem plagues planning and undermines model-based reinforcement learning. In this paper, we address the compounding-error problem by introducing a multi-step model that directly outputs the outcome of executing a sequence of actions. Novel theoretical and empirical results indicate that the multi-step model is more conducive to efficient value-function estimation, and it yields better action selection compared to the one-step model. These results make a strong case for using multi-step models in the context of model-based reinforcement learning.

Learning Compositional Neural Programs with Recursive Tree Search and Planning

Authors:Thomas Pierrot, Guillaume Ligner, Scott Reed, Olivier Sigaud, Nicolas Perrin, Alexandre Laterre, David Kas, Karim Beguir, Nando de Freitas
Date:2019-05-30 10:08:00

We propose a novel reinforcement learning algorithm, AlphaNPI, that incorporates the strengths of Neural Programmer-Interpreters (NPI) and AlphaZero. NPI contributes structural biases in the form of modularity, hierarchy and recursion, which are helpful to reduce sample complexity, improve generalization and increase interpretability. AlphaZero contributes powerful neural network guided search algorithms, which we augment with recursion. AlphaNPI only assumes a hierarchical program specification with sparse rewards: 1 when the program execution satisfies the specification, and 0 otherwise. Using this specification, AlphaNPI is able to train NPI models effectively with RL for the first time, completely eliminating the need for strong supervision in the form of execution traces. The experiments show that AlphaNPI can sort as well as previous strongly supervised NPI variants. The AlphaNPI agent is also trained on a Tower of Hanoi puzzle with two disks and is shown to generalize to puzzles with an arbitrary number of disk

Learning Navigation Subroutines from Egocentric Videos

Authors:Ashish Kumar, Saurabh Gupta, Jitendra Malik
Date:2019-05-29 17:50:19

Planning at a higher level of abstraction instead of low level torques improves the sample efficiency in reinforcement learning, and computational efficiency in classical planning. We propose a method to learn such hierarchical abstractions, or subroutines from egocentric video data of experts performing tasks. We learn a self-supervised inverse model on small amounts of random interaction data to pseudo-label the expert egocentric videos with agent actions. Visuomotor subroutines are acquired from these pseudo-labeled videos by learning a latent intent-conditioned policy that predicts the inferred pseudo-actions from the corresponding image observations. We demonstrate our proposed approach in context of navigation, and show that we can successfully learn consistent and diverse visuomotor subroutines from passive egocentric videos. We demonstrate the utility of our acquired visuomotor subroutines by using them as is for exploration, and as sub-policies in a hierarchical RL framework for reaching point goals and semantic goals. We also demonstrate behavior of our subroutines in the real world, by deploying them on a real robotic platform. Project website: https://ashishkumar1993.github.io/subroutines/.

Learning NP-Hard Multi-Agent Assignment Planning using GNN: Inference on a Random Graph and Provable Auction-Fitted Q-learning

Authors:Hyunwook Kang, Taehwan Kwon, Jinkyoo Park, James R. Morrison
Date:2019-05-29 04:02:41

This paper explores the possibility of near-optimally solving multi-agent, multi-task NP-hard planning problems with time-dependent rewards using a learning-based algorithm. In particular, we consider a class of robot/machine scheduling problems called the multi-robot reward collection problem (MRRC). Such MRRC problems well model ride-sharing, pickup-and-delivery, and a variety of related problems. In representing the MRRC problem as a sequential decision-making problem, we observe that each state can be represented as an extension of probabilistic graphical models (PGMs), which we refer to as random PGMs. We then develop a mean-field inference method for random PGMs. We then propose (1) an order-transferable Q-function estimator and (2) an order-transferability-enabled auction to select a joint assignment in polynomial time. These result in a reinforcement learning framework with at least $1-1/e$ optimality. Experimental results on solving MRRC problems highlight the near-optimality and transferability of the proposed methods. We also consider identical parallel machine scheduling problems (IPMS) and minimax multiple traveling salesman problems (minimax-mTSP).

Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

Authors:Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, Shie Mannor
Date:2019-05-27 22:22:49

State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing \emph{full-planning} on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon MDP setting and establish that exploring with \emph{greedy policies} -- act by \emph{1-step planning} -- can achieve tight minimax performance in terms of regret, $\tilde{\mathcal{O}}(\sqrt{HSAT})$. Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of $S$. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to such that act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.

Composing Task-Agnostic Policies with Deep Reinforcement Learning

Authors:Ahmed H. Qureshi, Jacob J. Johnson, Yuzhe Qin, Taylor Henderson, Byron Boots, Michael C. Yip
Date:2019-05-25 21:40:38

The composition of elementary behaviors to solve challenging transfer learning problems is one of the key elements in building intelligent machines. To date, there has been plenty of work on learning task-specific policies or skills but almost no focus on composing necessary, task-agnostic skills to find a solution to new problems. In this paper, we propose a novel deep reinforcement learning-based skill transfer and composition method that takes the agent's primitive policies to solve unseen tasks. We evaluate our method in difficult cases where training policy through standard reinforcement learning (RL) or even hierarchical RL is either not feasible or exhibits high sample complexity. We show that our method not only transfers skills to new problem settings but also solves the challenging environments requiring both task planning and motion control with high data efficiency.

Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound

Authors:Lin F. Yang, Mengdi Wang
Date:2019-05-24 18:02:39

Exploration in reinforcement learning (RL) suffers from the curse of dimensionality when the state-action space is large. A common practice is to parameterize the high-dimensional value and policy functions using given features. However existing methods either have no theoretical guarantee or suffer a regret that is exponential in the planning horizon $H$. In this paper, we propose an online RL algorithm, namely the MatrixRL, that leverages ideas from linear bandit to learn a low-dimensional representation of the probability transition model while carefully balancing the exploitation-exploration tradeoff. We show that MatrixRL achieves a regret bound ${O}\big(H^2d\log T\sqrt{T}\big)$ where $d$ is the number of features. MatrixRL has an equivalent kernelized version, which is able to work with an arbitrary kernel Hilbert space without using explicit features. In this case, the kernelized MatrixRL satisfies a regret bound ${O}\big(H^2\widetilde{d}\log T\sqrt{T}\big)$, where $\widetilde{d}$ is the effective dimension of the kernel space. To our best knowledge, for RL using features or kernels, our results are the first regret bounds that are near-optimal in time $T$ and dimension $d$ (or $\widetilde{d}$) and polynomial in the planning horizon $H$.

Scene Induced Multi-Modal Trajectory Forecasting via Planning

Authors:Nachiket Deo, Mohan M. Trivedi
Date:2019-05-23 22:00:17

We address multi-modal trajectory forecasting of agents in unknown scenes by formulating it as a planning problem. We present an approach consisting of three models; a goal prediction model to identify potential goals of the agent, an inverse reinforcement learning model to plan optimal paths to each goal, and a trajectory generator to obtain future trajectories along the planned paths. Analysis of predictions on the Stanford drone dataset, shows generalizability of our approach to novel scenes.

From semantics to execution: Integrating action planning with reinforcement learning for robotic causal problem-solving

Authors:Manfred Eppe, Phuong D. H. Nguyen, Stefan Wermter
Date:2019-05-23 14:34:38

Reinforcement learning is an appropriate and successful method to robustly perform low-level robot control under noisy conditions. Symbolic action planning is useful to resolve causal dependencies and to break a causally complex problem down into a sequence of simpler high-level actions. A problem with the integration of both approaches is that action planning is based on discrete high-level action- and state spaces, whereas reinforcement learning is usually driven by a continuous reward function. However, recent advances in reinforcement learning, specifically, universal value function approximators and hindsight experience replay, have focused on goal-independent methods based on sparse rewards. In this article, we build on these novel methods to facilitate the integration of action planning with reinforcement learning by exploiting the reward-sparsity as a bridge between the high-level and low-level state- and control spaces. As a result, we demonstrate that the integrated neuro-symbolic method is able to solve object manipulation problems that involve tool use and non-trivial causal dependencies under noisy conditions, exploiting both data and knowledge.

Deep Q-Learning with Q-Matrix Transfer Learning for Novel Fire Evacuation Environment

Authors:Jivitesh Sharma, Per-Arne Andersen, Ole-Chrisoffer Granmo, Morten Goodwin
Date:2019-05-23 14:15:51

We focus on the important problem of emergency evacuation, which clearly could benefit from reinforcement learning that has been largely unaddressed. Emergency evacuation is a complex task which is difficult to solve with reinforcement learning, since an emergency situation is highly dynamic, with a lot of changing variables and complex constraints that makes it difficult to train on. In this paper, we propose the first fire evacuation environment to train reinforcement learning agents for evacuation planning. The environment is modelled as a graph capturing the building structure. It consists of realistic features like fire spread, uncertainty and bottlenecks. We have implemented the environment in the OpenAI gym format, to facilitate future research. We also propose a new reinforcement learning approach that entails pretraining the network weights of a DQN based agents to incorporate information on the shortest path to the exit. We achieved this by using tabular Q-learning to learn the shortest path on the building model's graph. This information is transferred to the network by deliberately overfitting it on the Q-matrix. Then, the pretrained DQN model is trained on the fire evacuation environment to generate the optimal evacuation path under time varying conditions. We perform comparisons of the proposed approach with state-of-the-art reinforcement learning algorithms like PPO, VPG, SARSA, A2C and ACKTR. The results show that our method is able to outperform state-of-the-art models by a huge margin including the original DQN based models. Finally, we test our model on a large and complex real building consisting of 91 rooms, with the possibility to move to any other room, hence giving 8281 actions. We use an attention based mechanism to deal with large action spaces. Our model achieves near optimal performance on the real world emergency environment.

A Deep Reinforcement Learning Driving Policy for Autonomous Road Vehicles

Authors:Konstantinos Makantasis, Maria Kontorinaki, Ioannis Nikolos
Date:2019-05-22 09:56:07

This work regards our preliminary investigation on the problem of path planning for autonomous vehicles that move on a freeway. We approach this problem by proposing a driving policy based on Reinforcement Learning. The proposed policy makes minimal or no assumptions about the environment, since no a priori knowledge about the system dynamics is required. We compare the performance of the proposed policy against an optimal policy derived via Dynamic Programming and against manual driving simulated by SUMO traffic simulator.

Knowledge-Based Sequential Decision-Making Under Uncertainty

Authors:Daoming Lyu
Date:2019-05-16 20:56:03

Deep reinforcement learning (DRL) algorithms have achieved great success on sequential decision-making problems, yet is criticized for the lack of data-efficiency and explainability. Especially, explainability of subtasks is critical in hierarchical decision-making since it enhances the transparency of black-box-style DRL methods and helps the RL practitioners to understand the high-level behavior of the system better. To improve the data-efficiency and explainability of DRL, declarative knowledge is introduced in this work and a novel algorithm is proposed by integrating DRL with symbolic planning. Experimental analysis on publicly available benchmarks validates the explainability of the subtasks and shows that our method can outperform the state-of-the-art approach in terms of data-efficiency.

Spatial Positioning Token (SPToken) for Smart Mobility

Authors:Roman Overko, Rodrigo H. Ordonez-Hurtado, Sergiy Zhuk, Pietro Ferraro, Andrew Cullen, Robert Shorten
Date:2019-05-16 10:45:18

We introduce a permissioned distributed ledger technology (DLT) design for crowdsourced smart mobility applications. This architecture is based on a directed acyclic graph architecture (similar to the IOTA tangle) and uses both Proof-of-Work and Proof-of-Position mechanisms to provide protection against spam attacks and malevolent actors. In addition to enabling individuals to retain ownership of their data and to monetize it, the architecture also is suitable for distributed privacy-preserving machine learning algorithms, is lightweight, and can be implemented in simple internet-of-things (IoT) devices. To demonstrate its efficacy, we apply this framework to reinforcement learning settings where a third party is interested in acquiring information from agents. In particular, one may be interested in sampling an unknown vehicular traffic flow in a city, using a DLT-type architecture and without perturbing the density, with the idea of realizing a set of virtual tokens as surrogates of real vehicles to explore geographical areas of interest. These tokens, whose authenticated position determines write access to the ledger, are thus used to emulate the probing actions of commanded (real) vehicles on a given planned route by "jumping" from a passing-by vehicle to another to complete the planned trajectory. Consequently, the environment stays unaffected (i.e., the autonomy of participating vehicles is not influenced by the algorithm), regardless of the number of emitted tokens. The design of such a DLT architecture is presented, and numerical results from large-scale simulations are provided to validate the proposed approach.

Synthesis of Provably Correct Autonomy Protocols for Shared Control

Authors:Murat Cubuktepe, Nils Jansen, Mohammed Alsiekh, Ufuk Topcu
Date:2019-05-15 23:46:58

We synthesize shared control protocols subject to probabilistic temporal logic specifications. More specifically, we develop a framework in which a human and an autonomy protocol can issue commands to carry out a certain task. We blend these commands into a joint input to a robot. We model the interaction between the human and the robot as a Markov decision process (MDP) that represents the shared control scenario. Using inverse reinforcement learning, we obtain an abstraction of the human's behavior and decisions. We use randomized strategies to account for randomness in human's decisions, caused by factors such as complexity of the task specifications or imperfect interfaces. We design the autonomy protocol to ensure that the resulting robot behavior satisfies given safety and performance specifications in probabilistic temporal logic. Additionally, the resulting strategies generate behavior as similar to the behavior induced by the human's commands as possible. We solve the underlying problem efficiently using quasiconvex programming. Case studies involving autonomous wheelchair navigation and unmanned aerial vehicle mission planning showcase the applicability of our approach.

Autonomous Penetration Testing using Reinforcement Learning

Authors:Jonathon Schwartz, Hanna Kurniawati
Date:2019-05-15 06:18:14

Penetration testing (pentesting) involves performing a controlled attack on a computer system in order to assess it's security. Although an effective method for testing security, pentesting requires highly skilled practitioners and currently there is a growing shortage of skilled cyber security professionals. One avenue for alleviating this problem is automate the pentesting process using artificial intelligence techniques. Current approaches to automated pentesting have relied on model-based planning, however the cyber security landscape is rapidly changing making maintaining up-to-date models of exploits a challenge. This project investigated the application of model-free Reinforcement Learning (RL) to automated pentesting. Model-free RL has the key advantage over model-based planning of not requiring a model of the environment, instead learning the best policy through interaction with the environment. We first designed and built a fast, low compute simulator for training and testing autonomous pentesting agents. We did this by framing pentesting as a Markov Decision Process with the known configuration of the network as states, the available scans and exploits as actions, the reward determined by the value of machines on the network. We then used this simulator to investigate the application of model-free RL to pentesting. We tested the standard Q-learning algorithm using both tabular and neural network based implementations. We found that within the simulated environment both tabular and neural network implementations were able to find optimal attack paths for a range of different network topologies and sizes without having a model of action behaviour. However, the implemented algorithms were only practical for smaller networks and numbers of actions. Further work is needed in developing scalable RL algorithms and testing these algorithms in larger and higher fidelity environments.

Combining Planning and Deep Reinforcement Learning in Tactical Decision Making for Autonomous Driving

Authors:Carl-Johan Hoel, Katherine Driggs-Campbell, Krister Wolff, Leo Laine, Mykel J. Kochenderfer
Date:2019-05-06 12:50:14

Tactical decision making for autonomous driving is challenging due to the diversity of environments, the uncertainty in the sensor information, and the complex interaction with other road users. This paper introduces a general framework for tactical decision making, which combines the concepts of planning and learning, in the form of Monte Carlo tree search and deep reinforcement learning. The method is based on the AlphaGo Zero algorithm, which is extended to a domain with a continuous state space where self-play cannot be used. The framework is applied to two different highway driving cases in a simulated environment and it is shown to perform better than a commonly used baseline method. The strength of combining planning and learning is also illustrated by a comparison to using the Monte Carlo tree search or the neural network policy separately.

Deep Residual Reinforcement Learning

Authors:Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson
Date:2019-05-03 08:38:35

We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in the DeepMind Control Suite benchmark. Moreover, we find the residual algorithm an effective approach to the distribution mismatch problem in model-based planning. Compared with the existing TD($k$) method, our residual-based method makes weaker assumptions about the model and yields a greater performance boost.

Behavior Planning of Autonomous Cars with Social Perception

Authors:Liting Sun, Wei Zhan, Ching-Yao Chan, Masayoshi Tomizuka
Date:2019-05-02 22:45:26

Autonomous cars have to navigate in dynamic environment which can be full of uncertainties. The uncertainties can come either from sensor limitations such as occlusions and limited sensor range, or from probabilistic prediction of other road participants, or from unknown social behavior in a new area. To safely and efficiently drive in the presence of these uncertainties, the decision-making and planning modules of autonomous cars should intelligently utilize all available information and appropriately tackle the uncertainties so that proper driving strategies can be generated. In this paper, we propose a social perception scheme which treats all road participants as distributed sensors in a sensor network. By observing the individual behaviors as well as the group behaviors, uncertainties of the three types can be updated uniformly in a belief space. The updated beliefs from the social perception are then explicitly incorporated into a probabilistic planning framework based on Model Predictive Control (MPC). The cost function of the MPC is learned via inverse reinforcement learning (IRL). Such an integrated probabilistic planning module with socially enhanced perception enables the autonomous vehicles to generate behaviors which are defensive but not overly conservative, and socially compatible. The effectiveness of the proposed framework is verified in simulation on an representative scenario with sensor occlusions.

Efficient Model-free Reinforcement Learning in Metric Spaces

Authors:Zhao Song, Wen Sun
Date:2019-05-01 20:10:24

Model-free Reinforcement Learning (RL) algorithms such as Q-learning [Watkins, Dayan 92] have been widely used in practice and can achieve human level performance in applications such as video games [Mnih et al. 15]. Recently, equipped with the idea of optimism in the face of uncertainty, Q-learning algorithms [Jin, Allen-Zhu, Bubeck, Jordan 18] can be proven to be sample efficient for discrete tabular Markov Decision Processes (MDPs) which have finite number of states and actions. In this work, we present an efficient model-free Q-learning based algorithm in MDPs with a natural metric on the state-action space--hence extending efficient model-free Q-learning algorithms to continuous state-action space. Compared to previous model-based RL algorithms for metric spaces [Kakade, Kearns, Langford 03], our algorithm does not require access to a black-box planning oracle.

Driving with Style: Inverse Reinforcement Learning in General-Purpose Planning for Automated Driving

Authors:Sascha Rosbach, Vinit James, Simon Großjohann, Silviu Homoceanu, Stefan Roth
Date:2019-05-01 09:18:47

Behavior and motion planning play an important role in automated driving. Traditionally, behavior planners instruct local motion planners with predefined behaviors. Due to the high scene complexity in urban environments, unpredictable situations may occur in which behavior planners fail to match predefined behavior templates. Recently, general-purpose planners have been introduced, combining behavior and local motion planning. These general-purpose planners allow behavior-aware motion planning given a single reward function. However, two challenges arise: First, this function has to map a complex feature space into rewards. Second, the reward function has to be manually tuned by an expert. Manually tuning this reward function becomes a tedious task. In this paper, we propose an approach that relies on human driving demonstrations to automatically tune reward functions. This study offers important insights into the driving style optimization of general-purpose planners with maximum entropy inverse reinforcement learning. We evaluate our approach based on the expected value difference between learned and demonstrated policies. Furthermore, we compare the similarity of human driven trajectories with optimal policies of our planner under learned and expert-tuned reward functions. Our experiments show that we are able to learn reward functions exceeding the level of manual expert tuning without prior domain knowledge.

Deep Neuroevolution of Recurrent and Discrete World Models

Authors:Sebastian Risi, Kenneth O. Stanley
Date:2019-04-28 10:00:59

Neural architectures inspired by our own human cognitive system, such as the recently introduced world models, have been shown to outperform traditional deep reinforcement learning (RL) methods in a variety of different domains. Instead of the relatively simple architectures employed in most RL experiments, world models rely on multiple different neural components that are responsible for visual information processing, memory, and decision-making. However, so far the components of these models have to be trained separately and through a variety of specialized training methods. This paper demonstrates the surprising finding that models with the same precise parts can be instead efficiently trained end-to-end through a genetic algorithm (GA), reaching a comparable performance to the original world model by solving a challenging car racing task. An analysis of the evolved visual and memory system indicates that they include a similar effective representation to the system trained through gradient descent. Additionally, in contrast to gradient descent methods that struggle with discrete variables, GAs also work directly with such representations, opening up opportunities for classical planning in latent space. This paper adds additional evidence on the effectiveness of deep neuroevolution for tasks that require the intricate orchestration of multiple components in complex heterogeneous architectures.

How You Act Tells a Lot: Privacy-Leakage Attack on Deep Reinforcement Learning

Authors:Xinlei Pan, Weiyao Wang, Xiaoshuai Zhang, Bo Li, Jinfeng Yi, Dawn Song
Date:2019-04-24 21:41:04

Machine learning has been widely applied to various applications, some of which involve training with privacy-sensitive data. A modest number of data breaches have been studied, including credit card information in natural language data and identities from face dataset. However, most of these studies focus on supervised learning models. As deep reinforcement learning (DRL) has been deployed in a number of real-world systems, such as indoor robot navigation, whether trained DRL policies can leak private information requires in-depth study. To explore such privacy breaches in general, we mainly propose two methods: environment dynamics search via genetic algorithm and candidate inference based on shadow policies. We conduct extensive experiments to demonstrate such privacy vulnerabilities in DRL under various settings. We leverage the proposed algorithms to infer floor plans from some trained Grid World navigation DRL agents with LiDAR perception. The proposed algorithm can correctly infer most of the floor plans and reaches an average recovery rate of 95.83% using policy gradient trained agents. In addition, we are able to recover the robot configuration in continuous control environments and an autonomous driving simulator with high accuracy. To the best of our knowledge, this is the first work to investigate privacy leakage in DRL settings and we show that DRL-based agents do potentially leak privacy-sensitive information from the trained policies.

Non-Stationary Markov Decision Processes, a Worst-Case Approach using Model-Based Reinforcement Learning, Extended version

Authors:Erwan Lecarpentier, Emmanuel Rachelson
Date:2019-04-22 23:19:03

This work tackles the problem of robust zero-shot planning in non-stationary stochastic environments. We study Markov Decision Processes (MDPs) evolving over time and consider Model-Based Reinforcement Learning algorithms in this setting. We make two hypotheses: 1) the environment evolves continuously with a bounded evolution rate; 2) a current model is known at each decision epoch but not its evolution. Our contribution can be presented in four points. 1) we define a specific class of MDPs that we call Non-Stationary MDPs (NSMDPs). We introduce the notion of regular evolution by making an hypothesis of Lipschitz-Continuity on the transition and reward functions w.r.t. time; 2) we consider a planning agent using the current model of the environment but unaware of its future evolution. This leads us to consider a worst-case method where the environment is seen as an adversarial agent; 3) following this approach, we propose the Risk-Averse Tree-Search (RATS) algorithm, a zero-shot Model-Based method similar to Minimax search; 4) we illustrate the benefits brought by RATS empirically and compare its performance with reference Model-Based algorithms.

The MineRL 2019 Competition on Sample Efficient Reinforcement Learning using Human Priors

Authors:William H. Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie Milani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin, Manuela Veloso, Phillip Wang
Date:2019-04-22 22:18:37

Though deep reinforcement learning has led to breakthroughs in many difficult domains, these successes have required an ever-increasing number of samples. As state-of-the-art reinforcement learning (RL) systems require an exponentially increasing number of samples, their development is restricted to a continually shrinking segment of the AI community. Likewise, many of these systems cannot be applied to real-world problems, where environment samples are expensive. Resolution of these limitations requires new, sample-efficient methods. To facilitate research in this direction, we introduce the MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors. The primary goal of the competition is to foster the development of algorithms which can efficiently leverage human demonstrations to drastically reduce the number of samples needed to solve complex, hierarchical, and sparse environments. To that end, we introduce: (1) the Minecraft ObtainDiamond task, a sequential decision making environment requiring long-term planning, hierarchical control, and efficient exploration methods; and (2) the MineRL-v0 dataset, a large-scale collection of over 60 million state-action pairs of human demonstrations that can be resimulated into embodied trajectories with arbitrary modifications to game state and visuals. Participants will compete to develop systems which solve the ObtainDiamond task with a limited number of samples from the environment simulator, Malmo. The competition is structured into two rounds in which competitors are provided several paired versions of the dataset and environment with different game textures. At the end of each round, competitors will submit containerized versions of their learning algorithms and they will then be trained/evaluated from scratch on a hold-out dataset-environment pair for a total of 4-days on a prespecified hardware platform.

Improving Interactive Reinforcement Agent Planning with Human Demonstration

Authors:Guangliang Li, Randy Gomez, Keisuke Nakamura, Jinying Lin, Qilei Zhang, Bo He
Date:2019-04-18 07:45:36

TAMER has proven to be a powerful interactive reinforcement learning method for allowing ordinary people to teach and personalize autonomous agents' behavior by providing evaluative feedback. However, a TAMER agent planning with UCT---a Monte Carlo Tree Search strategy, can only update states along its path and might induce high learning cost especially for a physical robot. In this paper, we propose to drive the agent's exploration along the optimal path and reduce the learning cost by initializing the agent's reward function via inverse reinforcement learning from demonstration. We test our proposed method in the RL benchmark domain---Grid World---with different discounts on human reward. Our results show that learning from demonstration can allow a TAMER agent to learn a roughly optimal policy up to the deepest search and encourage the agent to explore along the optimal path. In addition, we find that learning from demonstration can improve the learning efficiency by reducing total feedback, the number of incorrect actions and increasing the ratio of correct actions to obtain an optimal policy, allowing a TAMER agent to converge faster.

Learning to Navigate in Indoor Environments: from Memorizing to Reasoning

Authors:Liulong Ma, Yanjie Liu, Jiao Chen, Dong Jin
Date:2019-04-15 09:47:38

Autonomous navigation is an essential capability of smart mobility for mobile robots. Traditional methods must have the environment map to plan a collision-free path in workspace. Deep reinforcement learning (DRL) is a promising technique to realize the autonomous navigation task without a map, with which deep neural network can fit the mapping from observation to reasonable action through explorations. It should not only memorize the trained target, but more importantly, the planner can reason out the unseen goal. We proposed a new motion planner based on deep reinforcement learning that can arrive at new targets that have not been trained before in the indoor environment with RGB image and odometry only. The model has a structure of stacked Long Short-Term memory (LSTM). Finally, experiments were implemented in both simulated and real environments. The source code is available: https://github.com/marooncn/navbot.

Safer Deep RL with Shallow MCTS: A Case Study in Pommerman

Authors:Bilal Kartal, Pablo Hernandez-Leal, Chao Gao, Matthew E. Taylor
Date:2019-04-10 14:34:40

Safe reinforcement learning has many variants and it is still an open research problem. Here, we focus on how to use action guidance by means of a non-expert demonstrator to avoid catastrophic events in a domain with sparse, delayed, and deceptive rewards: the recently-proposed multi-agent benchmark of Pommerman. This domain is very challenging for reinforcement learning (RL) --- past work has shown that model-free RL algorithms fail to achieve significant learning. In this paper, we shed light into the reasons behind this failure by exemplifying and analyzing the high rate of catastrophic events (i.e., suicides) that happen under random exploration in this domain. While model-free random exploration is typically futile, we propose a new framework where even a non-expert simulated demonstrator, e.g., planning algorithms such as Monte Carlo tree search with small number of rollouts, can be integrated to asynchronous distributed deep reinforcement learning methods. Compared to vanilla deep RL algorithms, our proposed methods both learn faster and converge to better policies on a two-player mini version of the Pommerman game.

Structured agents for physical construction

Authors:Victor Bapst, Alvaro Sanchez-Gonzalez, Carl Doersch, Kimberly L. Stachenfeld, Pushmeet Kohli, Peter W. Battaglia, Jessica B. Hamrick
Date:2019-04-05 17:52:35

Physical construction---the ability to compose objects, subject to physical dynamics, to serve some function---is fundamental to human intelligence. We introduce a suite of challenging physical construction tasks inspired by how children play with blocks, such as matching a target configuration, stacking blocks to connect objects together, and creating shelter-like structures over target objects. We examine how a range of deep reinforcement learning agents fare on these challenges, and introduce several new approaches which provide superior performance. Our results show that agents which use structured representations (e.g., objects and scene graphs) and structured policies (e.g., object-centric actions) outperform those which use less structured representations, and generalize better beyond their training when asked to reason about larger scenes. Model-based agents which use Monte-Carlo Tree Search also outperform strictly model-free agents in our most challenging construction problems. We conclude that approaches which combine structured representations and reasoning with powerful learning are a key path toward agents that possess rich intuitive physics, scene understanding, and planning.

Can a Robot Become a Movie Director? Learning Artistic Principles for Aerial Cinematography

Authors:Mirko Gschwindt, Efe Camci, Rogerio Bonatti, Wenshan Wang, Erdal Kayacan, Sebastian Scherer
Date:2019-04-04 14:30:09

Aerial filming is constantly gaining importance due to the recent advances in drone technology. It invites many intriguing, unsolved problems at the intersection of aesthetical and scientific challenges. In this work, we propose a deep reinforcement learning agent which supervises motion planning of a filming drone by making desirable shot mode selections based on aesthetical values of video shots. Unlike most of the current state-of-the-art approaches that require explicit guidance by a human expert, our drone learns how to make favorable viewpoint selections by experience. We propose a learning scheme that exploits aesthetical features of retrospective shots in order to extract a desirable policy for better prospective shots. We train our agent in realistic AirSim simulations using both a hand-crafted reward function as well as reward from direct human input. We then deploy the same agent on a real DJI M210 drone in order to test the generalization capability of our approach to real world conditions. To evaluate the success of our approach in the end, we conduct a comprehensive user study in which participants rate the shot quality of our methods. Videos of the system in action can be seen at https://youtu.be/qmVw6mfyEmw.

Self-Adapting Goals Allow Transfer of Predictive Models to New Tasks

Authors:Kai Olav Ellefsen, Jim Torresen
Date:2019-04-04 09:52:18

A long-standing challenge in Reinforcement Learning is enabling agents to learn a model of their environment which can be transferred to solve other problems in a world with the same underlying rules. One reason this is difficult is the challenge of learning accurate models of an environment. If such a model is inaccurate, the agent's plans and actions will likely be sub-optimal, and likely lead to the wrong outcomes. Recent progress in model-based reinforcement learning has improved the ability for agents to learn and use predictive models. In this paper, we extend a recent deep learning architecture which learns a predictive model of the environment that aims to predict only the value of a few key measurements, which are be indicative of an agent's performance. Predicting only a few measurements rather than the entire future state of an environment makes it more feasible to learn a valuable predictive model. We extend this predictive model with a small, evolving neural network that suggests the best goals to pursue in the current state. We demonstrate that this allows the predictive model to transfer to new scenarios where goals are different, and that the adaptive goals can even adjust agent behavior on-line, changing its strategy to fit the current context.

Centerline Depth World Reinforcement Learning-based Left Atrial Appendage Orifice Localization

Authors:Walid Abdullah Al, Il Dong Yun, Eun Ju Chun
Date:2019-04-02 06:56:11

Left atrial appendage (LAA) closure (LAAC) is a minimally invasive implant-based method to prevent cardiovascular stroke in patients with non-valvular atrial fibrillation. Assessing the LAA orifice in preoperative CT angiography plays a crucial role in choosing an appropriate LAAC implant size and a proper C-arm angulation. However, accurate orifice localization is hard because of the high anatomic variation of LAA, and unclear position and orientation of the orifice in available CT views. Deep localization models also yield high error in localizing the orifice in CT image because of the tiny structure of orifice compared to the vastness of CT image. In this paper, we propose a centerline depth-based reinforcement learning (RL) world for effective orifice localization in a small search space. In our scheme, an RL agent observes the centerline-to-surface distance and navigates through the LAA centerline to localize the orifice. Thus, the search space is significantly reduced facilitating improved localization. The proposed formulation could result in high localization accuracy comparing to the expert-annotations in 98 CT images. Moreover, the localization process takes about 8 seconds which is 18 times more efficient than the existing method. Therefore, this can be a useful aid to physicians during the preprocedural planning of LAAC.

Planning with Expectation Models

Authors:Yi Wan, Zaheer Abbas, Adam White, Martha White, Richard S. Sutton
Date:2019-04-02 03:25:25

Distribution and sample models are two popular model choices in model-based reinforcement learning (MBRL). However, learning these models can be intractable, particularly when the state and action spaces are large. Expectation models, on the other hand, are relatively easier to learn due to their compactness and have also been widely used for deterministic environments. For stochastic environments, it is not obvious how expectation models can be used for planning as they only partially characterize a distribution. In this paper, we propose a sound way of using approximate expectation models for MBRL. In particular, we 1) show that planning with an expectation model is equivalent to planning with a distribution model if the state value function is linear in state features, 2) analyze two common parametrization choices for approximating the expectation: linear and non-linear expectation models, 3) propose a sound model-based policy evaluation algorithm and present its convergence results, and 4) empirically demonstrate the effectiveness of the proposed planning algorithm.

Regularizing Trajectory Optimization with Denoising Autoencoders

Authors:Rinu Boney, Norman Di Palo, Mathias Berglund, Alexander Ilin, Juho Kannala, Antti Rasmus, Harri Valpola
Date:2019-03-28 14:02:04

Trajectory optimization using a learned model of the environment is one of the core elements of model-based reinforcement learning. This procedure often suffers from exploiting inaccuracies of the learned model. We propose to regularize trajectory optimization by means of a denoising autoencoder that is trained on the same trajectories as the model of the environment. We show that the proposed regularization leads to improved planning with both gradient-based and gradient-free optimizers. We also demonstrate that using regularized trajectory optimization leads to rapid initial learning in a set of popular motor control tasks, which suggests that the proposed approach can be a useful tool for improving sample efficiency.

Inverse Optimal Planning for Air Traffic Control

Authors:Ekaterina Tolstaya, Alejandro Ribeiro, Vijay Kumar, Ashish Kapoor
Date:2019-03-25 18:00:17

We envision a system that concisely describes the rules of air traffic control, assists human operators and supports dense autonomous air traffic around commercial airports. We develop a method to learn the rules of air traffic control from real data as a cost function via maximum entropy inverse reinforcement learning. This cost function is used as a penalty for a search-based motion planning method that discretizes both the control and the state space. We illustrate the methodology by showing that our approach can learn to imitate the airport arrival routes and separation rules of dense commercial air traffic. The resulting trajectories are shown to be safe, feasible, and efficient.

Temporal Logic Guided Safe Reinforcement Learning Using Control Barrier Functions

Authors:Xiao Li, Calin Belta
Date:2019-03-23 21:29:49

Using reinforcement learning to learn control policies is a challenge when the task is complex with potentially long horizons. Ensuring adequate but safe exploration is also crucial for controlling physical systems. In this paper, we use temporal logic to facilitate specification and learning of complex tasks. We combine temporal logic with control Lyapunov functions to improve exploration. We incorporate control barrier functions to safeguard the exploration and deployment process. We develop a flexible and learnable system that allows users to specify task objectives and constraints in different forms and at various levels. The framework is also able to take advantage of known system dynamics and handle unknown environmental dynamics by integrating model-free learning with model-based planning.

DQN with model-based exploration: efficient learning on environments with sparse rewards

Authors:Stephen Zhen Gou, Yuyang Liu
Date:2019-03-22 01:41:50

We propose Deep Q-Networks (DQN) with model-based exploration, an algorithm combining both model-free and model-based approaches that explores better and learns environments with sparse rewards more efficiently. DQN is a general-purpose, model-free algorithm and has been proven to perform well in a variety of tasks including Atari 2600 games since it's first proposed by Minh et el. However, like many other reinforcement learning (RL) algorithms, DQN suffers from poor sample efficiency when rewards are sparse in an environment. As a result, most of the transitions stored in the replay memory have no informative reward signal, and provide limited value to the convergence and training of the Q-Network. However, one insight is that these transitions can be used to learn the dynamics of the environment as a supervised learning problem. The transitions also provide information of the distribution of visited states. Our algorithm utilizes these two observations to perform a one-step planning during exploration to pick an action that leads to states least likely to be seen, thus improving the performance of exploration. We demonstrate our agent's performance in two classic environments with sparse rewards in OpenAI gym: Mountain Car and Lunar Lander.

Flying through a narrow gap using neural network: an end-to-end planning and control approach

Authors:Jiarong Lin, Luqi Wang, Fei Gao, Shaojie Shen, Fu Zhang
Date:2019-03-21 16:19:05

In this paper, we investigate the problem of enabling a drone to fly through a tilted narrow gap, without a traditional planning and control pipeline. To this end, we propose an end-to-end policy network, which imitates from the traditional pipeline and is fine-tuned using reinforcement learning. Unlike previous works which plan dynamical feasible trajectories using motion primitives and track the generated trajectory by a geometric controller, our proposed method is an end-to-end approach which takes the flight scenario as input and directly outputs thrust-attitude control commands for the quadrotor. Key contributions of our paper are: 1) presenting an imitate-reinforce training framework. 2) flying through a narrow gap using an end-to-end policy network, showing that learning based method can also address the highly dynamic control problem as the traditional pipeline does (see attached video: https://www.youtube.com/watch?v=jU1qRcLdjx0). 3) propose a robust imitation of an optimal trajectory generator using multilayer perceptrons. 4) show how reinforcement learning can improve the performance of imitation learning, and the potential to achieve higher performance over the model-based method.

ToyArchitecture: Unsupervised Learning of Interpretable Models of the World

Authors:Jaroslav Vítků, Petr Dluhoš, Joseph Davidson, Matěj Nikl, Simon Andersson, Přemysl Paška, Jan Šinkora, Petr Hlubuček, Martin Stránský, Martin Hyben, Martin Poliak, Jan Feyereisl, Marek Rosa
Date:2019-03-20 23:07:12

Research in Artificial Intelligence (AI) has focused mostly on two extremes: either on small improvements in narrow AI domains, or on universal theoretical frameworks which are usually uncomputable, incompatible with theories of biological intelligence, or lack practical implementations. The goal of this work is to combine the main advantages of the two: to follow a big picture view, while providing a particular theory and its implementation. In contrast with purely theoretical approaches, the resulting architecture should be usable in realistic settings, but also form the core of a framework containing all the basic mechanisms, into which it should be easier to integrate additional required functionality. In this paper, we present a novel, purposely simple, and interpretable hierarchical architecture which combines multiple different mechanisms into one system: unsupervised learning of a model of the world, learning the influence of one's own actions on the world, model-based reinforcement learning, hierarchical planning and plan execution, and symbolic/sub-symbolic integration in general. The learned model is stored in the form of hierarchical representations with the following properties: 1) they are increasingly more abstract, but can retain details when needed, and 2) they are easy to manipulate in their local and symbolic-like form, thus also allowing one to observe the learning process at each level of abstraction. On all levels of the system, the representation of the data can be interpreted in both a symbolic and a sub-symbolic manner. This enables the architecture to learn efficiently using sub-symbolic methods and to employ symbolic inference.

Single-step Options for Adversary Driving

Authors:Nazmus Sakib, Hengshuai Yao, Hong Zhang, Shangling Jui
Date:2019-03-20 16:39:28

In this paper, we use reinforcement learning for safety driving in adversary settings. In our work, the knowledge in state-of-art planning methods is reused by single-step options whose action suggestions are compared in parallel with primitive actions. We show two advantages by doing so. First, training this reinforcement learning agent is easier and faster than training the primitive-action agent. Second, our new agent outperforms the primitive-action reinforcement learning agent, human testers as well as the state-of-art planning methods that our agent queries as skill options.

Online Gaussian Process State-Space Model: Learning and Planning for Partially Observable Dynamical Systems

Authors:Soon-Seo Park, Young-Jin Park, Youngjae Min, Han-Lim Choi
Date:2019-03-14 13:45:58

This paper proposes an online learning method of Gaussian process state-space model (GP-SSM). GP-SSM is a probabilistic representation learning scheme that represents unknown state transition and/or measurement models as Gaussian processes (GPs). While the majority of prior literature on learning of GP-SSM are focused on processing a given set of time series data, data may arrive and accumulate sequentially over time in most dynamical systems. Storing all such sequential data and updating the model over entire data incur large amount of computational resources in space and time. To overcome this difficulty, we propose a practical method, termed \textit{onlineGPSSM}, that incorporates stochastic variational inference (VI) and online VI with novel formulation. The proposed method mitigates the computational complexity without catastrophic forgetting and also support adaptation to changes in a system and/or a real environments. Furthermore, we present application of onlineGPSSM into the reinforcement learning (RL) of partially observable dynamical systems by integrating onlineGPSSM with Bayesian filtering and trajectory optimization algorithms. Numerical examples are presented to demonstrate applicability of the proposed method.

Reinforcement Learning with Dynamic Boltzmann Softmax Updates

Authors:Ling Pan, Qingpeng Cai, Qi Meng, Wei Chen, Longbo Huang, Tie-Yan Liu
Date:2019-03-14 11:54:13

Value function estimation is an important task in reinforcement learning, i.e., prediction. The Boltzmann softmax operator is a natural value estimator and can provide several benefits. However, it does not satisfy the non-expansion property, and its direct use may fail to converge even in value iteration. In this paper, we propose to update the value function with dynamic Boltzmann softmax (DBS) operator, which has good convergence property in the setting of planning and learning. Experimental results on GridWorld show that the DBS operator enables better estimation of the value function, which rectifies the convergence issue of the softmax operator. Finally, we propose the DBS-DQN algorithm by applying dynamic Boltzmann softmax updates in deep Q-network, which outperforms DQN substantially in 40 out of 49 Atari games.

VRKitchen: an Interactive 3D Virtual Environment for Task-oriented Learning

Authors:Xiaofeng Gao, Ran Gong, Tianmin Shu, Xu Xie, Shu Wang, Song-Chun Zhu
Date:2019-03-13 23:31:21

One of the main challenges of advancing task-oriented learning such as visual task planning and reinforcement learning is the lack of realistic and standardized environments for training and testing AI agents. Previously, researchers often relied on ad-hoc lab environments. There have been recent advances in virtual systems built with 3D physics engines and photo-realistic rendering for indoor and outdoor environments, but the embodied agents in those systems can only conduct simple interactions with the world (e.g., walking around, moving objects, etc.). Most of the existing systems also do not allow human participation in their simulated environments. In this work, we design and implement a virtual reality (VR) system, VRKitchen, with integrated functions which i) enable embodied agents powered by modern AI methods (e.g., planning, reinforcement learning, etc.) to perform complex tasks involving a wide range of fine-grained object manipulations in a realistic environment, and ii) allow human teachers to perform demonstrations to train agents (i.e., learning from demonstration). We also provide standardized evaluation benchmarks and data collection tools to facilitate a broad use in research on task-oriented learning and beyond.

Trajectory Optimization for Unknown Constrained Systems using Reinforcement Learning

Authors:Kei Ota, Devesh K. Jha, Tomoaki Oiki, Mamoru Miura, Takashi Nammoto, Daniel Nikovski, Toshisada Mariyama
Date:2019-03-13 23:07:29

In this paper, we propose a reinforcement learning-based algorithm for trajectory optimization for constrained dynamical systems. This problem is motivated by the fact that for most robotic systems, the dynamics may not always be known. Generating smooth, dynamically feasible trajectories could be difficult for such systems. Using sampling-based algorithms for motion planning may result in trajectories that are prone to undesirable control jumps. However, they can usually provide a good reference trajectory which a model-free reinforcement learning algorithm can then exploit by limiting the search domain and quickly finding a dynamically smooth trajectory. We use this idea to train a reinforcement learning agent to learn a dynamically smooth trajectory in a curriculum learning setting. Furthermore, for generalization, we parameterize the policies with goal locations, so that the agent can be trained for multiple goals simultaneously. We show result in both simulated environments as well as real experiments, for a $6$-DoF manipulator arm operated in position-controlled mode to validate the proposed idea. We compare the proposed ideas against a PID controller which is used to track a designed trajectory in configuration space. Our experiments show that our RL agent trained with a reference path outperformed a model-free PID controller of the type commonly used on many robotic platforms for trajectory tracking.

Learning to Paint With Model-based Deep Reinforcement Learning

Authors:Zhewei Huang, Wen Heng, Shuchang Zhou
Date:2019-03-11 16:21:46

We show how to teach machines to paint like human painters, who can use a small number of strokes to create fantastic paintings. By employing a neural renderer in model-based Deep Reinforcement Learning (DRL), our agents learn to determine the position and color of each stroke and make long-term plans to decompose texture-rich images into strokes. Experiments demonstrate that excellent visual effects can be achieved using hundreds of strokes. The training process does not require the experience of human painters or stroke tracking data. The code is available at https://github.com/hzwer/ICCV2019-LearningToPaint.

Learning Self-Game-Play Agents for Combinatorial Optimization Problems

Authors:Ruiyang Xu, Karl Lieberherr
Date:2019-03-08 21:38:33

Recent progress in reinforcement learning (RL) using self-game-play has shown remarkable performance on several board games (e.g., Chess and Go) as well as video games (e.g., Atari games and Dota2). It is plausible to consider that RL, starting from zero knowledge, might be able to gradually approximate a winning strategy after a certain amount of training. In this paper, we explore neural Monte-Carlo-Tree-Search (neural MCTS), an RL algorithm which has been applied successfully by DeepMind to play Go and Chess at a super-human level. We try to leverage the computational power of neural MCTS to solve a class of combinatorial optimization problems. Following the idea of Hintikka's Game-Theoretical Semantics, we propose the Zermelo Gamification (ZG) to transform specific combinatorial optimization problems into Zermelo games whose winning strategies correspond to the solutions of the original optimization problem. The ZG also provides a specially designed neural MCTS. We use a combinatorial planning problem for which the ground-truth policy is efficiently computable to demonstrate that ZG is promising.

Deep Active Localization

Authors:Sai Krishna, Keehong Seo, Dhaivat Bhatt, Vincent Mai, Krishna Murthy, Liam Paull
Date:2019-03-05 05:00:08

Active localization is the problem of generating robot actions that allow it to maximally disambiguate its pose within a reference map. Traditional approaches to this use an information-theoretic criterion for action selection and hand-crafted perceptual models. In this work we propose an end-to-end differentiable method for learning to take informative actions that is trainable entirely in simulation and then transferable to real robot hardware with zero refinement. The system is composed of two modules: a convolutional neural network for perception, and a deep reinforcement learned planning module. We introduce a multi-scale approach to the learned perceptual model since the accuracy needed to perform action selection with reinforcement learning is much less than the accuracy needed for robot control. We demonstrate that the resulting system outperforms using the traditional approach for either perception or planning. We also demonstrate our approaches robustness to different map configurations and other nuisance parameters through the use of domain randomization in training. The code is also compatible with the OpenAI gym framework, as well as the Gazebo simulator.

Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future

Authors:Nan Rosemary Ke, Amanpreet Singh, Ahmed Touati, Anirudh Goyal, Yoshua Bengio, Devi Parikh, Dhruv Batra
Date:2019-03-05 00:15:21

In model-based reinforcement learning, the agent interleaves between model learning and planning. These two components are inextricably intertwined. If the model is not able to provide sensible long-term prediction, the executed planner would exploit model flaws, which can yield catastrophic failures. This paper focuses on building a model that reasons about the long-term future and demonstrates how to use this for efficient planning and exploration. To this end, we build a latent-variable autoregressive model by leveraging recent ideas in variational inference. We argue that forcing latent variables to carry future information through an auxiliary task substantially improves long-term predictions. Moreover, by planning in the latent space, the planner's solution is ensured to be within regions where the model is valid. An exploration strategy can be devised by searching for unlikely trajectories under the model. Our method achieves higher reward faster compared to baselines on a variety of tasks and environments in both the imitation learning and model-based reinforcement learning settings.

The StreetLearn Environment and Dataset

Authors:Piotr Mirowski, Andras Banki-Horvath, Keith Anderson, Denis Teplyashin, Karl Moritz Hermann, Mateusz Malinowski, Matthew Koichi Grimes, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, Raia Hadsell
Date:2019-03-04 16:21:22

Navigation is a rich and well-grounded problem domain that drives progress in many different areas of research: perception, planning, memory, exploration, and optimisation in particular. Historically these challenges have been separately considered and solutions built that rely on stationary datasets - for example, recorded trajectories through an environment. These datasets cannot be used for decision-making and reinforcement learning, however, and in general the perspective of navigation as an interactive learning task, where the actions and behaviours of a learning agent are learned simultaneously with the perception and planning, is relatively unsupported. Thus, existing navigation benchmarks generally rely on static datasets (Geiger et al., 2013; Kendall et al., 2015) or simulators (Beattie et al., 2016; Shah et al., 2018). To support and validate research in end-to-end navigation, we present StreetLearn: an interactive, first-person, partially-observed visual environment that uses Google Street View for its photographic content and broad coverage, and give performance baselines for a challenging goal-driven navigation task. The environment code, baseline agent code, and the dataset are available at http://streetlearn.cc

Learning To Follow Directions in Street View

Authors:Karl Moritz Hermann, Mateusz Malinowski, Piotr Mirowski, Andras Banki-Horvath, Keith Anderson, Raia Hadsell
Date:2019-03-01 16:50:02

Navigating and understanding the real world remains a key challenge in machine learning and inspires a great variety of research in areas such as language grounding, planning, navigation and computer vision. We propose an instruction-following task that requires all of the above, and which combines the practicality of simulated environments with the challenges of ambiguous, noisy real world data. StreetNav is built on top of Google Street View and provides visually accurate environments representing real places. Agents are given driving instructions which they must learn to interpret in order to successfully navigate in this environment. Since humans equipped with driving instructions can readily navigate in previously unseen cities, we set a high bar and test our trained agents for similar cognitive capabilities. Although deep reinforcement learning (RL) methods are frequently evaluated only on data that closely follow the training distribution, our dataset extends to multiple cities and has a clean train/test separation. This allows for thorough testing of generalisation ability. This paper presents the StreetNav environment and tasks, models that establish strong baselines, and extensive analysis of the task and the trained agents.

The Termination Critic

Authors:Anna Harutyunyan, Will Dabney, Diana Borsa, Nicolas Heess, Remi Munos, Doina Precup
Date:2019-02-26 15:26:10

In this work, we consider the problem of autonomously discovering behavioral abstractions, or options, for reinforcement learning agents. We propose an algorithm that focuses on the termination condition, as opposed to -- as is common -- the policy. The termination condition is usually trained to optimize a control objective: an option ought to terminate if another has better value. We offer a different, information-theoretic perspective, and propose that terminations should focus instead on the compressibility of the option's encoding -- arguably a key reason for using abstractions. To achieve this algorithmically, we leverage the classical options framework, and learn the option transition model as a "critic" for the termination condition. Using this model, we derive gradients that optimize the desired criteria. We show that the resulting options are non-trivial, intuitively meaningful, and useful for learning and planning.

Planning in Hierarchical Reinforcement Learning: Guarantees for Using Local Policies

Authors:Tom Zahavy, Avinatan Hasidim, Haim Kaplan, Yishay Mansour
Date:2019-02-26 15:04:18

We consider a settings of hierarchical reinforcement learning, in which the reward is a sum of components. For each component we are given a policy that maximizes it and our goal is to assemble a policy from the individual policies that maximizes the sum of the components. We provide theoretical guarantees for assembling such policies in deterministic MDPs with collectible rewards. Our approach builds on formulating this problem as a traveling salesman problem with discounted reward. We focus on local solutions, i.e., policies that only use information from the current state; thus, they are easy to implement and do not require substantial computational resources. We propose three local stochastic policies and prove that they guarantee better performance than any deterministic local policy in the worst case; experimental results suggest that they also perform better on average.

Unsupervised Grounding of Plannable First-Order Logic Representation from Images

Authors:Masataro Asai
Date:2019-02-21 15:16:38

Recently, there is an increasing interest in obtaining the relational structures of the environment in the Reinforcement Learning community. However, the resulting "relations" are not the discrete, logical predicates compatible to the symbolic reasoning such as classical planning or goal recognition. Meanwhile, Latplan (Asai and Fukunaga 2018) bridged the gap between deep-learning perceptual systems and symbolic classical planners. One key component of the system is a Neural Network called State AutoEncoder (SAE), which encodes an image-based input into a propositional representation compatible to classical planning. To get the best of both worlds, we propose First-Order State AutoEncoder, an unsupervised architecture for grounding the first-order logic predicates and facts. Each predicate models a relationship between objects by taking the interpretable arguments and returning a propositional value. In the experiment using 8-Puzzle and a photo-realistic Blocksworld environment, we show that (1) the resulting predicates capture the interpretable relations (e.g. spatial), (2) they help obtaining the compact, abstract model of the environment, and finally, (3) the resulting model is compatible to symbolic classical planning.

Network Offloading Policies for Cloud Robotics: a Learning-based Approach

Authors:Sandeep Chinchali, Apoorva Sharma, James Harrison, Amine Elhafsi, Daniel Kang, Evgenya Pergament, Eyal Cidon, Sachin Katti, Marco Pavone
Date:2019-02-15 06:34:31

Today's robotic systems are increasingly turning to computationally expensive models such as deep neural networks (DNNs) for tasks like localization, perception, planning, and object detection. However, resource-constrained robots, like low-power drones, often have insufficient on-board compute resources or power reserves to scalably run the most accurate, state-of-the art neural network compute models. Cloud robotics allows mobile robots the benefit of offloading compute to centralized servers if they are uncertain locally or want to run more accurate, compute-intensive models. However, cloud robotics comes with a key, often understated cost: communicating with the cloud over congested wireless networks may result in latency or loss of data. In fact, sending high data-rate video or LIDAR from multiple robots over congested networks can lead to prohibitive delay for real-time applications, which we measure experimentally. In this paper, we formulate a novel Robot Offloading Problem --- how and when should robots offload sensing tasks, especially if they are uncertain, to improve accuracy while minimizing the cost of cloud communication? We formulate offloading as a sequential decision making problem for robots, and propose a solution using deep reinforcement learning. In both simulations and hardware experiments using state-of-the art vision DNNs, our offloading strategy improves vision task performance by between 1.3-2.6x of benchmark offloading strategies, allowing robots the potential to significantly transcend their on-board sensing accuracy but with limited cost of cloud communication.

Active Perception in Adversarial Scenarios using Maximum Entropy Deep Reinforcement Learning

Authors:Macheng Shen, Jonathan P How
Date:2019-02-14 23:44:22

We pose an active perception problem where an autonomous agent actively interacts with a second agent with potentially adversarial behaviors. Given the uncertainty in the intent of the other agent, the objective is to collect further evidence to help discriminate potential threats. The main technical challenges are the partial observability of the agent intent, the adversary modeling, and the corresponding uncertainty modeling. Note that an adversary agent may act to mislead the autonomous agent by using a deceptive strategy that is learned from past experiences. We propose an approach that combines belief space planning, generative adversary modeling, and maximum entropy reinforcement learning to obtain a stochastic belief space policy. By accounting for various adversarial behaviors in the simulation framework and minimizing the predictability of the autonomous agent's action, the resulting policy is more robust to unmodeled adversarial strategies. This improved robustness is empirically shown against an adversary that adapts to and exploits the autonomous agent's policy when compared with a standard Chance-Constraint Partially Observable Markov Decision Process robust approach.

Unsupervised Visuomotor Control through Distributional Planning Networks

Authors:Tianhe Yu, Gleb Shevchuk, Dorsa Sadigh, Chelsea Finn
Date:2019-02-14 18:54:54

While reinforcement learning (RL) has the potential to enable robots to autonomously acquire a wide range of skills, in practice, RL usually requires manual, per-task engineering of reward functions, especially in real world settings where aspects of the environment needed to compute progress are not directly accessible. To enable robots to autonomously learn skills, we instead consider the problem of reinforcement learning without access to rewards. We aim to learn an unsupervised embedding space under which the robot can measure progress towards a goal for itself. Our approach explicitly optimizes for a metric space under which action sequences that reach a particular state are optimal when the goal is the final state reached. This enables learning effective and control-centric representations that lead to more autonomous reinforcement learning algorithms. Our experiments on three simulated environments and two real-world manipulation problems show that our method can learn effective goal metrics from unlabeled interaction, and use the learned goal metrics for autonomous reinforcement learning.

WiseMove: A Framework for Safe Deep Reinforcement Learning for Autonomous Driving

Authors:Jaeyoung Lee, Aravind Balakrishnan, Ashish Gaurav, Krzysztof Czarnecki, Sean Sedwards
Date:2019-02-11 19:59:23

Machine learning can provide efficient solutions to the complex problems encountered in autonomous driving, but ensuring their safety remains a challenge. A number of authors have attempted to address this issue, but there are few publicly-available tools to adequately explore the trade-offs between functionality, scalability, and safety. We thus present WiseMove, a software framework to investigate safe deep reinforcement learning in the context of motion planning for autonomous driving. WiseMove adopts a modular learning architecture that suits our current research questions and can be adapted to new technologies and new questions. We present the details of WiseMove, demonstrate its use on a common traffic scenario, and describe how we use it in our ongoing safe learning research.

Visual search and recognition for robot task execution and monitoring

Authors:Lorenzo Mauro, Francesco Puja, Simone Grazioso, Valsamis Ntouskos, Marta Sanzari, Edoardo Alati, Fiora Pirri
Date:2019-02-07 22:35:51

Visual search of relevant targets in the environment is a crucial robot skill. We propose a preliminary framework for the execution monitor of a robot task, taking care of the robot attitude to visually searching the environment for targets involved in the task. Visual search is also relevant to recover from a failure. The framework exploits deep reinforcement learning to acquire a "common sense" scene structure and it takes advantage of a deep convolutional network to detect objects and relevant relations holding between them. The framework builds on these methods to introduce a vision-based execution monitoring, which uses classical planning as a backbone for task execution. Experiments show that with the proposed vision-based execution monitor the robot can complete simple tasks and can recover from failures in autonomy.

Bayesian Reinforcement Learning via Deep, Sparse Sampling

Authors:Divya Grover, Debabrota Basu, Christos Dimitrakakis
Date:2019-02-07 14:52:37

We address the problem of Bayesian reinforcement learning using efficient model-based online planning. We propose an optimism-free Bayes-adaptive algorithm to induce deeper and sparser exploration with a theoretical bound on its performance relative to the Bayes optimal policy, with a lower computational complexity. The main novelty is the use of a candidate policy generator, to generate long-term options in the planning tree (over beliefs), which allows us to create much sparser and deeper trees. Experimental results on different environments show that in comparison to the state-of-the-art, our algorithm is both computationally more efficient, and obtains significantly higher reward in discrete environments.

Space Navigator: a Tool for the Optimization of Collision Avoidance Maneuvers

Authors:Leonid Gremyachikh, Dmitrii Dubov, Nikita Kazeev, Andrey Kulibaba, Andrey Skuratov, Anton Tereshkin, Andrey Ustyuzhanin, Lubov Shiryaeva, Sergej Shishkin
Date:2019-02-06 10:23:01

The number of space objects will grow several times in a few years due to the planned launches of constellations of thousands microsatellites. It leads to a significant increase in the threat of satellite collisions. Spacecraft must undertake collision avoidance maneuvers to mitigate the risk. According to publicly available information, conjunction events are now manually handled by operators on the Earth. The manual maneuver planning requires qualified personnel and will be impractical for constellations of thousands satellites. In this paper we propose a new modular autonomous collision avoidance system called "Space Navigator". It is based on a novel maneuver optimization approach that combines domain knowledge with Reinforcement Learning methods.

Separating value functions across time-scales

Authors:Joshua Romoff, Peter Henderson, Ahmed Touati, Emma Brunskill, Joelle Pineau, Yann Ollivier
Date:2019-02-05 19:45:08

In many finite horizon episodic reinforcement learning (RL) settings, it is desirable to optimize for the undiscounted return - in settings like Atari, for instance, the goal is to collect the most points while staying alive in the long run. Yet, it may be difficult (or even intractable) mathematically to learn with this target. As such, temporal discounting is often applied to optimize over a shorter effective planning horizon. This comes at the risk of potentially biasing the optimization target away from the undiscounted goal. In settings where this bias is unacceptable - where the system must optimize for longer horizons at higher discounts - the target of the value function approximator may increase in variance leading to difficulties in learning. We present an extension of temporal difference (TD) learning, which we call TD($\Delta$), that breaks down a value function into a series of components based on the differences between value functions with smaller discount factors. The separation of a longer horizon value function into these components has useful properties in scalability and performance. We discuss these properties and show theoretic and empirical improvements over standard TD learning in certain settings.

Probabilistic Recursive Reasoning for Multi-Agent Reinforcement Learning

Authors:Ying Wen, Yaodong Yang, Rui Luo, Jun Wang, Wei Pan
Date:2019-01-26 13:08:08

Humans are capable of attributing latent mental contents such as beliefs or intentions to others. The social skill is critical in daily life for reasoning about the potential consequences of others' behaviors so as to plan ahead. It is known that humans use such reasoning ability recursively by considering what others believe about their own beliefs. In this paper, we start from level-$1$ recursion and introduce a probabilistic recursive reasoning (PR2) framework for multi-agent reinforcement learning. Our hypothesis is that it is beneficial for each agent to account for how the opponents would react to its future behaviors. Under the PR2 framework, we adopt variational Bayes methods to approximate the opponents' conditional policies, to which each agent finds the best response and then improve their own policies. We develop decentralized-training-decentralized-execution algorithms, namely PR2-Q and PR2-Actor-Critic, that are proved to converge in the self-play scenarios when there exists one Nash equilibrium. Our methods are tested on both the matrix game and the differential game, which have a non-trivial equilibrium where common gradient-based methods fail to converge. Our experiments show that it is critical to reason about how the opponents believe about what the agent believes. We expect our work to contribute a new idea of modeling the opponents to the multi-agent reinforcement learning community.

Distributed Policy Iteration for Scalable Approximation of Cooperative Multi-Agent Policies

Authors:Thomy Phan, Kyrill Schmid, Lenz Belzner, Thomas Gabor, Sebastian Feld, Claudia Linnhoff-Popien
Date:2019-01-25 07:13:29

Decision making in multi-agent systems (MAS) is a great challenge due to enormous state and joint action spaces as well as uncertainty, making centralized control generally infeasible. Decentralized control offers better scalability and robustness but requires mechanisms to coordinate on joint tasks and to avoid conflicts. Common approaches to learn decentralized policies for cooperative MAS suffer from non-stationarity and lacking credit assignment, which can lead to unstable and uncoordinated behavior in complex environments. In this paper, we propose Strong Emergent Policy approximation (STEP), a scalable approach to learn strong decentralized policies for cooperative MAS with a distributed variant of policy iteration. For that, we use function approximation to learn from action recommendations of a decentralized multi-agent planning algorithm. STEP combines decentralized multi-agent planning with centralized learning, only requiring a generative model for distributed black box optimization. We experimentally evaluate STEP in two challenging and stochastic domains with large state and joint action spaces and show that STEP is able to learn stronger policies than standard multi-agent reinforcement learning algorithms, when combining multi-agent open-loop planning with centralized function approximation. The learned policies can be reintegrated into the multi-agent planning process to further improve performance.

Towards Physically Safe Reinforcement Learning under Supervision

Authors:Yinan Zhang, Devin Balkcom, Haoxiang Li
Date:2019-01-19 19:16:42

This paper addresses the question of how a previously available control policy $\pi_s$ can be used as a supervisor to more quickly and safely train a new learned control policy $\pi_L$ for a robot. A weighted average of the supervisor and learned policies is used during trials, with a heavier weight initially on the supervisor, in order to allow safe and useful physical trials while the learned policy is still ineffective. During the process, the weight is adjusted to favor the learned policy. As weights are adjusted, the learned network must compensate so as to give safe and reasonable outputs under the different weights. A pioneer network is introduced that pre-learns a policy that performs similarly to the current learned policy under the planned next step for new weights; this pioneer network then replaces the currently learned network in the next set of trials. Experiments in OpenAI Gym demonstrate the effectiveness of the proposed method.

Learning retrosynthetic planning through self-play

Authors:John S. Schreck, Connor W. Coley, Kyle J. M. Bishop
Date:2019-01-19 18:43:43

The problem of retrosynthetic planning can be framed as one player game, in which the chemist (or a computer program) works backwards from a molecular target to simpler starting materials though a series of choices regarding which reactions to perform. This game is challenging as the combinatorial space of possible choices is astronomical, and the value of each choice remains uncertain until the synthesis plan is completed and its cost evaluated. Here, we address this problem using deep reinforcement learning to identify policies that make (near) optimal reaction choices during each step of retrosynthetic planning. Using simulated experience or self-play, we train neural networks to estimate the expected synthesis cost or value of any given molecule based on a representation of its molecular structure. We show that learned policies based on this value network outperform heuristic approaches in synthesizing unfamiliar molecules from available starting materials using the fewest number of reactions. We discuss how the learned policies described here can be incorporated into existing synthesis planning tools and how they can be adapted to changes in the synthesis cost objective or material availability.

Theory of Minds: Understanding Behavior in Groups Through Inverse Planning

Authors:Michael Shum, Max Kleiman-Weiner, Michael L. Littman, Joshua B. Tenenbaum
Date:2019-01-18 04:50:08

Human social behavior is structured by relationships. We form teams, groups, tribes, and alliances at all scales of human life. These structures guide multi-agent cooperation and competition, but when we observe others these underlying relationships are typically unobservable and hence must be inferred. Humans make these inferences intuitively and flexibly, often making rapid generalizations about the latent relationships that underlie behavior from just sparse and noisy observations. Rapid and accurate inferences are important for determining who to cooperate with, who to compete with, and how to cooperate in order to compete. Towards the goal of building machine-learning algorithms with human-like social intelligence, we develop a generative model of multi-agent action understanding based on a novel representation for these latent relationships called Composable Team Hierarchies (CTH). This representation is grounded in the formalism of stochastic games and multi-agent reinforcement learning. We use CTH as a target for Bayesian inference yielding a new algorithm for understanding behavior in groups that can both infer hidden relationships as well as predict future actions for multiple agents interacting together. Our algorithm rapidly recovers an underlying causal model of how agents relate in spatial stochastic games from just a few observations. The patterns of inference made by this algorithm closely correspond with human judgments and the algorithm makes the same rapid generalizations that people do.

Multi-agent Reinforcement Learning Embedded Game for the Optimization of Building Energy Control and Power System Planning

Authors:Jun Hao
Date:2019-01-17 08:37:38

Most of the current game-theoretic demand-side management methods focus primarily on the scheduling of home appliances, and the related numerical experiments are analyzed under various scenarios to achieve the corresponding Nash-equilibrium (NE) and optimal results. However, not much work is conducted for academic or commercial buildings. The methods for optimizing academic-buildings are distinct from the optimal methods for home appliances. In my study, we address a novel methodology to control the operation of heating, ventilation, and air conditioning system (HVAC). With the development of Artificial Intelligence and computer technologies, reinforcement learning (RL) can be implemented in multiple realistic scenarios and help people to solve thousands of real-world problems. Reinforcement Learning, which is considered as the art of future AI, builds the bridge between agents and environments through Markov Decision Chain or Neural Network and has seldom been used in power system. The art of RL is that once the simulator for a specific environment is built, the algorithm can keep learning from the environment. Therefore, RL is capable of dealing with constantly changing simulator inputs such as power demand, the condition of power system and outdoor temperature, etc. Compared with the existing distribution power system planning mechanisms and the related game theoretical methodologies, our proposed algorithm can plan and optimize the hourly energy usage, and have the ability to corporate with even shorter time window if needed.

GridSim: A Vehicle Kinematics Engine for Deep Neuroevolutionary Control in Autonomous Driving

Authors:Bogdan Trasnea, Andrei Vasilcoi, Claudiu Pozna, Sorin Grigorescu
Date:2019-01-16 09:43:39

Current state of the art solutions in the control of an autonomous vehicle mainly use supervised end-to-end learning, or decoupled perception, planning and action pipelines. Another possible solution is deep reinforcement learning, but such a method requires that the agent interacts with its surroundings in a simulated environment. In this paper we introduce GridSim, which is an autonomous driving simulator engine running a car-like robot architecture to generate occupancy grids from simulated sensors. We use GridSim to study the performance of two deep learning approaches, deep reinforcement learning and driving behavioral learning through genetic algorithms. The deep network encodes the desired behavior in a two elements fitness function describing a maximum travel distance and a maximum forward speed, bounded to a specific interval. The algorithms are evaluated on simulated highways, curved roads and inner-city scenarios, all including different driving limitations.

An investigation of model-free planning

Authors:Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sébastien Racanière, Théophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver, Timothy Lillicrap
Date:2019-01-11 11:42:51

The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been proposed that learn how to plan, by providing the structure for planning via an inductive bias in the function approximator (such as a tree structured neural network), trained end-to-end by a model-free RL algorithm. In this paper, we go even further, and demonstrate empirically that an entirely model-free approach, without special structure beyond standard neural network components such as convolutional networks and LSTMs, can learn to exhibit many of the characteristics typically associated with a model-based planner. We measure our agent's effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has many of the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the state-of-the-art in challenging combinatorial domains such as Sokoban and outperforms other model-free approaches that utilize strong inductive biases toward planning.

Learning Manipulation States and Actions for Efficient Non-prehensile Rearrangement Planning

Authors:Joshua A. Haustein, Isac Arnekvist, Johannes Stork, Kaiyu Hang, Danica Kragic
Date:2019-01-11 11:25:52

This paper addresses non-prehensile rearrangement planning problems where a robot is tasked to rearrange objects among obstacles on a planar surface. We present an efficient planning algorithm that is designed to impose few assumptions on the robot's non-prehensile manipulation abilities and is simple to adapt to different robot embodiments. For this, we combine sampling-based motion planning with reinforcement learning and generative modeling. Our algorithm explores the composite configuration space of objects and robot as a search over robot actions, forward simulated in a physics model. This search is guided by a generative model that provides robot states from which an object can be transported towards a desired state, and a learned policy that provides corresponding robot actions. As an efficient generative model, we apply Generative Adversarial Networks. We implement and evaluate our approach for robots endowed with configuration spaces in SE(2). We demonstrate empirically the efficacy of our algorithm design choices and observe more than 2x speedup in planning time on various test scenarios compared to a state-of-the-art approach.

A* Tree Search for Portfolio Management

Authors:Xiaojie Gao, Shikui Tu, Lei Xu
Date:2019-01-07 14:59:15

We propose a planning-based method to teach an agent to manage portfolio from scratch. Our approach combines deep reinforcement learning techniques with search techniques like AlphaGo. By uniting the advantages in A* search algorithm with Monte Carlo tree search, we come up with a new algorithm named A* tree search in which best information is returned to guide next search. Also, the expansion mode of Monte Carlo tree is improved for a higher utilization of the neural network. The suggested algorithm can also optimize non-differentiable utility function by combinatorial search. This technique is then used in our trading system. The major component is a neural network that is trained by trading experiences from tree search and outputs prior probability to guide search by pruning away branches in turn. Experimental results on simulated and real financial data verify the robustness of the proposed trading system and the trading system produces better strategies than several approaches based on reinforcement learning.

Exploring applications of deep reinforcement learning for real-world autonomous driving systems

Authors:Victor Talpaert, Ibrahim Sobh, B Ravi Kiran, Patrick Mannion, Senthil Yogamani, Ahmad El-Sallab, Patrick Perez
Date:2019-01-06 13:02:46

Deep Reinforcement Learning (DRL) has become increasingly powerful in recent years, with notable achievements such as Deepmind's AlphaGo. It has been successfully deployed in commercial vehicles like Mobileye's path planning system. However, a vast majority of work on DRL is focused on toy examples in controlled synthetic car simulator environments such as TORCS and CARLA. In general, DRL is still at its infancy in terms of usability in real-world applications. Our goal in this paper is to encourage real-world deployment of DRL in various autonomous driving (AD) applications. We first provide an overview of the tasks in autonomous driving systems, reinforcement learning algorithms and applications of DRL to AD systems. We then discuss the challenges which must be addressed to enable further progress towards real-world deployment.

What Should I Do Now? Marrying Reinforcement Learning and Symbolic Planning

Authors:Daniel Gordon, Dieter Fox, Ali Farhadi
Date:2019-01-06 03:15:15

Long-term planning poses a major difficulty to many reinforcement learning algorithms. This problem becomes even more pronounced in dynamic visual environments. In this work we propose Hierarchical Planning and Reinforcement Learning (HIP-RL), a method for merging the benefits and capabilities of Symbolic Planning with the learning abilities of Deep Reinforcement Learning. We apply HIPRL to the complex visual tasks of interactive question answering and visual semantic planning and achieve state-of-the-art results on three challenging datasets all while taking fewer steps at test time and training in fewer iterations. Sample results can be found at youtu.be/0TtWJ_0mPfI

Human-Like Autonomous Car-Following Model with Deep Reinforcement Learning

Authors:Meixin Zhu, Xuesong Wang, Yinhai Wang
Date:2019-01-03 01:05:29

This study proposes a framework for human-like autonomous car-following planning based on deep reinforcement learning (deep RL). Historical driving data are fed into a simulation environment where an RL agent learns from trial and error interactions based on a reward function that signals how much the agent deviates from the empirical data. Through these interactions, an optimal policy, or car-following model that maps in a human-like way from speed, relative speed between a lead and following vehicle, and inter-vehicle spacing to acceleration of a following vehicle is finally obtained. The model can be continuously updated when more data are fed in. Two thousand car-following periods extracted from the 2015 Shanghai Naturalistic Driving Study were used to train the model and compare its performance with that of traditional and recent data-driven car-following models. As shown by this study results, a deep deterministic policy gradient car-following model that uses disparity between simulated and observed speed as the reward function and considers a reaction delay of 1s, denoted as DDPGvRT, can reproduce human-like car-following behavior with higher accuracy than traditional and recent data-driven car-following models. Specifically, the DDPGvRT model has a spacing validation error of 18% and speed validation error of 5%, which are less than those of other models, including the intelligent driver model, models based on locally weighted regression, and conventional neural network-based models. Moreover, the DDPGvRT demonstrates good capability of generalization to various driving situations and can adapt to different drivers by continuously learning. This study demonstrates that reinforcement learning methodology can offer insight into driver behavior and can contribute to the development of human-like autonomous driving algorithms and traffic-flow models.

Dynamic Planning Networks

Authors:Norman Tasfi, Miriam Capretz
Date:2018-12-28 22:37:30

We introduce Dynamic Planning Networks (DPN), a novel architecture for deep reinforcement learning, that combines model-based and model-free aspects for online planning. Our architecture learns to dynamically construct plans using a learned state-transition model by selecting and traversing between simulated states and actions to maximize information before acting. In contrast to model-free methods, model-based planning lets the agent efficiently test action hypotheses without performing costly trial-and-error in the environment. DPN learns to efficiently form plans by expanding a single action-conditional state transition at a time instead of exhaustively evaluating each action, reducing the required number of state-transitions during planning by up to 96%. We observe various emergent planning patterns used to solve environments, including classical search methods such as breadth-first and depth-first search. DPN shows improved data efficiency, performance, and generalization to new and unseen domains in comparison to several baselines.

Vehicular Edge Computing via Deep Reinforcement Learning

Authors:Qi Qi, Zhanyu Ma
Date:2018-12-27 09:54:37

The smart vehicles construct Vehicle of Internet which can execute various intelligent services. Although the computation capability of the vehicle is limited, multi-type of edge computing nodes provide heterogeneous resources for vehicular services.When offloading the complicated service to the vehicular edge computing node, the decision should consider numerous factors.The offloading decision work mostly formulate the decision to a resource scheduling problem with single or multiple objective function and some constraints, and explore customized heuristics algorithms. However, offloading multiple data dependency tasks in a service is a difficult decision, as an optimal solution must understand the resource requirement, the access network, the user mobility, and importantly the data dependency. Inspired by recent advances in machine learning, we propose a knowledge driven (KD) service offloading decision framework for Vehicle of Internet, which provides the optimal policy directly from the environment. We formulate the offloading decision of multi-task in a service as a long-term planning problem, and explores the recent deep reinforcement learning to obtain the optimal solution. It considers the future data dependency of the following tasks when making decision for a current task from the learned offloading knowledge. Moreover, the framework supports the pre-training at the powerful edge computing node and continually online learning when the vehicular service is executed, so that it can adapt the environment changes and learns policy that are sensible in hindsight. The simulation results show that KD service offloading decision converges quickly, adapts to different conditions, and outperforms the greedy offloading decision algorithm.

Learning to Prevent Monocular SLAM Failure using Reinforcement Learning

Authors:Vignesh Prasad, Karmesh Yadav, Rohitashva Singh Saurabh, Swapnil Daga, Nahas Pareekutty, K. Madhava Krishna, Balaraman Ravindran, Brojeshwar Bhowmick
Date:2018-12-23 03:28:26

Monocular SLAM refers to using a single camera to estimate robot ego motion while building a map of the environment. While Monocular SLAM is a well studied problem, automating Monocular SLAM by integrating it with trajectory planning frameworks is particularly challenging. This paper presents a novel formulation based on Reinforcement Learning (RL) that generates fail safe trajectories wherein the SLAM generated outputs do not deviate largely from their true values. Quintessentially, the RL framework successfully learns the otherwise complex relation between perceptual inputs and motor actions and uses this knowledge to generate trajectories that do not cause failure of SLAM. We show systematically in simulations how the quality of the SLAM dramatically improves when trajectories are computed using RL. Our method scales effectively across Monocular SLAM frameworks in both simulation and in real world experiments with a mobile robot.

Escape Room: A Configurable Testbed for Hierarchical Reinforcement Learning

Authors:Jacob Menashe, Peter Stone
Date:2018-12-22 12:29:20

Recent successes in Reinforcement Learning have encouraged a fast-growing network of RL researchers and a number of breakthroughs in RL research. As the RL community and the body of RL work grows, so does the need for widely applicable benchmarks that can fairly and effectively evaluate a variety of RL algorithms. This need is particularly apparent in the realm of Hierarchical Reinforcement Learning (HRL). While many existing test domains may exhibit hierarchical action or state structures, modern RL algorithms still exhibit great difficulty in solving domains that necessitate hierarchical modeling and action planning, even when such domains are seemingly trivial. These difficulties highlight both the need for more focus on HRL algorithms themselves, and the need for new testbeds that will encourage and validate HRL research. Existing HRL testbeds exhibit a Goldilocks problem; they are often either too simple (e.g. Taxi) or too complex (e.g. Montezuma's Revenge from the Arcade Learning Environment). In this paper we present the Escape Room Domain (ERD), a new flexible, scalable, and fully implemented testing domain for HRL that bridges the "moderate complexity" gap left behind by existing alternatives. ERD is open-source and freely available through GitHub, and conforms to widely-used public testing interfaces for simple integration and testing with a variety of public RL agent implementations. We show that the ERD presents a suite of challenges with scalable difficulty to provide a smooth learning gradient from Taxi to the Arcade Learning Environment.

Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning

Authors:Linhai Xie, Sen Wang, Stefano Rosa, Andrew Markham, Niki Trigoni
Date:2018-12-12 16:56:51

Deep Reinforcement Learning (DRL) has been applied successfully to many robotic applications. However, the large number of trials needed for training is a key issue. Most of existing techniques developed to improve training efficiency (e.g. imitation) target on general tasks rather than being tailored for robot applications, which have their specific context to benefit from. We propose a novel framework, Assisted Reinforcement Learning, where a classical controller (e.g. a PID controller) is used as an alternative, switchable policy to speed up training of DRL for local planning and navigation problems. The core idea is that the simple control law allows the robot to rapidly learn sensible primitives, like driving in a straight line, instead of random exploration. As the actor network becomes more advanced, it can then take over to perform more complex actions, like obstacle avoidance. Eventually, the simple controller can be discarded entirely. We show that not only does this technique train faster, it also is less sensitive to the structure of the DRL network and consistently outperforms a standard Deep Deterministic Policy Gradient network. We demonstrate the results in both simulation and real-world experiments.

Mitigating Planner Overfitting in Model-Based Reinforcement Learning

Authors:Dilip Arumugam, David Abel, Kavosh Asadi, Nakul Gopalan, Christopher Grimm, Jun Ki Lee, Lucas Lehnert, Michael L. Littman
Date:2018-12-03 23:11:30

An agent with an inaccurate model of its environment faces a difficult choice: it can ignore the errors in its model and act in the real world in whatever way it determines is optimal with respect to its model. Alternatively, it can take a more conservative stance and eschew its model in favor of optimizing its behavior solely via real-world interaction. This latter approach can be exceedingly slow to learn from experience, while the former can lead to "planner overfitting" - aspects of the agent's behavior are optimized to exploit errors in its model. This paper explores an intermediate position in which the planner seeks to avoid overfitting through a kind of regularization of the plans it considers. We present three different approaches that demonstrably mitigate planner overfitting in reinforcement-learning environments.

Resource Constrained Deep Reinforcement Learning

Authors:Abhinav Bhatia, Pradeep Varakantham, Akshat Kumar
Date:2018-12-03 08:34:36

In urban environments, supply resources have to be constantly matched to the "right" locations (where customer demand is present) so as to improve quality of life. For instance, ambulances have to be matched to base stations regularly so as to reduce response time for emergency incidents in EMS (Emergency Management Systems); vehicles (cars, bikes, scooters etc.) have to be matched to docking stations so as to reduce lost demand in shared mobility systems. Such problem domains are challenging owing to the demand uncertainty, combinatorial action spaces (due to allocation) and constraints on allocation of resources (e.g., total resources, minimum and maximum number of resources at locations and regions). Existing systems typically employ myopic and greedy optimization approaches to optimize allocation of supply resources to locations. Such approaches typically are unable to handle surges or variances in demand patterns well. Recent research has demonstrated the ability of Deep RL methods in adapting well to highly uncertain environments. However, existing Deep RL methods are unable to handle combinatorial action spaces and constraints on allocation of resources. To that end, we have developed three approaches on top of the well known actor critic approach, DDPG (Deep Deterministic Policy Gradient) that are able to handle constraints on resource allocation. More importantly, we demonstrate that they are able to outperform leading approaches on simulators validated on semi-real and real data sets.

Using Monte Carlo Tree Search as a Demonstrator within Asynchronous Deep RL

Authors:Bilal Kartal, Pablo Hernandez-Leal, Matthew E. Taylor
Date:2018-11-30 20:37:17

Deep reinforcement learning (DRL) has achieved great successes in recent years with the help of novel methods and higher compute power. However, there are still several challenges to be addressed such as convergence to locally optimal policies and long training times. In this paper, firstly, we augment Asynchronous Advantage Actor-Critic (A3C) method with a novel self-supervised auxiliary task, i.e. \emph{Terminal Prediction}, measuring temporal closeness to terminal states, namely A3C-TP. Secondly, we propose a new framework where planning algorithms such as Monte Carlo tree search or other sources of (simulated) demonstrators can be integrated to asynchronous distributed DRL methods. Compared to vanilla A3C, our proposed methods both learn faster and converge to better policies on a two-player mini version of the Pommerman game.

PEARL: PrEference Appraisal Reinforcement Learning for Motion Planning

Authors:Aleksandra Faust, Hao-Tien Lewis Chiang, Lydia Tapia
Date:2018-11-30 07:35:41

Robot motion planning often requires finding trajectories that balance different user intents, or preferences. One of these preferences is usually arrival at the goal, while another might be obstacle avoidance. Here, we formalize these, and similar, tasks as preference balancing tasks (PBTs) on acceleration controlled robots, and propose a motion planning solution, PrEference Appraisal Reinforcement Learning (PEARL). PEARL uses reinforcement learning on a restricted training domain, combined with features engineered from user-given intents. PEARL's planner then generates trajectories in expanded domains for more complex problems. We present an adaptation for rejection of stochastic disturbances and offer in-depth analysis, including task completion conditions and behavior analysis when the conditions do not hold. PEARL is evaluated on five problems, two multi-agent obstacle avoidance tasks and three that stochastically disturb the system at run-time: 1) a multi-agent pursuit problem with 1000 pursuers, 2) robot navigation through 900 moving obstacles, which is is trained with in an environment with only 4 static obstacles, 3) aerial cargo delivery, 4) two robot rendezvous, and 5) flying inverted pendulum. Lastly, we evaluate the method on a physical quadrotor UAV robot with a suspended load influenced by a stochastic disturbance. The video, https://youtu.be/ZkFt1uY6vlw contains the experiments and visualization of the simulations.

Intelligent Inverse Treatment Planning via Deep Reinforcement Learning, a Proof-of-Principle Study in High Dose-rate Brachytherapy for Cervical Cancer

Authors:Chenyang Shen, Yesenia Gonzalez, Peter Klages, Nan Qin, Hyunuk Jung, Liyuan Chen, Dan Nguyen, Steve B. Jiang, Xun Jia
Date:2018-11-25 21:41:31

Inverse treatment planning in radiation therapy is formulated as optimization problems. The objective function and constraints consist of multiple terms designed for different clinical and practical considerations. Weighting factors of these terms are needed to define the optimization problem. While a treatment planning system can solve the optimization problem with given weights, adjusting the weights for high plan quality is performed by human. The weight tuning task is labor intensive, time consuming, and it critically affects the final plan quality. An automatic weight-tuning approach is strongly desired. The weight tuning procedure is essentially a decision making problem. Motivated by the tremendous success in deep learning for decision making with human-level intelligence, we propose a novel framework to tune the weights in a human-like manner. Using treatment planning in high-dose-rate brachytherapy as an example, we develop a weight tuning policy network (WTPN) that observes dose volume histograms of a plan and outputs an action to adjust organ weights, similar to the behaviors of a human planner. We train the WTPN via end-to-end deep reinforcement learning. Experience replay is performed with the epsilon greedy algorithm. Then we apply the trained WTPN to guide treatment planning of testing patient cases. The trained WTPN successfully learns the treatment planning goals to guide the weight tuning process. On average, the quality score of plans generated under the WTPN's guidance is improved by ~8.5% compared to the initial plan with arbitrary weights, and by 10.7% compared to the plans generated by human planners. To our knowledge, this is the first tool to adjust weights for the treatment planning in a human-like fashion based on learnt intelligence. The study demonstrates potential feasibility to develop intelligent treatment planning system via deep reinforcement learning.

Planning in Dynamic Environments with Conditional Autoregressive Models

Authors:Johanna Hansen, Kyle Kastner, Aaron Courville, Gregory Dudek
Date:2018-11-25 21:10:10

We demonstrate the use of conditional autoregressive generative models (van den Oord et al., 2016a) over a discrete latent space (van den Oord et al., 2017b) for forward planning with MCTS. In order to test this method, we introduce a new environment featuring varying difficulty levels, along with moving goals and obstacles. The combination of high-quality frame generation and classical planning approaches nearly matches true environment performance for our task, demonstrating the usefulness of this method for model-based planning in dynamic environments.

Integrating Task-Motion Planning with Reinforcement Learning for Robust Decision Making in Mobile Robots

Authors:Yuqian Jiang, Fangkai Yang, Shiqi Zhang, Peter Stone
Date:2018-11-21 21:20:24

Task-motion planning (TMP) addresses the problem of efficiently generating executable and low-cost task plans in a discrete space such that the (initially unknown) action costs are determined by motion plans in a corresponding continuous space. However, a task-motion plan can be sensitive to unexpected domain uncertainty and changes, leading to suboptimal behaviors or execution failures. In this paper, we propose a novel framework, TMP-RL, which is an integration of TMP and reinforcement learning (RL) from the execution experience, to solve the problem of robust task-motion planning in dynamic and uncertain domains. TMP-RL features two nested planning-learning loops. In the inner TMP loop, the robot generates a low-cost, feasible task-motion plan by iteratively planning in the discrete space and updating relevant action costs evaluated by the motion planner in continuous space. In the outer loop, the plan is executed, and the robot learns from the execution experience via model-free RL, to further improve its task-motion plans. RL in the outer loop is more accurate to the current domain but also more expensive, and using less costly task and motion planning leads to a jump-start for learning in the real world. Our approach is evaluated on a mobile service robot conducting navigation tasks in an office area. Results show that TMP-RL approach significantly improves adaptability and robustness (in comparison to TMP methods) and leads to rapid convergence (in comparison to task planning (TP)-RL methods). We also show that TMP-RL can reuse learned values to smoothly adapt to new scenarios during long-term deployments.

Reinforcement Learning and Inverse Reinforcement Learning with System 1 and System 2

Authors:Alexander Peysakhovich
Date:2018-11-19 22:36:53

Inferring a person's goal from their behavior is an important problem in applications of AI (e.g. automated assistants, recommender systems). The workhorse model for this task is the rational actor model - this amounts to assuming that people have stable reward functions, discount the future exponentially, and construct optimal plans. Under the rational actor assumption techniques such as inverse reinforcement learning (IRL) can be used to infer a person's goals from their actions. A competing model is the dual-system model. Here decisions are the result of an interplay between a fast, automatic, heuristic-based system 1 and a slower, deliberate, calculating system 2. We generalize the dual system framework to the case of Markov decision problems and show how to compute optimal plans for dual-system agents. We show that dual-system agents exhibit behaviors that are incompatible with rational actor assumption. We show that naive applications of rational-actor IRL to the behavior of dual-system agents can generate wrong inference about the agents' goals and suggest interventions that actually reduce the agent's overall utility. Finally, we adapt a simple IRL algorithm to correctly infer the goals of dual system decision-makers. This allows us to make interventions that help, rather than hinder, the dual-system agent's ability to reach their true goals.

Chat More If You Like: Dynamic Cue Words Planning to Flow Longer Conversations

Authors:Lili Yao, Ruijian Xu, Chao Li, Dongyan Zhao, Rui Yan
Date:2018-11-19 11:54:25

To build an open-domain multi-turn conversation system is one of the most interesting and challenging tasks in Artificial Intelligence. Many research efforts have been dedicated to building such dialogue systems, yet few shed light on modeling the conversation flow in an ongoing dialogue. Besides, it is common for people to talk about highly relevant aspects during a conversation. And the topics are coherent and drift naturally, which demonstrates the necessity of dialogue flow modeling. To this end, we present the multi-turn cue-words driven conversation system with reinforcement learning method (RLCw), which strives to select an adaptive cue word with the greatest future credit, and therefore improve the quality of generated responses. We introduce a new reward to measure the quality of cue words in terms of effectiveness and relevance. To further optimize the model for long-term conversations, a reinforcement approach is adopted in this paper. Experiments on real-life dataset demonstrate that our model consistently outperforms a set of competitive baselines in terms of simulated turns, diversity and human evaluation.

Switch-based Active Deep Dyna-Q: Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning

Authors:Yuexin Wu, Xiujun Li, Jingjing Liu, Jianfeng Gao, Yiming Yang
Date:2018-11-19 08:23:34

Training task-completion dialogue agents with reinforcement learning usually requires a large number of real user experiences. The Dyna-Q algorithm extends Q-learning by integrating a world model, and thus can effectively boost training efficiency using simulated experiences generated by the world model. The effectiveness of Dyna-Q, however, depends on the quality of the world model - or implicitly, the pre-specified ratio of real vs. simulated experiences used for Q-learning. To this end, we extend the recently proposed Deep Dyna-Q (DDQ) framework by integrating a switcher that automatically determines whether to use a real or simulated experience for Q-learning. Furthermore, we explore the use of active learning for improving sample efficiency, by encouraging the world model to generate simulated experiences in the state-action space where the agent has not (fully) explored. Our results show that by combining switcher and active learning, the new framework named as Switch-based Active Deep Dyna-Q (Switch-DDQ), leads to significant improvement over DDQ and Q-learning baselines in both simulation and human evaluations.

Large-scale Interactive Recommendation with Tree-structured Policy Gradient

Authors:Haokun Chen, Xinyi Dai, Han Cai, Weinan Zhang, Xuejian Wang, Ruiming Tang, Yuzhou Zhang, Yong Yu
Date:2018-11-14 15:53:25

Reinforcement learning (RL) has recently been introduced to interactive recommender systems (IRS) because of its nature of learning from dynamic interactions and planning for long-run performance. As IRS is always with thousands of items to recommend (i.e., thousands of actions), most existing RL-based methods, however, fail to handle such a large discrete action space problem and thus become inefficient. The existing work that tries to deal with the large discrete action space problem by utilizing the deep deterministic policy gradient framework suffers from the inconsistency between the continuous action representation (the output of the actor network) and the real discrete action. To avoid such inconsistency and achieve high efficiency and recommendation effectiveness, in this paper, we propose a Tree-structured Policy Gradient Recommendation (TPGR) framework, where a balanced hierarchical clustering tree is built over the items and picking an item is formulated as seeking a path from the root to a certain leaf of the tree. Extensive experiments on carefully-designed environments based on two real-world datasets demonstrate that our model provides superior recommendation performance and significant efficiency improvement over state-of-the-art methods.

Modular Architecture for StarCraft II with Deep Reinforcement Learning

Authors:Dennis Lee, Haoran Tang, Jeffrey O Zhang, Huazhe Xu, Trevor Darrell, Pieter Abbeel
Date:2018-11-08 17:13:50

We present a novel modular architecture for StarCraft II AI. The architecture splits responsibilities between multiple modules that each control one aspect of the game, such as build-order selection or tactics. A centralized scheduler reviews macros suggested by all modules and decides their order of execution. An updater keeps track of environment changes and instantiates macros into series of executable actions. Modules in this framework can be optimized independently or jointly via human design, planning, or reinforcement learning. We apply deep reinforcement learning techniques to training two out of six modules of a modular agent with self-play, achieving 94% or 87% win rates against the "Harder" (level 5) built-in Blizzard bot in Zerg vs. Zerg matches, with or without fog-of-war.

Combining Subgoal Graphs with Reinforcement Learning to Build a Rational Pathfinder

Authors:Junjie Zeng, Long Qin, Yue Hu, Cong Hu, Quanjun Yin
Date:2018-11-05 14:12:14

In this paper, we present a hierarchical path planning framework called SG-RL (subgoal graphs-reinforcement learning), to plan rational paths for agents maneuvering in continuous and uncertain environments. By "rational", we mean (1) efficient path planning to eliminate first-move lags; (2) collision-free and smooth for agents with kinematic constraints satisfied. SG-RL works in a two-level manner. At the first level, SG-RL uses a geometric path-planning method, i.e., Simple Subgoal Graphs (SSG), to efficiently find optimal abstract paths, also called subgoal sequences. At the second level, SG-RL uses an RL method, i.e., Least-Squares Policy Iteration (LSPI), to learn near-optimal motion-planning policies which can generate kinematically feasible and collision-free trajectories between adjacent subgoals. The first advantage of the proposed method is that SSG can solve the limitations of sparse reward and local minima trap for RL agents; thus, LSPI can be used to generate paths in complex environments. The second advantage is that, when the environment changes slightly (i.e., unexpected obstacles appearing), SG-RL does not need to reconstruct subgoal graphs and replan subgoal sequences using SSG, since LSPI can deal with uncertainties by exploiting its generalization ability to handle changes in environments. Simulation experiments in representative scenarios demonstrate that, compared with existing methods, SG-RL can work well on large-scale maps with relatively low action-switching frequencies and shorter path lengths, and SG-RL can deal with small changes in environments. We further demonstrate that the design of reward functions and the types of training environments are important factors for learning feasible policies.

Towards a Simple Approach to Multi-step Model-based Reinforcement Learning

Authors:Kavosh Asadi, Evan Cater, Dipendra Misra, Michael L. Littman
Date:2018-10-31 21:31:59

When environmental interaction is expensive, model-based reinforcement learning offers a solution by planning ahead and avoiding costly mistakes. Model-based agents typically learn a single-step transition model. In this paper, we propose a multi-step model that predicts the outcome of an action sequence with variable length. We show that this model is easy to learn, and that the model can make policy-conditional predictions. We report preliminary results that show a clear advantage for the multi-step model compared to its one-step counterpart.

SDRL: Interpretable and Data-efficient Deep Reinforcement Learning Leveraging Symbolic Planning

Authors:Daoming Lyu, Fangkai Yang, Bo Liu, Steven Gustafson
Date:2018-10-31 19:56:06

Deep reinforcement learning (DRL) has gained great success by learning directly from high-dimensional sensory inputs, yet is notorious for the lack of interpretability. Interpretability of the subtasks is critical in hierarchical decision-making as it increases the transparency of black-box-style DRL approach and helps the RL practitioners to understand the high-level behavior of the system better. In this paper, we introduce symbolic planning into DRL and propose a framework of Symbolic Deep Reinforcement Learning (SDRL) that can handle both high-dimensional sensory inputs and symbolic planning. The task-level interpretability is enabled by relating symbolic actions to options.This framework features a planner -- controller -- meta-controller architecture, which takes charge of subtask scheduling, data-driven subtask learning, and subtask evaluation, respectively. The three components cross-fertilize each other and eventually converge to an optimal symbolic plan along with the learned subtasks, bringing together the advantages of long-term planning capability with symbolic knowledge and end-to-end reinforcement learning directly from a high-dimensional sensory input. Experimental results validate the interpretability of subtasks, along with improved data efficiency compared with state-of-the-art approaches.

Differentiable MPC for End-to-end Planning and Control

Authors:Brandon Amos, Ivan Dario Jimenez Rodriguez, Jacob Sacks, Byron Boots, J. Zico Kolter
Date:2018-10-31 16:46:38

We present foundations for using Model Predictive Control (MPC) as a differentiable policy class for reinforcement learning in continuous state and action spaces. This provides one way of leveraging and combining the advantages of model-free and model-based approaches. Specifically, we differentiate through MPC by using the KKT conditions of the convex approximation at a fixed point of the controller. Using this strategy, we are able to learn the cost and dynamics of a controller via end-to-end learning. Our experiments focus on imitation learning in the pendulum and cartpole domains, where we learn the cost and dynamics terms of an MPC policy class. We show that our MPC policies are significantly more data-efficient than a generic neural network and that our method is superior to traditional system identification in a setting where the expert is unrealizable.

Model-Based Active Exploration

Authors:Pranav Shyam, Wojciech Jaśkowski, Faustino Gomez
Date:2018-10-29 14:43:48

Efficient exploration is an unsolved problem in Reinforcement Learning which is usually addressed by reactively rewarding the agent for fortuitously encountering novel situations. This paper introduces an efficient active exploration algorithm, Model-Based Active eXploration (MAX), which uses an ensemble of forward models to plan to observe novel events. This is carried out by optimizing agent behaviour with respect to a measure of novelty derived from the Bayesian perspective of exploration, which is estimated using the disagreement between the futures predicted by the ensemble members. We show empirically that in semi-random discrete environments where directed exploration is critical to make progress, MAX is at least an order of magnitude more efficient than strong baselines. MAX scales to high-dimensional continuous environments where it builds task-agnostic models that can be used for any downstream task.

Learning Abstract Options

Authors:Matthew Riemer, Miao Liu, Gerald Tesauro
Date:2018-10-27 02:54:59

Building systems that autonomously create temporal abstractions from data is a key challenge in scaling learning and planning in reinforcement learning. One popular approach for addressing this challenge is the options framework (Sutton et al., 1999). However, only recently in (Bacon et al., 2017) was a policy gradient theorem derived for online learning of general purpose options in an end to end fashion. In this work, we extend previous work on this topic that only focuses on learning a two-level hierarchy including options and primitive actions to enable learning simultaneously at multiple resolutions in time. We achieve this by considering an arbitrarily deep hierarchy of options where high level temporally extended options are composed of lower level options with finer resolutions in time. We extend results from (Bacon et al., 2017) and derive policy gradient theorems for a deep hierarchy of options. Our proposed hierarchical option-critic architecture is capable of learning internal policies, termination conditions, and hierarchical compositions over options without the need for any intrinsic rewards or subgoals. Our empirical results in both discrete and continuous environments demonstrate the efficiency of our framework.

Efficient and Trustworthy Social Navigation Via Explicit and Implicit Robot-Human Communication

Authors:Yuhang Che, Allison M. Okamura, Dorsa Sadigh
Date:2018-10-26 23:38:20

In this paper, we present a planning framework that uses a combination of implicit (robot motion) and explicit (visual/audio/haptic feedback) communication during mobile robot navigation. First, we developed a model that approximates both continuous movements and discrete behavior modes in human navigation, considering the effects of implicit and explicit communication on human decision making. The model approximates the human as an optimal agent, with a reward function obtained through inverse reinforcement learning. Second, a planner uses this model to generate communicative actions that maximize the robot's transparency and efficiency. We implemented the planner on a mobile robot, using a wearable haptic device for explicit communication. In a user study of an indoor human-robot pair of orthogonal crossing situation, the robot was able to actively communicate its intent to users in order to avoid collisions and facilitate efficient trajectories. Results showed that the planner generated plans that were easier to understand, reduced users' effort, and increased users' trust of the robot, compared to simply performing collision avoidance. The key contribution of this work is the integration and analysis of explicit communication (together with implicit communication) for social navigation.

Transfer of Deep Reactive Policies for MDP Planning

Authors:Aniket Bajpai, Sankalp Garg, Mausam
Date:2018-10-26 18:28:42

Domain-independent probabilistic planners input an MDP description in a factored representation language such as PPDDL or RDDL, and exploit the specifics of the representation for faster planning. Traditional algorithms operate on each problem instance independently, and good methods for transferring experience from policies of other instances of a domain to a new instance do not exist. Recently, researchers have begun exploring the use of deep reactive policies, trained via deep reinforcement learning (RL), for MDP planning domains. One advantage of deep reactive policies is that they are more amenable to transfer learning. In this paper, we present the first domain-independent transfer algorithm for MDP planning domains expressed in an RDDL representation. Our architecture exploits the symbolic state configuration and transition function of the domain (available via RDDL) to learn a shared embedding space for states and state-action pairs for all problem instances of a domain. We then learn an RL agent in the embedding space, making a near zero-shot transfer possible, i.e., without much training on the new instance, and without using the domain simulator at all. Experiments on three different benchmark domains underscore the value of our transfer algorithm. Compared against planning from scratch, and a state-of-the-art RL transfer algorithm, our transfer solution has significantly superior learning curves.

Neural Modular Control for Embodied Question Answering

Authors:Abhishek Das, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra
Date:2018-10-26 03:58:26

We present a modular approach for learning policies for navigation over long planning horizons from language input. Our hierarchical policy operates at multiple timescales, where the higher-level master policy proposes subgoals to be executed by specialized sub-policies. Our choice of subgoals is compositional and semantic, i.e. they can be sequentially combined in arbitrary orderings, and assume human-interpretable descriptions (e.g. 'exit room', 'find kitchen', 'find refrigerator', etc.). We use imitation learning to warm-start policies at each level of the hierarchy, dramatically increasing sample efficiency, followed by reinforcement learning. Independent reinforcement learning at each level of hierarchy enables sub-policies to adapt to consequences of their actions and recover from errors. Subsequent joint hierarchical training enables the master policy to adapt to the sub-policies. On the challenging EQA (Das et al., 2018) benchmark in House3D (Wu et al., 2018), requiring navigating diverse realistic indoor environments, our approach outperforms prior work by a significant margin, both in terms of navigation and question answering.

Sample-Efficient Learning of Nonprehensile Manipulation Policies via Physics-Based Informed State Distributions

Authors:Lerrel Pinto, Aditya Mandalika, Brian Hou, Siddhartha Srinivasa
Date:2018-10-24 23:49:58

This paper proposes a sample-efficient yet simple approach to learning closed-loop policies for nonprehensile manipulation. Although reinforcement learning (RL) can learn closed-loop policies without requiring access to underlying physics models, it suffers from poor sample complexity on challenging tasks. To overcome this problem, we leverage rearrangement planning to provide an informative physics-based prior on the environment's optimal state-visitation distribution. Specifically, we present a new technique, Learning with Planned Episodic Resets (LeaPER), that resets the environment's state to one informed by the prior during the learning phase. We experimentally show that LeaPER significantly outperforms traditional RL approaches by a factor of up to 5X on simulated rearrangement. Further, we relax dynamics from quasi-static to welded contacts to illustrate that LeaPER is robust to the use of simpler physics models. Finally, LeaPER's closed-loop policies significantly improve task success rates relative to both open-loop controls with a planned path or simple feedback controllers that track open-loop trajectories. We demonstrate the performance and behavior of LeaPER on a physical 7-DOF manipulator in https://youtu.be/feS-zFq6J1c.

Safe Reinforcement Learning with Model Uncertainty Estimates

Authors:Björn Lütjens, Michael Everett, Jonathan P. How
Date:2018-10-19 22:04:59

Many current autonomous systems are being designed with a strong reliance on black box predictions from deep neural networks (DNNs). However, DNNs tend to be overconfident in predictions on unseen data and can give unpredictable results for far-from-distribution test data. The importance of predictions that are robust to this distributional shift is evident for safety-critical applications, such as collision avoidance around pedestrians. Measures of model uncertainty can be used to identify unseen data, but the state-of-the-art extraction methods such as Bayesian neural networks are mostly intractable to compute. This paper uses MC-Dropout and Bootstrapping to give computationally tractable and parallelizable uncertainty estimates. The methods are embedded in a Safe Reinforcement Learning framework to form uncertainty-aware navigation around pedestrians. The result is a collision avoidance policy that knows what it does not know and cautiously avoids pedestrians that exhibit unseen behavior. The policy is demonstrated in simulation to be more robust to novel observations and take safer actions than an uncertainty-unaware baseline.

Fast deep reinforcement learning using online adjustments from the past

Authors:Steven Hansen, Pablo Sprechmann, Alexander Pritzel, André Barreto, Charles Blundell
Date:2018-10-18 17:00:20

We propose Ephemeral Value Adjusments (EVA): a means of allowing deep reinforcement learning agents to rapidly adapt to experience in their replay buffer. EVA shifts the value predicted by a neural network with an estimate of the value function found by planning over experience tuples from the replay buffer near the current state. EVA combines a number of recent ideas around combining episodic memory-like structures into reinforcement learning agents: slot-based storage, content-based retrieval, and memory-based planning. We show that EVAis performant on a demonstration task and Atari games.

Integrating kinematics and environment context into deep inverse reinforcement learning for predicting off-road vehicle trajectories

Authors:Yanfu Zhang, Wenshan Wang, Rogerio Bonatti, Daniel Maturana, Sebastian Scherer
Date:2018-10-16 18:40:34

Predicting the motion of a mobile agent from a third-person perspective is an important component for many robotics applications, such as autonomous navigation and tracking. With accurate motion prediction of other agents, robots can plan for more intelligent behaviors to achieve specified objectives, instead of acting in a purely reactive way. Previous work addresses motion prediction by either only filtering kinematics, or using hand-designed and learned representations of the environment. Instead of separating kinematic and environmental context, we propose a novel approach to integrate both into an inverse reinforcement learning (IRL) framework for trajectory prediction. Instead of exponentially increasing the state-space complexity with kinematics, we propose a two-stage neural network architecture that considers motion and environment together to recover the reward function. The first-stage network learns feature representations of the environment using low-level LiDAR statistics and the second-stage network combines those learned features with kinematics data. We collected over 30 km of off-road driving data and validated experimentally that our method can effectively extract useful environmental and kinematic features. We generate accurate predictions of the distribution of future trajectories of the vehicle, encoding complex behaviors such as multi-modal distributions at road intersections, and even show different predictions at the same intersection depending on the vehicle's speed.

Incremental learning abstract discrete planning domains and mappings to continuous perceptions

Authors:Luciano Serafini, Paolo Traverso
Date:2018-10-16 15:53:22

Most of the works on planning and learning, e.g., planning by (model based) reinforcement learning, are based on two main assumptions: (i) the set of states of the planning domain is fixed; (ii) the mapping between the observations from the real word and the states is implicitly assumed or learned offline, and it is not part of the planning domain. Consequently, the focus is on learning the transitions between states. In this paper, we drop such assumptions. We provide a formal framework in which (i) the agent can learn dynamically new states of the planning domain; (ii) the mapping between abstract states and the perception from the real world, represented by continuous variables, is part of the planning domain; (iii) such mapping is learned and updated along the "life" of the agent. We define an algorithm that interleaves planning, acting, and learning, and allows the agent to update the planning domain depending on how much it trusts the model w.r.t. the new experiences learned by executing actions. We define a measure of coherence between the planning domain and the real world as perceived by the agent. We test our approach showing that the agent learns increasingly coherent models, and that the system can scale to deal with models with an order of $10^6$ states.

Factorized Machine Self-Confidence for Decision-Making Agents

Authors:Brett W Israelsen, Nisar R Ahmed, Eric Frew, Dale Lawrence, Brian Argrow
Date:2018-10-15 17:06:38

Algorithmic assurances from advanced autonomous systems assist human users in understanding, trusting, and using such systems appropriately. Designing these systems with the capacity of assessing their own capabilities is one approach to creating an algorithmic assurance. The idea of `machine self-confidence' is introduced for autonomous systems. Using a factorization based framework for self-confidence assessment, one component of self-confidence, called `solver-quality', is discussed in the context of Markov decision processes for autonomous systems. Markov decision processes underlie much of the theory of reinforcement learning, and are commonly used for planning and decision making under uncertainty in robotics and autonomous systems. A `solver quality' metric is formally defined in the context of decision making algorithms based on Markov decision processes. A method for assessing solver quality is then derived, drawing inspiration from empirical hardness models. Finally, numerical experiments for an unmanned autonomous vehicle navigation problem under different solver, parameter, and environment conditions indicate that the self-confidence metric exhibits the desired properties. Discussion of results, and avenues for future investigation are included.

The Dreaming Variational Autoencoder for Reinforcement Learning Environments

Authors:Per-Arne Andersen, Morten Goodwin, Ole-Christoffer Granmo
Date:2018-10-02 08:31:39

Reinforcement learning has shown great potential in generalizing over raw sensory data using only a single neural network for value optimization. There are several challenges in the current state-of-the-art reinforcement learning algorithms that prevent them from converging towards the global optima. It is likely that the solution to these problems lies in short- and long-term planning, exploration and memory management for reinforcement learning algorithms. Games are often used to benchmark reinforcement learning algorithms as they provide a flexible, reproducible, and easy to control environment. Regardless, few games feature a state-space where results in exploration, memory, and planning are easily perceived. This paper presents The Dreaming Variational Autoencoder (DVAE), a neural network based generative modeling architecture for exploration in environments with sparse feedback. We further present Deep Maze, a novel and flexible maze engine that challenges DVAE in partial and fully-observable state-spaces, long-horizon tasks, and deterministic and stochastic problems. We show initial findings and encourage further work in reinforcement learning driven by generative exploration.

Interactive Agent Modeling by Learning to Probe

Authors:Tianmin Shu, Caiming Xiong, Ying Nian Wu, Song-Chun Zhu
Date:2018-10-01 02:55:07

The ability of modeling the other agents, such as understanding their intentions and skills, is essential to an agent's interactions with other agents. Conventional agent modeling relies on passive observation from demonstrations. In this work, we propose an interactive agent modeling scheme enabled by encouraging an agent to learn to probe. In particular, the probing agent (i.e. a learner) learns to interact with the environment and with a target agent (i.e., a demonstrator) to maximize the change in the observed behaviors of that agent. Through probing, rich behaviors can be observed and are used for enhancing the agent modeling to learn a more accurate mind model of the target agent. Our framework consists of two learning processes: i) imitation learning for an approximated agent model and ii) pure curiosity-driven reinforcement learning for an efficient probing policy to discover new behaviors that otherwise can not be observed. We have validated our approach in four different tasks. The experimental results suggest that the agent model learned by our approach i) generalizes better in novel scenarios than the ones learned by passive observation, random probing, and other curiosity-driven approaches do, and ii) can be used for enhancing performance in multiple applications including distilling optimal planning to a policy net, collaboration, and competition. A video demo is available at https://www.dropbox.com/s/8mz6rd3349tso67/Probing_Demo.mov?dl=0

Few-Shot Goal Inference for Visuomotor Learning and Planning

Authors:Annie Xie, Avi Singh, Sergey Levine, Chelsea Finn
Date:2018-09-30 22:57:58

Reinforcement learning and planning methods require an objective or reward function that encodes the desired behavior. Yet, in practice, there is a wide range of scenarios where an objective is difficult to provide programmatically, such as tasks with visual observations involving unknown object positions or deformable objects. In these cases, prior methods use engineered problem-specific solutions, e.g., by instrumenting the environment with additional sensors to measure a proxy for the objective. Such solutions require a significant engineering effort on a per-task basis, and make it impractical for robots to continuously learn complex skills outside of laboratory settings. We aim to find a more general and scalable solution for specifying goals for robot learning in unconstrained environments. To that end, we formulate the few-shot objective learning problem, where the goal is to learn a task objective from only a few example images of successful end states for that task. We propose a simple solution to this problem: meta-learn a classifier that can recognize new goals from a few examples. We show how this approach can be used with both model-free reinforcement learning and visual model-based planning and show results in three domains: rope manipulation from images in simulation, visual navigation in a simulated 3D environment, and object arrangement into user-specified configurations on a real robot.

Robot Representation and Reasoning with Knowledge from Reinforcement Learning

Authors:Keting Lu, Shiqi Zhang, Peter Stone, Xiaoping Chen
Date:2018-09-28 15:02:21

Reinforcement learning (RL) agents aim at learning by interacting with an environment, and are not designed for representing or reasoning with declarative knowledge. Knowledge representation and reasoning (KRR) paradigms are strong in declarative KRR tasks, but are ill-equipped to learn from such experiences. In this work, we integrate logical-probabilistic KRR with model-based RL, enabling agents to simultaneously reason with declarative knowledge and learn from interaction experiences. The knowledge from humans and RL is unified and used for dynamically computing task-specific planning models under potentially new environments. Experiments were conducted using a mobile robot working on dialog, navigation, and delivery tasks. Results show significant improvements, in comparison to existing model-based RL methods.

Learning and Planning with a Semantic Model

Authors:Yi Wu, Yuxin Wu, Aviv Tamar, Stuart Russell, Georgia Gkioxari, Yuandong Tian
Date:2018-09-28 03:30:37

Building deep reinforcement learning agents that can generalize and adapt to unseen environments remains a fundamental challenge for AI. This paper describes progresses on this challenge in the context of man-made environments, which are visually diverse but contain intrinsic semantic regularities. We propose a hybrid model-based and model-free approach, LEArning and Planning with Semantics (LEAPS), consisting of a multi-target sub-policy that acts on visual inputs, and a Bayesian model over semantic structures. When placed in an unseen environment, the agent plans with the semantic model to make high-level decisions, proposes the next sub-target for the sub-policy to execute, and updates the semantic model based on new observations. We perform experiments in visual navigation tasks using House3D, a 3D environment that contains diverse human-designed indoor scenes with real-world objects. LEAPS outperforms strong baselines that do not explicitly plan using the semantic content.

Floyd-Warshall Reinforcement Learning: Learning from Past Experiences to Reach New Goals

Authors:Vikas Dhiman, Shurjo Banerjee, Jeffrey M. Siskind, Jason J. Corso
Date:2018-09-25 05:09:32

Consider mutli-goal tasks that involve static environments and dynamic goals. Examples of such tasks, such as goal-directed navigation and pick-and-place in robotics, abound. Two types of Reinforcement Learning (RL) algorithms are used for such tasks: model-free or model-based. Each of these approaches has limitations. Model-free RL struggles to transfer learned information when the goal location changes, but achieves high asymptotic accuracy in single goal tasks. Model-based RL can transfer learned information to new goal locations by retaining the explicitly learned state-dynamics, but is limited by the fact that small errors in modelling these dynamics accumulate over long-term planning. In this work, we improve upon the limitations of model-free RL in multi-goal domains. We do this by adapting the Floyd-Warshall algorithm for RL and call the adaptation Floyd-Warshall RL (FWRL). The proposed algorithm learns a goal-conditioned action-value function by constraining the value of the optimal path between any two states to be greater than or equal to the value of paths via intermediary states. Experimentally, we show that FWRL is more sample-efficient and learns higher reward strategies in multi-goal tasks as compared to Q-learning, model-based RL and other relevant baselines in a tabular domain.

Fast Motion Planning for High-DOF Robot Systems Using Hierarchical System Identification

Authors:Biao Jia, Zherong Pan, Dinesh Manocha
Date:2018-09-21 18:01:34

We present an efficient algorithm for motion planning and control of a robot system with a high number of degrees-of-freedom. These include high-DOF soft robots or an articulated robot interacting with a deformable environment. Our approach takes into account dynamics constraints and present a novel technique to accelerate the forward dynamic computation using a data-driven method. We precompute the forward dynamic function of the robot system on a hierarchical adaptive grid. Furthermore, we exploit the properties of underactuated robot systems and perform these computations for a few DOFs. We provide error bounds for our approximate forward dynamics computation and use our approach for optimization-based motion planning and reinforcement-learning-based feedback control. Our formulation is used for motion planning of two high DOF robot systems: a high-DOF line-actuated elastic robot arm and an underwater swimming robot operating in water. As compared to prior techniques based on exact dynamic function computation, we observe one to two orders of magnitude improvement in performance.

Combined Reinforcement Learning via Abstract Representations

Authors:Vincent François-Lavet, Yoshua Bengio, Doina Precup, Joelle Pineau
Date:2018-09-12 15:12:49

In the quest for efficient and robust reinforcement learning methods, both model-free and model-based approaches offer advantages. In this paper we propose a new way of explicitly bridging both approaches via a shared low-dimensional learned encoding of the environment, meant to capture summarizing abstractions. We show that the modularity brought by this approach leads to good generalization while being computationally efficient, with planning happening in a smaller latent state space. In addition, this approach recovers a sufficient low-dimensional representation of the environment, which opens up new strategies for interpretable AI, exploration and transfer learning.

Towards a Fatality-Aware Benchmark of Probabilistic Reaction Prediction in Highly Interactive Driving Scenarios

Authors:Wei Zhan, Liting Sun, Yeping Hu, Jiachen Li, Masayoshi Tomizuka
Date:2018-09-10 17:48:58

Autonomous vehicles should be able to generate accurate probabilistic predictions for uncertain behavior of other road users. Moreover, reactive predictions are necessary in highly interactive driving scenarios to answer "what if I take this action in the future" for autonomous vehicles. There is no existing unified framework to homogenize the problem formulation, representation simplification, and evaluation metric for various prediction methods, such as probabilistic graphical models (PGM), neural networks (NN) and inverse reinforcement learning (IRL). In this paper, we formulate a probabilistic reaction prediction problem, and reveal the relationship between reaction and situation prediction problems. We employ prototype trajectories with designated motion patterns other than "intention" to homogenize the representation so that probabilities corresponding to each trajectory generated by different methods can be evaluated. We also discuss the reasons why "intention" is not suitable to serve as a motion indicator in highly interactive scenarios. We propose to use Brier score as the baseline metric for evaluation. In order to reveal the fatality of the consequences when the predictions are adopted by decision-making and planning, we propose a fatality-aware metric, which is a weighted Brier score based on the criticality of the trajectory pairs of the interacting entities. Conservatism and non-defensiveness are defined from the weighted Brier score to indicate the consequences caused by inaccurate predictions. Modified methods based on PGM, NN and IRL are provided to generate probabilistic reaction predictions in an exemplar scenario of nudging from a highway ramp. The results are evaluated by the baseline and proposed metrics to construct a mini benchmark. Analysis on the properties of each method is also provided by comparing the baseline and proposed metric scores.

Probabilistic Prediction of Interactive Driving Behavior via Hierarchical Inverse Reinforcement Learning

Authors:Liting Sun, Wei Zhan, Masayoshi Tomizuka
Date:2018-09-09 05:44:16

Autonomous vehicles (AVs) are on the road. To safely and efficiently interact with other road participants, AVs have to accurately predict the behavior of surrounding vehicles and plan accordingly. Such prediction should be probabilistic, to address the uncertainties in human behavior. Such prediction should also be interactive, since the distribution over all possible trajectories of the predicted vehicle depends not only on historical information, but also on future plans of other vehicles that interact with it. To achieve such interaction-aware predictions, we propose a probabilistic prediction approach based on hierarchical inverse reinforcement learning (IRL). First, we explicitly consider the hierarchical trajectory-generation process of human drivers involving both discrete and continuous driving decisions. Based on this, the distribution over all future trajectories of the predicted vehicle is formulated as a mixture of distributions partitioned by the discrete decisions. Then we apply IRL hierarchically to learn the distributions from real human demonstrations. A case study for the ramp-merging driving scenario is provided. The quantitative results show that the proposed approach can accurately predict both the discrete driving decisions such as yield or pass as well as the continuous trajectories.

How to Combine Tree-Search Methods in Reinforcement Learning

Authors:Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor
Date:2018-09-06 06:40:08

Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success. Usually, the lookahead policies are implemented with specific planning methods such as Monte Carlo Tree Search (e.g. in AlphaZero). Referring to the planning problem as tree search, a reasonable practice in these implementations is to back up the value only at the leaves while the information obtained at the root is not leveraged other than for updating the policy. Here, we question the potency of this approach. Namely, the latter procedure is non-contractive in general, and its convergence is not guaranteed. Our proposed enhancement is straightforward and simple: use the return from the optimal tree path to back up the values at the descendants of the root. This leads to a $\gamma^h$-contracting procedure, where $\gamma$ is the discount factor and $h$ is the tree depth. To establish our results, we first introduce a notion called \emph{multiple-step greedy consistency}. We then provide convergence rates for two algorithmic instantiations of the above enhancement in the presence of noise injected to both the tree search stage and value estimation stage.

ExIt-OOS: Towards Learning from Planning in Imperfect Information Games

Authors:Andy Kitchen, Michela Benedetti
Date:2018-08-30 05:04:44

The current state of the art in playing many important perfect information games, including Chess and Go, combines planning and deep reinforcement learning with self-play. We extend this approach to imperfect information games and present ExIt-OOS, a novel approach to playing imperfect information games within the Expert Iteration framework and inspired by AlphaZero. We use Online Outcome Sampling, an online search algorithm for imperfect information games in place of MCTS. While training online, our neural strategy is used to improve the accuracy of playouts in OOS, allowing a learning and planning feedback loop for imperfect information games.

Deep RTS: A Game Environment for Deep Reinforcement Learning in Real-Time Strategy Games

Authors:Per-Arne Andersen, Morten Goodwin, Ole-Christoffer Granmo
Date:2018-08-15 10:30:41

Reinforcement learning (RL) is an area of research that has blossomed tremendously in recent years and has shown remarkable potential for artificial intelligence based opponents in computer games. This success is primarily due to the vast capabilities of convolutional neural networks, that can extract useful features from noisy and complex data. Games are excellent tools to test and push the boundaries of novel RL algorithms because they give valuable insight into how well an algorithm can perform in isolated environments without the real-life consequences. Real-time strategy games (RTS) is a genre that has tremendous complexity and challenges the player in short and long-term planning. There is much research that focuses on applied RL in RTS games, and novel advances are therefore anticipated in the not too distant future. However, there are to date few environments for testing RTS AIs. Environments in the literature are often either overly simplistic, such as microRTS, or complex and without the possibility for accelerated learning on consumer hardware like StarCraft II. This paper introduces the Deep RTS game environment for testing cutting-edge artificial intelligence algorithms for RTS games. Deep RTS is a high-performance RTS game made specifically for artificial intelligence research. It supports accelerated learning, meaning that it can learn at a magnitude of 50 000 times faster compared to existing RTS games. Deep RTS has a flexible configuration, enabling research in several different RTS scenarios, including partially observable state-spaces and map complexity. We show that Deep RTS lives up to our promises by comparing its performance with microRTS, ELF, and StarCraft II on high-end consumer hardware. Using Deep RTS, we show that a Deep Q-Network agent beats random-play agents over 70% of the time. Deep RTS is publicly available at https://github.com/cair/DeepRTS.

Learning to Optimize Join Queries With Deep Reinforcement Learning

Authors:Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, Ion Stoica
Date:2018-08-09 15:30:06

Exhaustive enumeration of all possible join orders is often avoided, and most optimizers leverage heuristics to prune the search space. The design and implementation of heuristics are well-understood when the cost model is roughly linear, and we find that these heuristics can be significantly suboptimal when there are non-linearities in cost. Ideally, instead of a fixed heuristic, we would want a strategy to guide the search space in a more data-driven way---tailoring the search to a specific dataset and query workload. Recognizing the link between classical Dynamic Programming enumeration methods and recent results in Reinforcement Learning (RL), we propose a new method for learning optimized join search strategies. We present our RL-based DQ optimizer, which currently optimizes select-project-join blocks. We implement three versions of DQ to illustrate the ease of integration into existing DBMSes: (1) A version built on top of Apache Calcite, (2) a version integrated into PostgreSQL, and (3) a version integrated into SparkSQL. Our extensive evaluation shows that DQ achieves plans with optimization costs and query execution times competitive with the native query optimizer in each system, but can execute significantly faster after learning (often by orders of magnitude).

Representational efficiency outweighs action efficiency in human program induction

Authors:Sophia Sanborn, David D. Bourgin, Michael Chang, Thomas L. Griffiths
Date:2018-07-18 20:20:40

The importance of hierarchically structured representations for tractable planning has long been acknowledged. However, the questions of how people discover such abstractions and how to define a set of optimal abstractions remain open. This problem has been explored in cognitive science in the problem solving literature and in computer science in hierarchical reinforcement learning. Here, we emphasize an algorithmic perspective on learning hierarchical representations in which the objective is to efficiently encode the structure of the problem, or, equivalently, to learn an algorithm with minimal length. We introduce a novel problem-solving paradigm that links problem solving and program induction under the Markov Decision Process (MDP) framework. Using this task, we target the question of whether humans discover hierarchical solutions by maximizing efficiency in number of actions they generate or by minimizing the complexity of the resulting representation and find evidence for the primacy of representational efficiency.

Safe Reinforcement Learning via Probabilistic Shields

Authors:Nils Jansen, Bettina Könighofer, Sebastian Junges, Alexandru C. Serban, Roderick Bloem
Date:2018-07-16 20:29:04

This paper targets the efficient construction of a safety shield for decision making in scenarios that incorporate uncertainty. Markov decision processes (MDPs) are prominent models to capture such planning problems. Reinforcement learning (RL) is a machine learning technique to determine near-optimal policies in MDPs that may be unknown prior to exploring the model. However, during exploration, RL is prone to induce behavior that is undesirable or not allowed in safety- or mission-critical contexts. We introduce the concept of a probabilistic shield that enables decision-making to adhere to safety constraints with high probability. In a separation of concerns, we employ formal verification to efficiently compute the probabilities of critical decisions within a safety-relevant fragment of the MDP. We use these results to realize a shield that is applied to an RL algorithm which then optimizes the actual performance objective. We discuss tradeoffs between sufficient progress in exploration of the environment and ensuring safety. In our experiments, we demonstrate on the arcade game PAC-MAN and on a case study involving service robots that the learning efficiency increases as the learning needs orders of magnitude fewer episodes.

Exploring Hierarchy-Aware Inverse Reinforcement Learning

Authors:Chris Cundy, Daniel Filan
Date:2018-07-13 12:33:07

We introduce a new generative model for human planning under the Bayesian Inverse Reinforcement Learning (BIRL) framework which takes into account the fact that humans often plan using hierarchical strategies. We describe the Bayesian Inverse Hierarchical RL (BIHRL) algorithm for inferring the values of hierarchical planners, and use an illustrative toy model to show that BIHRL retains accuracy where standard BIRL fails. Furthermore, BIHRL is able to accurately predict the goals of `Wikispeedia' game players, with inclusion of hierarchical structure in the model resulting in a large boost in accuracy. We show that BIHRL is able to significantly outperform BIRL even when we only have a weak prior on the hierarchical structure of the plans available to the agent, and discuss the significant challenges that remain for scaling up this framework to more realistic settings.

A Reinforcement Learning Approach to Jointly Adapt Vehicular Communications and Planning for Optimized Driving

Authors:Mayank K. Pal, Rupali Bhati, Anil Sharma, Sanjit K. Kaul, Saket Anand, P. B. Sujit
Date:2018-07-10 08:00:22

Our premise is that autonomous vehicles must optimize communications and motion planning jointly. Specifically, a vehicle must adapt its motion plan staying cognizant of communications rate related constraints and adapt the use of communications while being cognizant of motion planning related restrictions that may be imposed by the on-road environment. To this end, we formulate a reinforcement learning problem wherein an autonomous vehicle jointly chooses (a) a motion planning action that executes on-road and (b) a communications action of querying sensed information from the infrastructure. The goal is to optimize the driving utility of the autonomous vehicle. We apply the Q-learning algorithm to make the vehicle learn the optimal policy, which makes the optimal choice of planning and communications actions at any given time. We demonstrate the ability of the optimal policy to smartly adapt communications and planning actions, while achieving large driving utilities, using simulations.

Encoding Motion Primitives for Autonomous Vehicles using Virtual Velocity Constraints and Neural Network Scheduling

Authors:Mogens Graf Plessen
Date:2018-07-05 21:44:39

Within the context of trajectory planning for autonomous vehicles this paper proposes methods for efficient encoding of motion primitives in neural networks on top of model-based and gradient-free reinforcement learning. It is distinguished between 5 core aspects: system model, network architecture, training algorithm, training tasks selection and hardware/software implementation. For the system model, a kinematic (3-states-2-controls) and a dynamic (16-states-2-controls) vehicle model are compared. For the network architecture, 3 feedforward structures are compared including weighted skip connections. For the training algorithm, virtual velocity constraints and network scheduling are proposed. For the training tasks, different feature vector selections are discussed. For the implementation, aspects of gradient-free learning using 1 GPU and the handling of perturbation noise therefore are discussed. The effects of proposed methods are illustrated in experiments encoding up to 14625 motion primitives. The capabilities of tiny neural networks with as few as 10 scalar parameters when scheduled on vehicle velocity are emphasized.

Hierarchical Reinforcement Learning with Abductive Planning

Authors:Kazeto Yamamoto, Takashi Onishi, Yoshimasa Tsuruoka
Date:2018-06-28 06:56:19

One of the key challenges in applying reinforcement learning to real-life problems is that the amount of train-and-error required to learn a good policy increases drastically as the task becomes complex. One potential solution to this problem is to combine reinforcement learning with automated symbol planning and utilize prior knowledge on the domain. However, existing methods have limitations in their applicability and expressiveness. In this paper we propose a hierarchical reinforcement learning method based on abductive symbolic planning. The planner can deal with user-defined evaluation functions and is not based on the Herbrand theorem. Therefore it can utilize prior knowledge of the rewards and can work in a domain where the state space is unknown. We demonstrate empirically that our architecture significantly improves learning efficiency with respect to the amount of training examples on the evaluation domain, in which the state space is unknown and there exist multiple goals.

Human-Interactive Subgoal Supervision for Efficient Inverse Reinforcement Learning

Authors:Xinlei Pan, Eshed Ohn-Bar, Nicholas Rhinehart, Yan Xu, Yilin Shen, Kris M. Kitani
Date:2018-06-22 03:24:00

Humans are able to understand and perform complex tasks by strategically structuring the tasks into incremental steps or subgoals. For a robot attempting to learn to perform a sequential task with critical subgoal states, such states can provide a natural opportunity for interaction with a human expert. This paper analyzes the benefit of incorporating a notion of subgoals into Inverse Reinforcement Learning (IRL) with a Human-In-The-Loop (HITL) framework. The learning process is interactive, with a human expert first providing input in the form of full demonstrations along with some subgoal states. These subgoal states define a set of subtasks for the learning agent to complete in order to achieve the final goal. The learning agent queries for partial demonstrations corresponding to each subtask as needed when the agent struggles with the subtask. The proposed Human Interactive IRL (HI-IRL) framework is evaluated on several discrete path-planning tasks. We demonstrate that subgoal-based interactive structuring of the learning task results in significantly more efficient learning, requiring only a fraction of the demonstration data needed for learning the underlying reward function with the baseline IRL model.

Improving width-based planning with compact policies

Authors:Miquel Junyent, Anders Jonsson, Vicenç Gómez
Date:2018-06-15 10:41:23

Optimal action selection in decision problems characterized by sparse, delayed rewards is still an open challenge. For these problems, current deep reinforcement learning methods require enormous amounts of data to learn controllers that reach human-level performance. In this work, we propose a method that interleaves planning and learning to address this issue. The planning step hinges on the Iterated-Width (IW) planner, a state of the art planner that makes explicit use of the state representation to perform structured exploration. IW is able to scale up to problems independently of the size of the state space. From the state-actions visited by IW, the learning step estimates a compact policy, which in turn is used to guide the planning step. The type of exploration used by our method is radically different than the standard random exploration used in RL. We evaluate our method in simple problems where we show it to have superior performance than the state-of-the-art reinforcement learning algorithms A2C and Alpha Zero. Finally, we present preliminary results in a subset of the Atari games suite.

Surprising Negative Results for Generative Adversarial Tree Search

Authors:Kamyar Azizzadenesheli, Brandon Yang, Weitang Liu, Zachary C Lipton, Animashree Anandkumar
Date:2018-06-15 01:35:03

While many recent advances in deep reinforcement learning (RL) rely on model-free methods, model-based approaches remain an alluring prospect for their potential to exploit unsupervised data to learn environment model. In this work, we provide an extensive study on the design of deep generative models for RL environments and propose a sample efficient and robust method to learn the model of Atari environments. We deploy this model and propose generative adversarial tree search (GATS) a deep RL algorithm that learns the environment model and implements Monte Carlo tree search (MCTS) on the learned model for planning. While MCTS on the learned model is computationally expensive, similar to AlphaGo, GATS follows depth limited MCTS. GATS employs deep Q network (DQN) and learns a Q-function to assign values to the leaves of the tree in MCTS. We theoretical analyze GATS vis-a-vis the bias-variance trade-off and show GATS is able to mitigate the worst-case error in the Q-estimate. While we were expecting GATS to enjoy a better sample complexity and faster converges to better policies, surprisingly, GATS fails to outperform DQN. We provide a study on which we show why depth limited MCTS fails to perform desirably.

Deep Reinforcement Learning for Dynamic Urban Transportation Problems

Authors:Laura Schultz, Vadim Sokolov
Date:2018-06-14 00:24:49

We explore the use of deep learning and deep reinforcement learning for optimization problems in transportation. Many transportation system analysis tasks are formulated as an optimization problem - such as optimal control problems in intelligent transportation systems and long term urban planning. Often transportation models used to represent dynamics of a transportation system involve large data sets with complex input-output interactions and are difficult to use in the context of optimization. Use of deep learning metamodels can produce a lower dimensional representation of those relations and allow to implement optimization and reinforcement learning algorithms in an efficient manner. In particular, we develop deep learning models for calibrating transportation simulators and for reinforcement learning to solve the problem of optimal scheduling of travelers on the network.

Automatic View Planning with Multi-scale Deep Reinforcement Learning Agents

Authors:Amir Alansary, Loic Le Folgoc, Ghislain Vaillant, Ozan Oktay, Yuanwei Li, Wenjia Bai, Jonathan Passerat-Palmbach, Ricardo Guerrero, Konstantinos Kamnitsas, Benjamin Hou, Steven McDonagh, Ben Glocker, Bernhard Kainz, Daniel Rueckert
Date:2018-06-08 15:49:45

We propose a fully automatic method to find standardized view planes in 3D image acquisitions. Standard view images are important in clinical practice as they provide a means to perform biometric measurements from similar anatomical regions. These views are often constrained to the native orientation of a 3D image acquisition. Navigating through target anatomy to find the required view plane is tedious and operator-dependent. For this task, we employ a multi-scale reinforcement learning (RL) agent framework and extensively evaluate several Deep Q-Network (DQN) based strategies. RL enables a natural learning paradigm by interaction with the environment, which can be used to mimic experienced operators. We evaluate our results using the distance between the anatomical landmarks and detected planes, and the angles between their normal vector and target. The proposed algorithm is assessed on the mid-sagittal and anterior-posterior commissure planes of brain MRI, and the 4-chamber long-axis plane commonly used in cardiac MRI, achieving accuracy of 1.53mm, 1.98mm and 4.84mm, respectively.

Temporal Difference Variational Auto-Encoder

Authors:Karol Gregor, George Papamakarios, Frederic Besse, Lars Buesing, Theophane Weber
Date:2018-06-08 12:10:58

To act and plan in complex environments, we posit that agents should have a mental simulator of the world with three characteristics: (a) it should build an abstract state representing the condition of the world; (b) it should form a belief which represents uncertainty on the world; (c) it should go beyond simple step-by-step simulation, and exhibit temporal abstraction. Motivated by the absence of a model satisfying all these requirements, we propose TD-VAE, a generative sequence model that learns representations containing explicit beliefs about states several steps into the future, and that can be rolled out directly without single-step transitions. TD-VAE is trained on pairs of temporally separated time points, using an analogue of temporal difference learning used in reinforcement learning.

Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings

Authors:John D. Co-Reyes, YuXuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, Sergey Levine
Date:2018-06-07 17:49:08

In this work, we take a representation learning perspective on hierarchical reinforcement learning, where the problem of learning lower layers in a hierarchy is transformed into the problem of learning trajectory-level generative models. We show that we can learn continuous latent representations of trajectories, which are effective in solving temporally extended and multi-stage problems. Our proposed model, SeCTAR, draws inspiration from variational autoencoders, and learns latent representations of trajectories. A key component of this method is to learn both a latent-conditioned policy and a latent-conditioned model which are consistent with each other. Given the same latent, the policy generates a trajectory which should match the trajectory predicted by the model. This model provides a built-in prediction mechanism, by predicting the outcome of closed loop policy behavior. We propose a novel algorithm for performing hierarchical RL with this model, combining model-based planning in the learned latent space with an unsupervised exploration objective. We show that our model is effective at reasoning over long horizons with sparse rewards for several simulated tasks, outperforming standard reinforcement learning methods and prior methods for hierarchical reasoning, model-based planning, and exploration.

Simplifying Reward Design through Divide-and-Conquer

Authors:Ellis Ratner, Dylan Hadfield-Menell, Anca D. Dragan
Date:2018-06-07 03:49:05

Designing a good reward function is essential to robot planning and reinforcement learning, but it can also be challenging and frustrating. The reward needs to work across multiple different environments, and that often requires many iterations of tuning. We introduce a novel divide-and-conquer approach that enables the designer to specify a reward separately for each environment. By treating these separate reward functions as observations about the underlying true reward, we derive an approach to infer a common reward across all environments. We conduct user studies in an abstract grid world domain and in a motion planning domain for a 7-DOF manipulator that measure user effort and solution quality. We show that our method is faster, easier to use, and produces a higher quality solution than the typical method of designing a reward jointly across all environments. We additionally conduct a series of experiments that measure the sensitivity of these results to different properties of the reward design task, such as the number of environments, the number of feasible solutions per environment, and the fraction of the total features that vary within each environment. We find that independent reward design outperforms the standard, joint, reward design process but works best when the design problem can be divided into simpler subproblems.

Deep Reinforcement Learning for General Video Game AI

Authors:Ruben Rodriguez Torrado, Philip Bontrager, Julian Togelius, Jialin Liu, Diego Perez-Liebana
Date:2018-06-06 22:39:26

The General Video Game AI (GVGAI) competition and its associated software framework provides a way of benchmarking AI algorithms on a large number of games written in a domain-specific description language. While the competition has seen plenty of interest, it has so far focused on online planning, providing a forward model that allows the use of algorithms such as Monte Carlo Tree Search. In this paper, we describe how we interface GVGAI to the OpenAI Gym environment, a widely used way of connecting agents to reinforcement learning problems. Using this interface, we characterize how widely used implementations of several deep reinforcement learning algorithms fare on a number of GVGAI games. We further analyze the results to provide a first indication of the relative difficulty of these games relative to each other, and relative to those in the Arcade Learning Environment under similar conditions.

Relational Deep Reinforcement Learning

Authors:Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, Murray Shanahan, Victoria Langston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, Peter Battaglia
Date:2018-06-05 17:39:12

We introduce an approach for deep reinforcement learning (RL) that improves upon the efficiency, generalization capacity, and interpretability of conventional approaches through structured perception and relational reasoning. It uses self-attention to iteratively reason about the relations between entities in a scene and to guide a model-free policy. Our results show that in a novel navigation and planning task called Box-World, our agent finds interpretable solutions that improve upon baselines in terms of sample complexity, ability to generalize to more complex scenes than experienced during training, and overall performance. In the StarCraft II Learning Environment, our agent achieves state-of-the-art performance on six mini-games -- surpassing human grandmaster performance on four. By considering architectural inductive biases, our work opens new directions for overcoming important, but stubborn, challenges in deep RL.

The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces

Authors:G. Zacharias Holland, Erin J. Talvitie, Michael Bowling
Date:2018-06-05 17:31:02

Dyna is a fundamental approach to model-based reinforcement learning (MBRL) that interleaves planning, acting, and learning in an online setting. In the most typical application of Dyna, the dynamics model is used to generate one-step transitions from selected start states from the agent's history, which are used to update the agent's value function or policy as if they were real experiences. In this work, one-step Dyna was applied to several games from the Arcade Learning Environment (ALE). We found that the model-based updates offered surprisingly little benefit over simply performing more updates with the agent's existing experience, even when using a perfect model. We hypothesize that to get the most from planning, the model must be used to generate unfamiliar experience. To test this, we experimented with the "shape" of planning in multiple different concrete instantiations of Dyna, performing fewer, longer rollouts, rather than many short rollouts. We found that planning shape has a profound impact on the efficacy of Dyna for both perfect and learned models. In addition to these findings regarding Dyna in general, our results represent, to our knowledge, the first time that a learned dynamics model has been successfully used for planning in the ALE, suggesting that Dyna may be a viable approach to MBRL in the ALE and other high-dimensional problems.

Adversarial Reinforcement Learning Framework for Benchmarking Collision Avoidance Mechanisms in Autonomous Vehicles

Authors:Vahid Behzadan, Arslan Munir
Date:2018-06-04 20:17:40

With the rapidly growing interest in autonomous navigation, the body of research on motion planning and collision avoidance techniques has enjoyed an accelerating rate of novel proposals and developments. However, the complexity of new techniques and their safety requirements render the bulk of current benchmarking frameworks inadequate, thus leaving the need for efficient comparison techniques unanswered. This work proposes a novel framework based on deep reinforcement learning for benchmarking the behavior of collision avoidance mechanisms under the worst-case scenario of dealing with an optimal adversarial agent, trained to drive the system into unsafe states. We describe the architecture and flow of this framework as a benchmarking solution, and demonstrate its efficacy via a practical case study of comparing the reliability of two collision avoidance mechanisms in response to intentional collision attempts.

Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling

Authors:Emilie Kaufmann, Wouter Koolen, Aurelien Garivier
Date:2018-06-04 06:37:22

Learning the minimum/maximum mean among a finite set of distributions is a fundamental sub-task in planning, game tree search and reinforcement learning. We formalize this learning task as the problem of sequentially testing how the minimum mean among a finite set of distributions compares to a given threshold. We develop refined non-asymptotic lower bounds, which show that optimality mandates very different sampling behavior for a low vs high true minimum. We show that Thompson Sampling and the intuitive Lower Confidence Bounds policy each nail only one of these cases. We develop a novel approach that we call Murphy Sampling. Even though it entertains exclusively low true minima, we prove that MS is optimal for both possibilities. We then design advanced self-normalized deviation inequalities, fueling more aggressive stopping rules. We complement our theoretical guarantees by experiments showing that MS works best in practice.

Equivalence Between Wasserstein and Value-Aware Loss for Model-based Reinforcement Learning

Authors:Kavosh Asadi, Evan Cater, Dipendra Misra, Michael L. Littman
Date:2018-06-01 21:54:18

Learning a generative model is a key component of model-based reinforcement learning. Though learning a good model in the tabular setting is a simple task, learning a useful model in the approximate setting is challenging. In this context, an important question is the loss function used for model learning as varying the loss function can have a remarkable impact on effectiveness of planning. Recently Farahmand et al. (2017) proposed a value-aware model learning (VAML) objective that captures the structure of value function during model learning. Using tools from Asadi et al. (2018), we show that minimizing the VAML objective is in fact equivalent to minimizing the Wasserstein metric. This equivalence improves our understanding of value-aware models, and also creates a theoretical foundation for applications of Wasserstein in model-based reinforcement~learning.

Fast Exploration with Simplified Models and Approximately Optimistic Planning in Model Based Reinforcement Learning

Authors:Ramtin Keramati, Jay Whang, Patrick Cho, Emma Brunskill
Date:2018-06-01 02:54:06

Humans learn to play video games significantly faster than the state-of-the-art reinforcement learning (RL) algorithms. People seem to build simple models that are easy to learn to support planning and strategic exploration. Inspired by this, we investigate two issues in leveraging model-based RL for sample efficiency. First we investigate how to perform strategic exploration when exact planning is not feasible and empirically show that optimistic Monte Carlo Tree Search outperforms posterior sampling methods. Second we show how to learn simple deterministic models to support fast learning using object representation. We illustrate the benefit of these ideas by introducing a novel algorithm, Strategic Object Oriented Reinforcement Learning (SOORL), that outperforms state-of-the-art algorithms in the game of Pitfall! in less than 50 episodes.

Observe and Look Further: Achieving Consistent Performance on Atari

Authors:Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Večerík, Matteo Hessel, Rémi Munos, Olivier Pietquin
Date:2018-05-29 17:19:59

Despite significant advances in the field of deep Reinforcement Learning (RL), today's algorithms still fail to learn human-level policies consistently over a set of diverse tasks such as Atari 2600 games. We identify three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and exploring efficiently. In this paper, we propose an algorithm that addresses each of these challenges and is able to learn human-level policies on nearly all Atari games. A new transformed Bellman operator allows our algorithm to process rewards of varying densities and scales; an auxiliary temporal consistency loss allows us to train stably using a discount factor of $\gamma = 0.999$ (instead of $\gamma = 0.99$) extending the effective planning horizon by an order of magnitude; and we ease the exploration problem by using human demonstrations that guide the agent towards rewarding states. When tested on a set of 42 Atari games, our algorithm exceeds the performance of an average human on 40 games using a common set of hyper parameters. Furthermore, it is the first deep RL algorithm to solve the first level of Montezuma's Revenge.

The Actor Search Tree Critic (ASTC) for Off-Policy POMDP Learning in Medical Decision Making

Authors:Luchen Li, Matthieu Komorowski, Aldo A. Faisal
Date:2018-05-29 15:55:33

Off-policy reinforcement learning enables near-optimal policy from suboptimal experience, thereby provisions opportunity for artificial intelligence applications in healthcare. Previous works have mainly framed patient-clinician interactions as Markov decision processes, while true physiological states are not necessarily fully observable from clinical data. We capture this situation with partially observable Markov decision process, in which an agent optimises its actions in a belief represented as a distribution of patient states inferred from individual history trajectories. A Gaussian mixture model is fitted for the observed data. Moreover, we take into account the fact that nuance in pharmaceutical dosage could presumably result in significantly different effect by modelling a continuous policy through a Gaussian approximator directly in the policy space, i.e. the actor. To address the challenge of infinite number of possible belief states which renders exact value iteration intractable, we evaluate and plan for only every encountered belief, through heuristic search tree by tightly maintaining lower and upper bounds of the true value of belief. We further resort to function approximations to update value bounds estimation, i.e. the critic, so that the tree search can be improved through more compact bounds at the fringe nodes that will be back-propagated to the root. Both actor and critic parameters are learned via gradient-based approaches. Our proposed policy trained from real intensive care unit data is capable of dictating dosing on vasopressors and intravenous fluids for sepsis patients that lead to the best patient outcomes.

Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning

Authors:Wen Sun, J. Andrew Bagnell, Byron Boots
Date:2018-05-29 04:24:17

In this paper, we propose to combine imitation and reinforcement learning via the idea of reward shaping using an oracle. We study the effectiveness of the near-optimal cost-to-go oracle on the planning horizon and demonstrate that the cost-to-go oracle shortens the learner's planning horizon as function of its accuracy: a globally optimal oracle can shorten the planning horizon to one, leading to a one-step greedy Markov Decision Process which is much easier to optimize, while an oracle that is far away from the optimality requires planning over a longer horizon to achieve near-optimal performance. Hence our new insight bridges the gap and interpolates between imitation learning and reinforcement learning. Motivated by the above mentioned insights, we propose Truncated HORizon Policy Search (THOR), a method that focuses on searching for policies that maximize the total reshaped reward over a finite planning horizon when the oracle is sub-optimal. We experimentally demonstrate that a gradient-based implementation of THOR can achieve superior performance compared to RL baselines and IL baselines even when the oracle is sub-optimal.

Value Propagation Networks

Authors:Nantas Nardelli, Gabriel Synnaeve, Zeming Lin, Pushmeet Kohli, Philip H. S. Torr, Nicolas Usunier
Date:2018-05-28 23:21:32

We present Value Propagation (VProp), a set of parameter-efficient differentiable planning modules built on Value Iteration which can successfully be trained using reinforcement learning to solve unseen tasks, has the capability to generalize to larger map sizes, and can learn to navigate in dynamic environments. We show that the modules enable learning to plan when the environment also includes stochastic elements, providing a cost-efficient learning system to build low-level size-invariant planners for a variety of interactive navigation problems. We evaluate on static and dynamic configurations of MazeBase grid-worlds, with randomly generated environments of several different sizes, and on a StarCraft navigation scenario, with more complex dynamics, and pixels as input.

Dyna Planning using a Feature Based Generative Model

Authors:Ryan Faulkner, Doina Precup
Date:2018-05-23 23:23:34

Dyna-style reinforcement learning is a powerful approach for problems where not much real data is available. The main idea is to supplement real trajectories, or sequences of sampled states over time, with simulated ones sampled from a learned model of the environment. However, in large state spaces, the problem of learning a good generative model of the environment has been open so far. We propose to use deep belief networks to learn an environment model for use in Dyna. We present our approach and validate it empirically on problems where the state observations consist of images. Our results demonstrate that using deep belief networks, which are full generative models, significantly outperforms the use of linear expectation models, proposed in Sutton et al. (2008)

Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation

Authors:Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng Wu, Jianfeng Wang, Xiaodong He
Date:2018-05-21 17:23:31

We propose a hierarchically structured reinforcement learning approach to address the challenges of planning for generating coherent multi-sentence stories for the visual storytelling task. Within our framework, the task of generating a story given a sequence of images is divided across a two-level hierarchical decoder. The high-level decoder constructs a plan by generating a semantic concept (i.e., topic) for each image in sequence. The low-level decoder generates a sentence for each image using a semantic compositional network, which effectively grounds the sentence generation conditioned on the topic. The two decoders are jointly trained end-to-end using reinforcement learning. We evaluate our model on the visual storytelling (VIST) dataset. Empirical results from both automatic and human evaluations demonstrate that the proposed hierarchically structured reinforced training achieves significantly better performance compared to a strong flat deep reinforcement learning baseline.

Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior

Authors:Siddharth Reddy, Anca D. Dragan, Sergey Levine
Date:2018-05-21 12:15:34

Inferring intent from observed behavior has been studied extensively within the frameworks of Bayesian inverse planning and inverse reinforcement learning. These methods infer a goal or reward function that best explains the actions of the observed agent, typically a human demonstrator. Another agent can use this inferred intent to predict, imitate, or assist the human user. However, a central assumption in inverse reinforcement learning is that the demonstrator is close to optimal. While models of suboptimal behavior exist, they typically assume that suboptimal actions are the result of some type of random noise or a known cognitive bias, like temporal inconsistency. In this paper, we take an alternative approach, and model suboptimal behavior as the result of internal model misspecification: the reason that user actions might deviate from near-optimal actions is that the user has an incorrect set of beliefs about the rules -- the dynamics -- governing how actions affect the environment. Our insight is that while demonstrated actions may be suboptimal in the real world, they may actually be near-optimal with respect to the user's internal model of the dynamics. By estimating these internal beliefs from observed behavior, we arrive at a new method for inferring intent. We demonstrate in simulation and in a user study with 12 participants that this approach enables us to more accurately model human intent, and can be used in a variety of applications, including offering assistance in a shared autonomy framework and inferring human preferences.

A Lyapunov-based Approach to Safe Reinforcement Learning

Authors:Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, Mohammad Ghavamzadeh
Date:2018-05-20 05:12:04

In many real-world reinforcement learning (RL) problems, besides optimizing the main objective function, an agent must concurrently avoid violating a number of constraints. In particular, besides optimizing performance it is crucial to guarantee the safety of an agent during training as well as deployment (e.g. a robot should avoid taking actions - exploratory or not - which irrevocably harm its hardware). To incorporate safety in RL, we derive algorithms under the framework of constrained Markov decision problems (CMDPs), an extension of the standard Markov decision problems (MDPs) augmented with constraints on expected cumulative costs. Our approach hinges on a novel \emph{Lyapunov} method. We define and present a method for constructing Lyapunov functions, which provide an effective way to guarantee the global safety of a behavior policy during training via a set of local, linear constraints. Leveraging these theoretical underpinnings, we show how to use the Lyapunov approach to systematically transform dynamic programming (DP) and RL algorithms into their safe counterparts. To illustrate their effectiveness, we evaluate these algorithms in several CMDP planning and decision-making tasks on a safety benchmark domain. Our results show that our proposed method significantly outperforms existing baselines in balancing constraint satisfaction and performance.

FollowNet: Robot Navigation by Following Natural Language Directions with Deep Reinforcement Learning

Authors:Pararth Shah, Marek Fiser, Aleksandra Faust, J. Chase Kew, Dilek Hakkani-Tur
Date:2018-05-16 06:29:18

Understanding and following directions provided by humans can enable robots to navigate effectively in unknown situations. We present FollowNet, an end-to-end differentiable neural architecture for learning multi-modal navigation policies. FollowNet maps natural language instructions as well as visual and depth inputs to locomotion primitives. FollowNet processes instructions using an attention mechanism conditioned on its visual and depth input to focus on the relevant parts of the command while performing the navigation task. Deep reinforcement learning (RL) a sparse reward learns simultaneously the state representation, the attention function, and control policies. We evaluate our agent on a dataset of complex natural language directions that guide the agent through a rich and realistic dataset of simulated homes. We show that the FollowNet agent learns to execute previously unseen instructions described with a similar vocabulary, and successfully navigates along paths not encountered during training. The agent shows 30% improvement over a baseline model without the attention mechanism, with 52% success rate at novel instructions.

Generating Rescheduling Knowledge using Reinforcement Learning in a Cognitive Architecture

Authors:Jorge A. Palombarini, Juan Cruz Barsce, Ernesto C. Martínez
Date:2018-05-12 17:05:56

In order to reach higher degrees of flexibility, adaptability and autonomy in manufacturing systems, it is essential to develop new rescheduling methodologies which resort to cognitive capabilities, similar to those found in human beings. Artificial cognition is important for designing planning and control systems that generate and represent knowledge about heuristics for repair-based scheduling. Rescheduling knowledge in the form of decision rules is used to deal with unforeseen events and disturbances reactively in real time, and take advantage of the ability to act interactively with the user to counteract the effects of disruptions. In this work, to achieve the aforementioned goals, a novel approach to generate rescheduling knowledge in the form of dynamic first-order logical rules is proposed. The proposed approach is based on the integration of reinforcement learning with artificial cognitive capabilities involving perception and reasoning/learning skills embedded in the Soar cognitive architecture. An industrial example is discussed showing that the approach enables the scheduling system to assess its operational range in an autonomic way, and to acquire experience through intensive simulation while performing repair tasks.

Task Transfer by Preference-Based Cost Learning

Authors:Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu
Date:2018-05-12 09:08:14

The goal of task transfer in reinforcement learning is migrating the action policy of an agent to the target task from the source task. Given their successes on robotic action planning, current methods mostly rely on two requirements: exactly-relevant expert demonstrations or the explicitly-coded cost function on target task, both of which, however, are inconvenient to obtain in practice. In this paper, we relax these two strong conditions by developing a novel task transfer framework where the expert preference is applied as a guidance. In particular, we alternate the following two steps: Firstly, letting experts apply pre-defined preference rules to select related expert demonstrates for the target task. Secondly, based on the selection result, we learn the target cost function and trajectory distribution simultaneously via enhanced Adversarial MaxEnt IRL and generate more trajectories by the learned target distribution for the next preference selection. The theoretical analysis on the distribution learning and convergence of the proposed algorithm are provided. Extensive simulations on several benchmarks have been conducted for further verifying the effectiveness of the proposed method.

Learning Coordinated Tasks using Reinforcement Learning in Humanoids

Authors:S Phaniteja, Parijat Dewangan, Pooja Guhan, K Madhava Krishna, Abhishek Sarkar
Date:2018-05-09 15:21:09

With the advent of artificial intelligence and machine learning, humanoid robots are made to learn a variety of skills which humans possess. One of fundamental skills which humans use in day-to-day activities is performing tasks with coordination between both the hands. In case of humanoids, learning such skills require optimal motion planning which includes avoiding collisions with the surroundings. In this paper, we propose a framework to learn coordinated tasks in cluttered environments based on DiGrad - A multi-task reinforcement learning algorithm for continuous action-spaces. Further, we propose an algorithm to smooth the joint space trajectories obtained by the proposed framework in order to reduce the noise instilled during training. The proposed framework was tested on a 27 degrees of freedom (DoF) humanoid with articulated torso for performing coordinated object-reaching task with both the hands in four different environments with varying levels of difficulty. It is observed that the humanoid is able to plan collision free trajectory in real-time. Simulation results also reveal the usefulness of the articulated torso for performing tasks which require coordination between both the arms.

Planning and Learning with Stochastic Action Sets

Authors:Craig Boutilier, Alon Cohen, Amit Daniely, Avinatan Hassidim, Yishay Mansour, Ofer Meshi, Martin Mladenov, Dale Schuurmans
Date:2018-05-07 06:48:41

In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundations. We show that optimal policies and value functions in this model have a structure that admits a compact representation. From an RL perspective, we show that Q-learning with sampled action sets is sound. In model-based settings, we consider two important special cases: when individual actions are available with independent probabilities; and a sampling-based model for unknown distributions. We develop poly-time value and policy iteration methods for both cases; and in the first, we offer a poly-time linear programming solution.

Motion Planning Among Dynamic, Decision-Making Agents with Deep Reinforcement Learning

Authors:Michael Everett, Yu Fan Chen, Jonathan P. How
Date:2018-05-04 22:45:08

Robots that navigate among pedestrians use collision avoidance algorithms to enable safe and efficient operation. Recent works present deep reinforcement learning as a framework to model the complex interactions and cooperation. However, they are implemented using key assumptions about other agents' behavior that deviate from reality as the number of agents in the environment increases. This work extends our previous approach to develop an algorithm that learns collision avoidance among a variety of types of dynamic agents without assuming they follow any particular behavior rules. This work also introduces a strategy using LSTM that enables the algorithm to use observations of an arbitrary number of other agents, instead of previous methods that have a fixed observation size. The proposed algorithm outperforms our previous approach in simulation as the number of agents increases, and the algorithm is demonstrated on a fully autonomous robotic vehicle traveling at human walking speed, without the use of a 3D Lidar.

Decoupling Dynamics and Reward for Transfer Learning

Authors:Amy Zhang, Harsh Satija, Joelle Pineau
Date:2018-04-27 21:16:40

Current reinforcement learning (RL) methods can successfully learn single tasks but often generalize poorly to modest perturbations in task domain or training procedure. In this work, we present a decoupled learning strategy for RL that creates a shared representation space where knowledge can be robustly transferred. We separate learning the task representation, the forward dynamics, the inverse dynamics and the reward function of the domain, and show that this decoupling improves performance within the task, transfers well to changes in dynamics and reward, and can be effectively used for online planning. Empirical results show good performance in both continuous and discrete RL domains.

Action Categorization for Computationally Improved Task Learning and Planning

Authors:Lakshmi Nair, Sonia Chernova
Date:2018-04-26 02:10:22

This paper explores the problem of task learning and planning, contributing the Action-Category Representation (ACR) to improve computational performance of both Planning and Reinforcement Learning (RL). ACR is an algorithm-agnostic, abstract data representation that maps objects to action categories (groups of actions), inspired by the psychological concept of action codes. We validate our approach in StarCraft and Lightworld domains; our results demonstrate several benefits of ACR relating to improved computational performance of planning and RL, by reducing the action space for the agent.

Generative Temporal Models with Spatial Memory for Partially Observed Environments

Authors:Marco Fraccaro, Danilo Jimenez Rezende, Yori Zwols, Alexander Pritzel, S. M. Ali Eslami, Fabio Viola
Date:2018-04-25 07:40:37

In model-based reinforcement learning, generative and temporal models of environments can be leveraged to boost agent performance, either by tuning the agent's representations during training or via use as part of an explicit planning mechanism. However, their application in practice has been limited to simplistic environments, due to the difficulty of training such models in larger, potentially partially-observed and 3D environments. In this work we introduce a novel action-conditioned generative model of such challenging environments. The model features a non-parametric spatial memory system in which we store learned, disentangled representations of the environment. Low-dimensional spatial updates are computed using a state-space model that makes use of knowledge on the prior dynamics of the moving agent, and high-dimensional visual observations are modelled with a Variational Auto-Encoder. The result is a scalable architecture capable of performing coherent predictions over hundreds of time steps across a range of partially observed 2D and 3D environments.

Crawling in Rogue's dungeons with (partitioned) A3C

Authors:Andrea Asperti, Daniele Cortesi, Francesco Sovrano
Date:2018-04-23 19:59:51

Rogue is a famous dungeon-crawling video-game of the 80ies, the ancestor of its gender. Rogue-like games are known for the necessity to explore partially observable and always different randomly-generated labyrinths, preventing any form of level replay. As such, they serve as a very natural and challenging task for reinforcement learning, requiring the acquisition of complex, non-reactive behaviors involving memory and planning. In this article we show how, exploiting a version of A3C partitioned on different situations, the agent is able to reach the stairs and descend to the next level in 98% of cases.

Subgoal Discovery for Hierarchical Dialogue Policy Learning

Authors:Da Tang, Xiujun Li, Jianfeng Gao, Chong Wang, Lihong Li, Tony Jebara
Date:2018-04-20 23:06:44

Developing agents to engage in complex goal-oriented dialogues is challenging partly because the main learning signals are very sparse in long conversations. In this paper, we propose a divide-and-conquer approach that discovers and exploits the hidden structure of the task to enable efficient policy learning. First, given successful example dialogues, we propose the Subgoal Discovery Network (SDN) to divide a complex goal-oriented task into a set of simpler subgoals in an unsupervised fashion. We then use these subgoals to learn a multi-level policy by hierarchical reinforcement learning. We demonstrate our method by building a dialogue agent for the composite task of travel planning. Experiments with simulated and real users show that our approach performs competitively against a state-of-the-art method that requires human-defined subgoals. Moreover, we show that the learned subgoals are often human comprehensible.

PEORL: Integrating Symbolic Planning and Hierarchical Reinforcement Learning for Robust Decision-Making

Authors:Fangkai Yang, Daoming Lyu, Bo Liu, Steven Gustafson
Date:2018-04-20 18:16:43

Reinforcement learning and symbolic planning have both been used to build intelligent autonomous agents. Reinforcement learning relies on learning from interactions with real world, which often requires an unfeasibly large amount of experience. Symbolic planning relies on manually crafted symbolic knowledge, which may not be robust to domain uncertainties and changes. In this paper we present a unified framework {\em PEORL} that integrates symbolic planning with hierarchical reinforcement learning (HRL) to cope with decision-making in a dynamic environment with uncertainties. Symbolic plans are used to guide the agent's task execution and learning, and the learned experience is fed back to symbolic knowledge to improve planning. This method leads to rapid policy search and robust symbolic plans in complex domains. The framework is tested on benchmark domains of HRL.

Leveraging Statistical Multi-Agent Online Planning with Emergent Value Function Approximation

Authors:Thomy Phan, Lenz Belzner, Thomas Gabor, Kyrill Schmid
Date:2018-04-17 15:10:44

Making decisions is a great challenge in distributed autonomous environments due to enormous state spaces and uncertainty. Many online planning algorithms rely on statistical sampling to avoid searching the whole state space, while still being able to make acceptable decisions. However, planning often has to be performed under strict computational constraints making online planning in multi-agent systems highly limited, which could lead to poor system performance, especially in stochastic domains. In this paper, we propose Emergent Value function Approximation for Distributed Environments (EVADE), an approach to integrate global experience into multi-agent online planning in stochastic domains to consider global effects during local planning. For this purpose, a value function is approximated online based on the emergent system behaviour by using methods of reinforcement learning. We empirically evaluated EVADE with two statistical multi-agent online planning algorithms in a highly complex and stochastic smart factory environment, where multiple agents need to process various items at a shared set of machines. Our experiments show that EVADE can effectively improve the performance of multi-agent online planning while offering efficiency w.r.t. the breadth and depth of the planning process.

Optimizing Query Evaluations using Reinforcement Learning for Web Search

Authors:Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, Saurabh Tiwary
Date:2018-04-12 10:22:28

In web search, typically a candidate generation step selects a small set of documents---from collections containing as many as billions of web pages---that are subsequently ranked and pruned before being presented to the user. In Bing, the candidate generation involves scanning the index using statically designed match plans that prescribe sequences of different match criteria and stopping conditions. In this work, we pose match planning as a reinforcement learning task and observe up to 20% reduction in index blocks accessed, with small or no degradation in the quality of the candidate sets.

Policy Gradient With Value Function Approximation For Collective Multiagent Planning

Authors:Duc Thien Nguyen, Akshat Kumar, Hoong Chuin Lau
Date:2018-04-09 09:45:29

Decentralized (PO)MDPs provide an expressive framework for sequential decision making in a multiagent system. Given their computational complexity, recent research has focused on tractable yet practical subclasses of Dec-POMDPs. We address such a subclass called CDEC-POMDP where the collective behavior of a population of agents affects the joint-reward and environment dynamics. Our main contribution is an actor-critic (AC) reinforcement learning method for optimizing CDEC-POMDP policies. Vanilla AC has slow convergence for larger problems. To address this, we show how a particular decomposition of the approximate action-value function over agents leads to effective updates, and also derive a new way to train the critic based on local reward signals. Comparisons on a synthetic benchmark and a real-world taxi fleet optimization problem show that our new AC approach provides better quality solutions than previous best approaches.

Hierarchical Modular Reinforcement Learning Method and Knowledge Acquisition of State-Action Rule for Multi-target Problem

Authors:Takumi Ichimura, Daisuke Igaue
Date:2018-04-08 14:39:13

Hierarchical Modular Reinforcement Learning (HMRL), consists of 2 layered learning where Profit Sharing works to plan a prey position in the higher layer and Q-learning method trains the state-actions to the target in the lower layer. In this paper, we expanded HMRL to multi-target problem to take the distance between targets to the consideration. The function, called `AT field', can estimate the interests for an agent according to the distance between 2 agents and the advantage/disadvantage of the other agent. Moreover, the knowledge related to state-action rules is extracted by C4.5. The action under the situation is decided by using the acquired knowledge. To verify the effectiveness of proposed method, some experimental results are reported.

Universal Planning Networks

Authors:Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, Chelsea Finn
Date:2018-04-02 17:51:53

A key challenge in complex visuomotor control is learning abstract representations that are effective for specifying goals, planning, and generalization. To this end, we introduce universal planning networks (UPN). UPNs embed differentiable planning within a goal-directed policy. This planning computation unrolls a forward model in a latent space and infers an optimal action plan through gradient descent trajectory optimization. The plan-by-gradient-descent process and its underlying representations are learned end-to-end to directly optimize a supervised imitation learning objective. We find that the representations learned are not only effective for goal-directed visual imitation via gradient-based trajectory optimization, but can also provide a metric for specifying goals using images. The learned representations can be leveraged to specify distance-based rewards to reach new target states for model-free reinforcement learning, resulting in substantially more effective learning when solving new tasks described via image-based goals. We were able to achieve successful transfer of visuomotor planning strategies across robots with significantly different morphologies and actuation capabilities.

Learning to Run challenge: Synthesizing physiologically accurate motion using deep reinforcement learning

Authors:Łukasz Kidziński, Sharada P. Mohanty, Carmichael Ong, Jennifer L. Hicks, Sean F. Carroll, Sergey Levine, Marcel Salathé, Scott L. Delp
Date:2018-03-31 17:56:28

Synthesizing physiologically-accurate human movement in a variety of conditions can help practitioners plan surgeries, design experiments, or prototype assistive devices in simulated environments, reducing time and costs and improving treatment outcomes. Because of the large and complex solution spaces of biomechanical models, current methods are constrained to specific movements and models, requiring careful design of a controller and hindering many possible applications. We sought to discover if modern optimization methods efficiently explore these complex spaces. To do this, we posed the problem as a competition in which participants were tasked with developing a controller to enable a physiologically-based human model to navigate a complex obstacle course as quickly as possible, without using any experimental data. They were provided with a human musculoskeletal model and a physics-based simulation environment. In this paper, we discuss the design of the competition, technical difficulties, results, and analysis of the top controllers. The challenge proved that deep reinforcement learning techniques, despite their high computational cost, can be successfully employed as an optimization method for synthesizing physiologically feasible motion in high-dimensional biomechanical systems.

Deep Reinforcement Learning with Model Learning and Monte Carlo Tree Search in Minecraft

Authors:Stephan Alaniz
Date:2018-03-22 16:53:34

Deep reinforcement learning has been successfully applied to several visual-input tasks using model-free methods. In this paper, we propose a model-based approach that combines learning a DNN-based transition model with Monte Carlo tree search to solve a block-placing task in Minecraft. Our learned transition model predicts the next frame and the rewards one step ahead given the last four frames of the agent's first-person-view image and the current action. Then a Monte Carlo tree search algorithm uses this model to plan the best sequence of actions for the agent to perform. On the proposed task in Minecraft, our model-based approach reaches the performance comparable to the Deep Q-Network's, but learns faster and, thus, is more training sample efficient.

DOP: Deep Optimistic Planning with Approximate Value Function Evaluation

Authors:Francesco Riccio, Roberto Capobianco, Daniele Nardi
Date:2018-03-22 14:59:16

Research on reinforcement learning has demonstrated promising results in manifold applications and domains. Still, efficiently learning effective robot behaviors is very difficult, due to unstructured scenarios, high uncertainties, and large state dimensionality (e.g. multi-agent systems or hyper-redundant robots). To alleviate this problem, we present DOP, a deep model-based reinforcement learning algorithm, which exploits action values to both (1) guide the exploration of the state space and (2) plan effective policies. Specifically, we exploit deep neural networks to learn Q-functions that are used to attack the curse of dimensionality during a Monte-Carlo tree search. Our algorithm, in fact, constructs upper confidence bounds on the learned value function to select actions optimistically. We implement and evaluate DOP on different scenarios: (1) a cooperative navigation problem, (2) a fetching task for a 7-DOF KUKA robot, and (3) a human-robot handover with a humanoid robot (both in simulation and real). The obtained results show the effectiveness of DOP in the chosen applications, where action values drive the exploration and reduce the computational demand of the planning process while achieving good performance.

Planning with a Receding Horizon for Manipulation in Clutter using a Learned Value Function

Authors:Wissam Bejjani, Rafael Papallas, Matteo Leonetti, Mehmet R. Dogar
Date:2018-03-21 19:37:18

Manipulation in clutter requires solving complex sequential decision making problems in an environment rich with physical interactions. The transfer of motion planning solutions from simulation to the real world, in open-loop, suffers from the inherent uncertainty in modelling real world physics. We propose interleaving planning and execution in real-time, in a closed-loop setting, using a Receding Horizon Planner (RHP) for pushing manipulation in clutter. In this context, we address the problem of finding a suitable value function based heuristic for efficient planning, and for estimating the cost-to-go from the horizon to the goal. We estimate such a value function first by using plans generated by an existing sampling-based planner. Then, we further optimize the value function through reinforcement learning. We evaluate our approach and compare it to state-of-the-art planning techniques for manipulation in clutter. We conduct experiments in simulation with artificially injected uncertainty on the physics parameters, as well as in real world tasks of manipulation in clutter. We show that this approach enables the robot to react to the uncertain dynamics of the real world effectively.

Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation

Authors:Xin Wang, Wenhan Xiong, Hongmin Wang, William Yang Wang
Date:2018-03-21 03:21:38

Existing research studies on vision and language grounding for robot navigation focus on improving model-free deep reinforcement learning (DRL) models in synthetic environments. However, model-free DRL models do not consider the dynamics in the real-world environments, and they often fail to generalize to new scenes. In this paper, we take a radical approach to bridge the gap between synthetic studies and real-world practices---We propose a novel, planned-ahead hybrid reinforcement learning model that combines model-free and model-based reinforcement learning to solve a real-world vision-language navigation task. Our look-ahead module tightly integrates a look-ahead policy model with an environment model that predicts the next state and the reward. Experimental results suggest that our proposed method significantly outperforms the baselines and achieves the best on the real-world Room-to-Room dataset. Moreover, our scalable method is more generalizable when transferring to unseen environments.

Learning Robotic Assembly from CAD

Authors:Garrett Thomas, Melissa Chien, Aviv Tamar, Juan Aparicio Ojea, Pieter Abbeel
Date:2018-03-20 20:16:18

In this work, motivated by recent manufacturing trends, we investigate autonomous robotic assembly. Industrial assembly tasks require contact-rich manipulation skills, which are challenging to acquire using classical control and motion planning approaches. Consequently, robot controllers for assembly domains are presently engineered to solve a particular task, and cannot easily handle variations in the product or environment. Reinforcement learning (RL) is a promising approach for autonomously acquiring robot skills that involve contact-rich dynamics. However, RL relies on random exploration for learning a control policy, which requires many robot executions, and often gets trapped in locally suboptimal solutions. Instead, we posit that prior knowledge, when available, can improve RL performance. We exploit the fact that in modern assembly domains, geometric information about the task is readily available via the CAD design files. We propose to leverage this prior knowledge by guiding RL along a geometric motion plan, calculated using the CAD data. We show that our approach effectively improves over traditional control approaches for tracking the motion plan, and can solve assembly tasks that require high precision, even without accurate state estimation. In addition, we propose a neural network architecture that can learn to track the motion plan, and generalize the assembly controller to changes in the object positions.

Transferable Pedestrian Motion Prediction Models at Intersections

Authors:Macheng Shen, Golnaz Habibi, Jonathan P. How
Date:2018-03-15 23:58:19

One desirable capability of autonomous cars is to accurately predict the pedestrian motion near intersections for safe and efficient trajectory planning. We are interested in developing transfer learning algorithms that can be trained on the pedestrian trajectories collected at one intersection and yet still provide accurate predictions of the trajectories at another, previously unseen intersection. We first discussed the feature selection for transferable pedestrian motion models in general. Following this discussion, we developed one transferable pedestrian motion prediction algorithm based on Inverse Reinforcement Learning (IRL) that infers pedestrian intentions and predicts future trajectories based on observed trajectory. We evaluated our algorithm on a dataset collected at two intersections, trained at one intersection and tested at the other intersection. We used the accuracy of augmented semi-nonnegative sparse coding (ASNSC), trained and tested at the same intersection as a baseline. The result shows that the proposed algorithm improves the baseline accuracy by 40% in the non-transfer task, and 16% in the transfer task.

Rearrangement with Nonprehensile Manipulation Using Deep Reinforcement Learning

Authors:Weihao Yuan, Johannes A. Stork, Danica Kragic, Michael Y. Wang, Kaiyu Hang
Date:2018-03-15 14:00:24

Rearranging objects on a tabletop surface by means of nonprehensile manipulation is a task which requires skillful interaction with the physical world. Usually, this is achieved by precisely modeling physical properties of the objects, robot, and the environment for explicit planning. In contrast, as explicitly modeling the physical environment is not always feasible and involves various uncertainties, we learn a nonprehensile rearrangement strategy with deep reinforcement learning based on only visual feedback. For this, we model the task with rewards and train a deep Q-network. Our potential field-based heuristic exploration strategy reduces the amount of collisions which lead to suboptimal outcomes and we actively balance the training set to avoid bias towards poor examples. Our training process leads to quicker learning and better performance on the task as compared to uniform exploration and standard experience replay. We demonstrate empirical evidence from simulation that our method leads to a success rate of 85%, show that our system can cope with sudden changes of the environment, and compare our performance with human level performance.

The 2017 AIBIRDS Competition

Authors:Matthew Stephenson, Jochen Renz, Xiaoyu Ge, Peng Zhang
Date:2018-03-14 07:53:31

This paper presents an overview of the sixth AIBIRDS competition, held at the 26th International Joint Conference on Artificial Intelligence. This competition tasked participants with developing an intelligent agent which can play the physics-based puzzle game Angry Birds. This game uses a sophisticated physics engine that requires agents to reason and predict the outcome of actions with only limited environmental information. Agents entered into this competition were required to solve a wide assortment of previously unseen levels within a set time limit. The physical reasoning and planning required to solve these levels are very similar to those of many real-world problems. This year's competition featured some of the best agents developed so far and even included several new AI techniques such as deep reinforcement learning. Within this paper we describe the framework, rules, submitted agents and results for this competition. We also provide some background information on related work and other video game AI competitions, as well as discussing some potential ideas for future AIBIRDS competitions and agent improvements.

Hierarchical Reinforcement Learning: Approximating Optimal Discounted TSP Using Local Policies

Authors:Tom Zahavy, Avinatan Hasidim, Haim Kaplan, Yishay Mansour
Date:2018-03-13 08:13:11

In this work, we provide theoretical guarantees for reward decomposition in deterministic MDPs. Reward decomposition is a special case of Hierarchical Reinforcement Learning, that allows one to learn many policies in parallel and combine them into a composite solution. Our approach builds on mapping this problem into a Reward Discounted Traveling Salesman Problem, and then deriving approximate solutions for it. In particular, we focus on approximate solutions that are local, i.e., solutions that only observe information about the current state. Local policies are easy to implement and do not require substantial computational resources as they do not perform planning. While local deterministic policies, like Nearest Neighbor, are being used in practice for hierarchical reinforcement learning, we propose three stochastic policies that guarantee better performance than any deterministic policy.

Extracting Action Sequences from Texts Based on Deep Reinforcement Learning

Authors:Wenfeng Feng, Hankz Hankui Zhuo, Subbarao Kambhampati
Date:2018-03-07 13:13:16

Extracting action sequences from natural language texts is challenging, as it requires commonsense inferences based on world knowledge. Although there has been work on extracting action scripts, instructions, navigation actions, etc., they require that either the set of candidate actions be provided in advance, or that action descriptions are restricted to a specific form, e.g., description templates. In this paper, we aim to extract action sequences from texts in free natural language, i.e., without any restricted templates, provided the candidate set of actions is unknown. We propose to extract action sequences from texts based on the deep reinforcement learning framework. Specifically, we view "selecting" or "eliminating" words from texts as "actions", and the texts associated with actions as "states". We then build Q-networks to learn the policy of extracting actions and extract plans from the labeled texts. We demonstrate the effectiveness of our approach on several datasets with comparison to state-of-the-art approaches, including online experiments interacting with humans.

Intent-aware Multi-agent Reinforcement Learning

Authors:Siyuan Qi, Song-Chun Zhu
Date:2018-03-06 04:53:50

This paper proposes an intent-aware multi-agent planning framework as well as a learning algorithm. Under this framework, an agent plans in the goal space to maximize the expected utility. The planning process takes the belief of other agents' intents into consideration. Instead of formulating the learning problem as a partially observable Markov decision process (POMDP), we propose a simple but effective linear function approximation of the utility function. It is based on the observation that for humans, other people's intents will pose an influence on our utility for a goal. The proposed framework has several major advantages: i) it is computationally feasible and guaranteed to converge. ii) It can easily integrate existing intent prediction and low-level planning algorithms. iii) It does not suffer from sparse feedbacks in the action space. We experiment our algorithm in a real-world problem that is non-episodic, and the number of agents and goals can vary over time. Our algorithm is trained in a scene in which aerial robots and humans interact, and tested in a novel scene with a different environment. Experimental results show that our algorithm achieves the best performance and human-like behaviors emerge during the dynamic process.

Inverse Reinforcement Learning via Nonparametric Spatio-Temporal Subgoal Modeling

Authors:Adrian Šošić, Elmar Rueckert, Jan Peters, Abdelhak M. Zoubir, Heinz Koeppl
Date:2018-03-01 15:31:28

Advances in the field of inverse reinforcement learning (IRL) have led to sophisticated inference frameworks that relax the original modeling assumption of observing an agent behavior that reflects only a single intention. Instead of learning a global behavioral model, recent IRL methods divide the demonstration data into parts, to account for the fact that different trajectories may correspond to different intentions, e.g., because they were generated by different domain experts. In this work, we go one step further: using the intuitive concept of subgoals, we build upon the premise that even a single trajectory can be explained more efficiently locally within a certain context than globally, enabling a more compact representation of the observed behavior. Based on this assumption, we build an implicit intentional model of the agent's goals to forecast its behavior in unobserved situations. The result is an integrated Bayesian prediction framework that significantly outperforms existing IRL solutions and provides smooth policy estimates consistent with the expert's plan. Most notably, our framework naturally handles situations where the intentions of the agent change over time and classical IRL algorithms fail. In addition, due to its probabilistic nature, the model can be straightforwardly applied in active learning scenarios to guide the demonstration process of the expert.

Learning Human-Aware Path Planning with Fully Convolutional Networks

Authors:Noé Pérez-Higueras, Fernando Caballero, Luis Merino
Date:2018-03-01 15:08:14

This work presents an approach to learn path planning for robot social navigation by demonstration. We make use of Fully Convolutional Neural Networks (FCNs) to learn from expert's path demonstrations a map that marks a feasible path to the goal as a classification problem. The use of FCNs allows us to overcome the problem of manually designing/identifying the cost-map and relevant features for the task of robot navigation. The method makes use of optimal Rapidly-exploring Random Tree planner (RRT*) to overcome eventual errors in the path prediction; the FCNs prediction is used as cost-map and also to partially bias the sampling of the configuration space, leading the planner to behave similarly to the learned expert behavior. The approach is evaluated in experiments with real trajectories and compared with Inverse Reinforcement Learning algorithms that use RRT* as underlying planner.

Q-CP: Learning Action Values for Cooperative Planning

Authors:Francesco Riccio, Roberto Capobianco, Daniele Nardi
Date:2018-03-01 10:53:04

Research on multi-robot systems has demonstrated promising results in manifold applications and domains. Still, efficiently learning an effective robot behaviors is very difficult, due to unstructured scenarios, high uncertainties, and large state dimensionality (e.g. hyper-redundant and groups of robot). To alleviate this problem, we present Q-CP a cooperative model-based reinforcement learning algorithm, which exploits action values to both (1) guide the exploration of the state space and (2) generate effective policies. Specifically, we exploit Q-learning to attack the curse-of-dimensionality in the iterations of a Monte-Carlo Tree Search. We implement and evaluate Q-CP on different stochastic cooperative (general-sum) games: (1) a simple cooperative navigation problem among 3 robots, (2) a cooperation scenario between a pair of KUKA YouBots performing hand-overs, and (3) a coordination task between two mobile robots entering a door. The obtained results show the effectiveness of Q-CP in the chosen applications, where action values drive the exploration and reduce the computational demand of the planning process while achieving good performance.

Deep Reinforcement Learning for Join Order Enumeration

Authors:Ryan Marcus, Olga Papaemmanouil
Date:2018-02-28 20:00:33

Join order selection plays a significant role in query performance. However, modern query optimizers typically employ static join enumeration algorithms that do not receive any feedback about the quality of the resulting plan. Hence, optimizers often repeatedly choose the same bad plan, as they do not have a mechanism for "learning from their mistakes". In this paper, we argue that existing deep reinforcement learning techniques can be applied to address this challenge. These techniques, powered by artificial neural networks, can automatically improve decision making by incorporating feedback from their successes and failures. Towards this goal, we present ReJOIN, a proof-of-concept join enumerator, and present preliminary results indicating that ReJOIN can match or outperform the PostgreSQL optimizer in terms of plan quality and join enumeration efficiency.

Cuttlefish: A Lightweight Primitive for Adaptive Query Processing

Authors:Tomer Kaftan, Magdalena Balazinska, Alvin Cheung, Johannes Gehrke
Date:2018-02-26 06:50:43

Modern data processing applications execute increasingly sophisticated analysis that requires operations beyond traditional relational algebra. As a result, operators in query plans grow in diversity and complexity. Designing query optimizer rules and cost models to choose physical operators for all of these novel logical operators is impractical. To address this challenge, we develop Cuttlefish, a new primitive for adaptively processing online query plans that explores candidate physical operator instances during query execution and exploits the fastest ones using multi-armed bandit reinforcement learning techniques. We prototype Cuttlefish in Apache Spark and adaptively choose operators for image convolution, regular expression matching, and relational joins. Our experiments show Cuttlefish-based adaptive convolution and regular expression operators can reach 72-99% of the throughput of an all-knowing oracle that always selects the optimal algorithm, even when individual physical operators are up to 105x slower than the optimal. Additionally, Cuttlefish achieves join throughput improvements of up to 7.5x compared with Spark SQL's query optimizer.

Learning to Gather without Communication

Authors:El Mahdi El Mhamdi, Rachid Guerraoui, Alexandre Maurer, Vladislav Tempez
Date:2018-02-21 22:26:21

A standard belief on emerging collective behavior is that it emerges from simple individual rules. Most of the mathematical research on such collective behavior starts from imperative individual rules, like always go to the center. But how could an (optimal) individual rule emerge during a short period within the group lifetime, especially if communication is not available. We argue that such rules can actually emerge in a group in a short span of time via collective (multi-agent) reinforcement learning, i.e learning via rewards and punishments. We consider the gathering problem: several agents (social animals, swarming robots...) must gather around a same position, which is not determined in advance. They must do so without communication on their planned decision, just by looking at the position of other agents. We present the first experimental evidence that a gathering behavior can be learned without communication in a partially observable environment. The learned behavior has the same properties as a self-stabilizing distributed algorithm, as processes can gather from any initial state (and thus tolerate any transient failure). Besides, we show that it is possible to tolerate the brutal loss of up to 90\% of agents without significant impact on the behavior.

Human-in-the-Loop Mixed-Initiative Control under Temporal Tasks

Authors:Meng Guo, Sofie Andersson, Dimos V. Dimarogonas
Date:2018-02-19 20:37:32

This paper considers the motion control and task planning problem of mobile robots under complex high-level tasks and human initiatives. The assigned task is specified as Linear Temporal Logic (LTL) formulas that consist of hard and soft constraints. The human initiative influences the robot autonomy in two explicit ways: with additive terms in the continuous controller and with contingent task assignments. We propose an online coordination scheme that encapsulates (i) a mixed-initiative continuous controller that ensures all-time safety despite of possible human errors, (ii) a plan adaptation scheme that accommodates new features discovered in the workspace and short-term tasks assigned by the operator during run time, and (iii) an iterative inverse reinforcement learning (IRL) algorithm that allows the robot to asymptotically learn the human preference on the parameters during the plan synthesis. The results are demonstrated by both realistic human-in-the-loop simulations and experiments.

Efficient Model-Based Deep Reinforcement Learning with Variational State Tabulation

Authors:Dane Corneil, Wulfram Gerstner, Johanni Brea
Date:2018-02-12 19:38:44

Modern reinforcement learning algorithms reach super-human performance on many board and video games, but they are sample inefficient, i.e. they typically require significantly more playing experience than humans to reach an equal performance level. To improve sample efficiency, an agent may build a model of the environment and use planning methods to update its policy. In this article we introduce Variational State Tabulation (VaST), which maps an environment with a high-dimensional state space (e.g. the space of visual inputs) to an abstract tabular model. Prioritized sweeping with small backups, a highly efficient planning method, can then be used to update state-action values. We show how VaST can rapidly learn to maximize reward in tasks like 3D navigation and efficiently adapt to sudden changes in rewards or transition probabilities.

Learning and Querying Fast Generative Models for Reinforcement Learning

Authors:Lars Buesing, Theophane Weber, Sebastien Racaniere, S. M. Ali Eslami, Danilo Rezende, David P. Reichert, Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, Daan Wierstra
Date:2018-02-08 18:54:44

A key challenge in model-based reinforcement learning (RL) is to synthesize computationally efficient and accurate environment models. We show that carefully designed generative models that learn and operate on compact state representations, so-called state-space models, substantially reduce the computational costs for predicting outcomes of sequences of actions. Extensive experiments establish that state-space models accurately capture the dynamics of Atari games from the Arcade Learning Environment from raw pixels. The computational speed-up of state-space models while maintaining high accuracy makes their application in RL feasible: We demonstrate that agents which query these models for decision making outperform strong model-free baselines on the game MSPACMAN, demonstrating the potential of using learned environment models for planning.

Learning Symmetric and Low-energy Locomotion

Authors:Wenhao Yu, Greg Turk, C. Karen Liu
Date:2018-01-24 17:37:35

Learning locomotion skills is a challenging problem. To generate realistic and smooth locomotion, existing methods use motion capture, finite state machines or morphology-specific knowledge to guide the motion generation algorithms. Deep reinforcement learning (DRL) is a promising approach for the automatic creation of locomotion control. Indeed, a standard benchmark for DRL is to automatically create a running controller for a biped character from a simple reward function. Although several different DRL algorithms can successfully create a running controller, the resulting motions usually look nothing like a real runner. This paper takes a minimalist learning approach to the locomotion problem, without the use of motion examples, finite state machines, or morphology-specific knowledge. We introduce two modifications to the DRL approach that, when used together, produce locomotion behaviors that are symmetric, low-energy, and much closer to that of a real person. First, we introduce a new term to the loss function (not the reward function) that encourages symmetric actions. Second, we introduce a new curriculum learning method that provides modulated physical assistance to help the character with left/right balance and forward movement. The algorithm automatically computes appropriate assistance to the character and gradually relaxes this assistance, so that eventually the character learns to move entirely without help. Because our method does not make use of motion capture data, it can be applied to a variety of character morphologies. We demonstrate locomotion controllers for the lower half of a biped, a full humanoid, a quadruped, and a hexapod. Our results show that learned policies are able to produce symmetric, low-energy gaits. In addition, speed-appropriate gait patterns emerge without any guidance from motion examples or contact planning.

Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning

Authors:Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Kam-Fai Wong, Shang-Yu Su
Date:2018-01-18 18:57:33

Training a task-completion dialogue agent via reinforcement learning (RL) is costly because it requires many interactions with real users. One common alternative is to use a user simulator. However, a user simulator usually lacks the language complexity of human interlocutors and the biases in its design may tend to degrade the agent. To address these issues, we present Deep Dyna-Q, which to our knowledge is the first deep RL framework that integrates planning for task-completion dialogue policy learning. We incorporate into the dialogue agent a model of the environment, referred to as the world model, to mimic real user response and generate simulated experience. During dialogue policy learning, the world model is constantly updated with real user experience to approach real user behavior, and in turn, the dialogue agent is optimized using both real experience and simulated experience. The effectiveness of our approach is demonstrated on a movie-ticket booking task in both simulated and human-in-the-loop settings.

Cellular-Connected UAVs over 5G: Deep Reinforcement Learning for Interference Management

Authors:Ursula Challita, Walid Saad, Christian Bettstetter
Date:2018-01-16 22:35:55

In this paper, an interference-aware path planning scheme for a network of cellular-connected unmanned aerial vehicles (UAVs) is proposed. In particular, each UAV aims at achieving a tradeoff between maximizing energy efficiency and minimizing both wireless latency and the interference level caused on the ground network along its path. The problem is cast as a dynamic game among UAVs. To solve this game, a deep reinforcement learning algorithm, based on echo state network (ESN) cells, is proposed. The introduced deep ESN architecture is trained to allow each UAV to map each observation of the network state to an action, with the goal of minimizing a sequence of time-dependent utility functions. Each UAV uses ESN to learn its optimal path, transmission power level, and cell association vector at different locations along its path. The proposed algorithm is shown to reach a subgame perfect Nash equilibrium (SPNE) upon convergence. Moreover, an upper and lower bound for the altitude of the UAVs is derived thus reducing the computational complexity of the proposed algorithm. Simulation results show that the proposed scheme achieves better wireless latency per UAV and rate per ground user (UE) while requiring a number of steps that is comparable to a heuristic baseline that considers moving via the shortest distance towards the corresponding destinations. The results also show that the optimal altitude of the UAVs varies based on the ground network density and the UE data rate requirements and plays a vital role in minimizing the interference level on the ground UEs as well as the wireless transmission delay of the UAV.

Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator

Authors:Maryam Fazel, Rong Ge, Sham M. Kakade, Mehran Mesbahi
Date:2018-01-15 21:40:50

Direct policy gradient methods for reinforcement learning and continuous control problems are a popular approach for a variety of reasons: 1) they are easy to implement without explicit knowledge of the underlying model 2) they are an "end-to-end" approach, directly optimizing the performance metric of interest 3) they inherently allow for richly parameterized policies. A notable drawback is that even in the most basic continuous control problem (that of linear quadratic regulators), these methods must solve a non-convex optimization problem, where little is understood about their efficiency from both computational and statistical perspectives. In contrast, system identification and model based planning in optimal control theory have a much more solid theoretical footing, where much is known with regards to their computational and statistical properties. This work bridges this gap showing that (model free) policy gradient methods globally converge to the optimal solution and are efficient (polynomially so in relevant problem dependent quantities) with regards to their sample and computational complexities.

DeepTraffic: Crowdsourced Hyperparameter Tuning of Deep Reinforcement Learning Systems for Multi-Agent Dense Traffic Navigation

Authors:Lex Fridman, Jack Terwilliger, Benedikt Jenik
Date:2018-01-09 05:56:15

We present a traffic simulation named DeepTraffic where the planning systems for a subset of the vehicles are handled by a neural network as part of a model-free, off-policy reinforcement learning process. The primary goal of DeepTraffic is to make the hands-on study of deep reinforcement learning accessible to thousands of students, educators, and researchers in order to inspire and fuel the exploration and evaluation of deep Q-learning network variants and hyperparameter configurations through large-scale, open competition. This paper investigates the crowd-sourced hyperparameter tuning of the policy network that resulted from the first iteration of the DeepTraffic competition where thousands of participants actively searched through the hyperparameter space.

Multiagent-based Participatory Urban Simulation through Inverse Reinforcement Learning

Authors:Soma Suzuki
Date:2017-12-21 11:41:13

The multiagent-based participatory simulation features prominently in urban planning as the acquired model is considered as the hybrid system of the domain and the local knowledge. However, the key problem of generating realistic agents for particular social phenomena invariably remains. The existing models have attempted to dictate the factors involving human behavior, which appeared to be intractable. In this paper, Inverse Reinforcement Learning (IRL) is introduced to address this problem. IRL is developed for computational modeling of human behavior and has achieved great successes in robotics, psychology and machine learning. The possibilities presented by this new style of modeling are drawn out as conclusions, and the relative challenges with this modeling are highlighted.

Hierarchical Text Generation and Planning for Strategic Dialogue

Authors:Denis Yarats, Mike Lewis
Date:2017-12-15 21:33:07

End-to-end models for goal-orientated dialogue are challenging to train, because linguistic and strategic aspects are entangled in latent state vectors. We introduce an approach to learning representations of messages in dialogues by maximizing the likelihood of subsequent sentences and actions, which decouples the semantics of the dialogue utterance from its linguistic realization. We then use these latent sentence representations for hierarchical language generation, planning and reinforcement learning. Experiments show that our approach increases the end-task reward achieved by the model, improves the effectiveness of long-term planning using rollouts, and allows self-play reinforcement learning to improve decision making without diverging from human language. Our hierarchical latent-variable model outperforms previous work both linguistically and strategically.

Occam's razor is insufficient to infer the preferences of irrational agents

Authors:Stuart Armstrong, Sören Mindermann
Date:2017-12-15 19:05:01

Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings. However, the general problem of inferring the reward function of an agent of unknown rationality has received little attention. Unlike the well-known ambiguity problems in IRL, this one is practically relevant but cannot be resolved by observing the agent's policy in enough environments. This paper shows (1) that a No Free Lunch result implies it is impossible to uniquely decompose a policy into a planning algorithm and reward function, and (2) that even with a reasonable simplicity prior/Occam's razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret. To address this, we need simple `normative' assumptions, which cannot be deduced exclusively from observations.

AI2-THOR: An Interactive 3D Environment for Visual AI

Authors:Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, Ali Farhadi
Date:2017-12-14 23:17:24

We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at http://ai2thor.allenai.org. AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning, learning by interaction, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and learning models of cognition. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.

IQA: Visual Question Answering in Interactive Environments

Authors:Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, Ali Farhadi
Date:2017-12-09 00:13:59

We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: "Are there any apples in the fridge?" The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question. Popular reinforcement learning approaches with a single controller perform poorly on IQA owing to the large and diverse state space. We propose the Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction. To evaluate HIMN, we introduce IQUAD V1, a new dataset built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive objects (code and dataset available at https://github.com/danielgordon10/thor-iqa-cvpr-2018). IQUAD V1 has 75,000 questions, each paired with a unique scene configuration. Our experiments show that our proposed model outperforms popular single controller based methods on IQUAD V1. For sample questions and results, please view our video: https://youtu.be/pXd3C-1jr98

A Novel Model for Arbitration between Planning and Habitual Control Systems

Authors:Farzaneh S. Fard, Thomas P. Trappenberg
Date:2017-12-06 23:33:40

It is well established that humans decision making and instrumental control uses multiple systems, some which use habitual action selection and some which require deliberate planning. Deliberate planning systems use predictions of action-outcomes using an internal model of the agent's environment, while habitual action selection systems learn to automate by repeating previously rewarded actions. Habitual control is computationally efficient but may be inflexible in changing environments. Conversely, deliberate planning may be computationally expensive, but flexible in dynamic environments. This paper proposes a general architecture comprising both control paradigms by introducing an arbitrator that controls which subsystem is used at any time. This system is implemented for a target-reaching task with a simulated two-joint robotic arm that comprises a supervised internal model and deep reinforcement learning. Through permutation of target-reaching conditions, we demonstrate that the proposed is capable of rapidly learning kinematics of the system without a priori knowledge, and is robust to (A) changing environmental reward and kinematics, and (B) occluded vision. The arbitrator model is compared to exclusive deliberate planning with the internal model and exclusive habitual control instances of the model. The results show how such a model can harness the benefits of both systems, using fast decisions in reliable circumstances while optimizing performance in changing environments. In addition, the proposed model learns very fast. Finally, the system which includes internal models is able to reach the target under the visual occlusion, while the pure habitual system is unable to operate sufficiently under such conditions.

Plan, Attend, Generate: Planning for Sequence-to-Sequence Models

Authors:Francis Dutil, Caglar Gulcehre, Adam Trischler, Yoshua Bengio
Date:2017-11-28 18:50:05

We investigate the integration of a planning mechanism into sequence-to-sequence models using attention. We develop a model which can plan ahead in the future when it computes its alignments between input and output sequences, constructing a matrix of proposed future alignments and a commitment vector that governs whether to follow or recompute the plan. This mechanism is inspired by the recently proposed strategic attentive reader and writer (STRAW) model for Reinforcement Learning. Our proposed model is end-to-end trainable using primarily differentiable operations. We show that it outperforms a strong baseline on character-level translation tasks from WMT'15, the algorithmic task of finding Eulerian circuits of graphs, and question generation from the text. Our analysis demonstrates that the model computes qualitatively intuitive alignments, converges faster than the baselines, and achieves superior performance with fewer parameters.

Hierarchical Policy Search via Return-Weighted Density Estimation

Authors:Takayuki Osa, Masashi Sugiyama
Date:2017-11-28 08:30:11

Learning an optimal policy from a multi-modal reward function is a challenging problem in reinforcement learning (RL). Hierarchical RL (HRL) tackles this problem by learning a hierarchical policy, where multiple option policies are in charge of different strategies corresponding to modes of a reward function and a gating policy selects the best option for a given context. Although HRL has been demonstrated to be promising, current state-of-the-art methods cannot still perform well in complex real-world problems due to the difficulty of identifying modes of the reward function. In this paper, we propose a novel method called hierarchical policy search via return-weighted density estimation (HPSDE), which can efficiently identify the modes through density estimation with return-weighted importance sampling. Our proposed method finds option policies corresponding to the modes of the return function and automatically determines the number and the location of option policies, which significantly reduces the burden of hyper-parameters tuning. Through experiments, we demonstrate that the proposed HPSDE successfully learns option policies corresponding to modes of the return function and that it can be successfully applied to a challenging motion planning problem of a redundant robotic manipulator.

Situationally Aware Options

Authors:Daniel J. Mankowitz, Aviv Tamar, Shie Mannor
Date:2017-11-20 08:11:12

Hierarchical abstractions, also known as options -- a type of temporally extended action (Sutton et. al. 1999) that enables a reinforcement learning agent to plan at a higher level, abstracting away from the lower-level details. In this work, we learn reusable options whose parameters can vary, encouraging different behaviors, based on the current situation. In principle, these behaviors can include vigor, defence or even risk-averseness. These are some examples of what we refer to in the broader context as Situational Awareness (SA). We incorporate SA, in the form of vigor, into hierarchical RL by defining and learning situationally aware options in a Probabilistic Goal Semi-Markov Decision Process (PG-SMDP). This is achieved using our Situationally Aware oPtions (SAP) policy gradient algorithm which comes with a theoretical convergence guarantee. We learn reusable options in different scenarios in a RoboCup soccer domain (i.e., winning/losing). These options learn to execute with different levels of vigor resulting in human-like behaviours such as `time-wasting' in the winning scenario. We show the potential of the agent to exit bad local optima using reusable options in RoboCup. Finally, using SAP, the agent mitigates feature-based model misspecification in a Bottomless Pit of Death domain.

Hindsight policy gradients

Authors:Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, Juergen Schmidhuber
Date:2017-11-16 10:05:31

A reinforcement learning agent that needs to pursue different goals across episodes requires a goal-conditional policy. In addition to their potential to generalize desirable behavior to unseen goals, such policies may also enable higher-level planning based on subgoals. In sparse-reward environments, the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended appears crucial to enable sample efficient learning. However, reinforcement learning agents have only recently been endowed with such capacity for hindsight. In this paper, we demonstrate how hindsight can be introduced to policy gradient methods, generalizing this idea to a broad class of successful algorithms. Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efficiency.

TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning

Authors:Gregory Farquhar, Tim Rocktäschel, Maximilian Igl, Shimon Whiteson
Date:2017-10-31 11:54:35

Combining deep model-free reinforcement learning with on-line planning is a promising approach to building on the successes of deep RL. On-line planning with look-ahead trees has proven successful in environments where transition models are known a priori. However, in complex environments where transition models need to be learned from data, the deficiencies of learned models have limited their utility for planning. To address these challenges, we propose TreeQN, a differentiable, recursive, tree-structured model that serves as a drop-in replacement for any value function network in deep RL with discrete actions. TreeQN dynamically constructs a tree by recursively applying a transition model in a learned abstract state space and then aggregating predicted rewards and state-values using a tree backup to estimate Q-values. We also propose ATreeC, an actor-critic variant that augments TreeQN with a softmax layer to form a stochastic policy network. Both approaches are trained end-to-end, such that the learned model is optimised for its actual use in the tree. We show that TreeQN and ATreeC outperform n-step DQN and A2C on a box-pushing task, as well as n-step DQN and value prediction networks (Oh et al. 2017) on multiple Atari games. Furthermore, we present ablation studies that demonstrate the effect of different auxiliary losses on learning transition models.

Eigenoption Discovery through the Deep Successor Representation

Authors:Marlos C. Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, Murray Campbell
Date:2017-10-30 17:36:19

Options in reinforcement learning allow agents to hierarchically decompose a task into subtasks, having the potential to speed up learning and planning. However, autonomously learning effective sets of options is still a major challenge in the field. In this paper we focus on the recently introduced idea of using representation learning methods to guide the option discovery process. Specifically, we look at eigenoptions, options obtained from representations that encode diffusive information flow in the environment. We extend the existing algorithms for eigenoption discovery to settings with stochastic transitions and in which handcrafted features are not available. We propose an algorithm that discovers eigenoptions while learning non-linear state representations from raw pixels. It exploits recent successes in the deep reinforcement learning literature and the equivalence between proto-value functions and the successor representation. We use traditional tabular domains to provide intuition about our approach and Atari 2600 games to demonstrate its potential.

PRM-RL: Long-range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning

Authors:Aleksandra Faust, Oscar Ramirez, Marek Fiser, Kenneth Oslund, Anthony Francis, James Davidson, Lydia Tapia
Date:2017-10-11 07:19:17

We present PRM-RL, a hierarchical method for long-range navigation task completion that combines sampling based path planning with reinforcement learning (RL). The RL agents learn short-range, point-to-point navigation policies that capture robot dynamics and task constraints without knowledge of the large-scale topology. Next, the sampling-based planners provide roadmaps which connect robot configurations that can be successfully navigated by the RL agent. The same RL agents are used to control the robot under the direction of the planning, enabling long-range navigation. We use the Probabilistic Roadmaps (PRMs) for the sampling-based planner. The RL agents are constructed using feature-based and deep neural net policies in continuous state and action spaces. We evaluate PRM-RL, both in simulation and on-robot, on two navigation tasks with non-trivial robot dynamics: end-to-end differential drive indoor navigation in office environments, and aerial cargo delivery in urban environments with load displacement constraints. Our results show improvement in task completion over both RL agents on their own and traditional sampling-based planners. In the indoor navigation task, PRM-RL successfully completes up to 215 m long trajectories under noisy sensor conditions, and the aerial cargo delivery completes flights over 1000 m without violating the task constraints in an environment 63 million times larger than used in training.

Meta Inverse Reinforcement Learning via Maximum Reward Sharing for Human Motion Analysis

Authors:Kun Li, Joel W. Burdick
Date:2017-10-07 20:22:32

This work handles the inverse reinforcement learning (IRL) problem where only a small number of demonstrations are available from a demonstrator for each high-dimensional task, insufficient to estimate an accurate reward function. Observing that each demonstrator has an inherent reward for each state and the task-specific behaviors mainly depend on a small number of key states, we propose a meta IRL algorithm that first models the reward function for each task as a distribution conditioned on a baseline reward function shared by all tasks and dependent only on the demonstrator, and then finds the most likely reward function in the distribution that explains the task-specific behaviors. We test the method in a simulated environment on path planning tasks with limited demonstrations, and show that the accuracy of the learned reward function is significantly improved. We also apply the method to analyze the motion of a patient under rehabilitation.

Deep Abstract Q-Networks

Authors:Melrose Roderick, Christopher Grimm, Stefanie Tellex
Date:2017-10-02 02:17:09

We examine the problem of learning and planning on high-dimensional domains with long horizons and sparse rewards. Recent approaches have shown great successes in many Atari 2600 domains. However, domains with long horizons and sparse rewards, such as Montezuma's Revenge and Venture, remain challenging for existing methods. Methods using abstraction (Dietterich 2000; Sutton, Precup, and Singh 1999) have shown to be useful in tackling long-horizon problems. We combine recent techniques of deep reinforcement learning with existing model-based approaches using an expert-provided state abstraction. We construct toy domains that elucidate the problem of long horizons, sparse rewards and high-dimensional inputs, and show that our algorithm significantly outperforms previous methods on these domains. Our abstraction-based approach outperforms Deep Q-Networks (Mnih et al. 2015) on Montezuma's Revenge and Venture, and exhibits backtracking behavior that is absent from previous methods.

Self-supervised Deep Reinforcement Learning with Generalized Computation Graphs for Robot Navigation

Authors:Gregory Kahn, Adam Villaflor, Bosen Ding, Pieter Abbeel, Sergey Levine
Date:2017-09-29 16:47:14

Enabling robots to autonomously navigate complex environments is essential for real-world deployment. Prior methods approach this problem by having the robot maintain an internal map of the world, and then use a localization and planning method to navigate through the internal map. However, these approaches often include a variety of assumptions, are computationally intensive, and do not learn from failures. In contrast, learning-based methods improve as the robot acts in the environment, but are difficult to deploy in the real-world due to their high sample complexity. To address the need to learn complex policies with few samples, we propose a generalized computation graph that subsumes value-based model-free methods and model-based methods, with specific instantiations interpolating between model-free and model-based. We then instantiate this graph to form a navigation model that learns from raw images and is sample efficient. Our simulated car experiments explore the design decisions of our navigation model, and show our approach outperforms single-step and $N$-step double Q-learning. We also evaluate our approach on a real-world RC car and show it can learn to navigate through a complex indoor environment with a few hours of fully autonomous, self-supervised training. Videos of the experiments and code can be found at github.com/gkahn13/gcg

Towards Optimally Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning

Authors:Pinxin Long, Tingxiang Fan, Xinyi Liao, Wenxi Liu, Hao Zhang, Jia Pan
Date:2017-09-28 17:44:09

Developing a safe and efficient collision avoidance policy for multiple robots is challenging in the decentralized scenarios where each robot generate its paths without observing other robots' states and intents. While other distributed multi-robot collision avoidance systems exist, they often require extracting agent-level features to plan a local collision-free action, which can be computationally prohibitive and not robust. More importantly, in practice the performance of these methods are much lower than their centralized counterparts. We present a decentralized sensor-level collision avoidance policy for multi-robot systems, which directly maps raw sensor measurements to an agent's steering commands in terms of movement velocity. As a first step toward reducing the performance gap between decentralized and centralized methods, we present a multi-scenario multi-stage training framework to find an optimal policy which is trained over a large number of robots on rich, complex environments simultaneously using a policy gradient based reinforcement learning algorithm. We validate the learned sensor-level collision avoidance policy in a variety of simulated scenarios with thorough performance evaluations and show that the final learned policy is able to find time efficient, collision-free paths for a large-scale robot system. We also demonstrate that the learned policy can be well generalized to new scenarios that do not appear in the entire training period, including navigating a heterogeneous group of robots and a large-scale scenario with 100 robots. Videos are available at https://sites.google.com/view/drlmaca

The detour problem in a stochastic environment: Tolman revisited

Authors:Pegah Fakhari, Arash Khodadadi, Jerome Busemeyer
Date:2017-09-27 23:22:06

We designed a grid world task to study human planning and re-planning behavior in an unknown stochastic environment. In our grid world, participants were asked to travel from a random starting point to a random goal position while maximizing their reward. Because they were not familiar with the environment, they needed to learn its characteristics from experience to plan optimally. Later in the task, we randomly blocked the optimal path to investigate whether and how people adjust their original plans to find a detour. To this end, we developed and compared 12 different models. These models were different on how they learned and represented the environment and how they planned to catch the goal. The majority of our participants were able to plan optimally. We also showed that people were capable of revising their plans when an unexpected event occurred. The result from the model comparison showed that the model-based reinforcement learning approach provided the best account for the data and outperformed heuristics in explaining the behavioral data in the re-planning trials.

Autonomous Waypoint Generation with Safety Guarantees: On-Line Motion Planning in Unknown Environments

Authors:Sanjeev Sharma
Date:2017-09-02 08:53:00

On-line motion planning in unknown environments is a challenging problem as it requires (i) ensuring collision avoidance and (ii) minimizing the motion time, while continuously predicting where to go next. Previous approaches to on-line motion planning assume that a rough map of the environment is available, thereby simplifying the problem. This paper presents a reactive on-line motion planner, Robust Autonomous Waypoint generation (RAW), for mobile robots navigating in unknown and unstructured environments. RAW generates a locally maximal ellipsoid around the robot, using semi-definite programming, such that the surrounding obstacles lie outside the ellipsoid. A reinforcement learning agent then generates a local waypoint in the robot's field of view, inside the ellipsoid. The robot navigates to the waypoint and the process iterates until it reaches the goal. By following the waypoints the robot navigates through a sequence of overlapping ellipsoids, and avoids collision. Robot's safety is guaranteed theoretically and the claims are validated through rigorous numerical experiments in four different experimental setups. Near-optimality is shown empirically by comparing RAW trajectories with the global optimal trajectories.

Learning to Price with Reference Effects

Authors:Abbas Kazerouni, Benjamin Van Roy
Date:2017-08-29 20:40:10

As a firm varies the price of a product, consumers exhibit reference effects, making purchase decisions based not only on the prevailing price but also the product's price history. We consider the problem of learning such behavioral patterns as a monopolist releases, markets, and prices products. This context calls for pricing decisions that intelligently trade off between maximizing revenue generated by a current product and probing to gain information for future benefit. Due to dependence on price history, realized demand can reflect delayed consequences of earlier pricing decisions. As such, inference entails attribution of outcomes to prior decisions and effective exploration requires planning price sequences that yield informative future outcomes. Despite the considerable complexity of this problem, we offer a tractable systematic approach. In particular, we frame the problem as one of reinforcement learning and leverage Thompson sampling. We also establish a regret bound that provides graceful guarantees on how performance improves as data is gathered and how this depends on the complexity of the demand model. We illustrate merits of the approach through simulations.

Hierarchical Subtask Discovery With Non-Negative Matrix Factorization

Authors:Adam C. Earle, Andrew M. Saxe, Benjamin Rosman
Date:2017-08-01 18:19:40

Hierarchical reinforcement learning methods offer a powerful means of planning flexible behavior in complicated domains. However, learning an appropriate hierarchical decomposition of a domain into subtasks remains a substantial challenge. We present a novel algorithm for subtask discovery, based on the recently introduced multitask linearly-solvable Markov decision process (MLMDP) framework. The MLMDP can perform never-before-seen tasks by representing them as a linear combination of a previously learned basis set of tasks. In this setting, the subtask discovery problem can naturally be posed as finding an optimal low-rank approximation of the set of tasks the agent will face in a domain. We use non-negative matrix factorization to discover this minimal basis set of tasks, and show that the technique learns intuitive decompositions in a variety of domains. Our method has several qualitatively desirable features: it is not limited to learning subtasks with single goal states, instead learning distributed patterns of preferred states; it learns qualitatively different hierarchical decompositions in the same domain depending on the ensemble of tasks the agent will face; and it may be straightforwardly iterated to obtain deeper hierarchical decompositions.

Grounding Language for Transfer in Deep Reinforcement Learning

Authors:Karthik Narasimhan, Regina Barzilay, Tommi Jaakkola
Date:2017-08-01 02:20:00

In this paper, we explore the utilization of natural language to drive transfer for reinforcement learning (RL). Despite the wide-spread application of deep RL techniques, learning generalized policy representations that work across domains remains a challenging problem. We demonstrate that textual descriptions of environments provide a compact intermediate channel to facilitate effective policy transfer. Specifically, by learning to ground the meaning of text to the dynamics of the environment such as transitions and rewards, an autonomous agent can effectively bootstrap policy learning on a new domain given its description. We employ a model-based RL approach consisting of a differentiable planning module, a model-free component and a factorized state representation to effectively use entity descriptions. Our model outperforms prior work on both transfer and multi-task scenarios in a variety of different environments. For instance, we achieve up to 14% and 11.5% absolute improvement over previously existing models in terms of average and initial rewards, respectively.

Pragmatic-Pedagogic Value Alignment

Authors:Jaime F. Fisac, Monica A. Gates, Jessica B. Hamrick, Chang Liu, Dylan Hadfield-Menell, Malayandi Palaniappan, Dhruv Malik, S. Shankar Sastry, Thomas L. Griffiths, Anca D. Dragan
Date:2017-07-20 03:07:19

As intelligent systems gain autonomy and capability, it becomes vital to ensure that their objectives match those of their human users; this is known as the value-alignment problem. In robotics, value alignment is key to the design of collaborative robots that can integrate into human workflows, successfully inferring and adapting to their users' objectives as they go. We argue that a meaningful solution to value alignment must combine multi-agent decision theory with rich mathematical models of human cognition, enabling robots to tap into people's natural collaborative capabilities. We present a solution to the cooperative inverse reinforcement learning (CIRL) dynamic game based on well-established cognitive models of decision making and theory of mind. The solution captures a key reciprocity relation: the human will not plan her actions in isolation, but rather reason pedagogically about how the robot might learn from them; the robot, in turn, can anticipate this and interpret the human's actions pragmatically. To our knowledge, this work constitutes the first formal analysis of value alignment grounded in empirically validated cognitive models.

Imagination-Augmented Agents for Deep Reinforcement Learning

Authors:Théophane Weber, Sébastien Racanière, David P. Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pascanu, Peter Battaglia, Demis Hassabis, David Silver, Daan Wierstra
Date:2017-07-19 17:12:56

We introduce Imagination-Augmented Agents (I2As), a novel architecture for deep reinforcement learning combining model-free and model-based aspects. In contrast to most existing model-based reinforcement learning and planning methods, which prescribe how a model should be used to arrive at a policy, I2As learn to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. I2As show improved data efficiency, performance, and robustness to model misspecification compared to several baselines.

On-line Building Energy Optimization using Deep Reinforcement Learning

Authors:Elena Mocanu, Decebal Constantin Mocanu, Phuong H. Nguyen, Antonio Liotta, Michael E. Webber, Madeleine Gibescu, J. G. Slootweg
Date:2017-07-18 22:00:53

Unprecedented high volumes of data are becoming available with the growth of the advanced metering infrastructure. These are expected to benefit planning and operation of the future power system, and to help the customers transition from a passive to an active role. In this paper, we explore for the first time in the smart grid context the benefits of using Deep Reinforcement Learning, a hybrid type of methods that combines Reinforcement Learning with Deep Learning, to perform on-line optimization of schedules for building energy management systems. The learning procedure was explored using two methods, Deep Q-learning and Deep Policy Gradient, both of them being extended to perform multiple actions simultaneously. The proposed approach was validated on the large-scale Pecan Street Inc. database. This highly-dimensional database includes information about photovoltaic power generation, electric vehicles as well as buildings appliances. Moreover, these on-line energy scheduling strategies could be used to provide real-time feedback to consumers to encourage more efficient use of electricity.

Value Prediction Network

Authors:Junhyuk Oh, Satinder Singh, Honglak Lee
Date:2017-07-11 23:32:36

This paper proposes a novel deep reinforcement learning (RL) architecture, called Value Prediction Network (VPN), which integrates model-free and model-based RL methods into a single neural network. In contrast to typical model-based RL methods, VPN learns a dynamics model whose abstract states are trained to make option-conditional predictions of future values (discounted sum of rewards) rather than of future observations. Our experimental results show that VPN has several advantages over both model-free and model-based baselines in a stochastic environment where careful planning is required but building an accurate observation-prediction model is difficult. Furthermore, VPN outperforms Deep Q-Network (DQN) on several Atari games even with short-lookahead planning, demonstrating its potential as a new way of learning a good state representation.

Path Integral Networks: End-to-End Differentiable Optimal Control

Authors:Masashi Okada, Luca Rigazio, Takenobu Aoshima
Date:2017-06-29 07:13:15

In this paper, we introduce Path Integral Networks (PI-Net), a recurrent network representation of the Path Integral optimal control algorithm. The network includes both system dynamics and cost models, used for optimal control based planning. PI-Net is fully differentiable, learning both dynamics and cost models end-to-end by back-propagation and stochastic gradient descent. Because of this, PI-Net can learn to plan. PI-Net has several advantages: it can generalize to unseen states thanks to planning, it can be applied to continuous control tasks, and it allows for a wide variety learning schemes, including imitation and reinforcement learning. Preliminary experiment results show that PI-Net, trained by imitation learning, can mimic control demonstrations for two simulated problems; a linear system and a pendulum swing-up problem. We also show that PI-Net is able to learn dynamics and cost models latent in the demonstrations.

Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control

Authors:Sanket Kamthe, Marc Peter Deisenroth
Date:2017-06-20 14:44:25

Trial-and-error based reinforcement learning (RL) has seen rapid advancements in recent times, especially with the advent of deep neural networks. However, the majority of autonomous RL algorithms require a large number of interactions with the environment. A large number of interactions may be impractical in many real-world applications, such as robotics, and many practical systems have to obey limitations in the form of state space or control constraints. To reduce the number of system interactions while simultaneously handling constraints, we propose a model-based RL framework based on probabilistic Model Predictive Control (MPC). In particular, we propose to learn a probabilistic transition model using Gaussian Processes (GPs) to incorporate model uncertainty into long-term predictions, thereby, reducing the impact of model errors. We then use MPC to find a control sequence that minimises the expected long-term cost. We provide theoretical guarantees for first-order optimality in the GP-based transition models with deterministic approximate inference for long-term planning. We demonstrate that our approach does not only achieve state-of-the-art data efficiency, but also is a principled way for RL in constrained environments.

Pedestrian Prediction by Planning using Deep Neural Networks

Authors:Eike Rehder, Florian Wirth, Martin Lauer, Christoph Stiller
Date:2017-06-19 12:40:30

Accurate traffic participant prediction is the prerequisite for collision avoidance of autonomous vehicles. In this work, we predict pedestrians by emulating their own motion planning. From online observations, we infer a mixture density function for possible destinations. We use this result as the goal states of a planning stage that performs motion prediction based on common behavior patterns. The entire system is modeled as one monolithic neural network and trained via inverse reinforcement learning. Experimental validation on real world data shows the system's ability to predict both, destinations and trajectories accurately.

Reinforcement Learning with Budget-Constrained Nonparametric Function Approximation for Opportunistic Spectrum Access

Authors:Theodoros Tsiligkaridis, David Romero
Date:2017-06-14 15:44:52

Opportunistic spectrum access is one of the emerging techniques for maximizing throughput in congested bands and is enabled by predicting idle slots in spectrum. We propose a kernel-based reinforcement learning approach coupled with a novel budget-constrained sparsification technique that efficiently captures the environment to find the best channel access actions. This approach allows learning and planning over the intrinsic state-action space and extends well to large state spaces. We apply our methods to evaluate coexistence of a reinforcement learning-based radio with a multi-channel adversarial radio and a single-channel CSMA-CA radio. Numerical experiments show the performance gains over carrier-sense systems.

Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics

Authors:Ken Kansky, Tom Silver, David A. Mély, Mohamed Eldawy, Miguel Lázaro-Gredilla, Xinghua Lou, Nimrod Dorfman, Szymon Sidor, Scott Phoenix, Dileep George
Date:2017-06-14 05:11:08

The recent adaptation of deep neural network-based methods to reinforcement learning and planning domains has yielded remarkable progress on individual tasks. Nonetheless, progress on task-to-task transfer remains limited. In pursuit of efficient and robust generalization, we introduce the Schema Network, an object-oriented generative physics simulator capable of disentangling multiple causes of events and reasoning backward through causes to achieve goals. The richly structured architecture of the Schema Network can learn the dynamics of an environment directly from data. We compare Schema Networks with Asynchronous Advantage Actor-Critic and Progressive Networks on a suite of Breakout variations, reporting results on training efficiency and zero-shot generalization, consistently demonstrating faster, more robust learning and better transfer. We argue that generalizing from limited data and learning causal relationships are essential abilities on the path toward generally intelligent systems.

Meta learning Framework for Automated Driving

Authors:Ahmad El Sallab, Mahmoud Saeed, Omar Abdel Tawab, Mohammed Abdou
Date:2017-06-11 12:32:30

The success of automated driving deployment is highly depending on the ability to develop an efficient and safe driving policy. The problem is well formulated under the framework of optimal control as a cost optimization problem. Model based solutions using traditional planning are efficient, but require the knowledge of the environment model. On the other hand, model free solutions suffer sample inefficiency and require too many interactions with the environment, which is infeasible in practice. Methods under the Reinforcement Learning framework usually require the notion of a reward function, which is not available in the real world. Imitation learning helps in improving sample efficiency by introducing prior knowledge obtained from the demonstrated behavior, on the risk of exact behavior cloning without generalizing to unseen environments. In this paper we propose a Meta learning framework, based on data set aggregation, to improve generalization of imitation learning algorithms. Under the proposed framework, we propose MetaDAgger, a novel algorithm which tackles the generalization issues in traditional imitation learning. We use The Open Race Car Simulator (TORCS) to test our algorithm. Results on unseen test tracks show significant improvement over traditional imitation learning algorithms, improving the learning time and sample efficiency in the same time. The results are also supported by visualization of the learnt features to prove generalization of the captured details.

Fine-grained acceleration control for autonomous intersection management using deep reinforcement learning

Authors:Hamid Mirzaei, Tony Givargis
Date:2017-05-30 02:04:29

Recent advances in combining deep learning and Reinforcement Learning have shown a promising path for designing new control agents that can learn optimal policies for challenging control tasks. These new methods address the main limitations of conventional Reinforcement Learning methods such as customized feature engineering and small action/state space dimension requirements. In this paper, we leverage one of the state-of-the-art Reinforcement Learning methods, known as Trust Region Policy Optimization, to tackle intersection management for autonomous vehicles. We show that using this method, we can perform fine-grained acceleration control of autonomous vehicles in a grid street plan to achieve a global design objective.

Thinking Fast and Slow with Deep Learning and Tree Search

Authors:Thomas Anthony, Zheng Tian, David Barber
Date:2017-05-23 17:48:51

Sequential decision making problems, such as structured prediction, robotic control, and game playing, require a combination of planning policies and generalisation of those plans. In this paper, we present Expert Iteration (ExIt), a novel reinforcement learning algorithm which decomposes the problem into separate planning and generalisation tasks. Planning new policies is performed by tree search, while a deep neural network generalises those plans. Subsequently, tree search is improved by using the neural network policy to guide search, increasing the strength of new plans. In contrast, standard deep Reinforcement Learning algorithms rely on a neural network not only to generalise plans, but to discover them too. We show that ExIt outperforms REINFORCE for training a neural network to play the board game Hex, and our final tree search agent, trained tabula rasa, defeats MoHex 1.0, the most recent Olympiad Champion player to be publicly released.

Visual Semantic Planning using Deep Successor Representations

Authors:Yuke Zhu, Daniel Gordon, Eric Kolve, Dieter Fox, Li Fei-Fei, Abhinav Gupta, Roozbeh Mottaghi, Ali Farhadi
Date:2017-05-23 05:22:47

A crucial capability of real-world intelligent agents is their ability to plan a sequence of actions to achieve their goals in the visual world. In this work, we address the problem of visual semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state. Doing so entails knowledge about objects and their affordances, as well as actions and their preconditions and effects. We propose learning these through interacting with a visual and dynamic environment. Our proposed solution involves bootstrapping reinforcement learning with imitation learning. To ensure cross task generalization, we develop a deep predictive model based on successor representations. Our experimental results show near optimal results across a wide range of tasks in the challenging THOR environment.

Experimental results : Reinforcement Learning of POMDPs using Spectral Methods

Authors:Kamyar Azizzadenesheli, Alessandro Lazaric, Animashree Anandkumar
Date:2017-05-07 02:49:10

We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through epochs, in each epoch we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the epoch, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound with respect to the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces.

Learning Multimodal Transition Dynamics for Model-Based Reinforcement Learning

Authors:Thomas M. Moerland, Joost Broekens, Catholijn M. Jonker
Date:2017-05-01 11:06:04

In this paper we study how to learn stochastic, multimodal transition dynamics in reinforcement learning (RL) tasks. We focus on evaluating transition function estimation, while we defer planning over this model to future work. Stochasticity is a fundamental property of many task environments. However, discriminative function approximators have difficulty estimating multimodal stochasticity. In contrast, deep generative models do capture complex high-dimensional outcome distributions. First we discuss why, amongst such models, conditional variational inference (VI) is theoretically most appealing for model-based RL. Subsequently, we compare different VI models on their ability to learn complex stochasticity on simulated functions, as well as on a typical RL gridworld with multimodal dynamics. Results show VI successfully predicts multimodal outcomes, but also robustly ignores these for deterministic parts of the transition dynamics. In summary, we show a robust method to learn multimodal transitions using function approximation, which is a key preliminary for model-based RL in stochastic domains.

Mapping Instructions and Visual Observations to Actions with Reinforcement Learning

Authors:Dipendra Misra, John Langford, Yoav Artzi
Date:2017-04-28 03:12:57

We propose to directly map raw visual observations and text input to actions for instruction execution. While existing approaches assume access to structured environment representations or use a pipeline of separately trained models, we learn a single model to jointly reason about linguistic and visual input. We use reinforcement learning in a contextual bandit setting to train a neural network agent. To guide the agent's exploration, we use reward shaping with different forms of supervision. Our approach does not require intermediate representations, planning procedures, or training different models. We evaluate in a simulated environment, and show significant improvements over supervised learning and common reinforcement learning variants.

Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning

Authors:Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, Kam-Fai Wong
Date:2017-04-10 23:24:46

Building a dialogue agent to fulfill complex tasks, such as travel planning, is challenging because the agent has to learn to collectively complete multiple subtasks. For example, the agent needs to reserve a hotel and book a flight so that there leaves enough time for commute between arrival and hotel check-in. This paper addresses this challenge by formulating the task in the mathematical framework of options over Markov Decision Processes (MDPs), and proposing a hierarchical deep reinforcement learning approach to learning a dialogue manager that operates at different temporal scales. The dialogue manager consists of: (1) a top-level dialogue policy that selects among subtasks or options, (2) a low-level dialogue policy that selects primitive actions to complete the subtask given by the top-level policy, and (3) a global state tracker that helps ensure all cross-subtask constraints be satisfied. Experiments on a travel planning task with simulated and real users show that our approach leads to significant improvements over three baselines, two based on handcrafted rules and the other based on flat deep reinforcement learning.

Multi-Advisor Reinforcement Learning

Authors:Romain Laroche, Mehdi Fatemi, Joshua Romoff, Harm van Seijen
Date:2017-04-03 18:37:12

We consider tackling a single-agent RL problem by distributing it to $n$ learners. These learners, called advisors, endeavour to solve the problem from a different focus. Their advice, taking the form of action values, is then communicated to an aggregator, which is in control of the system. We show that the local planning method for the advisors is critical and that none of the ones found in the literature is flawless: the egocentric planning overestimates values of states where the other advisors disagree, and the agnostic planning is inefficient around danger zones. We introduce a novel approach called empathic and discuss its theoretical aspects. We empirically examine and validate our theoretical findings on a fruit collection task.

Socially Aware Motion Planning with Deep Reinforcement Learning

Authors:Yu Fan Chen, Michael Everett, Miao Liu, Jonathan P. How
Date:2017-03-26 19:39:50

For robotic vehicles to navigate safely and efficiently in pedestrian-rich environments, it is important to model subtle human behaviors and navigation rules (e.g., passing on the right). However, while instinctive to humans, socially compliant navigation is still difficult to quantify due to the stochasticity in people's behaviors. Existing works are mostly focused on using feature-matching techniques to describe and imitate human paths, but often do not generalize well since the feature values can vary from person to person, and even run to run. This work notes that while it is challenging to directly specify the details of what to do (precise mechanisms of human navigation), it is straightforward to specify what not to do (violations of social norms). Specifically, using deep reinforcement learning, this work develops a time-efficient navigation policy that respects common social norms. The proposed method is shown to enable fully autonomous navigation of a robotic vehicle moving at human walking speed in an environment with many pedestrians.

Combining Neural Networks and Tree Search for Task and Motion Planning in Challenging Environments

Authors:Chris Paxton, Vasumathi Raman, Gregory D. Hager, Marin Kobilarov
Date:2017-03-22 23:46:51

We consider task and motion planning in complex dynamic environments for problems expressed in terms of a set of Linear Temporal Logic (LTL) constraints, and a reward function. We propose a methodology based on reinforcement learning that employs deep neural networks to learn low-level control policies as well as task-level option policies. A major challenge in this setting, both for neural network approaches and classical planning, is the need to explore future worlds of a complex and interactive environment. To this end, we integrate Monte Carlo Tree Search with hierarchical neural net control policies trained on expressive LTL specifications. This paper investigates the ability of neural networks to learn both LTL constraints and control policies in order to generate task plans in complex environments. We demonstrate our approach in a simulated autonomous driving setting, where a vehicle must drive down a road in traffic, avoid collisions, and navigate an intersection, all while obeying given rules of the road.

Multi-Timescale, Gradient Descent, Temporal Difference Learning with Linear Options

Authors:Peeyush Kumar, Doina Precup
Date:2017-03-19 17:31:13

Deliberating on large or continuous state spaces have been long standing challenges in reinforcement learning. Temporal Abstraction have somewhat made this possible, but efficiently planing using temporal abstraction still remains an issue. Moreover using spatial abstractions to learn policies for various situations at once while using temporal abstraction models is an open problem. We propose here an efficient algorithm which is convergent under linear function approximation while planning using temporally abstract actions. We show how this algorithm can be used along with randomly generated option models over multiple time scales to plan agents which need to act real time. Using these randomly generated option models over multiple time scales are shown to reduce number of decision epochs required to solve the given task, hence effectively reducing the time needed for deliberation.

End-to-end optimization of goal-driven and visually grounded dialogue systems

Authors:Florian Strub, Harm de Vries, Jeremie Mary, Bilal Piot, Aaron Courville, Olivier Pietquin
Date:2017-03-15 23:34:20

End-to-end design of dialogue systems has recently become a popular research topic thanks to powerful tools such as encoder-decoder architectures for sequence-to-sequence learning. Yet, most current approaches cast human-machine dialogue management as a supervised learning problem, aiming at predicting the next utterance of a participant given the full history of the dialogue. This vision is too simplistic to render the intrinsic planning problem inherent to dialogue as well as its grounded nature, making the context of a dialogue larger than the sole history. This is why only chit-chat and question answering tasks have been addressed so far using end-to-end architectures. In this paper, we introduce a Deep Reinforcement Learning method to optimize visually grounded task-oriented dialogues, based on the policy gradient algorithm. This approach is tested on a dataset of 120k dialogues collected through Mechanical Turk and provides encouraging results at solving both the problem of generating natural dialogues and the task of discovering a specific object in a complex picture.

Tactics of Adversarial Attack on Deep Reinforcement Learning Agents

Authors:Yen-Chen Lin, Zhang-Wei Hong, Yuan-Hong Liao, Meng-Li Shih, Ming-Yu Liu, Min Sun
Date:2017-03-08 04:39:34

We introduce two tactics to attack agents trained by deep reinforcement learning algorithms using adversarial examples, namely the strategically-timed attack and the enchanting attack. In the strategically-timed attack, the adversary aims at minimizing the agent's reward by only attacking the agent at a small subset of time steps in an episode. Limiting the attack activity to this subset helps prevent detection of the attack by the agent. We propose a novel method to determine when an adversarial example should be crafted and applied. In the enchanting attack, the adversary aims at luring the agent to a designated target state. This is achieved by combining a generative model and a planning algorithm: while the generative model predicts the future states, the planning algorithm generates a preferred sequence of actions for luring the agent. A sequence of adversarial examples is then crafted to lure the agent to take the preferred sequence of actions. We apply the two tactics to the agents trained by the state-of-the-art deep reinforcement learning algorithm including DQN and A3C. In 5 Atari games, our strategically timed attack reduces as much reward as the uniform attack (i.e., attacking at every time step) does by attacking the agent 4 times less often. Our enchanting attack lures the agent toward designated target states with a more than 70% success rate. Videos are available at http://yenchenlin.me/adversarial_attack_RL/

Functions that Emerge through End-to-End Reinforcement Learning - The Direction for Artificial General Intelligence -

Authors:Katsunari Shibata
Date:2017-03-07 06:51:19

Recently, triggered by the impressive results in TV-games or game of Go by Google DeepMind, end-to-end reinforcement learning (RL) is collecting attentions. Although little is known, the author's group has propounded this framework for around 20 years and already has shown various functions that emerge in a neural network (NN) through RL. In this paper, they are introduced again at this timing. "Function Modularization" approach is deeply penetrated subconsciously. The inputs and outputs for a learning system can be raw sensor signals and motor commands. "State space" or "action space" generally used in RL show the existence of functional modules. That has limited reinforcement learning to learning only for the action-planning module. In order to extend reinforcement learning to learning of the entire function on a huge degree of freedom of a massively parallel learning system and to explain or develop human-like intelligence, the author has believed that end-to-end RL from sensors to motors using a recurrent NN (RNN) becomes an essential key. Especially in the higher functions, this approach is very effective by being free from the need to decide their inputs and outputs. The functions that emerge, we have confirmed, through RL using a NN cover a broad range from real robot learning with raw camera pixel inputs to acquisition of dynamic functions in a RNN. Those are (1)image recognition, (2)color constancy (optical illusion), (3)sensor motion (active recognition), (4)hand-eye coordination and hand reaching movement, (5)explanation of brain activities, (6)communication, (7)knowledge transfer, (8)memory, (9)selective attention, (10)prediction, (11)exploration. The end-to-end RL enables the emergence of very flexible comprehensive functions that consider many things in parallel although it is difficult to give the boundary of each function clearly.

Generalised Discount Functions applied to a Monte-Carlo AImu Implementation

Authors:Sean Lamont, John Aslanides, Jan Leike, Marcus Hutter
Date:2017-03-03 23:25:38

In recent years, work has been done to develop the theory of General Reinforcement Learning (GRL). However, there are few examples demonstrating these results in a concrete way. In particular, there are no examples demonstrating the known results regarding gener- alised discounting. We have added to the GRL simulation platform AIXIjs the functionality to assign an agent arbitrary discount functions, and an environment which can be used to determine the effect of discounting on an agent's policy. Using this, we investigate how geometric, hyperbolic and power discounting affect an informed agent in a simple MDP. We experimentally reproduce a number of theoretical results, and discuss some related subtleties. It was found that the agent's behaviour followed what is expected theoretically, assuming appropriate parameters were chosen for the Monte-Carlo Tree Search (MCTS) planning algorithm.

Deep Reinforcement Learning: An Overview

Authors:Yuxi Li
Date:2017-01-25 11:52:11

We give an overview of recent exciting achievements of deep reinforcement learning (RL). We discuss six core elements, six important mechanisms, and twelve applications. We start with background of machine learning, deep learning and reinforcement learning. Next we discuss core RL elements, including value function, in particular, Deep Q-Network (DQN), policy, reward, model, planning, and exploration. After that, we discuss important mechanisms for RL, including attention and memory, unsupervised learning, transfer learning, multi-agent RL, hierarchical RL, and learning to learn. Then we discuss various applications of RL, including games, in particular, AlphaGo, robotics, natural language processing, including dialogue systems, machine translation, and text generation, computer vision, neural architecture design, business management, finance, healthcare, Industry 4.0, smart grid, intelligent transportation systems, and computer systems. We mention topics not reviewed yet, and list a collection of RL resources. After presenting a brief summary, we close with discussions. Please see Deep Reinforcement Learning, arXiv:1810.06339, for a significant update.

Near Optimal Behavior via Approximate State Abstraction

Authors:David Abel, D. Ellis Hershkowitz, Michael L. Littman
Date:2017-01-15 21:24:45

The combinatorial explosion that plagues planning and reinforcement learning (RL) algorithms can be moderated using state abstraction. Prohibitively large task representations can be condensed such that essential information is preserved, and consequently, solutions are tractably computable. However, exact abstractions, which treat only fully-identical situations as equivalent, fail to present opportunities for abstraction in environments where no two situations are exactly alike. In this work, we investigate approximate state abstractions, which treat nearly-identical situations as equivalent. We present theoretical guarantees of the quality of behaviors derived from four types of approximate abstractions. Additionally, we empirically demonstrate that approximate abstractions lead to reduction in task complexity and bounded loss of optimality of behavior in a variety of environments.

Reinforcement Learning via Recurrent Convolutional Neural Networks

Authors:Tanmay Shankar, Santosha K. Dwivedy, Prithwijit Guha
Date:2017-01-09 23:36:05

Deep Reinforcement Learning has enabled the learning of policies for complex tasks in partially observable environments, without explicitly learning the underlying model of the tasks. While such model-free methods achieve considerable performance, they often ignore the structure of task. We present a natural representation of to Reinforcement Learning (RL) problems using Recurrent Convolutional Neural Networks (RCNNs), to better exploit this inherent structure. We define 3 such RCNNs, whose forward passes execute an efficient Value Iteration, propagate beliefs of state in partially observable environments, and choose optimal actions respectively. Backpropagating gradients through these RCNNs allows the system to explicitly learn the Transition Model and Reward Function associated with the underlying MDP, serving as an elegant alternative to classical model-based RL. We evaluate the proposed algorithms in simulation, considering a robot planning problem. We demonstrate the capability of our framework to reduce the cost of replanning, learn accurate MDP models, and finally re-plan with learnt models to achieve near-optimal policies.

Self-Correcting Models for Model-Based Reinforcement Learning

Authors:Erik Talvitie
Date:2016-12-19 01:09:23

When an agent cannot represent a perfectly accurate model of its environment's dynamics, model-based reinforcement learning (MBRL) can fail catastrophically. Planning involves composing the predictions of the model; when flawed predictions are composed, even minor errors can compound and render the model useless for planning. Hallucinated Replay (Talvitie 2014) trains the model to "correct" itself when it produces errors, substantially improving MBRL with flawed models. This paper theoretically analyzes this approach, illuminates settings in which it is likely to be effective or ineffective, and presents a novel error bound, showing that a model's ability to self-correct is more tightly related to MBRL performance than one-step prediction error. These results inspire an MBRL algorithm for deterministic MDPs with performance guarantees that are robust to model class limitations.

An Alternative Softmax Operator for Reinforcement Learning

Authors:Kavosh Asadi, Michael L. Littman
Date:2016-12-16 20:49:35

A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one's weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study a differentiable softmax operator that, among other properties, is a non-expansion ensuring a convergent behavior in learning and planning. We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice.

Deep Reinforcement Learning with Successor Features for Navigation across Similar Environments

Authors:Jingwei Zhang, Jost Tobias Springenberg, Joschka Boedecker, Wolfram Burgard
Date:2016-12-16 16:15:26

In this paper we consider the problem of robot navigation in simple maze-like environments where the robot has to rely on its onboard sensors to perform the navigation task. In particular, we are interested in solutions to this problem that do not require localization, mapping or planning. Additionally, we require that our solution can quickly adapt to new situations (e.g., changing navigation goals and environments). To meet these criteria we frame this problem as a sequence of related reinforcement learning tasks. We propose a successor feature based deep reinforcement learning algorithm that can learn to transfer knowledge from previously mastered navigation tasks to new problem instances. Our algorithm substantially decreases the required learning time after the first task instance has been solved, which makes it easily adaptable to changing environments. We validate our method in both simulated and real robot experiments with a Robotino and compare it to a set of baseline methods including classical planning-based navigation.

Incorporating Human Domain Knowledge into Large Scale Cost Function Learning

Authors:Markus Wulfmeier, Dushyant Rao, Ingmar Posner
Date:2016-12-13 18:56:03

Recent advances have shown the capability of Fully Convolutional Neural Networks (FCN) to model cost functions for motion planning in the context of learning driving preferences purely based on demonstration data from human drivers. While pure learning from demonstrations in the framework of Inverse Reinforcement Learning (IRL) is a promising approach, we can benefit from well informed human priors and incorporate them into the learning process. Our work achieves this by pretraining a model to regress to a manual cost function and refining it based on Maximum Entropy Deep Inverse Reinforcement Learning. When injecting prior knowledge as pretraining for the network, we achieve higher robustness, more visually distinct obstacle boundaries, and the ability to capture instances of obstacles that elude models that purely learn from demonstration data. Furthermore, by exploiting these human priors, the resulting model can more accurately handle corner cases that are scarcely seen in the demonstration data, such as stairs, slopes, and underpasses.

Playing Doom with SLAM-Augmented Deep Reinforcement Learning

Authors:Shehroze Bhatti, Alban Desmaison, Ondrej Miksik, Nantas Nardelli, N. Siddharth, Philip H. S. Torr
Date:2016-12-01 18:54:51

A number of recent approaches to policy learning in 2D game domains have been successful going directly from raw input images to actions. However when employed in complex 3D environments, they typically suffer from challenges related to partial observability, combinatorial exploration spaces, path planning, and a scarcity of rewarding scenarios. Inspired from prior work in human cognition that indicates how humans employ a variety of semantic concepts and abstractions (object categories, localisation, etc.) to reason about the world, we build an agent-model that incorporates such abstractions into its policy-learning framework. We augment the raw image input to a Deep Q-Learning Network (DQN), by adding details of objects and structural elements encountered, along with the agent's localisation. The different components are automatically extracted and composed into a topological representation using on-the-fly object detection and 3D-scene reconstruction.We evaluate the efficacy of our approach in Doom, a 3D first-person combat game that exhibits a number of challenges discussed, and show that our augmented framework consistently learns better, more effective policies.

CAD2RL: Real Single-Image Flight without a Single Real Image

Authors:Fereshteh Sadeghi, Sergey Levine
Date:2016-11-13 23:08:42

Deep reinforcement learning has emerged as a promising and powerful technique for automatically acquiring control policies that can process raw sensory inputs, such as images, and perform complex behaviors. However, extending deep RL to real-world robotic tasks has proven challenging, particularly in safety-critical domains such as autonomous flight, where a trial-and-error learning process is often impractical. In this paper, we explore the following question: can we train vision-based navigation policies entirely in simulation, and then transfer them into the real world to achieve real-world flight without a single real training image? We propose a learning method that we call CAD$^2$RL, which can be used to perform collision-free indoor flight in the real world while being trained entirely on 3D CAD models. Our method uses single RGB images from a monocular camera, without needing to explicitly reconstruct the 3D geometry of the environment or perform explicit motion planning. Our learned collision avoidance policy is represented by a deep convolutional neural network that directly processes raw monocular images and outputs velocity commands. This policy is trained entirely on simulated images, with a Monte Carlo policy evaluation algorithm that directly optimizes the network's ability to produce collision-free flight. By highly randomizing the rendering settings for our simulated training set, we show that we can train a policy that generalizes to the real world, without requiring the simulator to be particularly realistic or high-fidelity. We evaluate our method by flying a real quadrotor through indoor environments, and further evaluate the design choices in our simulator through a series of ablation studies on depth prediction. For supplementary video see: https://youtu.be/nXBWmzFrj5s

Estimating Dynamic Treatment Regimes in Mobile Health Using V-learning

Authors:Daniel J. Luckett, Eric B. Laber, Anna R. Kahkoska, David M. Maahs, Elizabeth Mayer-Davis, Michael R. Kosorok
Date:2016-11-10 22:04:13

The vision for precision medicine is to use individual patient characteristics to inform a personalized treatment plan that leads to the best healthcare possible for each patient. Mobile technologies have an important role to play in this vision as they offer a means to monitor a patient's health status in real-time and subsequently to deliver interventions if, when, and in the dose that they are needed. Dynamic treatment regimes formalize individualized treatment plans as sequences of decision rules, one per stage of clinical intervention, that map current patient information to a recommended treatment. However, existing methods for estimating optimal dynamic treatment regimes are designed for a small number of fixed decision points occurring on a coarse time-scale. We propose a new reinforcement learning method for estimating an optimal treatment regime that is applicable to data collected using mobile technologies in an outpatient setting. The proposed method accommodates an indefinite time horizon and minute-by-minute decision making that are common in mobile health applications. We show the proposed estimators are consistent and asymptotically normal under mild conditions. The proposed methods are applied to estimate an optimal dynamic treatment regime for controlling blood glucose levels in patients with type 1 diabetes.

A Reinforcement Learning Approach to the View Planning Problem

Authors:Mustafa Devrim Kaba, Mustafa Gokhan Uzunbas, Ser Nam Lim
Date:2016-10-19 20:29:20

We present a Reinforcement Learning (RL) solution to the view planning problem (VPP), which generates a sequence of view points that are capable of sensing all accessible area of a given object represented as a 3D model. In doing so, the goal is to minimize the number of view points, making the VPP a class of set covering optimization problem (SCOP). The SCOP is NP-hard, and the inapproximability results tell us that the greedy algorithm provides the best approximation that runs in polynomial time. In order to find a solution that is better than the greedy algorithm, (i) we introduce a novel score function by exploiting the geometry of the 3D model, (ii) we model an intuitive human approach to VPP using this score function, and (iii) we cast VPP as a Markovian Decision Process (MDP), and solve the MDP in RL framework using well-known RL algorithms. In particular, we use SARSA, Watkins-Q and TD with function approximation to solve the MDP. We compare the results of our method with the baseline greedy algorithm in an extensive set of test objects, and show that we can out-perform the baseline in almost all cases.

Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model

Authors:Paul Christiano, Zain Shah, Igor Mordatch, Jonas Schneider, Trevor Blackwell, Joshua Tobin, Pieter Abbeel, Wojciech Zaremba
Date:2016-10-11 20:24:31

Developing control policies in simulation is often more practical and safer than directly running experiments in the real world. This applies to policies obtained from planning and optimization, and even more so to policies obtained from reinforcement learning, which is often very data demanding. However, a policy that succeeds in simulation often doesn't work when deployed on a real robot. Nevertheless, often the overall gist of what the policy does in simulation remains valid in the real world. In this paper we investigate such settings, where the sequence of states traversed in simulation remains reasonable for the real world, even if the details of the controls are not, as could be the case when the key differences lie in detailed friction, contact, mass and geometry properties. During execution, at each time step our approach computes what the simulation-based control policy would do, but then, rather than executing these controls on the real robot, our approach computes what the simulation expects the resulting next state(s) will be, and then relies on a learned deep inverse dynamics model to decide which real-world action is most suitable to achieve those next states. Deep models are only as good as their training data, and we also propose an approach for data collection to (incrementally) learn the deep inverse dynamics model. Our experiments shows our approach compares favorably with various baselines that have been developed for dealing with simulation to real world model discrepancy, including output error control and Gaussian dynamics adaptation.

Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

Authors:Shai Shalev-Shwartz, Shaked Shammah, Amnon Shashua
Date:2016-10-11 12:09:03

Autonomous driving is a multi-agent setting where the host vehicle must apply sophisticated negotiation skills with other road users when overtaking, giving way, merging, taking left and right turns and while pushing ahead in unstructured urban roadways. Since there are many possible scenarios, manually tackling all possible cases will likely yield a too simplistic policy. Moreover, one must balance between unexpected behavior of other drivers/pedestrians and at the same time not to be too defensive so that normal traffic flow is maintained. In this paper we apply deep reinforcement learning to the problem of forming long term driving strategies. We note that there are two major challenges that make autonomous driving different from other robotic tasks. First, is the necessity for ensuring functional safety - something that machine learning has difficulty with given that performance is optimized at the level of an expectation over many instances. Second, the Markov Decision Process model often used in robotics is problematic in our case because of unpredictable behavior of other agents in this multi-agent scenario. We make three contributions in our work. First, we show how policy gradient iterations can be used without Markovian assumptions. Second, we decompose the problem into a composition of a Policy for Desires (which is to be learned) and trajectory planning with hard constraints (which is not learned). The goal of Desires is to enable comfort of driving, while hard constraints guarantees the safety of driving. Third, we introduce a hierarchical temporal abstraction we call an "Option Graph" with a gating mechanism that significantly reduces the effective horizon and thereby reducing the variance of the gradient estimation even further.

Situational Awareness by Risk-Conscious Skills

Authors:Daniel J. Mankowitz, Aviv Tamar, Shie Mannor
Date:2016-10-10 11:01:32

Hierarchical Reinforcement Learning has been previously shown to speed up the convergence rate of RL planning algorithms as well as mitigate feature-based model misspecification (Mankowitz et. al. 2016a,b, Bacon 2015). To do so, it utilizes hierarchical abstractions, also known as skills -- a type of temporally extended action (Sutton et. al. 1999) to plan at a higher level, abstracting away from the lower-level details. We incorporate risk sensitivity, also referred to as Situational Awareness (SA), into hierarchical RL for the first time by defining and learning risk aware skills in a Probabilistic Goal Semi-Markov Decision Process (PG-SMDP). This is achieved using our novel Situational Awareness by Risk-Conscious Skills (SARiCoS) algorithm which comes with a theoretical convergence guarantee. We show in a RoboCup soccer domain that the learned risk aware skills exhibit complex human behaviors such as `time-wasting' in a soccer game. In addition, the learned risk aware skills are able to mitigate reward-based model misspecification.

Deep Visual Foresight for Planning Robot Motion

Authors:Chelsea Finn, Sergey Levine
Date:2016-10-03 19:54:17

A key challenge in scaling up robot learning to many skills and environments is removing the need for human supervision, so that robots can collect their own data and improve their own performance without being limited by the cost of requesting human feedback. Model-based reinforcement learning holds the promise of enabling an agent to learn to predict the effects of its actions, which could provide flexible predictive models for a wide range of tasks and environments, without detailed human supervision. We develop a method for combining deep action-conditioned video prediction models with model-predictive control that uses entirely unlabeled training data. Our approach does not require a calibrated camera, an instrumented training set-up, nor precise sensing and actuation. Our results show that our method enables a real robot to perform nonprehensile manipulation -- pushing objects -- and can handle novel objects not seen during training.

Principled Option Learning in Markov Decision Processes

Authors:Roy Fox, Michal Moshkovitz, Naftali Tishby
Date:2016-09-18 18:19:02

It is well known that options can make planning more efficient, among their many benefits. Thus far, algorithms for autonomously discovering a set of useful options were heuristic. Naturally, a principled way of finding a set of useful options may be more promising and insightful. In this paper we suggest a mathematical characterization of good sets of options using tools from information theory. This characterization enables us to find conditions for a set of options to be optimal and an algorithm that outputs a useful set of options and illustrate the proposed algorithm in simulation.

The Option-Critic Architecture

Authors:Pierre-Luc Bacon, Jean Harb, Doina Precup
Date:2016-09-16 17:05:55

Temporal abstraction is key to scaling up learning and planning in reinforcement learning. While planning with temporally extended actions is well understood, creating such abstractions autonomously from data has remained challenging. We tackle this problem in the framework of options [Sutton, Precup & Singh, 1999; Precup, 2000]. We derive policy gradient theorems for options and propose a new option-critic architecture capable of learning both the internal policies and the termination conditions of options, in tandem with the policy over options, and without the need to provide any additional rewards or subgoals. Experimental results in both discrete and continuous environments showcase the flexibility and efficiency of the framework.

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

Authors:Kamyar Azizzadenesheli, Alessandro Lazaric, Animashree Anandkumar
Date:2016-08-17 15:20:35

Planning plays an important role in the broad class of decision theory. Planning has drawn much attention in recent work in the robotics and sequential decision making areas. Recently, Reinforcement Learning (RL), as an agent-environment interaction problem, has brought further attention to planning methods. Generally in RL, one can assume a generative model, e.g. graphical models, for the environment, and then the task for the RL agent is to learn the model parameters and find the optimal strategy based on these learnt parameters. Based on environment behavior, the agent can assume various types of generative models, e.g. Multi Armed Bandit for a static environment, or Markov Decision Process (MDP) for a dynamic environment. The advantage of these popular models is their simplicity, which results in tractable methods of learning the parameters and finding the optimal policy. The drawback of these models is again their simplicity: these models usually underfit and underestimate the actual environment behavior. For example, in robotics, the agent usually has noisy observations of the environment inner state and MDP is not a suitable model. More complex models like Partially Observable Markov Decision Process (POMDP) can compensate for this drawback. Fitting this model to the environment, where the partial observation is given to the agent, generally gives dramatic performance improvement, sometimes unbounded improvement, compared to MDP. In general, finding the optimal policy for the POMDP model is computationally intractable and fully non convex, even for the class of memoryless policies. The open problem is to come up with a method to find an exact or an approximate optimal stochastic memoryless policy for POMDP models.

Learning to Prevent Monocular SLAM Failure using Reinforcement Learning

Authors:Vignesh Prasad, Karmesh Yadav, Rohitashva Singh Saurabh, Swapnil Daga, Nahas Pareekutty, K. Madhava Krishna, Balaraman Ravindran, Brojeshwar Bhowmick
Date:2016-07-26 06:53:38

Monocular SLAM refers to using a single camera to estimate robot ego motion while building a map of the environment. While Monocular SLAM is a well studied problem, automating Monocular SLAM by integrating it with trajectory planning frameworks is particularly challenging. This paper presents a novel formulation based on Reinforcement Learning (RL) that generates fail safe trajectories wherein the SLAM generated outputs do not deviate largely from their true values. Quintessentially, the RL framework successfully learns the otherwise complex relation between perceptual inputs and motor actions and uses this knowledge to generate trajectories that do not cause failure of SLAM. We show systematically in simulations how the quality of the SLAM dramatically improves when trajectories are computed using RL. Our method scales effectively across Monocular SLAM frameworks in both simulation and in real world experiments with a mobile robot.

Strategic Attentive Writer for Learning Macro-Actions

Authors:Alexander, Vezhnevets, Volodymyr Mnih, John Agapiou, Simon Osindero, Alex Graves, Oriol Vinyals, Koray Kavukcuoglu
Date:2016-06-15 09:28:52

We present a novel deep recurrent neural network architecture that learns to build implicit plans in an end-to-end manner by purely interacting with an environment in reinforcement learning setting. The network builds an internal plan, which is continuously updated upon observation of the next input from the environment. It can also partition this internal representation into contiguous sub- sequences by learning for how long the plan can be committed to - i.e. followed without re-planing. Combining these properties, the proposed model, dubbed STRategic Attentive Writer (STRAW) can learn high-level, temporally abstracted macro- actions of varying lengths that are solely learnt from data without any prior information. These macro-actions enable both structured exploration and economic computation. We experimentally demonstrate that STRAW delivers strong improvements on several ATARI games by employing temporally extended planning strategies (e.g. Ms. Pacman and Frostbite). It is at the same time a general algorithm that can be applied on any sequence data. To that end, we also show that when trained on text prediction task, STRAW naturally predicts frequent n-grams (instead of macro-actions), demonstrating the generality of the approach.

Natural Language Generation as Planning under Uncertainty Using Reinforcement Learning

Authors:Verena Rieser, Oliver Lemon
Date:2016-06-15 09:05:56

We present and evaluate a new model for Natural Language Generation (NLG) in Spoken Dialogue Systems, based on statistical planning, given noisy feedback from the current generation context (e.g. a user and a surface realiser). We study its use in a standard NLG problem: how to present information (in this case a set of search results) to users, given the complex trade- offs between utterance length, amount of information conveyed, and cognitive load. We set these trade-offs by analysing existing MATCH data. We then train a NLG pol- icy using Reinforcement Learning (RL), which adapts its behaviour to noisy feed- back from the current generation context. This policy is compared to several base- lines derived from previous work in this area. The learned policy significantly out- performs all the prior approaches.

Model-Free Imitation Learning with Policy Optimization

Authors:Jonathan Ho, Jayesh K. Gupta, Stefano Ermon
Date:2016-05-26 23:43:32

In imitation learning, an agent learns how to behave in an environment with an unknown cost function by mimicking expert demonstrations. Existing imitation learning algorithms typically involve solving a sequence of planning or reinforcement learning problems. Such algorithms are therefore not directly applicable to large, high-dimensional environments, and their performance can significantly degrade if the planning problems are not solved to optimality. Under the apprenticeship learning formalism, we develop alternative model-free algorithms for finding a parameterized stochastic policy that performs at least as well as an expert policy on an unknown cost function, based on sample trajectories from the expert. Our approach, based on policy gradients, scales to large continuous environments with guaranteed convergence to local minima.

A Reinforcement Learning System to Encourage Physical Activity in Diabetes Patients

Authors:Irit Hochberg, Guy Feraru, Mark Kozdoba, Shie Mannor, Moshe Tennenholtz, Elad Yom-Tov
Date:2016-05-13 07:25:14

Regular physical activity is known to be beneficial to people suffering from diabetes type 2. Nevertheless, most such people are sedentary. Smartphones create new possibilities for helping people to adhere to their physical activity goals, through continuous monitoring and communication, coupled with personalized feedback. We provided 27 sedentary diabetes type 2 patients with a smartphone-based pedometer and a personal plan for physical activity. Patients were sent SMS messages to encourage physical activity between once a day and once per week. Messages were personalized through a Reinforcement Learning (RL) algorithm which optimized messages to improve each participant's compliance with the activity regimen. The RL algorithm was compared to a static policy for sending messages and to weekly reminders. Our results show that participants who received messages generated by the RL algorithm increased the amount of activity and pace of walking, while the control group patients did not. Patients assigned to the RL algorithm group experienced a superior reduction in blood glucose levels (HbA1c) compared to control policies, and longer participation caused greater reductions in blood glucose levels. The learning algorithm improved gradually in predicting which messages would lead participants to exercise. Our results suggest that a mobile phone application coupled with a learning algorithm can improve adherence to exercise in diabetic patients. As a learning algorithm is automated, and delivers personalized messages, it could be used in large populations of diabetic patients to improve health and glycemic control. Our results can be expanded to other areas where computer-led health coaching of humans may have a positive impact.

HIRL: Hierarchical Inverse Reinforcement Learning for Long-Horizon Tasks with Delayed Rewards

Authors:Sanjay Krishnan, Animesh Garg, Richard Liaw, Lauren Miller, Florian T. Pokorny, Ken Goldberg
Date:2016-04-21 22:14:11

Reinforcement Learning (RL) struggles in problems with delayed rewards, and one approach is to segment the task into sub-tasks with incremental rewards. We propose a framework called Hierarchical Inverse Reinforcement Learning (HIRL), which is a model for learning sub-task structure from demonstrations. HIRL decomposes the task into sub-tasks based on transitions that are consistent across demonstrations. These transitions are defined as changes in local linearity w.r.t to a kernel function. Then, HIRL uses the inferred structure to learn reward functions local to the sub-tasks but also handle any global dependencies such as sequentiality. We have evaluated HIRL on several standard RL benchmarks: Parallel Parking with noisy dynamics, Two-Link Pendulum, 2D Noisy Motion Planning, and a Pinball environment. In the parallel parking task, we find that rewards constructed with HIRL converge to a policy with an 80% success rate in 32% fewer time-steps than those constructed with Maximum Entropy Inverse RL (MaxEnt IRL), and with partial state observation, the policies learned with IRL fail to achieve this accuracy while HIRL still converges. We further find that that the rewards learned with HIRL are robust to environment noise where they can tolerate 1 stdev. of random perturbation in the poses in the environment obstacles while maintaining roughly the same convergence rate. We find that HIRL rewards can converge up-to 6x faster than rewards constructed with IRL.

Intelligent Agent-Based Stimulation for Testing Robotic Software in Human-Robot Interactions

Authors:Dejanira Araiza-Illan, Anthony G. Pipe, Kerstin Eder
Date:2016-04-19 10:33:34

The challenges of robotic software testing extend beyond conventional software testing. Valid, realistic and interesting tests need to be generated for multiple programs and hardware running concurrently, deployed into dynamic environments with people. We investigate the use of Belief-Desire-Intention (BDI) agents as models for test generation, in the domain of human-robot interaction (HRI) in simulations. These models provide rational agency, causality, and a reasoning mechanism for planning, which emulate both intelligent and adaptive robots, as well as smart testing environments directed by humans. We introduce reinforcement learning (RL) to automate the exploration of the BDI models using a reward function based on coverage feedback. Our approach is evaluated using a collaborative manufacture example, where the robotic software under test is stimulated indirectly via a simulated human co-worker. We conclude that BDI agents provide intuitive models for test generation in the HRI domain. Our results demonstrate that RL can fully automate BDI model exploration, leading to very effective coverage-directed test generation.

Hierarchical Linearly-Solvable Markov Decision Problems

Authors:Anders Jonsson, Vicenç Gómez
Date:2016-03-10 13:50:31

We present a hierarchical reinforcement learning framework that formulates each task in the hierarchy as a special type of Markov decision process for which the Bellman equation is linear and has analytical solution. Problems of this type, called linearly-solvable MDPs (LMDPs) have interesting properties that can be exploited in a hierarchical setting, such as efficient learning of the optimal value function or task compositionality. The proposed hierarchical approach can also be seen as a novel alternative to solving LMDPs with large state spaces. We derive a hierarchical version of the so-called Z-learning algorithm that learns different tasks simultaneously and show empirically that it significantly outperforms the state-of-the-art learning methods in two classical hierarchical reinforcement learning domains: the taxi domain and an autonomous guided vehicle task.

Reinforcement Learning of POMDPs using Spectral Methods

Authors:Kamyar Azizzadenesheli, Alessandro Lazaric, Animashree Anandkumar
Date:2016-02-25 01:25:36

We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through episodes, in each episode we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the episode, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound with respect to the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces.

POMDP-lite for Robust Robot Planning under Uncertainty

Authors:Min Chen, Emilio Frazzoli, David Hsu, Wee Sun Lee
Date:2016-02-16 00:47:08

The partially observable Markov decision process (POMDP) provides a principled general model for planning under uncertainty. However, solving a general POMDP is computationally intractable in the worst case. This paper introduces POMDP-lite, a subclass of POMDPs in which the hidden state variables are constant or only change deterministically. We show that a POMDP-lite is equivalent to a set of fully observable Markov decision processes indexed by a hidden parameter and is useful for modeling a variety of interesting robotic tasks. We develop a simple model-based Bayesian reinforcement learning algorithm to solve POMDP-lite models. The algorithm performs well on large-scale POMDP-lite models with up to $10^{20}$ states and outperforms the state-of-the-art general-purpose POMDP algorithms. We further show that the algorithm is near-Bayesian-optimal under suitable conditions.

Value Iteration Networks

Authors:Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, Pieter Abbeel
Date:2016-02-09 05:44:36

We introduce the value iteration network (VIN): a fully differentiable neural network with a `planning module' embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.

Information-Theoretic Bounded Rationality

Authors:Pedro A. Ortega, Daniel A. Braun, Justin Dyer, Kee-Eung Kim, Naftali Tishby
Date:2015-12-21 19:58:46

Bounded rationality, that is, decision-making and planning under resource limitations, is widely regarded as an important open problem in artificial intelligence, reinforcement learning, computational neuroscience and economics. This paper offers a consolidated presentation of a theory of bounded rationality based on information-theoretic ideas. We provide a conceptual justification for using the free energy functional as the objective function for characterizing bounded-rational decisions. This functional possesses three crucial properties: it controls the size of the solution space; it has Monte Carlo planners that are exact, yet bypass the need for exhaustive search; and it captures model uncertainty arising from lack of evidence or from interacting with other agents having unknown intentions. We discuss the single-step decision-making case, and show how to extend it to sequential decisions using equivalence transformations. This extension yields a very general class of decision problems that encompass classical decision rules (e.g. EXPECTIMAX and MINIMAX) as limit cases, as well as trust- and risk-sensitive planning.

On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models

Authors:Juergen Schmidhuber
Date:2015-11-30 11:35:26

This paper addresses the general problem of reinforcement learning (RL) in partially observable environments. In 2013, our large RL recurrent neural networks (RNNs) learned from scratch to drive simulated cars from high-dimensional video input. However, real brains are more powerful in many ways. In particular, they learn a predictive model of their initially unknown environment, and somehow use it for abstract (e.g., hierarchical) planning and reasoning. Guided by algorithmic information theory, we describe RNN-based AIs (RNNAIs) designed to do the same. Such an RNNAI can be trained on never-ending sequences of tasks, some of them provided by the user, others invented by the RNNAI itself in a curious, playful fashion, to improve its RNN-based world model. Unlike our previous model-building RNN-based RL machines dating back to 1990, the RNNAI learns to actively query its model for abstract reasoning and planning and decision making, essentially "learning to think." The basic ideas of this report can be applied to many other cases where one RNN-like system exploits the algorithmic information content of another. They are taken from a grant proposal submitted in Fall 2014, and also explain concepts such as "mirror neurons." Experimental results will be described in separate papers.

MazeBase: A Sandbox for Learning from Games

Authors:Sainbayar Sukhbaatar, Arthur Szlam, Gabriel Synnaeve, Soumith Chintala, Rob Fergus
Date:2015-11-23 20:23:53

This paper introduces MazeBase: an environment for simple 2D games, designed as a sandbox for machine learning approaches to reasoning and planning. Within it, we create 10 simple games embodying a range of algorithmic tasks (e.g. if-then statements or set negation). A variety of neural models (fully connected, convolutional network, memory network) are deployed via reinforcement learning on these games, with and without a procedurally generated curriculum. Despite the tasks' simplicity, the performance of the models is far from optimal, suggesting directions for future development. We also demonstrate the versatility of MazeBase by using it to emulate small combat scenarios from StarCraft. Models trained on the MazeBase version can be directly applied to StarCraft, where they consistently beat the in-game AI.

One-Shot Learning of Manipulation Skills with Online Dynamics Adaptation and Neural Network Priors

Authors:Justin Fu, Sergey Levine, Pieter Abbeel
Date:2015-09-23 04:19:14

One of the key challenges in applying reinforcement learning to complex robotic control tasks is the need to gather large amounts of experience in order to find an effective policy for the task at hand. Model-based reinforcement learning can achieve good sample efficiency, but requires the ability to learn a model of the dynamics that is good enough to learn an effective policy. In this work, we develop a model-based reinforcement learning algorithm that combines prior knowledge from previous tasks with online adaptation of the dynamics model. These two ingredients enable highly sample-efficient learning even in regimes where estimating the true dynamics is very difficult, since the online model adaptation allows the method to locally compensate for unmodeled variation in the dynamics. We encode the prior experience into a neural network dynamics model, adapt it online by progressively refitting a local linear model of the dynamics, and use model predictive control to plan under these dynamics. Our experimental results show that this approach can be used to solve a variety of complex robotic manipulation tasks in just a single attempt, using prior data from other manipulation behaviors.

Continuous control with deep reinforcement learning

Authors:Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra
Date:2015-09-09 23:01:36

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.

Learning Efficient Representations for Reinforcement Learning

Authors:Yanping Huang
Date:2015-08-28 06:01:56

Markov decision processes (MDPs) are a well studied framework for solving sequential decision making problems under uncertainty. Exact methods for solving MDPs based on dynamic programming such as policy iteration and value iteration are effective on small problems. In problems with a large discrete state space or with continuous state spaces, a compact representation is essential for providing an efficient approximation solutions to MDPs. Commonly used approximation algorithms involving constructing basis functions for projecting the value function onto a low dimensional subspace, and building a factored or hierarchical graphical model to decompose the transition and reward functions. However, hand-coding a good compact representation for a given reinforcement learning (RL) task can be quite difficult and time consuming. Recent approaches have attempted to automatically discover efficient representations for RL. In this thesis proposal, we discuss the problems of automatically constructing structured kernel for kernel based RL, a popular approach to learning non-parametric approximations for value function. We explore a space of kernel structures which are built compositionally from base kernels using a context-free grammar. We examine a greedy algorithm for searching over the structure space. To demonstrate how the learned structure can represent and approximate the original RL problem in terms of compactness and efficiency, we plan to evaluate our method on a synthetic problem and compare it to other RL baselines.

Bootstrapping Skills

Authors:Daniel J. Mankowitz, Timothy A. Mann, Shie Mannor
Date:2015-06-11 11:06:40

The monolithic approach to policy representation in Markov Decision Processes (MDPs) looks for a single policy that can be represented as a function from states to actions. For the monolithic approach to succeed (and this is not always possible), a complex feature representation is often necessary since the policy is a complex object that has to prescribe what actions to take all over the state space. This is especially true in large domains with complicated dynamics. It is also computationally inefficient to both learn and plan in MDPs using a complex monolithic approach. We present a different approach where we restrict the policy space to policies that can be represented as combinations of simpler, parameterized skills---a type of temporally extended action, with a simple policy representation. We introduce Learning Skills via Bootstrapping (LSB) that can use a broad family of Reinforcement Learning (RL) algorithms as a "black box" to iteratively learn parametrized skills. Initially, the learned skills are short-sighted but each iteration of the algorithm allows the skills to bootstrap off one another, improving each skill in the process. We prove that this bootstrapping process returns a near-optimal policy. Furthermore, our experiments demonstrate that LSB can solve MDPs that, given the same representational power, could not be solved by a monolithic approach. Thus, planning with learned skills results in better policies without requiring complex policy representations.

Correct-by-synthesis reinforcement learning with temporal logic constraints

Authors:Min Wen, Ruediger Ehlers, Ufuk Topcu
Date:2015-03-05 21:23:45

We consider a problem on the synthesis of reactive controllers that optimize some a priori unknown performance criterion while interacting with an uncontrolled environment such that the system satisfies a given temporal logic specification. We decouple the problem into two subproblems. First, we extract a (maximally) permissive strategy for the system, which encodes multiple (possibly all) ways in which the system can react to the adversarial environment and satisfy the specifications. Then, we quantify the a priori unknown performance criterion as a (still unknown) reward function and compute an optimal strategy for the system within the operating envelope allowed by the permissive strategy by using the so-called maximin-Q learning algorithm. We establish both correctness (with respect to the temporal logic specifications) and optimality (with respect to the a priori unknown performance criterion) of this two-step technique for a fragment of temporal logic specifications. For specifications beyond this fragment, correctness can still be preserved, but the learned strategy may be sub-optimal. We present an algorithm to the overall problem, and demonstrate its use and computational requirements on a set of robot motion planning examples.

Gaussian Processes for Data-Efficient Learning in Robotics and Control

Authors:Marc Peter Deisenroth, Dieter Fox, Carl Edward Rasmussen
Date:2015-02-10 11:09:38

Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning approaches typically require task-specific knowledge in form of expert demonstrations, realistic simulators, pre-shaped policies, or specific knowledge about the underlying dynamics. In this article, we follow a different approach and speed up learning by extracting more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system. By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks.

The hippocampal-striatal circuit for goal-directed and habitual choice

Authors:Fabian Chersi
Date:2014-12-09 00:19:25

It is now widely accepted that one of the roles of the hippocampus is to maintain episodic spatial representations, while parallel striatal pathways contribute to both declarative and procedural value computations by encoding different input-specific outcome predictions. In this paper we investigate the use of these brain mechanisms for action selection, linking them to model-based and model-free controllers for decision making. To this aim we propose a biologically inspired computational model that embodies these theories and explains the functioning of the hippocampal-striatal circuit in a rat navigation task. Its main characteristic is to allow the cooperation of habitual and goal-directed behaviors, with the hippocampus primarily involved in encoding spatial information and simulating possible navigation paths, and the ventral and dorsal striatum involved in learning stimulus-response behaviors and evaluating the reward expectancies associated to predicted locations and sensed stimuli, respectively. The architecture we present employs an unsupervised reinforcement learning rule for the hippocampal-striatal network that is able to build a representation of the environment in which rewarding sites and informative landmarks produce value gradients that are used for planning and decision making. Additionally, it utilizes an arbitration mechanism that balances between exploitation, i.e. stimulus-response behaviors, and mental exploration, i.e. motor imagery processes, based on the intensity and the variability of the responses of striatal neurons. We interpret these results in light of recent experimental data that show anticipatory activations in hippocampal and striatal areas.

Scalable Planning and Learning for Multiagent POMDPs: Extended Version

Authors:Christopher Amato, Frans A. Oliehoek
Date:2014-04-04 03:02:44

Online, sample-based planning algorithms for POMDPs have shown great promise in scaling to problems with large state spaces, but they become intractable for large action and observation spaces. This is particularly problematic in multiagent POMDPs where the action and observation space grows exponentially with the number of agents. To combat this intractability, we propose a novel scalable approach based on sample-based planning and factored value functions that exploits structure present in many multiagent settings. This approach applies not only in the planning case, but also in the Bayesian reinforcement learning setting. Experimental results show that we are able to provide high quality solutions to large multiagent planning and learning problems.

Better Optimism By Bayes: Adaptive Planning with Rich Models

Authors:Arthur Guez, David Silver, Peter Dayan
Date:2014-02-09 15:38:57

The computational costs of inference and planning have confined Bayesian model-based reinforcement learning to one of two dismal fates: powerful Bayes-adaptive planning but only for simplistic models, or powerful, Bayesian non-parametric models but using simple, myopic planning strategies such as Thompson sampling. We ask whether it is feasible and truly beneficial to combine rich probabilistic models with a closer approximation to fully Bayesian planning. First, we use a collection of counterexamples to show formal problems with the over-optimism inherent in Thompson sampling. Then we leverage state-of-the-art techniques in efficient Bayes-adaptive planning and non-parametric Bayesian methods to perform qualitatively better than both existing conventional algorithms and Thompson sampling on two contextual bandit-like problems.

Non-Deterministic Policies in Markovian Decision Processes

Authors:Mahdi Milani Fard, Joelle Pineau
Date:2014-01-16 05:09:10

Markovian processes have long been used to model stochastic environments. Reinforcement learning has emerged as a framework to solve sequential planning and decision-making problems in such environments. In recent years, attempts were made to apply methods from reinforcement learning to construct decision support systems for action selection in Markovian environments. Although conventional methods in reinforcement learning have proved to be useful in problems concerning sequential decision-making, they cannot be applied in their current form to decision support systems, such as those in medical domains, as they suggest policies that are often highly prescriptive and leave little room for the users input. Without the ability to provide flexible guidelines, it is unlikely that these methods can gain ground with users of such systems. This paper introduces the new concept of non-deterministic policies to allow more flexibility in the users decision-making process, while constraining decisions to remain near optimal solutions. We provide two algorithms to compute non-deterministic policies in discrete domains. We study the output and running time of these method on a set of synthetic and real-world problems. In an experiment with human subjects, we show that humans assisted by hints based on non-deterministic policies outperform both human-only and computer-only agents in a web navigation task.

Learning Partially Observable Deterministic Action Models

Authors:Eyal Amir, Allen Chang
Date:2014-01-15 04:52:56

We present exact algorithms for identifying deterministic-actions effects and preconditions in dynamic partially observable domains. They apply when one does not know the action model(the way actions affect the world) of a domain and must learn it from partial observations over time. Such scenarios are common in real world applications. They are challenging for AI tasks because traditional domain structures that underly tractability (e.g., conditional independence) fail there (e.g., world features become correlated). Our work departs from traditional assumptions about partial observations and action models. In particular, it focuses on problems in which actions are deterministic of simple logical structure and observation models have all features observed with some frequency. We yield tractable algorithms for the modified problem for such domains. Our algorithms take sequences of partial observations over time as input, and output deterministic action models that could have lead to those observations. The algorithms output all or one of those models (depending on our choice), and are exact in that no model is misclassified given the observations. Our algorithms take polynomial time in the number of time steps and state features for some traditional action classes examined in the AI-planning literature, e.g., STRIPS actions. In contrast, traditional approaches for HMMs and Reinforcement Learning are inexact and exponentially intractable for such domains. Our experiments verify the theoretical tractability guarantees, and show that we identify action models exactly. Several applications in planning, autonomous exploration, and adventure-game playing already use these results. They are also promising for probabilistic settings, partially observable reinforcement learning, and diagnosis.

Efficient Learning and Planning with Compressed Predictive States

Authors:William L. Hamilton, Mahdi Milani Fard, Joelle Pineau
Date:2013-12-01 23:17:06

Predictive state representations (PSRs) offer an expressive framework for modelling partially observable systems. By compactly representing systems as functions of observable quantities, the PSR learning approach avoids using local-minima prone expectation-maximization and instead employs a globally optimal moment-based algorithm. Moreover, since PSRs do not require a predetermined latent state structure as an input, they offer an attractive framework for model-based reinforcement learning when agents must plan without a priori access to a system model. Unfortunately, the expressiveness of PSRs comes with significant computational cost, and this cost is a major factor inhibiting the use of PSRs in applications. In order to alleviate this shortcoming, we introduce the notion of compressed PSRs (CPSRs). The CPSR learning approach combines recent advancements in dimensionality reduction, incremental matrix decomposition, and compressed sensing. We show how this approach provides a principled avenue for learning accurate approximations of PSRs, drastically reducing the computational costs associated with learning while also providing effective regularization. Going further, we propose a planning framework which exploits these learned models. And we show that this approach facilitates model-learning and planning in large complex partially observable domains, a task that is infeasible without the principled use of compression.

Scaling Up Robust MDPs by Reinforcement Learning

Authors:Aviv Tamar, Huan Xu, Shie Mannor
Date:2013-06-26 09:52:51

We consider large-scale Markov decision processes (MDPs) with parameter uncertainty, under the robust MDP paradigm. Previous studies showed that robust MDPs, based on a minimax approach to handle uncertainty, can be solved using dynamic programming for small to medium sized problems. However, due to the "curse of dimensionality", MDPs that model real-life problems are typically prohibitively large for such approaches. In this work we employ a reinforcement learning approach to tackle this planning problem: we develop a robust approximate dynamic programming method based on a projected fixed point equation to approximately solve large scale robust MDPs. We show that the proposed method provably succeeds under certain technical conditions, and demonstrate its effectiveness through simulation of an option pricing problem. To the best of our knowledge, this is the first attempt to scale up the robust MDPs paradigm.

Non Deterministic Logic Programs

Authors:Emad Saad
Date:2013-04-26 13:55:05

Non deterministic applications arise in many domains, including, stochastic optimization, multi-objectives optimization, stochastic planning, contingent stochastic planning, reinforcement learning, reinforcement learning in partially observable Markov decision processes, and conditional planning. We present a logic programming framework called non deterministic logic programs, along with a declarative semantics and fixpoint semantics, to allow representing and reasoning about inherently non deterministic real-world applications. The language of non deterministic logic programs framework is extended with non-monotonic negation, and two alternative semantics are defined: the stable non deterministic model semantics and the well-founded non deterministic model semantics as well as their relationship is studied. These semantics subsume the deterministic stable model semantics and the deterministic well-founded semantics of deterministic normal logic programs, and they reduce to the semantics of deterministic definite logic programs without negation. We show the application of the non deterministic logic programs framework to a conditional planning problem.

Model-based Bayesian Reinforcement Learning for Dialogue Management

Authors:Pierre Lison
Date:2013-04-05 20:47:02

Reinforcement learning methods are increasingly used to optimise dialogue policies from experience. Most current techniques are model-free: they directly estimate the utility of various actions, without explicit model of the interaction dynamics. In this paper, we investigate an alternative strategy grounded in model-based Bayesian reinforcement learning. Bayesian inference is used to maintain a posterior distribution over the model parameters, reflecting the model uncertainty. This parameter distribution is gradually refined as more data is collected and simultaneously used to plan the agent's actions. Within this learning framework, we carried out experiments with two alternative formalisations of the transition model, one encoded with standard multinomial distributions, and one structured with probabilistic rules. We demonstrate the potential of our approach with empirical results on a user simulator constructed from Wizard-of-Oz data in a human-robot interaction scenario. The results illustrate in particular the benefits of capturing prior domain knowledge with high-level rules.

A Greedy Approximation of Bayesian Reinforcement Learning with Probably Optimistic Transition Model

Authors:Kenji Kawaguchi, Mauricio Araya
Date:2013-03-13 14:06:21

Bayesian Reinforcement Learning (RL) is capable of not only incorporating domain knowledge, but also solving the exploration-exploitation dilemma in a natural way. As Bayesian RL is intractable except for special cases, previous work has proposed several approximation methods. However, these methods are usually too sensitive to parameter values, and finding an acceptable parameter setting is practically impossible in many applications. In this paper, we propose a new algorithm that greedily approximates Bayesian RL to achieve robustness in parameter space. We show that for a desired learning behavior, our proposed algorithm has a polynomial sample complexity that is lower than those of existing algorithms. We also demonstrate that the proposed algorithm naturally outperforms other existing algorithms when the prior distributions are not significantly misleading. On the other hand, the proposed algorithm cannot handle greatly misspecified priors as well as the other algorithms can. This is a natural consequence of the fact that the proposed algorithm is greedier than the other algorithms. Accordingly, we discuss a way to select an appropriate algorithm for different tasks based on the algorithms' greediness. We also introduce a new way of simplifying Bayesian planning, based on which future work would be able to derive new algorithms.

Toggling a Genetic Switch Using Reinforcement Learning

Authors:Aivar Sootla, Natalja Strelkowa, Damien Ernst, Mauricio Barahona, Guy-Bart Stan
Date:2013-03-12 15:34:41

In this paper, we consider the problem of optimal exogenous control of gene regulatory networks. Our approach consists in adapting an established reinforcement learning algorithm called the fitted Q iteration. This algorithm infers the control law directly from the measurements of the system's response to external control inputs without the use of a mathematical model of the system. The measurement data set can either be collected from wet-lab experiments or artificially created by computer simulations of dynamical models of the system. The algorithm is applicable to a wide range of biological systems due to its ability to deal with nonlinear and stochastic system dynamics. To illustrate the application of the algorithm to a gene regulatory network, the regulation of the toggle switch system is considered. The control objective of this problem is to drive the concentrations of two specific proteins to a target region in the state space.

On the Complexity of Solving Markov Decision Problems

Authors:Michael L. Littman, Thomas L. Dean, Leslie Pack Kaelbling
Date:2013-02-20 15:22:36

Markov decision problems (MDPs) provide the foundations for a number of problems of interest to AI researchers studying automated planning and reinforcement learning. In this paper, we summarize results regarding the complexity of solving MDPs and the running time of MDP solution algorithms. We argue that, although MDPs can be solved efficiently in theory, more study is needed to reveal practical algorithms for solving large problems quickly. To encourage future research, we sketch some alternative methods of analysis that rely on the structure of MDPs.

Probabilistic Exploration in Planning while Learning

Authors:Grigoris I. Karakoulas
Date:2013-02-20 15:22:12

Sequential decision tasks with incomplete information are characterized by the exploration problem; namely the trade-off between further exploration for learning more about the environment and immediate exploitation of the accrued information for decision-making. Within artificial intelligence, there has been an increasing interest in studying planning-while-learning algorithms for these decision tasks. In this paper we focus on the exploration problem in reinforcement learning and Q-learning in particular. The existing exploration strategies for Q-learning are of a heuristic nature and they exhibit limited scaleability in tasks with large (or infinite) state and action spaces. Efficient experimentation is needed for resolving uncertainties when possible plans are compared (i.e. exploration). The experimentation should be sufficient for selecting with statistical significance a locally optimal plan (i.e. exploitation). For this purpose, we develop a probabilistic hill-climbing algorithm that uses a statistical selection procedure to decide how much exploration is needed for selecting a plan which is, with arbitrarily high probability, arbitrarily close to a locally optimal one. Due to its generality the algorithm can be employed for the exploration strategy of robust Q-learning. An experiment on a relatively complex control task shows that the proposed exploration strategy performs better than a typical exploration strategy.

Behavior Pattern Recognition using A New Representation Model

Authors:Qifeng Qiao, Peter A. Beling
Date:2013-01-16 09:01:47

We study the use of inverse reinforcement learning (IRL) as a tool for the recognition of agents' behavior on the basis of observation of their sequential decision behavior interacting with the environment. We model the problem faced by the agents as a Markov decision process (MDP) and model the observed behavior of the agents in terms of forward planning for the MDP. We use IRL to learn reward functions and then use these reward functions as the basis for clustering or classification models. Experimental studies with GridWorld, a navigation problem, and the secretary problem, an optimal stopping problem, suggest reward vectors found from IRL can be a good basis for behavior pattern recognition problems. Empirical comparisons of our method with several existing IRL algorithms and with direct methods that use feature statistics observed in state-action space suggest it may be superior for recognition problems.

Planning by Prioritized Sweeping with Small Backups

Authors:Harm van Seijen, Richard S. Sutton
Date:2013-01-10 21:54:42

Efficient planning plays a crucial role in model-based reinforcement learning. Traditionally, the main planning operation is a full backup based on the current estimates of the successor states. Consequently, its computation time is proportional to the number of successor states. In this paper, we introduce a new planning backup that uses only the current value of a single successor state and has a computation time independent of the number of successor states. This new backup, which we call a small backup, opens the door to a new class of model-based reinforcement learning methods that exhibit much finer control over their planning process than traditional methods. We empirically demonstrate that this increased flexibility allows for more efficient planning by showing that an implementation of prioritized sweeping based on small backups achieves a substantial performance improvement over classical implementations.

Reinforcement Learning with Partially Known World Dynamics

Authors:Christian R. Shelton
Date:2012-12-12 15:58:25

Reinforcement learning would enjoy better success on real-world problems if domain knowledge could be imparted to the algorithm by the modelers. Most problems have both hidden state and unknown dynamics. Partially observable Markov decision processes (POMDPs) allow for the modeling of both. Unfortunately, they do not provide a natural framework in which to specify knowledge about the domain dynamics. The designer must either admit to knowing nothing about the dynamics or completely specify the dynamics (thereby turning it into a planning problem). We propose a new framework called a partially known Markov decision process (PKMDP) which allows the designer to specify known dynamics while still leaving portions of the environment s dynamics unknown.The model represents NOT ONLY the environment dynamics but also the agents knowledge of the dynamics. We present a reinforcement learning algorithm for this model based on importance sampling. The algorithm incorporates planning based on the known dynamics and learning about the unknown dynamics. Our results clearly demonstrate the ability to add domain knowledge and the resulting benefits for learning.

The Arcade Learning Environment: An Evaluation Platform for General Agents

Authors:Marc G. Bellemare, Yavar Naddaf, Joel Veness, Michael Bowling
Date:2012-07-19 15:33:25

In this article we introduce the Arcade Learning Environment (ALE): both a challenge problem and a platform and methodology for evaluating the development of general, domain-independent AI technology. ALE provides an interface to hundreds of Atari 2600 game environments, each one different, interesting, and designed to be a challenge for human players. ALE presents significant research challenges for reinforcement learning, model learning, model-based planning, imitation learning, transfer learning, and intrinsic motivation. Most importantly, it provides a rigorous testbed for evaluating and comparing approaches to these problems. We illustrate the promise of ALE by developing and benchmarking domain-independent agents designed using well-established AI techniques for both reinforcement learning and planning. In doing so, we also propose an evaluation methodology made possible by ALE, reporting empirical results on over 55 different games. All of the software, including the benchmark agents, is publicly available.

Policy Gradients with Variance Related Risk Criteria

Authors:Dotan Di Castro, Aviv Tamar, Shie Mannor
Date:2012-06-27 19:59:59

Managing risk in dynamic decision problems is of cardinal importance in many fields such as finance and process control. The most common approach to defining risk is through various variance related criteria such as the Sharpe Ratio or the standard deviation adjusted reward. It is known that optimizing many of the variance related risk criteria is NP-hard. In this paper we devise a framework for local policy gradient style algorithms for reinforcement learning for variance related criteria. Our starting point is a new formula for the variance of the cost-to-go in episodic tasks. Using this formula we develop policy gradient algorithms for criteria that involve both the expected cost and the variance of the cost. We prove the convergence of these algorithms to local minima and demonstrate their applicability in a portfolio planning problem.

Chi-square Tests Driven Method for Learning the Structure of Factored MDPs

Authors:Thomas Degris, Olivier Sigaud, Pierre-Henri Wuillemin
Date:2012-06-27 16:20:30

SDYNA is a general framework designed to address large stochastic reinforcement learning problems. Unlike previous model based methods in FMDPs, it incrementally learns the structure and the parameters of a RL problem using supervised learning techniques. Then, it integrates decision-theoric planning algorithms based on FMDPs to compute its policy. SPITI is an instanciation of SDYNA that exploits ITI, an incremental decision tree algorithm, to learn the reward function and the Dynamic Bayesian Networks with local structures representing the transition function of the problem. These representations are used by an incremental version of the Structured Value Iteration algorithm. In order to learn the structure, SPITI uses Chi-Square tests to detect the independence between two probability distributions. Thus, we study the relation between the threshold used in the Chi-Square test, the size of the model built and the relative error of the value function of the induced policy with respect to the optimal value. We show that, on stochastic problems, one can tune the threshold so as to generate both a compact model and an efficient policy. Then, we show that SPITI, while keeping its model compact, uses the generalization property of its learning method to perform better than a stochastic classical tabular algorithm in large RL problem with an unknown structure. We also introduce a new measure based on Chi-Square to qualify the accuracy of the model learned by SPITI. We qualitatively show that the generalization property in SPITI within the FMDP framework may prevent an exponential growth of the time required to learn the structure of large stochastic RL problems.

Policy Iteration for Relational MDPs

Authors:Chenggang Wang, Roni Khardon
Date:2012-06-20 15:16:29

Relational Markov Decision Processes are a useful abstraction for complex reinforcement learning problems and stochastic planning problems. Recent work developed representation schemes and algorithms for planning in such problems using the value iteration algorithm. However, exact versions of more complex algorithms, including policy iteration, have not been developed or analyzed. The paper investigates this potential and makes several contributions. First we observe two anomalies for relational representations showing that the value of some policies is not well defined or cannot be calculated for restricted representation schemes used in the literature. On the other hand, we develop a variant of policy iteration that can get around these anomalies. The algorithm includes an aspect of policy improvement in the process of policy evaluation and thus differs from the original algorithm. We show that despite this difference the algorithm converges to the optimal policy.

Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping

Authors:Richard S. Sutton, Csaba Szepesvari, Alborz Geramifard, Michael P. Bowling
Date:2012-06-13 15:45:04

We consider the problem of efficiently learning optimal control policies and value functions over large state spaces in an online setting in which estimates must be available after each interaction with the world. This paper develops an explicitly model-based approach extending the Dyna architecture to linear function approximation. Dynastyle planning proceeds by generating imaginary experience from the world model and then applying model-free reinforcement learning algorithms to the imagined state transitions. Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions. In the policy evaluation setting, we prove that the limit point is the least-squares (LSTD) solution. An implication of our results is that prioritized-sweeping can be soundly extended to the linear approximation case, backing up to preceding features rather than to preceding states. We introduce two versions of prioritized sweeping with linear Dyna and briefly illustrate their performance empirically on the Mountain Car and Boyan Chain problems.

Model-Based Bayesian Reinforcement Learning in Large Structured Domains

Authors:Stephane Ross, Joelle Pineau
Date:2012-06-13 15:43:32

Model-based Bayesian reinforcement learning has generated significant interest in the AI community as it provides an elegant solution to the optimal exploration-exploitation tradeoff in classical reinforcement learning. Unfortunately, the applicability of this type of approach has been limited to small domains due to the high complexity of reasoning about the joint posterior over model parameters. In this paper, we consider the use of factored representations combined with online planning techniques, to improve scalability of these methods. The main contribution of this paper is a Bayesian framework for learning the structure and parameters of a dynamical system, while also simultaneously planning a (near-)optimal sequence of actions.

Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Authors:Arthur Guez, David Silver, Peter Dayan
Date:2012-05-14 17:20:29

Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayes-optimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems -- because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration.

Seeing the Forest Despite the Trees: Large Scale Spatial-Temporal Decision Making

Authors:Mark Crowley, John Nelson, David L Poole
Date:2012-05-09 15:08:18

We introduce a challenging real-world planning problem where actions must be taken at each location in a spatial area at each point in time. We use forestry planning as the motivating application. In Large Scale Spatial-Temporal (LSST) planning problems, the state and action spaces are defined as the cross-products of many local state and action spaces spread over a large spatial area such as a city or forest. These problems possess state uncertainty, have complex utility functions involving spatial constraints and we generally must rely on simulations rather than an explicit transition model. We define LSST problems as reinforcement learning problems and present a solution using policy gradients. We compare two different policy formulations: an explicit policy that identifies each location in space and the action to take there; and an abstract policy that defines the proportion of actions to take across all locations in space. We show that the abstract policy is more robust and achieves higher rewards with far fewer parameters than the elementary policy. This abstract policy is also a better fit to the properties that practitioners in LSST problem domains require for such methods to be widely useful.

Variance-Based Rewards for Approximate Bayesian Reinforcement Learning

Authors:Jonathan Sorg, Satinder Singh, Richard L. Lewis
Date:2012-03-15 11:17:56

The explore{exploit dilemma is one of the central challenges in Reinforcement Learning (RL). Bayesian RL solves the dilemma by providing the agent with information in the form of a prior distribution over environments; however, full Bayesian planning is intractable. Planning with the mean MDP is a common myopic approximation of Bayesian planning. We derive a novel reward bonus that is a function of the posterior distribution over environments, which, when added to the reward in planning with the mean MDP, results in an agent which explores efficiently and effectively. Although our method is similar to existing methods when given an uninformative or unstructured prior, unlike existing methods, our method can exploit structured priors. We prove that our method results in a polynomial sample complexity and empirically demonstrate its advantages in a structured exploration task.

Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search

Authors:John Asmuth, Michael L. Littman
Date:2012-02-14 16:41:17

Bayes-optimal behavior, while well-defined, is often difficult to achieve. Recent advances in the use of Monte-Carlo tree search (MCTS) have shown that it is possible to act near-optimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayes-optimal behavior in an unknown MDP is equivalent to optimal behavior in the known belief-space MDP, although the size of this belief-space MDP grows exponentially with the amount of history retained, and is potentially infinite. We show how an agent can use one particular MCTS algorithm, Forward Search Sparse Sampling (FSSS), in an efficient way to act nearly Bayes-optimally for all but a polynomial number of steps, assuming that FSSS can be used to act efficiently in any possible underlying MDP.

A Real-Time Model-Based Reinforcement Learning Architecture for Robot Control

Authors:Todd Hester, Michael Quinlan, Peter Stone
Date:2011-05-09 18:17:20

Reinforcement Learning (RL) is a method for learning decision-making tasks that could enable robots to learn and adapt to their situation on-line. For an RL algorithm to be practical for robotic control tasks, it must learn in very few actions, while continually taking those actions in real-time. Existing model-based RL methods learn in relatively few actions, but typically take too much time between each action for practical on-line learning. In this paper, we present a novel parallel architecture for model-based RL that runs in real-time by 1) taking advantage of sample-based approximate planning methods and 2) parallelizing the acting, model learning, and planning processes such that the acting process is sufficiently fast for typical robot control cycles. We demonstrate that algorithms using this architecture perform nearly as well as methods using the typical sequential architecture when both are given unlimited time, and greatly out-perform these methods on tasks that require real-time actions such as controlling an autonomous vehicle.

Dyna-H: a heuristic planning reinforcement learning algorithm applied to role-playing-game strategy decision systems

Authors:Matilde Santos, Jose Antonio Martin H., Victoria Lopez, Guillermo Botella
Date:2011-01-20 19:51:58

In a Role-Playing Game, finding optimal trajectories is one of the most important tasks. In fact, the strategy decision system becomes a key component of a game engine. Determining the way in which decisions are taken (online, batch or simulated) and the consumed resources in decision making (e.g. execution time, memory) will influence, in mayor degree, the game performance. When classical search algorithms such as A* can be used, they are the very first option. Nevertheless, such methods rely on precise and complete models of the search space, and there are many interesting scenarios where their application is not possible. Then, model free methods for sequential decision making under uncertainty are the best choice. In this paper, we propose a heuristic planning strategy to incorporate the ability of heuristic-search in path-finding into a Dyna agent. The proposed Dyna-H algorithm, as A* does, selects branches more likely to produce outcomes than other branches. Besides, it has the advantages of being a model-free online reinforcement learning algorithm. The proposal was evaluated against the one-step Q-Learning and Dyna-Q algorithms obtaining excellent experimental results: Dyna-H significantly overcomes both methods in all experiments. We suggest also, a functional analogy between the proposed sampling from worst trajectories heuristic and the role of dreams (e.g. nightmares) in human behavior.

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Authors:Christos Dimitrakakis
Date:2009-12-26 16:32:46

There has been a lot of recent work on Bayesian methods for reinforcement learning exhibiting near-optimal online performance. The main obstacle facing such methods is that in most problems of interest, the optimal solution involves planning in an infinitely large tree. However, it is possible to obtain stochastic lower and upper bounds on the value of each tree node. This enables us to use stochastic branch and bound algorithms to search the tree efficiently. This paper proposes two such algorithms and examines their complexity in this setting.

Tree Exploration for Bayesian RL Exploration

Authors:Christos Dimitrakakis
Date:2009-02-02 22:37:23

Research in reinforcement learning has produced algorithms for optimal decision making under uncertainty that fall within two main types. The first employs a Bayesian framework, where optimality improves with increased computational time. This is because the resulting planning task takes the form of a dynamic programming problem on a belief tree with an infinite number of states. The second type employs relatively simple algorithm which are shown to suffer small regret within a distribution-free framework. This paper presents a lower bound and a high probability upper bound on the optimal value function for the nodes in the Bayesian belief tree, which are analogous to similar bounds in POMDPs. The bounds are then used to create more efficient strategies for exploring the tree. The resulting algorithms are compared with the distribution-free algorithm UCB1, as well as a simpler baseline algorithm on multi-armed bandit problems.

Nearly optimal exploration-exploitation decision thresholds

Authors:Christos Dimitrakakis
Date:2006-04-05 10:29:48

While in general trading off exploration and exploitation in reinforcement learning is hard, under some formulations relatively simple solutions exist. In this paper, we first derive upper bounds for the utility of selecting different actions in the multi-armed bandit setting. Unlike the common statistical upper confidence bounds, these explicitly link the planning horizon, uncertainty and the need for exploration explicit. The resulting algorithm can be seen as a generalisation of the classical Thompson sampling algorithm. We experimentally test these algorithms, as well as $\epsilon$-greedy and the value of perfect information heuristics. Finally, we also introduce the idea of bagging for reinforcement learning. By employing a version of online bootstrapping, we can efficiently sample from an approximate posterior distribution.

Artificial Intelligence and Systems Theory: Applied to Cooperative Robots

Authors:Pedro U. Lima, Luis M. M. Custodio
Date:2004-11-08 20:41:44

This paper describes an approach to the design of a population of cooperative robots based on concepts borrowed from Systems Theory and Artificial Intelligence. The research has been developed under the SocRob project, carried out by the Intelligent Systems Laboratory at the Institute for Systems and Robotics - Instituto Superior Tecnico (ISR/IST) in Lisbon. The acronym of the project stands both for "Society of Robots" and "Soccer Robots", the case study where we are testing our population of robots. Designing soccer robots is a very challenging problem, where the robots must act not only to shoot a ball towards the goal, but also to detect and avoid static (walls, stopped robots) and dynamic (moving robots) obstacles. Furthermore, they must cooperate to defeat an opposing team. Our past and current research in soccer robotics includes cooperative sensor fusion for world modeling, object recognition and tracking, robot navigation, multi-robot distributed task planning and coordination, including cooperative reinforcement learning in cooperative and adversarial environments, and behavior-based architectures for real time task execution of cooperating robot teams.

Learning for Adaptive Real-time Search

Authors:Vadim Bulitko
Date:2004-07-06 22:18:25

Real-time heuristic search is a popular model of acting and learning in intelligent autonomous agents. Learning real-time search agents improve their performance over time by acquiring and refining a value function guiding the application of their actions. As computing the perfect value function is typically intractable, a heuristic approximation is acquired instead. Most studies of learning in real-time search (and reinforcement learning) assume that a simple value-function-greedy policy is used to select actions. This is in contrast to practice, where high-performance is usually attained by interleaving planning and acting via a lookahead search of a non-trivial depth. In this paper, we take a step toward bridging this gap and propose a novel algorithm that (i) learns a heuristic function to be used specifically with a lookahead-based policy, (ii) selects the lookahead depth adaptively in each state, (iii) gives the user control over the trade-off between exploration and exploitation. We extensively evaluate the algorithm in the sliding tile puzzle testbed comparing it to the classical LRTA* and the more recent weighted LRTA*, bounded LRTA*, and FALCONS. Improvements of 5 to 30 folds in convergence speed are observed.

Temporal plannability by variance of the episode length

Authors:Balint Takacs, Istvan Szita, Andras Lorincz
Date:2003-01-09 12:39:03

Optimization of decision problems in stochastic environments is usually concerned with maximizing the probability of achieving the goal and minimizing the expected episode length. For interacting agents in time-critical applications, learning of the possibility of scheduling of subtasks (events) or the full task is an additional relevant issue. Besides, there exist highly stochastic problems where the actual trajectories show great variety from episode to episode, but completing the task takes almost the same amount of time. The identification of sub-problems of this nature may promote e.g., planning, scheduling and segmenting Markov decision processes. In this work, formulae for the average duration as well as the standard deviation of the duration of events are derived. The emerging Bellman-type equation is a simple extension of Sobel's work (1982). Methods of dynamic programming as well as methods of reinforcement learning can be applied for our extension. Computer demonstration on a toy problem serve to highlight the principle.

Searching for Plannable Domains can Speed up Reinforcement Learning

Authors:Istvan Szita, Balint Takacs, Andras Lorincz
Date:2002-12-10 22:15:25

Reinforcement learning (RL) involves sequential decision making in uncertain environments. The aim of the decision-making agent is to maximize the benefit of acting in its environment over an extended period of time. Finding an optimal policy in RL may be very slow. To speed up learning, one often used solution is the integration of planning, for example, Sutton's Dyna algorithm, or various other methods using macro-actions. Here we suggest to separate plannable, i.e., close to deterministic parts of the world, and focus planning efforts in this domain. A novel reinforcement learning method called plannable RL (pRL) is proposed here. pRL builds a simple model, which is used to search for macro actions. The simplicity of the model makes planning computationally inexpensive. It is shown that pRL finds an optimal policy, and that plannable macro actions found by pRL are near-optimal. In turn, it is unnecessary to try large numbers of macro actions, which enables fast learning. The utility of pRL is demonstrated by computer simulations.