π₯ π₯ π’ β π π π¦ π§ π’ πΈ π π π π π π π³ π π π¬ πΉ π π π πΊ π»
πΎ π π π π π π π π π π π π ποΈ π ποΈ π΄ π° ποΈ π π π π π ποΈ π π π ποΈ ποΈ βοΈ π
Recommendation: π π₯ π π₯
To-Do (Reading) List: π§ π¦ [幢沑ζθ―΄ζ°΄ηζζππ]
TOC
- READING LIST
- Emp. & ASP
- Meta-RL
- HRL
- SKILLS
- Control as Inference
- State Abstraction, Representation Learning
- Mutual Information
- DR (Domain Randomization) & sim2real
- Transfer: Generalization & Adaption (Dynamics)
- IL (IRL)
- Offline RL
- Exploration
- Causal Inference
- Supervised RL & Goal-conditioned Policy
- Goal-relabeling & Self-imitation
- Model-based RL & world models
- Training RL & Just Fast & Embedding? & OPE(DICE)
- MARL
- Constrained RL
- Distributional RL
- Continual Learning
- Self-paced & Curriculum RL
- Foundation models
- Quadruped
- Optimization
- Galaxy Forest
- Aha
β π π π± π² π π π» π΅ π πΏ π΄ π π³ πΉ π° π πΎ π΄ π πΎ π π π π π π π π π π
-
Empowerment β An Introduction https://arxiv.org/pdf/1310.1863.pdf π
-
Keep your options open: An information-based driving principle for sensorimotor systems
-
It measures the capacity of the agent to influence the world in a way that this influence is perceivable via the agentβs sensors.
-
Concretely, we define empowerment as the maximum amount of information that an agent could send from its actuators to its sensors via the environment, reducing in the simplest case to the external information channel capacity of the channel from the actuators to the sensors of the agent.
-
An individual agent or an agent population can attempt and explore only a small fraction of possible behaviors during its lifetime.
-
universal & local
-
-
What is intrinsic motivation? A typology of computational approaches
-
Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning[2015] πβ
We focussed specifically on intrinsic motivation with a reward measure known as empowerment, which requires at its core the efficient computation of the mutual information.
-
Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning
-
A survey on intrinsic motivation in reinforcement learning https://arxiv.org/abs/1908.06976 π π π π₯ π₯ π¦ π§ π§
-
Efficient Exploration via State Marginal Matching https://arxiv.org/pdf/1906.05274.pdf π
-
Empowerment: A Universal Agent-Centric Measure of Control https://uhra.herts.ac.uk/bitstream/handle/2299/1114/901241.pdf?sequence=1&isAllowed=y
πΉ On Learning Intrinsic Rewards for Policy Gradient Methods π₯
The policy-gradient updates the policy parameters to optimize the sum of the extrinsic and intrinsic rewards, while simultaneously our method updates the intrinsic reward parameters to optimize the extrinsic rewards achieved by the policy.
πΉ Adversarial Intrinsic Motivation for Reinforcement Learning π§
πΉ Evaluating Agents without Rewards πΆ
We retrospectively compute potential objectives on pre-collected datasets of agent behavior, rather than optimizing them online, and compare them by analyzing their correlations.
πΉ LEARNING ALTRUISTIC BEHAVIOURS IN REINFORCEMENT LEARNING WITHOUT EXTERNAL REWARDS π₯
We propose an altruistic agent that learns to increase the choices another agent has by preferring to maximize the number of states that the other agent can reach in its future.
πΉ Entropic Desired Dynamics for Intrinsic Control π₯
EDDICT: By situating these latent codes in a globally consistent coordinate system, we show that agents can reliably reach more states in the long term while still optimizing a local objective.
-
SMiRL: Surprise Minimizing Reinforcement Learning in Dynamic Environments https://openreview.net/pdf?id=cPZOyoDloxl π₯ π₯ π π₯ π₯ π§
In the real world, natural forces and other agents already offer unending novelty. The second law of thermodynamics stipulates ever-increasing entropy, and therefore perpetual novelty, without even requiring any active intervention.
πΉ Unsupervised Skill Discovery with Bottleneck Option Learning π π₯
On top of the linearization of environments that promotes more various and distant state transitions, IBOL enables the discovery of diverse skills.
πΉ Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions ββ Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions πΆ β
πΉ The Viable System Model - Stafford Beer
πΉ Reinforcement Learning Generalization with Surprise Minimization πΆ β
πΉ TERRAIN RL SIMULATOR Github
πΉ POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning πΆ
It utilizes an ensemble of policies that are discovered during pretraining and moves the policy of the URL algorithm closer to its optimal prior.
πΉ WALK THE RANDOM WALK: LEARNING TO DISCOVER AND REACH GOALS WITHOUT SUPERVISION πΆ
We use random walk to train a reachability network that predicts the similarity between two states. This reachability network is then used in building goal memory containing past observations that are diverse and well-balanced. Finally, we train a goal-conditioned policy network with goals sampled from the goal memory and reward it by the reachability network and the goal memory.
we introduce BRAXLINES, a toolkit for fast and interactive RL-driven behavior generation beyond simple reward maximization that includes COMPOSER, a programmatic API for generating continuous control environments, and set of stable and well-tested baselines for two families of algorithms βmutual information maximization (MI-MAX) and divergence minimization (D-MIN)β supporting unsupervised skill learning and distribution sketching as other modes of behavior specification.
πΉ pen-Ended Reinforcement Learning with Neural Reward Functions πΆ
We propose a different approach that uses reward functions encoded by neural networks. These are trained iteratively to reward more complex behavior.
πΉ URLB: Unsupervised Reinforcement Learning Benchmark πΆ
URLB consists of two phases: reward-free pre-training and downstream task adaptation with extrinsic rewards.
-
ASP: ASYMMETRIC SELF-PLAY
πΉ INTRINSIC MOTIVATION AND AUTOMATIC CURRICULA VIA ASYMMETRIC SELF-PLAY https://arxiv.org/pdf/1703.05407.pdf [θ΅·ι£ ASP] π₯ π₯ π
πΉ Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards [ASP] π₯ π π
Our method introduces an auxiliary distance-based reward based on pairs of rollouts to encourage diverse exploration. This approach effectively prevents learning dynamics from stabilizing around local optima induced by the naive distance-to-goal reward shaping and enables policies to efficiently solve sparse reward tasks.
πΉ Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm π§
πΉ Learning Goal Embeddings via Self-Play for Hierarchical Reinforcement Learning [ASP] π₯ π
πΉ Generating Automatic Curricula via Self-Supervised Active Domain Randomization [ASP]
πΉASYMMETRIC SELF-PLAY FOR AUTOMATIC GOAL DISCOVERY IN ROBOTIC MANIPULATION πΆ [ASP]
πΉ Language as a Cognitive Tool to Imagine Goals in Curiosity-Driven Exploration π₯ π₯ β β
We introduce IMAGINE, an intrinsically motivated deep reinforcement learning architecture that models this ability. Such imaginative agents, like children, benefit from the guidance of a social peer who provides language descriptions. To take advantage of goal imagination, agents must be able to leverage these descriptions to interpret their imagined out-of-distribution goals.
-
PBRL (Population Based); Quality-Diversity (QD);
πΉ Effective Diversity in Population Based Reinforcement Learning πΆ
Diversity via Determinants (DvD)
πΉ βOther-Playβ for Zero-Shot Coordination
zero-shot coordination: constructing AI agents that can coordinate with novel partners they have not seen before. Other-Play (OP) enhances self-play by looking for more robust strategies, exploiting the presence of known symmetries in the underlying problem.
πΉ Trajectory Diversity for Zero-Shot Coordination
TrajeDi: a differentiable objective for generating diverse reinforcement learning policies.
πΉ Illuminating search spaces by mapping elites π π₯
Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) algorithm illuminates search spaces, allowing researchers to understand how interesting attributes of solutions combine to affect performance, either positively or, equally of interest, negatively.
πΉ Differentiable Quality Diversity π
We present the differentiable quality diversity (DQD) problem, a special case of QD, where both the objective and measure functions are first order differentiable.
πΉ Accelerated Quality-Diversity through Massive Parallelism πΆ
We show that QD algorithms are ideal candidates to take advantage of progress in hardware acceleration. We demonstrate that QD algorithms can scale with massive parallelism to be run at interactive timescales without any significant effect on the performance.
πΉ Diversity Policy Gradient for Sample Efficient Quality-Diversity Optimization π₯
qd-pg: The main contribution of this work is the introduction of a Diversity Policy Gradient (DPG) that exploits information at the time-step level to drive policies towards more diversity in a sample efficient manner.
πΉ Discovering Diverse Nearly Optimal Policies with Successor Features π₯ π₯
we propose new explicit diversity rewards that aim to minimize the correlation between the Successor Features of the policies in the set.
πΉ Continual Auxiliary Task Learning π
we investigate a reinforcement learning system designed to learn a collection of auxiliary tasks, with a behavior policy learning to take actions to improve those auxiliary predictions.
πΉ Towards Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games π
we summarize previous concepts of diversity and work towards offering a unified measure of diversity in multi-agent open-ended learning to include all elements in Markov games, based on both Behavioral Diversity (BD) and Response Diversity (RD).
πΉ CONTINUOUSLY DISCOVERING NOVEL STRATEGIES VIA REWARD-SWITCHING POLICY OPTIMIZATION π₯ π
RSPO: When a sampled trajectory is sufficiently distinct, RSPO performs standard policy optimization with extrinsic rewards. For trajectories with high likelihood under existing policies, RSPO utilizes an intrinsic diversity reward to promote exploration.
πΉ DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization π₯
we formalize our algorithm as the combination of a diversity-constrained optimization problem and an extrinsic-reward constrained optimization problem.
πΉ POPULATION-GUIDED PARALLEL POLICY SEARCH FOR REINFORCEMENT LEARNING π₯ π
P3S: The key point is that the information of the best policy is fused in a soft manner by constructing an augmented loss function for policy update to enlarge the overall search region by the multiple learners.
πΉ Periodic Intra-Ensemble Knowledge Distillation for Reinforcement Learning πΆ
PIEKD is a learning framework that uses an ensemble of policies to act in the environment while periodically sharing knowledge amongst policies in the ensemble through knowledge distillation.
πΉ Cooperative Heterogeneous Deep Reinforcement Learning π₯
CHDRL: Global agents are off-policy agents that can utilize experiences from the other agents. Local agents are either on-policy agents or population-based evolutionary algorithms (EAs) agents that can explore the local area effectively.
πΉ General Characterization of Agents by States they Visit π
Behavioural characterizations: adopt Gaussian mixture models (GMMs).
πΉ Improving Policy Optimization with Generalist-Specialist Learning πΆ
GSL: we first train a generalist on all environment variations; when it fails to improve, we launch a large population of specialists with weights cloned from the generalist, each trained to master a selected small subset of variations. We finally resume the training of the generalist with auxiliary rewards induced by demonstrations of all specialists.
πΉ Diversity Can Be Transferred: Output Diversification for White- and Black-box Attacks π₯ π π
Output Diversified Sampling (ODS): a novel sampling strategy that attempts to maximize diversity in the target modelβs outputs among the generated samples.
πΉ Diversity Matters When Learning From Ensembles π₯
Our key assumption is that a distilled model should absorb as much function diversity inside the ensemble as possible.
πΉ Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation π₯
we propose a weight averaging technique where a student with multiple subnetworks is trained to absorb the functional diversity of ensemble teachers, but then those subnetworks are properly averaged for inference, giving a single student network with no additional inference cost. We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student.
πΉ Efficient Exploration using Model-Based Quality-Diversity with Gradients π₯
DA-QD-ext and GDA-QD: extends existing QD methods to use gradients for efficient exploitation and leverage perturbations in imagination for efficient exploration.
πΉ DISCOVERING UNSUPERVISED BEHAVIOURS FROM FULL-STATE TRAJECTORIES π
AURORA: a Quality-Diversity algorithm that autonomously finds behavioural characterisations.
πΈ π― π π π« π― π’ π¦ π π¨ πΆ πͺ² π π π π π³ π π¬ π π π‘ π π² π π π π’ π π€
-
A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms 2020 https://arxiv.org/pdf/1901.10912.pdf Yoshua Bengio π π₯ π₯ π π π¦ [contrative loss on causal mechanisms?]
We show that under this assumption, the correct causal structural choices lead to faster adaptation to modified distributions because the changes are concentrated in one or just a few mechanisms when the learned knowledge is modularized appropriately.
-
Causal Reasoning from Meta-reinforcement Learning 2019 π πΆ
-
Discovering Reinforcement Learning Algorithms https://arxiv.org/pdf/2007.08794.pdf π
This paper introduces a new meta-learning approach that discovers an entire update rule which includes both βwhat to predictβ (e.g. value functions) and βhow to learn from itβ (e.g. bootstrapping) by interacting with a set of environments.
-
Meta
πΉ Discovering Reinforcement Learning Algorithms Attempte to discover the full update rule π β
πΉ What Can Learned Intrinsic Rewards Capture? How/What value function/policy network π
β lifetime return:A finite sequence of agent-environment interactions until the end of training defined by an agentdesigner, which can consist of multiple episodes.
πΉ Discovery of Useful Questions as Auxiliary Tasks π
β Related work is good! (Prior work on auxiliary tasks in RL + GVF) π₯ π
πΉ Meta-Gradient Reinforcement Learning discount factor + bootstrapped factor π¦ β
πΉ BEYOND EXPONENTIALLY DISCOUNTED SUM: AUTOMATIC LEARNING OF RETURN FUNCTION πΆ
We research how to modify the form of the return function to enhance the learning towards the optimal policy. We propose to use a general mathematical form for return function, and employ meta-learning to learn the optimal return function in an end-to-end manner.
πΉ Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks π₯ π π₯
MAML: In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task.
πΉ BERT Learns to Teach: Knowledge Distillation with Meta Learning π
MetaDistill: We show the teacher network can learn to better transfer knowledge to the student network (i.e., learning to teach) with the feedback from the performance of the distilled student network in a meta learning framework.
πΉ Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables π₯ π₯
PEARL: Current methods rely heavily on on-policy experience, limiting their sample efficiency. They also lack mechanisms to reason about task uncertainty when adapting to new tasks, limiting their effectiveness on sparse reward problems. We address these challenges by developing an offpolicy meta-RL algorithm that disentangles task inference and control.
πΉ Guided Meta-Policy Search π π₯ π
GMPS: We propose to learn a RL procedure in a federated way, where individual off-policy learners can solve the individual meta-training tasks, and then consolidate these solutions into a single meta-learner. Since the central meta-learner learns by imitating the solutions to the individual tasks, it can accommodate either the standard meta-RL problem setting, or a hybrid setting where some or all tasks are provided with example demonstrations.
πΉ CoMPS: Continual Meta Policy Search π₯
CoMPS continuously repeats two subroutines: learning a new task using RL and using the experience from RL to perform completely offline meta-learning to prepare for subsequent task learning.
πΉ Bootstrapped Meta-Learning π₯ π
We propose an algorithm that tackles these issues by letting the metalearner teach itself. The algorithm first bootstraps a target from the meta-learner, then optimises the meta-learner by minimising the distance to that target under a chosen (pseudo-)metric.
πΉ Taming MAML: Efficient Unbiased Meta-Reinforcement Learning π π₯
TMAML: that adds control variates into gradient estimation via automatic differentiation. TMAML improves the quality of gradient estimation by reducing variance without introducing bias.
πΉ NoRML: No-Reward Meta Learning π
NoRML: The key insight underlying NoRML is that we can simultaneously learn the meta-policy and the advantage function used for adapting the meta-policy, optimizing for the ability to effectively adapt to varying dynamics.
πΉ SKILL-BASED META-REINFORCEMENT LEARNING π π₯
we propose to (1) extract reusable skills and a skill prior from offline datasets, (2) meta-train a high-level policy that learns to efficiently compose learned skills into long-horizon behaviors, and (3) rapidly adapt the meta-trained policy to solve an unseen target task.
πΉ Offline Meta Learning of Exploration π§
We take a Bayesian RL (BRL) view, and seek to learn a Bayes-optimal policy from the offline data. Building on the recent VariBAD BRL approach, we develop an off-policy BRL method that learns to plan an exploration strategy based on an adaptive neural belief estimate.
-
Unsupervised Meta-Learning for Reinforcement Learning https://arxiv.org/pdf/1806.04640.pdf [Abhishek Gupta, Benjamin Eysenbach, Chelsea Finn, Sergey Levine] π π
Meta-RL shifts the human burden from algorithm to task design. In contrast, our work deals with the RL setting, where the environment dynamics provides a rich inductive bias that our meta-learner can exploit.
πΉ UNSUPERVISED LEARNING VIA META-LEARNING π βWe construct tasks from unlabeled data in an automatic way and run meta-learning over the constructed tasks.
πΉ Unsupervised Curricula for Visual Meta-Reinforcement Learning [Allan JabriΞ±; Kyle Hsu] π π§ π π₯
Yet, the aforementioned relation between skill acquisition and meta-learning suggests that they should not be treated separately.
However, relying solely on discriminability becomes problematic in environments with high-dimensional (image-based) observation spaces as it results in an issue akin to mode-collapse in the task space. This problem is further complicated in the setting we propose to study, wherein the policy data distribution is that of a meta-learner rather than a contextual policy. We will see that this can be ameliorated by specifying a hybrid discriminative-generative model for parameterizing the task distribution.
We, rather, will tolerate lossy representations as long as they capture discriminative features useful for stimulus-reward association.
πΉ On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning πΆ
Conclusion: multi-task pretraining with fine-tuning on new tasks performs equally as well, or better, than meta-RL.
πΉ Enhanced Meta Reinforcement Learning using Demonstrations in Sparse Reward Environments π₯
(EMRLD) that combines RL-based policy improvement and behavior cloning from demonstrations for task-specific adaptation.
-
Asymmetric Distribution Measure for Few-shot Learning https://arxiv.org/pdf/2002.00153.pdf π
feature representations and relation measure.
-
latent models
πΉ MELD: Meta-Reinforcement Learning from Images via Latent State Models π π β β
we leverage the perspective of meta-learning as task inference to show that latent state models can also perform meta-learning given an appropriately defined observation space.
πΉ Explore then Execute: Adapting without Rewards via Factorized Meta-Reinforcement Learning π π
based on identifying key information in the environment, independent of how this information will exactly be used solve the task. By decoupling exploration from task execution, DREAM explores and consequently adapts to new environments, requiring no reward signal when the task is specified via an instruction.
-
model identification and experience relabeling (MIER)
πΉ Meta-Reinforcement Learning Robust to Distributional Shift via Model Identification and Experience Relabeling π π₯ β β
Our method is based on a simple insight: we recognize that dynamics models can be adapted efficiently and consistently with off-policy data, more easily than policies and value functions. These dynamics models can then be used to continue training policies and value functions for out-of-distribution tasks without using meta-reinforcement learning at all, by generating synthetic experience for the new task.
πΉ Distributionally Adaptive Meta Reinforcement Learning π₯
DiAMetR: Our framework centers on an adaptive approach to distributional robustness that trains a population of meta-policies to be robust to varying levels of distribution shift. When evaluated on a potentially shifted test-time distribution of tasks, this allows us to choose the meta-policy with the most appropriate level of robustness, and use it to perform fast adaptation.
πΉ PaCo: Parameter-Compositional Multi-Task Reinforcement Learning π₯
A policy subspace represented by a set of parameters is learned. Policies for all the single tasks lie in this subspace and can be composed by interpolating with the learned set.
-
SUB-POLICY ADAPTATION FOR HIERARCHICAL REINFORCEMENT LEARNING https://arxiv.org/pdf/1906.05862.pdf π
πΉ STOCHASTIC NEURAL NETWORKS FOR HIERARCHICAL REINFORCEMENT LEARNING
-
HIERARCHICAL RL USING AN ENSEMBLE OF PROPRIOCEPTIVE PERIODIC POLICIES https://openreview.net/pdf?id=SJz1x20cFQ π
-
LEARNING TEMPORAL ABSTRACTION WITH INFORMATION-THEORETIC CONSTRAINTS FOR HIERARCHICAL REINFORCEMENT LEARNING https://openreview.net/pdf?id=HkeUDCNFPS π₯ π
we maximize the mutual information between the latent variables and the state changes.
πΉ Hierarchical Reinforcement Learning with Advantage-Based Auxiliary Rewards π₯
HAAR: We propose an HRL framework which sets auxiliary rewards for low-level skill training based on the advantage function of the high-level policy.
-
learning representation
πΉ LEARNING SUBGOAL REPRESENTATIONS WITH SLOW DYNAMICS π π§ π₯ ββ β β
Observing that the high-level agent operates at an abstract temporal scale, we propose a slowness objective to effectively learn the subgoal representation (i.e., the high-level action space). We provide a theoretical grounding for the slowness objective. β
πΉ ACTIVE HIERARCHICAL EXPLORATION WITH STABLE SUBGOAL REPRESENTATION LEARNING π
HESS: We propose a novel regularization that contributes to both stable and efficient subgoal representation learning.
-
meta; skills
πΉ LEARNING TRANSFERABLE MOTOR SKILLS WITH HIERARCHICAL LATENT MIXTURE POLICIES π₯
our method exploits a three-level hierarchy of both discrete and continuous latent variables, to capture a set of high-level behaviours while allowing for variance in how they are executed.
πΉ Hierarchical Planning Through Goal-Conditioned Offline Reinforcement Learning π₯
HiGoC: The low-level policy is trained via offline RL. We improve the offline training to deal with out-of-distribution goals by a perturbed goal sampling process. The high-level planner selects intermediate sub-goals by taking advantages of model-based planning methods.
πΉ Example-Driven Model-Based Reinforcement Learning for Solving Long-Horizon Visuomotor Tasks π π₯
EMBR learns and plans using a learned model, critic, and success classifier, where the success classifier serves both as a reward function for RL and as a grounding mechanism to continuously detect if the robot should retry a skill when unsuccessful or under perturbations.
πΉ Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space π
PTP: first, a high-level planner that sets intermediate subgoals using conditional subgoal generators in the latent space for a lowlevel model-free policy. second, a hybrid approach which first pre-trains both the conditional subgoal generator and the policy on previously collected data through offline reinforcement learning, and then fine-tunes the policy via online.
-
Latent Space Policies for Hierarchical Reinforcement Learning 2018
-
EPISODIC CURIOSITY THROUGH REACHABILITY [reward design]
In particular, inspired by curious behaviour in animals, observing something novel could be rewarded with a bonus. Such bonus is summed up with the real task reward β making it possible for RL algorithms to learn from the combined reward. We propose a new curiosity method which uses episodic memory to form the novelty bonus. π§ To determine the bonus, the current observation is compared with the observations in memory. Crucially, the comparison is done based on how many environment steps it takes to reach the current observation from those in memory β which incorporates rich information about environment dynamics. This allows us to overcome the known βcouch-potatoβ issues of prior work β when the agent finds a way to instantly gratify itself by exploiting actions which lead to hardly predictable consequences.
-
Combing Skills & KL regularized expected reward objective
πΉ INFOBOT: TRANSFER AND EXPLORATION VIA THE INFORMATION BOTTLENECK π₯ π₯ β β
By training a goal-conditioned policy with an information bottleneck, we can identify decision states by examining where the model actually leverages the goal state.
πΉ THE VARIATIONAL BANDWIDTH BOTTLENECK: STOCHASTIC EVALUATION ON AN INFORMATION BUDGET π₯ π₯ π
we propose the variational bandwidth bottleneck, which decides for each example on the estimated value of the privileged information before seeing it, i.e., only based on the standard input, and then accordingly chooses stochastically, whether to access the privileged input or not.
πΉ the option keyboard Combing Skills in Reinforcement Learning
We argue that a more robust way of combining skills is to do so directly in the goal space, using pseudo-rewards or cumulants. If we associate each skill with a cumulant, we can combine the former by manipulating the latter. This allows us to go beyond the direct prescription of behaviors, working instead in the space of intentions. π
Others: 1. in the space of policies -- over actions; 2. manipulating the corresponding parameters.
πΉ Scaling simulation-to-real transfer by learning composable robot skills π₯ π π₯
we first use simulation to jointly learn a policy for a set of low-level skills, and a βskill embeddingβ parameterization which can be used to compose them.
πΉ LEARNING AN EMBEDDING SPACE FOR TRANSFERABLE ROBOT SKILLS π₯ π
our method is able to learn the skill embedding distributions, which enables interpolation between different skills as well as discovering the number of distinct skills necessary to accomplish a set of tasks.
πΉ CoMic: Complementary Task Learning & Mimicry for Reusable Skills π₯ π₯ β β
We study the problem of learning reusable humanoid skills by imitating motion capture data and joint training with complementary tasks. Related work is good!
πΉ Learning to combine primitive skills: A step towards versatile robotic manipulation π β
RL(high-level) + IM (low-level)
πΉ COMPOSABLE SEMI-PARAMETRIC MODELLING FOR LONG-RANGE MOTION GENERATION π β
Our proposed method learns to model the motion of human by combining the complementary strengths of both non-parametric techniques and parametric ones. Good EXPERIMENTS!
πΉ LEARNING TO COORDINATE MANIPULATION SKILLS VIA SKILL BEHAVIOR DIVERSIFICATION π₯ π β β
Our method consists of two parts: (1) acquiring primitive skills with diverse behaviors by mutual information maximization, and (2) learning a meta policy that selects a skill for each end-effector and coordinates the chosen skills by controlling the behavior of each skill. Related work is good!
πΉ Information asymmetry in KL-regularized RL π₯ π₯ π β β β
In this work we study the possibility of leveraging such repeated structure to speed up and regularize learning. We start from the KL regularized expected reward objective which introduces an additional component, a default policy. Instead of relying on a fixed default policy, we learn it from data. But crucially, we restrict the amount of information the default policy receives, forcing it to learn reusable behaviours that help the policy learn faster.
πΉ Exploiting Hierarchy for Learning and Transfer in KL-regularized RL π π₯ π₯ π§
The KL-regularized expected reward objective constitutes a convenient tool to this end. It introduces an additional component, a default or prior behavior, which can be learned alongside the policy and as such partially transforms the reinforcement learning problem into one of behavior modelling. In this work we consider the implications of this framework in case where both the policy and default behavior are augmented with latent variables. We discuss how the resulting hierarchical structures can be exploited to implement different inductive biases and how the resulting modular structures can be exploited for transfer. Good Writing / Related-work! π
πΉ CompILE: Compositional Imitation Learning and Execution πΆ β
CompILE can successfully discover sub-tasks and their boundaries in an imitation learning setting.
πΉ Bayesian Nonparametrics for Offline Skill Discovery π₯
We propose a method for offline learning of options (a particular skill framework) exploiting advances in variational inference and continuous relaxations.
πΉ Strategic Attentive Writer for Learning Macro-Actions πΆ β
πΉSynthesizing Programs for Images using Reinforced Adversarial Learning πΆ RL render RENDERS β
πΉ Neural Task Graphs: Generalizing to Unseen Tasks from a Single Video Demonstration π₯ π β β
The NTG networks consist of a generator that produces the conjugate task graph as the intermediate representation, and an execution engine that executes the graph by localizing node and deciding the edge transition in the task graph based on the current visual observation.
πΉ Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives π₯ π
each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world.
πΉ COMPOSING TASK-AGNOSTIC POLICIES WITH DEEP REINFORCEMENT LEARNING π β
πΉ DISCOVERING A SET OF POLICIES FOR THE WORST CASE REWARD π
the problem we are solving can be seen as the definition and discovery of lower-level policies that will lead to a robust hierarchical agent.
πΉ CONSTRUCTING A GOOD BEHAVIOR BASIS FOR TRANSFER USING GENERALIZED POLICY UPDATES π₯
We show theoretically that, under certain assumptions, having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance on all possible downstream tasks which are typically more complex than the ones on which the agent was trained.
πΉ ASPiRe: Adaptive Skill Priors for Reinforcement Learning π₯ π₯
ASPiRe includes Adaptive Weight Module (AWM) that learns to infer an adaptive weight assignment between different skill priors and uses them to guide policy learning for downstream tasks via weighted Kullback-Leibler divergences.
-
Acquiring Diverse Robot Skills via Maximum Entropy Deep Reinforcement Learning [Tuomas Haarnoja, UCB] https://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-176.pdf π₯ π₯ π¦ π¦
-
One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL π β
-
Evaluating Agents without Rewards π β π¦ β
-
Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills π π₯ π π₯ β
It should not aim for states where it has the most control according to its current abilities, but for states where it expects it will achieve the most control after learning.
πΉ Ensemble and Auxiliary Tasks for Data-Efficient Deep Reinforcement Learning πΆ β
we study the effects of ensemble and auxiliary tasks when combined with the deep Q-learning alg.
πΉ Unsupervised Skill-Discovery and Skill-Learning in Minecraft πΆ β
πΉ Variational Empowerment as Representation Learning for Goal-Based Reinforcement Learning π§ β
πΉ LEARNING MORE SKILLS THROUGH OPTIMISTIC EXPLORATION π
DISDAIN (discriminator disagreement intrinsic reward): we derive an information gain auxiliary objective that involves training an ensemble of discriminators and rewarding the policy for their disagreement.
πΉ Deep Reinforcement Learning at the Edge of the Statistical Precipice
πΉ LIPSCHITZ-CONSTRAINED UNSUPERVISED SKILL DISCOVERY π₯
We propose Lipschitz-constrained Skill Discovery (LSD), which encourages the agent to discover more diverse, dynamic, and far-reaching skills. LSD encourages the agent to prefer skills with larger traveled distances, unlike previous MI-based methods
πΉ THE INFORMATION GEOMETRY OF UNSUPERVISED REINFORCEMENT LEARNING π
we show that unsupervised skill discovery algorithms based on MI maximization do not learn skills that are optimal for every possible reward function. However, we show that the distribution over skills provides an optimal initialization minimizing regret against adversarially-chosen reward functions, assuming a certain type of adaptation procedure.
πΉ Unsupervised Reinforcement Learning in Multiple Environments π
we foster an exploration strategy that is sensitive to the most adverse cases within the class. Hence, we cast the exploration problem as the maximization of the mean of a critical percentile of the state visitation entropy induced by the exploration strategy over the class of environments.
πΉ CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery πΆ
CIC utilizes contrastive learning between state-transitions and skills to learn behavior embeddings and maximizes the entropy of these embeddings as an intrinsic reward to encourage behavioral diversity.
πΉ SKILL-BASED REINFORCEMENT LEARNING WITH INTRINSIC REWARD MATCHING π₯ πΆ
Intrinsic Reward Matching (IRM): We propose to leverage the skill discriminator to match the intrinsic and downstream task rewards and determine the optimal skill for an unseen task without environment samples, consequently finetuning with greater sample-efficiency.
ReSkill: accelerating exploration in the skill space using state-conditioned generative models to directly bias the high-level agent towards only sampling skills relevant to a given state based on prior experience.
πΉ Where To Start? Transferring Simple Skills to Complex Environments π₯
we introduce an affordance model based on a graph representation of an environment, which is optimised during deployment to find suitable robot configurations to start a skill from, such that the skill can be executed without any collisions.
πΉ Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review π₯ π₯ π π β
Graphic model for control as inference (Decision Making Problem and Terminology; The Graphical Model; Policy search as Probabilistic Inference; Which Objective does This Inference Procedure Optimize; Alternative Model Formulations);
Variation Inference and Stochastic Dynamic(Maximium RL with Fixed Dynamics; Connection to Structured VI);
Approximate Inference with Function Approximation(Maximum Entropy PG; Maxium Entropy AC Algorithms)
πΉ On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference π¦ β
emphasizes that MaxEnt RL can be viewed as minimizing an KL divergence.
πΉ Iterative Inference Models Iterative Amortized Inference π π β β
Latent Variable Models & Variational Inference & Variational Expectation Maximization (EM) & Inference Models
πΉ MAKING SENSE OF REINFORCEMENT LEARNING AND PROBABILISTIC INFERENCE β π¦ β
πΉ Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model π π₯ β β
The main contribution of this work is a novel and principled approach that integrates learning stochastic sequential models and RL into a single method, performing RL in the modelβs learned latent space. By formalizing the problem as a control as inference problem within a POMDP, we show that variational inference leads to the objective of our SLAC algorithm.
πΉ On the Design of Variational RL Algorithms π π π₯ β Good design choices. π¦ π» β
Identify several settings that have not yet been fully explored, and we discuss general directions for improving these algorithms: VI details; (non-)Parametric; Uniform/Learned Prior.
πΉ VIREL: A Variational Inference Framework for Reinforcement Learning π β
existing inference frameworks and their algorithms pose significant challenges for learning optimal policies, for example, the lack of mode capturing behaviour in pseudo-likelihood methods, difficulties learning deterministic policies in maximum entropy RL based approaches, and a lack of analysis when function approximators are used.
πΉ MAXIMUM A POSTERIORI POLICY OPTIMISATION π₯ π π₯
MPO based on coordinate ascent on a relative entropy objective. We show that several existing methods can directly be related to our derivation.
πΉ V-MPO: ON-POLICY MAXIMUM A POSTERIORI POLICY OPTIMIZATION FOR DISCRETE AND CONTINUOUS CONTROL π π₯ π π₯ β β β β
adapts Maximum a Posteriori Policy Optimization to the on-policy setting.
πΉ SOFT Q-LEARNING WITH MUTUAL-INFORMATION REGULARIZATION π π₯ π
In this paper, we propose a theoretically motivated framework that dynamically weights the importance of actions by using the mutual information. In particular, we express the RL problem as an inference problem where the prior probability distribution over actions is subject to optimization.
β
- Action and Perception as Divergence Minimization π₯ π β π» [the art of design] π¦ β
Representation learning for control based on bisimulation does not depend on reconstruction, but aims to group states based on their behavioral similarity in MDP. lil-log π¦
πΉ Equivalence Notions and Model Minimization in Markov Decision Processes http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.2493&rep=rep1&type=pdf : refers to an equivalence relation between two states with similar long-term behavior. π
BISIMULATION METRICS FOR CONTINUOUS MARKOV DECISION PROCESSES
πΉ DeepMDP: Learning Continuous Latent Space Models for Representation Learning https://arxiv.org/pdf/1906.02736.pdf simplifies high-dimensional observations in RL tasks and learns a latent space model via minimizing two losses: prediction of rewards and prediction of the distribution over next latent states. π£ π π π₯ π£ π₯
πΉ DARLA: Improving Zero-Shot Transfer in Reinforcement Learning π₯
We propose a new multi-stage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act. DARLAβs vision is based on learning a disentangled representation of the observed environment. Once DARLA can see, it is able to acquire source policies that are robust to many domain shifts - even with no access to the target domain.
πΉ DBC: Learning Invariant Representations for Reinforcement Learning without Reconstruction π₯ π₯ π₯
Our method trains encoders such that distances in latent space equal bisimulation distances in state space. PSM: r(s,a) ---> pi(a|s)
πΉ Towards Robust Bisimulation Metric Learning π π₯
we generalize value function approximation bounds for on-policy bisimulation metrics to non-optimal policies and approximate environment dynamics. Our theoretical results help us identify embedding pathologies that may occur in practical use. In particular, we find that these issues stem from an underconstrained dynamics model and an unstable dependence of the embedding norm on the reward signal in environments with sparse rewards.
πΉ TASK-INDUCED REPRESENTATION LEARNING π
We formalize the problem of task-induced 11 representation learning (TARP), which aims to leverage such task information in offline experience from prior tasks for learning compact representations that focus 13 on modelling only task-relevant aspects.
πΉ LEARNING GENERALIZABLE REPRESENTATIONS FOR REINFORCEMENT LEARNING VIA ADAPTIVE METALEARNER OF BEHAVIORAL SIMILARITIES π π₯
Meta-learner of Behavioral Similarities (AMBS): A pair of meta-learners is developed, one of which quantifies the reward similarity and the other of which quantifies dynamics similarity over the correspondingly decomposed embeddings. The meta-learners are self-learned to update the state embeddings by approximating two disjoint terms in on-policy bisimulation metric.
πΉ LEARNING INVARIANT FEATURE SPACES TO TRANSFER SKILLS WITH REINFORCEMENT LEARNING https://arxiv.org/pdf/1703.02949.pdf π₯ π
differ in state-space, action-space, and dynamics.
Our method uses the skills that were learned by both agents to train invariant feature spaces that can then be used to transfer other skills from one agent to another.
πΉ UIUC: CS 598 Statistical Reinforcement Learning (S19) NanJiang ππ¦ π¦
πΉ CONTRASTIVE BEHAVIORAL SIMILARITY EMBEDDINGS FOR GENERALIZATION IN REINFORCEMENT LEARNING π₯ π¦ β
β π π β Representation learning. β π π β
πΉ Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels? π₯
we fail to find a single self-supervised loss or a combination of multiple SSL methods that consistently improve RL under the existing joint learning framework with image augmentation.
πΉ CURL: Contrastive Unsupervised Representations for Reinforcement Learning π₯ π π§ β β
πΉ Denoised MDPs: Learning World Models Better Than the World Itself π₯ π
This framework clarifies the kinds information (controllable and reward-relevant) removed by various prior work on representation learning in reinforcement learning (RL), and leads to our proposed approach of learning a Denoised MDP that explicitly factors out certain noise distractors.
πΉ MASTERING VISUAL CONTINUOUS CONTROL: IMPROVED DATA-AUGMENTED REINFORCEMENT LEARNING π₯
DrQ-v2 builds on DrQ, an off-policy actor-critic approach that uses data augmentation to learn directly from pixels.
πΉ Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation π
By only applying augmentation in Q-value estimation of the current state, without augmenting Q-targets used for bootstrapping, SVEA circumvents erroneous bootstrapping caused by data augmentation.
πΉ Sim-to-Real via Sim-to-Sim: Data-efficient Robotic Grasping via Randomized-to-Canonical Adaptation Networks π π₯
Our method learns to translate randomized rendered images into their equivalent non-randomized, canonical versions. This in turn allows for real images to also be translated into canonical sim images.
πΉ Time-contrastive networks: Self-supervised learning from video π₯ β
πΉ Data-Efficient Reinforcement Learning with Self-Predictive Representations π₯ β β β
SPR:
πΉ Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning π₯ π₯
VCR: Instead of aligning this imagined state with a real state returned by the environment, VCR applies a Q-value head on both states and obtains two distributions of action values. Then a distance is computed and minimized to force the imagined state to produce a similar action value prediction as that by the real state.
πΉ Intrinsically Motivated Self-supervised Learning in Reinforcement Learning π π₯ β
employ self-supervised loss as an intrinsic reward, called Intrinsically Motivated Self-Supervised learning in Reinforcement learning (IM-SSR). Decomposition and Interpretation of Contrastive Loss.
πΉ INFORMATION PRIORITIZATION THROUGH EMPOWERMENT IN VISUAL MODEL-BASED RL π π₯
InfoPower: We propose a modified objective for model-based RL that, in combination with mutual information maximization, allows us to learn representations and dynamics for visual model-based RL without reconstruction in a way that explicitly prioritizes functionally relevant factors.
πΉ PlayVirtual: Augmenting Cycle-Consistent Virtual Trajectories for Reinforcement Learning π
PlayVirtual predicts future states in a latent space based on the current state and action by a dynamics model and then predicts the previous states by a backward dynamics model, which forms a trajectory cycle. Based on this, we augment the actions to generate a large amount of virtual state-action trajectories.
πΉ Masked World Models for Visual Control πΆ
We train an autoencoder with convolutional layers and vision transformers (ViT) to reconstruct pixels given masked convolutional features, and learn a latent dynamics model that operates on the representations from the autoencoder. Moreover, to encode task-relevant information, we introduce an auxiliary reward prediction objective for the autoencoder.
πΉ EMI: Exploration with Mutual Information π
We propose EMI, which is an exploration method that constructs embedding representation of states and actions that does not rely on generative decoding of the full observation but extracts predictive signals that can be used to guide exploration based on forward prediction in the representation space.
πΉ Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning π π₯ π
The forward prediction encourages the agent state to move away from collapsing in order to accurately predict future random projections of observations. Similarly, the reverse prediction encourages the latent observation away from collapsing in order to accurately predict the random projection of a full history. As we continue to train forward and reverse predictions, this seems to result in a virtuous cycle that continuously enriches both representations.
πΉ Unsupervised Domain Adaptation with Shared Latent Dynamics for Reinforcement Learning π
The model achieves the alignment between the latent codes via learning shared dynamics for different environments and matching marginal distributions of latent codes.
πΉ RETURN-BASED CONTRASTIVE REPRESENTATION LEARNING FOR REINFORCEMENT LEARNING π π₯
Our auxiliary loss is theoretically justified to learn representations that capture the structure of a new form of state-action abstraction, under which state-action pairs with similar return distributions are aggregated together. Related work: AUXILIARY TASK + ABSTRACTION.
πΉ Representation Matters: Offline Pretraining for Sequential Decision Making π
πΉ SELF-SUPERVISED POLICY ADAPTATION DURING DEPLOYMENT π π₯ β
test time training TTT Our work explores the use of self-supervision to allow the policy to continue training after deployment without using any rewards.
πΉ Test-Time Training with Masked Autoencoders π₯
Test-time training adapts to a new test distribution on the fly by optimizing a model for each test input using self-supervision. In this paper, we use masked autoencoders for this one-sample learning problem.
πΉ MEMO: Test Time Robustness via Adaptation and Augmentation π₯
MEMO:when presented with a test example, perform different data augmentations on the data point, and then adapt (all of) the model parameters by minimizing the entropy of the modelβs average, or marginal, output distribution across the augmentations. Intuitively, this objective encourages the model to make the same prediction across different augmentations, thus enforcing the invariances encoded in these augmentations, while also maintaining confidence in its predictions.
πΉ What Makes for Good Views for Contrastive Learning? π π₯ π₯ π
we should reduce the mutual information (MI) between views while keeping task-relevant information intact.
πΉ SELF-SUPERVISED LEARNING FROM A MULTI-VIEW PERSPECTIVE π π₯ β β
Demystifying Self-Supervised Learning: An Information-Theoretical Framework.
πΉCONTRASTIVE BEHAVIORAL SIMILARITY EMBEDDINGS FOR GENERALIZATION IN REINFORCEMENT LEARNING π π
policy similarity metric (PSM) for measuring behavioral similarity between states. PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar.
We propose a new theoretically-motivated framework called Generalized Similarity Functions (GSF), which uses contrastive learning to train an offline RL agent to aggregate observations based on the similarity of their expected future behavior, where we quantify this similarity using generalized value functions.
πΉ Invariant Causal Prediction for Block MDPs π¦
State Abstractions and Bisimulation; Causal Inference Using Invariant Prediction;
πΉ Learning Domain Invariant Representations in Goal-conditioned Block MDPs
πΉ CAUSAL INFERENCE Q-NETWORK: TOWARD RESILIENT REINFORCEMENT LEARNING π
In this paper, we consider a resilient DRL framework with observational interferences.
πΉ Decoupling Value and Policy for Generalization in Reinforcement Learning π π₯ β
Invariant Decoupled Advantage ActorCritic. First, IDAAC decouples the optimization of the policy and value function, using separate networks to model them. Second, it introduces an auxiliary loss which encourages the representation to be invariant to task-irrelevant properties of the environment.
πΉ Robust Deep Reinforcement Learning against Adversarial Perturbations on State Obs π₯ π π§ β β
We propose the state-adversarial Markov decision process (SA-MDP) to study the fundamental properties of this problem, and develop a theoretically principled policy regularization which can be applied to a large family of DRL algorithms.
πΉ Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning π¦ β
πΉ ROBUST REINFORCEMENT LEARNING ON STATE OBSERVATIONS WITH LEARNED OPTIMAL ADVERSARY
πΉ Loss is its own Reward: Self-Supervision for Reinforcement Learning π π₯ β
To augment reward, we consider a range of selfsupervised tasks that incorporate states, actions, and successors to provide auxiliary losses.
πΉ Unsupervised Learning of Visual 3D Keypoints for Control π π₯
motivation: most of these representations, whether structured or unstructured are learned in a 2D space even though the control tasks are usually performed in a 3D environment.
πΉ Which Mutual-Information Representation Learning Objectives are Sufficient for Control? π π₯ π
we formalize the sufficiency of a state representation for learning and representing the optimal policy, and study several popular mutual-information based objectives through this lens. β
πΉ Towards a Unified Theory of State Abstraction for MDPs π π₯π π§ β β β
We provide a unified treatment of state abstraction for Markov decision processes. We study five particular abstraction schemes.
πΉ Learning State Abstractions for Transfer in Continuous Control π₯ β
Our main contribution is a learning algorithm that abstracts a continuous state-space into a discrete one. We transfer this learned representation to unseen problems to enable effective learning.
πΉ Multi-Modal Mutual Information (MuMMI) Training for Robust Self-Supervised Deep Reinforcement Learning π π₯
we contribute a new multi-modal deep latent state-space model, trained using a mutual information lower-bound.
πΉ LEARNING ACTIONABLE REPRESENTATIONS WITH GOAL-CONDITIONED POLICIES π π β β
Aim to capture those factors of variation that are important for decision making β that are βactionable.β These representations are aware of the dynamics of the environment, and capture only the elements of the observation that are necessary for decision making rather than all factors of variation.
πΉ Adaptive Auxiliary Task Weighting for Reinforcement Learning π
Dynamically combines different auxiliary tasks to speed up training for reinforcement learning: Our method is based on the idea that auxiliary tasks should provide gradient directions that, in the long term, help to decrease the loss of the main task.
πΉ Scalable methods for computing state similarity in deterministic Markov Decision Processes
Computing and approximating bisimulation metrics in large deterministic MDPs.
πΉ Value Preserving State-Action Abstractions π
We proved which state-action abstractions are guaranteed to preserve representation of high value policies. To do so, we introduced -relative options, a simple but expressive formalism for combining state abstractions with options.
πΉ Learning Markov State Abstractions for Deep Reinforcement Learning π π π§
We introduce a novel set of conditions and prove that they are sufficient for learning a Markov abstract state representation. We then describe a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions.
πΉ Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL π π§
CTRL: We posit that a superior encoder for zero-shot generalization in RL can be trained by using solely an auxiliary SSL objective if the training process encourages the encoder to map behaviorally similar observations to similar representations.
πΉ Jointly-Learned State-Action Embedding for Efficient Reinforcement Learning πΆ
We establish the theoretical foundations for the validity of training a rl agent using embedded states and actions. We then propose a new approach for jointly learning embeddings for states and actions that combines model-free and model-based rl.
πΉ Metrics and continuity in reinforcement learning π
We introduce a unified formalism for defining these topologies through the lens of metrics. We establish a hierarchy amongst these metrics and demonstrate their theoretical implications on the Markov Decision Process specifying the rl problem.
πΉ Environment Shaping in Reinforcement Learning using State Abstraction π¦ π
Our key idea is to compress the environmentβs large state space with noisy signals to an abstracted space, and to use this abstraction in creating smoother and more effective feedback signals for the agent. We study the theoretical underpinnings of our abstractionbased environment shaping, and show that the agentβs policy learnt in the shaped environment preserves near-optimal behavior in the original environment.
πΉ A RELATIONAL INTERVENTION APPROACH FOR UNSUPERVISED DYNAMICS GENERALIZATION IN MODEL-BASED REINFORCEMENT LEARNING π₯ π
Because environments are not labelled, the extracted information inevitably contains redundant information unrelated to the dynamics in transition segments and thus fails to maintain a crucial property of Z: Z should be similar in the same environment and dissimilar in different ones. we introduce an interventional prediction module to estimate the probability of two estimated zi , zj belonging to the same environment.
πΉ Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL π₯
We propose Cross Trajectory Representation Learning (CTRL), a method that runs within an RL agent and conditions its encoder to recognize behavioral similarity in observations by applying a novel SSL objective to pairs of trajectories from the agentβs policies.
πΉ Bayesian Imitation Learning for End-to-End Mobile Manipulation π
We show that using the Variational Information Bottleneck to regularize convolutional neural networks improves generalization to held-out domains and reduces the sim-to-real gap in a sensor-agnostic manner. As a side effect, the learned embeddings also provide useful estimates of model uncertainty for each sensor.
πΉ Control-Aware Representations for Model-based Reinforcement Learning π π π₯ π₯
CARL: How to learn a representation that is amenable to the control problem at hand, and how to achieve an end-to-end framework for representation learning and control: We first formulate a learning controllable embedding (LCE) model to learn representations that are suitable to be used by a policy iteration style algorithm in the latent space. We call this model control-aware representation learning (CARL). We derive a loss function for CARL that has close connection to the prediction, consistency, and curvature (PCC) principle for representation learning.
πΉ Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images π₯
E2C: Embed to Control (E2C) consists of a deep generative model, belonging to the family of variational autoencoders, that learns to generate image trajectories from a latent space in which the dynamics is constrained to be locally linear.
πΉ Robust Locally-Linear Controllable Embedding π₯
RCE: propose a principled variational approximation of the embedding posterior that takes the future observation into account, and thus, makes the variational approximation more robust against the noise.
πΉ SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning πΆ
SOLAR: we present a method for learning representations that are suitable for iterative model-based policy improvement.
πΉ DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION π
Dreamer: (Learning long-horizon behaviors by latent imagination) predicting both actions and state values.
πΉ Learning Task Informed Abstractions πΆ
Task Informed Abstractions (TIA) that explicitly separates rewardcorrelated visual features from distractors.
πΉ PREDICTION, CONSISTENCY, CURVATURE: REPRESENTATION LEARNING FOR LOCALLY-LINEAR CONTROL π π₯ π
PCC: We propose the Prediction, Consistency, and Curvature (PCC) framework for learning a latent space that is amenable to locally-linear control (LLC) algorithms and show that the elements of PCC arise systematically from bounding the suboptimality of the solution of the LLC algorithm in the latent space.
πΉ Predictive Coding for Locally-Linear Control π₯ π
PC3: we propose a novel information-theoretic LCE approach and show theoretically that explicit next-observation prediction can be replaced with predictive coding. We then use predictive coding to develop a decoder-free LCE model whose latent dynamics are amenable to locally-linear control.
πΉ Robust Predictable Control π₯ π π§
RPC: Our objective differs from prior work by compressing sequences of observations, resulting in a method that jointly trains a policy and a model to be self-consistent.
πΉ Representation Gap in Deep Reinforcement Learning π
We propose Policy Optimization from Preventing Representation Overlaps (POPRO), which regularizes the policy evaluation phase through differing the representation of action value function from its target.
πΉ TRANSFER RL ACROSS OBSERVATION FEATURE SPACES VIA MODEL-BASED REGULARIZATION π₯ π₯
We propose to learn a latent dynamics model in the source task and transfer the model to the target task to facilitate representation learning (+heoretical analysis).
πΉ Sample-Efficient Reinforcement Learning in the Presence of Exogenous Information π₯
ExoMDP: the state space admits an (unknown) factorization into a small controllable (or, endogenous) component and a large irrelevant (or, exogenous) component; the exogenous component is independent of the learnerβs actions, but evolves in an arbitrary, temporally correlated fashion.
πΉ Stabilizing Off-Policy Deep Reinforcement Learning from Pixels π
A-LIX: [poster]
πΉ Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning π
we introduce TEmporal Disentanglement (TED), a self-supervised auxiliary task that leads to disentangled representations using the sequential nature of RL observations.
πΉ R3M: A Universal Visual Representation for Robot Manipulation
We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks.
πΉ PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning π₯ π
We propose a multi-task inverse reinforcement learning (IRL) algorithm, called inverse temporal difference learning (ITD), that learns shared state features, alongside peragent successor features and preference vectors, purely from demonstrations without reward labels. We further ...
πΉ LOOK WHERE YOU LOOK! SALIENCY-GUIDED Q-NETWORKS FOR VISUAL RL TASKS π₯ π
SGQN: a good visual policy should be able to identify which pixels are important for its decision, and preserve this identification of important sources of information across images.
πΉ Improving Deep Learning Interpretability by Saliency Guided Training π₯ π
Saliency Guided Training: Our saliency guided training procedure iteratively masks features with small and potentially noisy gradients while maximizing the similarity of model outputs for both masked and unmasked inputs.
We hypothesize that adversarial training can eliminate shortcut features whereas saliency guided training can filter out non-relevant features; both are nuisance features accounting for the performance degradation on OOD test sets.
πΉ VISFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives π₯ π
VISFIS: (1) accurate predictions given limited but sufficient information (Sufficiency); (2) max-entropy predictions given no important information (Uncertainty); (3) invariance of predictions to changes in unimportant features (Invariance); and (4) alignment between model FI explanations and human FI explanations (Plausibility).
πΉ Concept Embedding Models π
CEM: we propose Concept Embedding Models, a novel family of concept bottleneck models which goes beyond the current accuracy-vs-interpretability trade-off by learning interpretable highdimensional concept representations.
πΉ Invariance Through Latent Alignment π₯
ILA performs unsupervised adaptation at deployment-time by matching the distribution of latent features on the target domain to the agentβs prior experience, without relying on paired data.
πΉ Does Zero-Shot Reinforcement Learning Exist? π₯ π π§
Strategies for approximate zero-shot RL have been suggested using successor features (SFs) or forward-backward (FB) representations, but testing has been limited.
πΉ Learning Successor States and Goal-Dependent Values: A Mathematical Viewpoint π§
πΉ Learning One Representation to Optimize All Rewards π₯ π π₯
We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process. It provides explicit near-optimal policies for any reward specified a posteriori.
πΉ TOWARDS UNIVERSAL VISUAL REWARD AND REPRESENTATION VIA VALUE-IMPLICIT PRE-TRAINING π π₯ π π₯ π§
Value-Implicit Pre-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks.
πΉ VRL3: A Data-Driven Framework for Visual Deep Reinforcement Learning π§
πΉ CLOUD: Contrastive Learning of Unsupervised Dynamics π
CLOUD: we train a forward dynamics model and an inverse dynamics model in the feature space of states and actions with data collected from random exploration.
πΉ Object-Category Aware Reinforcement Learning π₯ π
OCARL consists of three parts: (1) category-aware unsupervised object discover (category-aware UOD) module, (2) object-category aware perception (OCAP), and (3) object-centric modular reasoning (OCMR) module.
πΉ MINE: Mutual Information Neural Estimation ππ§ π₯ f-gan & mine π¦
πΉ IMPROVING MUTUAL INFORMATION ESTIMATION WITH ANNEALED AND ENERGY-BASED BOUNDS
Multi-Sample Annealed Importance Sampling (AIS):
πΉ C-MI-GAN : Estimation of Conditional Mutual Information Using MinMax Formulation π π₯ β β
πΉ Deep InfoMax: LEARNING DEEP REPRESENTATIONS BY MUTUAL INFORMATION ESTIMATION AND MAXIMIZATION ππ§ β
πΉ ON MUTUAL INFORMATION MAXIMIZATION FOR REPRESENTATION LEARNING π¦ π β
πΉ [Deep Reinforcement and InfoMax Learning](Deep Reinforcement and InfoMax Learning) π¦ π π π§
Our work is based on the hypothesis that a model-free agent whose representations are predictive of properties of future states (beyond expected rewards) will be more capable of solving and adapting to new RL problems, and in a way, incorporate aspects of model-based learning.
πΉ Unpacking Information Bottlenecks: Unifying Information-Theoretic Objectives in Deep Learning π
New: Unpacking Information Bottlenecks: Surrogate Objectives for Deep Learning
πΉ Opening the black box of Deep Neural Networ ksvia Information
β β UYANG:
πΉ β[Self-Supervised Representation Learning From Multi-Domain Data](Self-Supervised Representation Learning From Multi-Domain Data), π₯ π π
The proposed mutual information constraints encourage neural network to extract common invariant information across domains and to preserve peculiar information of each domain simultaneously. We adopt tractable upper and lower bounds of mutual information to make the proposed constraints solvable.
πΉ βUnsupervised Domain Adaptation via Regularized Conditional Alignment, π₯ π
Joint alignment ensures that not only the marginal distributions of the domains are aligned, but the labels as well.
πΉ βDomain Adaptation with Conditional Distribution Matching and Generalized Label Shift, π₯ π₯ π₯ π¦
In this paper, we extend a recent upper-bound on the performance of adversarial domain adaptation to multi-class classification and more general discriminators. We then propose generalized label shift (GLS) as a way to improve robustness against mismatched label distributions. GLS states that, conditioned on the label, there exists a representation of the input that is invariant between the source and target domains.
πΉ βLearning to Learn with Variational Information Bottleneck for Domain Generalization,
Through episodic training, MetaVIB learns to gradually narrow domain gaps to establish domain-invariant representations, while simultaneously maximizing prediction accuracy.
πΉ βDeep Domain Generalization via Conditional Invariant Adversarial Networks, π
πΉ βOn Learning Invariant Representation for Domain Adaptation π₯ π₯ π¦
πΉ βGENERALIZING ACROSS DOMAINS VIA CROSS-GRADIENT TRAINING π₯ π π
In contrast, in our setting, we wish to avoid any such explicit domain representation, appealing instead to the power of deep networks to discover implicit features. We also argue that even if such such overfitting could be avoided, we do not necessarily want to wipe out domain signals, if it helps in-domain test instances.
πΉ βIn Search of Lost Domain Generalization πΆ
πΉ DIRL: Domain-Invariant Representation Learning for Sim-to-Real Transfer π¦ β
β β self-supervised learning
πΉ Bootstrap Your Own Latent A New Approach to Self-Supervised Learning π₯ π
Related work is good! β β
πΉ Model-Based Relative Entropy Stochastic Search
MORE:
πΉ Efficient Gradient-Free Variational Inference using Policy Search
VIPS: Our method establishes information-geometric trust regions to ensure efficient exploration of the sampling space and stability of the GMM updates, allowing for efficient estimation of multi-variate Gaussian variational distributions.
πΉ EXPECTED INFORMATION MAXIMIZATION USING THE I-PROJECTION FOR MIXTURE DENSITY ESTIMATION π₯
EIM: we present a new algorithm called Expected Information Maximization (EIM) for computing the I-projection solely based on samples for general latent variable models.
πΉ An Information-theoretic Approach to Distribution Shifts π§
πΉ An Asymmetric Contrastive Loss for Handling Imbalanced Datasets π₯ π
we propose the asymmetric focal contrastive loss (AFCL) as a further generalization of both ACL and focal contrastive loss (FCL).
πΉ Active Domain Randomization http://proceedings.mlr.press/v100/mehta20a/mehta20a.pdf π₯ π₯ π₯
Our method looks for the most informative environment variations within the given randomization ranges by leveraging the discrepancies of policy rollouts in randomized and reference environment instances. We find that training more frequently on these instances leads to better overall agent generalization.
Domain Randomization; Stein Variational Policy Gradient;
Bhairav Mehta On Learning and Generalization in Unstructured Task Spaces π¦ π¦
πΉ VADRA: Visual Adversarial Domain Randomization and Augmentation π₯ π generative + learner
πΉ Which Training Methods for GANs do actually Converge? π π§ ODE: GAN
πΉ Robust Adversarial Reinforcement Learning πΆ β
Robust Adversarial Reinforcement Learning (RARL), jointly trains a pair of agents, a protagonist and an adversary, where the protagonist learns to fulfil the original task goals while being robust to the disruptions generated by its adversary.
πΉ Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience πΆ β
πΉ POLICY TRANSFER WITH STRATEGY OPTIMIZATION πΆ β
The key idea that, instead of learning a single policy in the simulation, we simultaneously learn a family of policies that exhibit different behaviors. When tested in the target environment, we directly search for the best policy in the family based on the task performance, without the need to identify the dynamic parameters.
πΉ https://lilianweng.github.io/lil-log/2019/05/05/domain-randomization.html π¦
πΉ THE INGREDIENTS OF REAL-WORLD ROBOTIC REINFORCEMENT LEARNING πΆ β
πΉ ROBUST REINFORCEMENT LEARNING ON STATE OBSERVATIONS WITH LEARNED OPTIMAL ADVERSARY π
To enhance the robustness of an agent, we propose a framework of alternating training with learned adversaries (ATLA), which trains an adversary online together with the agent using policy gradient following the optimal adversarial attack framework.
πΉ SELF-SUPERVISED POLICY ADAPTATION DURING DEPLOYMENT π π₯
Our work explores the use of self-supervision to allow the policy to continue training after deployment without using any rewards.
πΉ SimGAN: Hybrid Simulator Identification for Domain Adaptation via Adversarial Reinforcement Learning π
identifying a hybrid physics simulator to match the simulated \tau to the ones from the target domain, using a learned discriminative loss to address the limitations associated with manual loss design. Our hybrid simulator combines nns and traditional physics simulaton to balance expressiveness and generalizability, and alleviates the need for a carefully selected parameter set in System ID.
πΉ Generalization of Reinforcement Learning with Policy-Aware Adversarial Data Augmentation πΆ
our proposed method adversarially generates new trajectory data based on the policy gradient objective and aims to more effectively increase the RL agentβs generalization ability with the policy-aware data augmentation.
πΉ Understanding Domain Randomization for Sim-to-real Transfer π π§
We provide sharp bounds on the sim-to-real gapβthe difference between the value of policy returned by domain randomization and the value of an optimal policy for the real world.
π³οΈ see more robustness in model-based setting
πΉ EPOPT: LEARNING ROBUST NEURAL NETWORK POLICIES USING MODEL ENSEMBLES πΆ
Our method provides for training of robust policies, and supports an adversarial training regime designed to provide good direct-transfer performance. We also describe how our approach can be combined with Bayesian model adaptation to adapt the source domain ensemble to a target domain using a small amount of target domain experience.
πΉ Action Robust Reinforcement Learning and Applications in Continuous Control π₯ π§
We have presented two new criteria for robustness, the Probabilistic and Noisy action Robust MDP, related each to real world scenarios of uncertainty and discussed the theoretical differences between both approaches.
πΉ Robust Policy Learning over Multiple Uncertainty Sets π§
System Identification and Risk-Sensitive Adaptation (SIRSA):
πΉ βSim: DIFFERENTIABLE SIMULATION FOR SYSTEM IDENTIFICATION AND VISUOMOTOR CONTROL π π§
πΉ RISP: RENDERING-INVARIANT STATE PREDICTOR WITH DIFFERENTIABLE SIMULATION AND RENDERING FOR CROSS-DOMAIN PARAMETER ESTIMATION π₯ π π
This work considers identifying parameters characterizing a physical systemβs dynamic motion directly from a video whose rendering configurations are inaccessible. Our core idea is to train a rendering-invariant state-prediction (RISP) network that transforms image differences into state differences independent of rendering configurations.
πΉ Sim and Real: Better Together π₯ π§
By separating the rate of collecting samples from each environment and the rate of choosing samples for the optimization process, we were able to achieve a significant reduction in the amount of real environment samples, comparing to the common strategy of using the same rate for both collection and optimization phases.
πΉ Online Robust Reinforcement Learning with Model Uncertainty π
We develop a sample-based approach to estimate the unknown uncertainty set, and design robust Q-learning algorithm (tabular case) and robust TDC algorithm (function approximation setting).
πΉ Robust Deep Reinforcement Learning through Adversarial Loss π₯ π₯
RADIAL-RL: Construct an strict upper bound of the perturbed standard loss; Design a regularizer to minimize overlap between output bounds of actions with large difference in outcome.
πΉ Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculum π§
πΉ Robust Reinforcement Learning using Offline Data π₯ π
This poses challenges in offline data collection, optimization over the models, and unbiased estimation. In this work, we propose a systematic approach to overcome these challenges, resulting in our RFQI algorithm.
β
πΉ Automatic Data Augmentation for Generalization in Deep Reinforcement Learning π π β β β
Across different visual inputs (with the same semantics), dynamics, or other environment structures
πΉ Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels π
πΉ Fast Adaptation to New Environments via Policy-Dynamics Value Functions π₯ π₯ π β
PD-VF explicitly estimates the cumulative reward in a space of policies and environments.
πΉ Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers π₯ π₯ π π§
DARC: The main contribution of this work is an algorithm for domain adaptation to dynamics changes in RL, based on the idea of compensating for differences in dynamics by modifying the reward function. This algorithm does not need to estimate transition probabilities, but rather modifies the reward function using a pair of classifiers.
πΉ DARA: DYNAMICS-AWARE REWARD AUGMENTATION IN OFFLINE REINFORCEMENT LEARNING π₯
πΉ When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning π₯
H2O introduces a dynamics-aware policy evaluation scheme, which adaptively penalizes the Q function learning on simulated stateaction pairs with large dynamics gaps, while also simultaneously allowing learning from a fixed real-world dataset.
πΉ DOMAIN TRANSFER WITH LARGE DYNAMICS SHIFT IN OFFLINE REINFORCEMENT LEARNING π
the source data will play two roles. One is to serve as augmentation data by compensating for the difference in dynamics with modified reward. Another is to form prior knowledge for the behaviour policy to collect a small amount of new data in the target domain safely and efficiently.
πΉ TARGETED ENVIRONMENT DESIGN FROM OFFLINE DATA π π₯
OTED: which automatically learns a distribution over simulator parameters to match a provided offline dataset, and then uses the learned simulator to train an RL agent in standard online fashion.
This paper considers learning a predictive model to address the missing parameters in sequential decision problems.
-
general domain adaption (DA) = importance weighting + domain-agnostic features
-
DA in RL = system identification + domain randomization + observation adaptation π
- formulates control as a problem of probabilistic inference π§
πΉ Unsupervised Domain Adaptation with Dynamics Aware Rewards in Reinforcement Learning π π₯ π
DADS: we introduce a KL regularized objective to encourage emergence of skills, rewarding the agent for both discovering skills and aligning its behaviors respecting dynamics shifts.
πΉ Mutual Alignment Transfer Learning π π₯ β β
The developed approach harnesses auxiliary rewards to guide the exploration for the real world agent based on the proficiency of the agent in simulation and vice versa.
πΉ SimGAN: Hybrid Simulator Identification for Domain Adaptation via Adversarial Reinforcement Learning [real dog] π π₯ β β
a framework to tackle domain adaptation by identifying a hybrid physics simulator to match the simulated trajectories to the ones from the target domain, using a learned discriminative loss to address the limitations associated with manual loss design.
πΉ Disentangled Skill Embeddings for Reinforcement Learning π₯ π π₯ π₯ β β β β
We have developed a multi-task framework from a variational inference perspective that is able to learn latent spaces that generalize to unseen tasks where the dynamics and reward can change independently.
πΉ Transfer Learning in Deep Reinforcement Learning: A Survey π¦ β
Evaluation metrics: Mastery and Generalization.
TRANSFER LEARNING APPROACHES: Reward Shaping; Learning from Demonstrations; Policy Transfer (Transfer Learning via Policy Distillation, Transfer Learning via Policy Reuse); Inter-Task Mapping; Representation Transfer(Reusing Representations, Disentangling Representations);
πΉ Provably Efficient Model-based Policy Adaptation π π₯ π π§ π β
We prove that the approach learns policies in the target environment that can recover trajectories from the source environment, and establish the rate of convergence in general settings.
β reward shaping
-
πΉ Useful Policy Invariant Shaping from Arbitrary Advice π β
-
Action
πΉ Generalization to New Actions in Reinforcement Learning π
We propose a two-stage framework where the agent first infers action representations from action information acquired separately from the task. A policy flexible to varying action sets is then trained with generalization objectives.
πΉ Policy Transfer across Visual and Dynamics Domain Gaps via Iterative Grounding π π₯
alternates between (1) directly minimizing both visual and dynamics domain gaps by grounding the source env in the target env domains, and (2) training a policy on the grounded source env.
πΉ Learning Agile Robotic Locomotion Skills by Imitating Animals π π₯ π β
We show that by leveraging reference motion data, a single learning-based approach is able to automatically synthesize controllers for a diverse repertoire behaviors for legged robots. By incorporating sample efficient domain adaptation techniques into the training process, our system is able to learn adaptive policies in simulation that can then be quickly adapted for real-world deployment.
πΉ RMA: Rapid Motor Adaptation for Legged Robots π
The robot achieves this high success rate despite never having seen unstable or sinking ground, obstructive vegetation or stairs during training. All deployment results are with the same policy without any simulation calibration, or real-world fine-tuning. β β
πΉ A System for General In-Hand Object Re-Orientation π
We present a simple model-free framework (teacher-student distillation) that can learn to reorient objects with both the hand facing upwards and downwards.β + DAgger
πΉ LEARNING VISION-GUIDED QUADRUPEDAL LOCOMOTION END-TO-END WITH CROSS-MODAL TRANSFORMERS π₯
LocoTransformer: We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs.
πΉ Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability π₯
we recast the problem of generalization in RL as solving the induced partially observed Markov decision process, which we call the epistemic POMDP.
πΉ Learning quadrupedal locomotion over challenging terrain π₯ π
We present a novel solution to incorporating proprioceptive feedback in locomotion control and demonstrate remarkable zero-shot generalization from simulation to natural environments.
πΉ Rma: Rapid motor adaptation for legged robots π₯ π
RMA consists of two components: a base policy and an adaptation module. The combination of these components enables the robot to adapt to novel situations in fractions of a second. RMA is trained completely in simulation without using any domain knowledge like reference trajectories or predefined foot trajectory generators and is deployed on the A1 robot without any fine-tuning.
πΉ Fast Adaptation to New Environments via Policy-Dynamics Value Functions π₯
PD-VF: explicitly estimates the cumulative reward in a space of policies and environments.
πΉ PAnDR: Fast Adaptation to New Environments from Offline Experiences via Decoupling Policy and Environment Representations π₯ π
In offline phase, the environment representation and policy representation are learned through contrastive learning and policy recovery, respectively. The representations are further refined by mutual information optimization to make them more decoupled and complete.
State-Conservative Policy Optimization (SCPO) reduces the disturbance in transition dynamics to that in state space and then approximates it by a simple gradient-based regularizer.
πΉ LEARNING A SUBSPACE OF POLICIES FOR ONLINE ADAPTATION IN REINFORCEMENT LEARNING πΆ
LoP does not need any particular tuning or definition of additional architectures to handle diversity, which is a critical aspect in the online adaptation setting where hyper-parameters tuning is impossible or at least very difficult.
πΉ ADAPT-TO-LEARN: POLICY TRANSFER IN REINFORCEMENT LEARNING π π β
New: Adaptive Policy Transfer in Reinforcement Learning
adapt the source policy to learn to solve a target task with significant transition differences and uncertainties.
πΉ Unsupervised Domain Adaptation with Dynamics Aware Rewards in Reinforcement Learning π₯ π
DARS: We propose an unsupervised domain adaptation method to identify and acquire skills across dynamics. We introduce a KL regularized objective to encourage emergence of skills, rewarding the agent for both discovering skills and aligning its behaviors respecting dynamics shifts.
πΉ SINGLE EPISODE POLICY TRANSFER IN REINFORCEMENT LEARNING π₯ π β β
Our key idea of optimized probing for accelerated latent variable inference is to train a dedicated probe policy ΟΟ(a|s) to generate a dataset D of short trajectories at the beginning of all training episodes, such that the VAEβs performance on D is optimized.
πΉ VARIBAD: A VERY GOOD METHOD FOR BAYES-ADAPTIVE DEEP RL VIA META-LEARNING π₯ π
we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncertainty directly during action selection.
πΉ Dynamical Variational Autoencoders: A Comprehensive Review π¦ π¦ β β
πΉ Dynamics Generalization via Information Bottleneck in Deep Reinforcement Learningβ π₯ β β
In particular, we show that the poor generalization in unseen tasks is due to the DNNs memorizing environment observations, rather than extracting the relevant information for a task. To prevent this, we impose communication constraints as an information bottleneck between the agent and the environment.
πΉ UNIVERSAL AGENT FOR DISENTANGLING ENVIRONMENTS AND TASKS π₯ π
The environment-specific unit handles how to move from one state to the target state; and the task-specific unit plans for the next target state given a specific task.
πΉ Decoupling Dynamics and Reward for Transfer Learning π
We separate learning the task representation, the forward dynamics, the inverse dynamics and the reward function of the domain.
πΉ Neural Dynamic Policies for End-to-End Sensorimotor Learning π₯ π
We propose Neural Dynamic Policies (NDPs) that make predictions in trajectory distribution space as opposed to raw control spaces. [see Abstract!] Similar in spirit to UNIVERSAL AGENT.
πΉ Accelerating Reinforcement Learning with Learned Skill Priors π π₯
We propose a deep latent variable model that jointly learns an embedding space of skills and the skill prior from offline agent experience. We then extend common maximumentropy RL approaches to use skill priors to guide downstream learning.
πΉ Mutual Alignment Transfer Learning π π₯
The developed approach harnesses auxiliary rewards to guide the exploration for the real world agent based on the proficiency of the agent in simulation and vice versa.
πΉ LEARNING CROSS-DOMAIN CORRESPONDENCE FOR CONTROL WITH DYNAMICS CYCLE-CONSISTENCY π π π₯
In this paper, we propose to learn correspondence across such domains emphasizing on differing modalities (vision and internal state), physics parameters (mass and friction), and morphologies (number of limbs). Importantly, correspondences are learned using unpaired and randomly collected data from the two domains. We propose dynamics cycles that align dynamic robotic behavior across two domains using a cycle consistency constraint.
πΉ Hierarchically Decoupled Imitation for Morphological Transfer πΆ
incentivizing a complex agentβs low-level to imitate a simpler agentβs low-level significantly improves zero-shot high-level transfer; KL-regularized training of the high level stabilizes learning and prevents modecollapse.
πΉ Improving Generalization in Reinforcement Learning with Mixture Regularization π
these approaches only locally perturb the observations regardless of the training environments, showing limited effectiveness on enhancing the data diversity and the generalization performance.
πΉ AdaRL: What, Where, and How to Adapt in Transfer Reinforcement Learning π₯ π§
we characterize a minimal set of representations, including both domain-specific factors and domain-shared state representations, that suffice for reliable and low-cost transfer.
πΉ A GENERAL THEORY OF RELATIVITY IN REINFORCEMENT LEARNING π₯ π
The proposed theory deeply investigates the connection between any two cumulative expected returns defined on different policies and environment dynamics: Relative Policy Optimization (RPO) updates the policy using the relative policy gradient to transfer the policy evaluated in one environment to maximize the return in another, while Relative Transition Optimization (RTO) updates the parameterized dynamics model (if there exists) using the relative transition gradient to reduce the gap between the dynamics of the two environments.
πΉ COPA: CERTIFYING ROBUST POLICIES FOR OFFLINE REINFORCEMENT LEARNING AGAINST POISONING ATTACKS π
We focus on certifying the robustness of offline RL in the presence of poisoning attacks, where a subset of training trajectories could be arbitrarily manipulated. We propose the first certification framework, COPA to certify the number of poisoning trajectories that can be tolerated regarding different certification criteria.
πΉ CROP: CERTIFYING ROBUST POLICIES FOR REINFORCEMENT LEARNING THROUGH FUNCTIONAL SMOOTHING π₯
We propose two particular types of robustness certification criteria: robustness of per-state actions and lower bound of cumulative rewards.
πΉ Learning Action Translator for Meta Reinforcement Learning on Sparse-Reward Tasks π₯ π₯
MCAT: we propose to learn an action translator among multiple training tasks. The objective function forces the translated action to behave on the target task similarly to the source action on the source task. We consider the policy transfer for any pair of source and target tasks in the training task distribution.
πΉ AACC: Asymmetric Actor-Critic in Contextual Reinforcement Learning π₯
the critic is trained with environmental factors and observation while the actor only gets the observation as input.
πΉ Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification π₯
Max-Min Twin Delayed Deep Deterministic Policy Gradient algorithm (M2TD3), which solves a max-min optimization problem using a simultaneous gradient ascent descent approach.
β Multi-task
πΉ Multi-Task Reinforcement Learning without Interference π₯
We develop a general approach that can change the multi-task optimization landscape to alleviate conflicting gradients across tasks, one architectural and one algorithmic, that prevent gradients for different tasks from interfering with one another.
πΉ Multi-Task Reinforcement Learning with Soft Modularization πΆ
Given a base policy network, we design a routing network which estimates different routing strategies to reconfigure the base network for each task.
πΉ Multi-task Batch Reinforcement Learning with Metric Learning π₯
MBML: Because the different datasets may have state-action distributions with large divergence, the task inference module can learn to ignore the rewards and spuriously correlate only state-action pairs to the task identity, leading to poor test time performance. To robustify task inference, we propose a novel application of the triplet loss.
πΉ MULTI-BATCH REINFORCEMENT LEARNING VIA SAMPLE TRANSFER AND IMITATION LEARNING πΆ
BAIL+ and MBAIL
πΉ Knowledge Transfer in Multi-Task Deep Reinforcement Learning for Continuous Control πΆ
KTM-DRL enables a single multi-task agent to leverage the offline knowledge transfer, the online learning, and the hierarchical experience replay for achieving expert-level performance in multiple different continuous control tasks.
πΉ Multi-Task Reinforcement Learning with Context-based Representations π₯
CARE: We posit that an efficient approach to knowledge transfer is through the use of multiple context-dependent, composable representations shared across a family of tasks. Metadata can help to learn interpretable representations and provide the context to inform which representations to compose and how to compose them.
πΉ CARL: A Benchmark for Contextual and Adaptive Reinforcement Learning π
We propose CARL, a collection of well-known RL environments extended to contextual RL problems to study generalization.
We propose SwitchTT, a multi-task extension to Trajectory Transformer but enhanced with two striking features: (i) exploiting a sparsely activated model to reduce computation cost in multitask offline model learning and (ii) adopting a distributional trajectory value estimator that improves policy performance, especially in sparse reward settings.
πΉ Efficient Planning in a Compact Latent Action Space πΆ
TAP avoids planning step-by-step in a high-dimensional continuous action space but instead looks for the optimal latent code sequences by beam search.
πΉ MULTI-CRITIC ACTOR LEARNING: TEACHING RL POLICIES TO ACT WITH STYLE π
Multi-Critic Actor Learning (MultiCriticAL) proposes instead maintaining separate critics for each task being trained while training a single multi-task actor.
πΉ Investigating Generalisation in Continuous Deep Reinforcement Learning
πΉ Evolution Gym: A Large-Scale Benchmark for Evolving Soft Robots π₯
πΉ Beyond Tabula Rasa: Reincarnating Reinforcement Learning π
As a step towards enabling reincarnating RL from any agent to any other agent, we focus on the specific setting of efficiently transferring an existing sub-optimal policy to a standalone valuebased RL agent.
πΉ A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning π π₯
DAgger (Dataset Aggregation): trains a deterministic policy that achieves good performance guarantees under its induced distribution of states.
πΉ Multifidelity Reinforcement Learning with Control Variates π₯
MFMCRL: a multifidelity estimator that exploits the cross-correlations between the low- and high-fidelity returns is proposed to reduce the variance in the estimation of the state-action value function.
πΉ Robust Trajectory Prediction against Adversarial Attacks π
we propose an adversarial training framework with three main components, including (1) a deterministic attack for the inner maximization process of the adversarial training, (2) additional regularization terms for stable outer minimization of adversarial training, and (3) a domain-specific augmentation strategy to achieve a better performance trade-off on clean and adversarial data.
πΉ Model-based Trajectory Stitching for Improved Offline Reinforcement Learning π₯
TS: A stitching event consists of a transition between a pair of observed states through a synthetic and highly probable action.
πΉ BATS: Best Action Trajectory Stitching π₯
BATS: we narrow the pool of candidate stitches to those that are both feasible and impactful.
πΉ TRANSFER RL VIA THE UNDO MAPS FORMALISM π₯
TvD: characterizing the discrepancy in environments by means of (potentially complex) transformation between their state spaces, and thus posing the problem of transfer as learning to undo this transformation.
πΉ Provably Sample-Efficient RL with Side Information about Latent Dynamics
TASID:
β β β Out-of-Distribution (OOD) Generalization Modularity--->Generalization
πΉ Invariant Risk Minimization Introduction is good! π π₯ π₯ slide information theoretic view
To learn invariances across environments, find a data representation such that the optimal classifier on top of that representation matches for all environments.
πΉ Out-of-Distribution Generalization via Risk Extrapolation π π₯
REx can be viewed as encouraging robustness over affine combinations of training risks, by encouraging strict equality between training risks.
πΉ OUT-OF-DISTRIBUTION GENERALIZATION ANALYSIS VIA INFLUENCE FUNCTION π₯
if a learnt model fΞΈΛ manage to simultaneously achieve small VΞ³Λ|ΞΈΛ and high accuracy over E_test, it should have good OOD accuracy.
πΉ EMPIRICAL OR INVARIANT RISK MINIMIZATION? A SAMPLE COMPLEXITY PERSPECTIVE π§ β
πΉ Invariant Rationalization π π₯ π₯
MMI can be problematic because it picks up spurious correlations between the input features and the output. Instead, we introduce a game-theoretic invariant rationalization criterion where the rationales are constrained to enable the same predictor to be optimal across different environments.
πΉ Invariant Risk Minimization Games π§ β
β β
β influence function
-
πΉ Understanding Black-box Predictions via Influence Functions π₯ π§
Upweighting a training point; Perturbing a training input; Efficiently calculating influence π§ ;
-
πΉ INFLUENCE FUNCTIONS IN DEEP LEARNING ARE FRAGILE π π₯ π
non-convexity of the loss function --- different initializations; parameters might be very large --- substantial Taylorβs approximation error of the loss function; computationally very expensive --- approximate inverse-Hessian Vector product techniques which might be erroneous; different architectures can have different loss landscape geometries near the optimal model parameters, leading to varying influence estimates.
-
πΉ On the Accuracy of Influence Functions for Measuring Group Effects π β
when measuring the change in test prediction or test loss, influence is additive.
β do-calculate ---> causual inference (Interventions) ---> counterfactuals
see inFERENCe's blog π π₯ π₯ the intervention conditional p(y|do(X=x^))p(y|do(X=x^)) is the average of counterfactuals over the obserevable population.
-
πΉ Soft-Robust Actor-Critic Policy-Gradient π β
Robust RL has shown that by considering the worst case scenario, robust policies can be overly conservative. Soft-Robust Actor Critic (SR-AC) learns an optimal policy with respect to a distribution over an uncertainty set and stays robust to model uncertainty but avoids the conservativeness of robust strategies.
πΉ A Game-Theoretic Perspective of Generalization in Reinforcement Learning π₯ π₯
We propose a game-theoretic framework for the generalization in reinforcement learning, named GiRL, where an RL agent is trained against an adversary over a set of tasks, where the adversary can manipulate the distributions over tasks within a given threshold.
πΉ UNSUPERVISED TASK CLUSTERING FOR MULTI-TASK REINFORCEMENT LEARNING π π₯
EM-Task-Clustering: We propose a general approach to automatically cluster together similar tasks during training. Our method, inspired by the expectation-maximization algorithm, succeeds at finding clusters of related tasks and uses these to improve sample complexity.
πΉ Learning Dynamics and Generalization in Reinforcement Learning π₯
TD learning dynamics discourage interference, and that while this may have a beneficial effect on stability during training, it can reduce the ability of the network to generalize to new observations.
-
Inverse RL & Apprenticeship Learning, PPT-levine(1π 2), Medium(1 2),
πΉ Apprenticeship Learning via Inverse Reinforcement Learning π Maximum Entropy Inverse Reinforcement Learning Maximum Entropy Deep Inverse Reinforcement Learning
πΉ Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization π₯ π
πΉ A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models π π¦ π¦ zhihu π β
πΉ Generative Adversarial Imitation Learning π₯ π π₯ π¦ zhihu π§ β
IRL is a dual of an occupancy measure matching problem; The induced optimal policy is the primal optimum.
πΉ Visual Adversarial Imitation Learning using Variational Models π₯ π
V-MAIL: learns a model of the environment, which serves as a strong self-supervision signal for visual representation learning and mitigates distribution shift by enabling synthetic on-policy rollouts using the model.
πΉ Latent Policies for Adversarial Imitation Learning π₯
LAPAL: We use an action encoder-decoder model to obtain a low-dimensional latent action space and train a LAtent Policy using Adversarial imitation Learning (LAPAL).
πΉ LEARNING ROBUST REWARDS WITH ADVERSARIAL INVERSE REINFORCEMENT LEARNING π₯π π₯ π¦
AIRL: Part of the challenge is that IRL is an ill-defined problem, since there are many optimal policies that can explain a set of demonstrations, and many rewards that can explain an optimal policy. The maximum entropy (MaxEnt) IRL framework introduced by Ziebart et al. (2008) handles the former ambiguity, but the latter ambiguity means that IRL algorithms have difficulty distinguishing the true reward functions from those shaped by the environment dynamics (THE REWARD AMBIGUITY PROBLEM). -- DISENTANGLING REWARDS FROM DYNAMICS.
πΉ Adversarially Robust Imitation Learning π
ARIL: physical attack; sensory attack.
πΉ Robust Imitation of Diverse Behaviors π₯
VAE+GAN: a new version of GAIL that (1) is much more robust than the purely-supervised controller, especially with few demonstrations, and (2) avoids mode collapse, capturing many diverse behaviors when GAIL on its own does not.
πΉ OFF-POLICY ADVERSARIAL INVERSE REINFORCEMENT LEARNING
πΉ A Primer on Maximum Causal Entropy Inverse Reinforcement Learning π§ π¦
πΉ ADVERSARIAL IMITATION VIA VARIATIONAL INVERSE REINFORCEMENT LEARNING π π₯ π§
Our method simultaneously learns empowerment through variational information maximization along with the reward and policy under the adversarial learning formulation.
πΉ A Divergence Minimization Perspective on Imitation Learning Methods π¦ π π β β
State-Marginal Matching. we present a unified probabilistic perspective on IL algorithms based on divergence minimization.
πΉ f-IRL: Inverse Reinforcement Learning via State Marginal Matching π¦
πΉ Imitation Learning as f-Divergence Minimization π¦ β
πΉ Offline Imitation Learning with a Misspecified Simulator π β
learn pi in the condition of a few expert demonstrations and a simulator with misspecified dynamics.
πΉ Inverse Constrained Reinforcement Learning π₯ π₯
The main task (βdo thisβ) is often quite easy to encode in the form of a simple nominal reward function. In this work, we focus on learning the constraint part (βdo not do thatβ) from provided expert demonstrations and using it in conjunction with the nominal reward function to train RL agents.
πΉ PRIMAL WASSERSTEIN IMITATION LEARNING π₯ π
We present Imitation Learning as a distribution matching problem and introduce a reward function which is based on an upper bound of the Wasserstein distance between the state-action distributions of the agent and the expert.
πΉ Robust Inverse Reinforcement Learning under Transition Dynamics Mismatch π₯
We consider the Maximum Causal Entropy (MCE) IRL learner model and provide a tight upper bound on the learnerβs performance degradation based on the `1-distance between the transition dynamics of the expert and the learner.
πΉ XIRL: Cross-embodiment Inverse Reinforcement Learning
leverages temporal cycleconsistency constraints to learn deep visual embeddings that capture task progression from offline videos of demonstrations across multiple expert agents, each performing the same task differently due to embodiment differences.
πΉ Deterministic and Discriminative Imitation (D2-Imitation): Revisiting Adversarial Imitation for Sample Efficiency π₯ π
Deterministic and Discriminative Imitation (D2-Imitation) operates by first partitioning samples into two replay buffers and then learning a deterministic policy via off-policy reinforcement learning.
πΉ Learning Sparse Rewarded Tasks from Sub-Optimal Demonstrations π₯
We propose Self-Adaptive Imitation Learning (SAIL) that can achieve (near) optimal performance given only a limited number of sub-optimal demonstrations for highly challenging sparse reward tasks. reward = log pi_1/pi_2;
πΉ Model-Based Imitation Learning Using Entropy Regularization of Model and Policy π π₯ π
MB-ERIL: A policy discriminator distinguishes the actions generated by a robot from expert ones, and a model discriminator distinguishes the counterfactual state transitions generated by the model from the actual ones.
πΉ Robust Imitation Learning against Variations in Environment Dynamics π π₯ π
RIME: Our framework effectively deals with environments with varying dynamics by imitating multiple experts in sampled environment dynamics to enhance the robustness in general variations in environment dynamics
πΉ Learning Multi-Task Transferable Rewards via Variational Inverse Reinforcement Learning π₯ π₯
Our proposed method derives the variational lower bound of the situational mutual information to optimize it. We simultaneously learn the transferable multi-task reward function and policy by adding an induced term to the objective function.
β
πΉ Disagreement-Regularized Imitation Learning
πΉ Intrinsic Reward Driven Imitation Learning via Generative Model π π₯
Combines a backward action encoding and a forward dynamics model into one generative solution. Moreover, our model generates a family of intrinsic rewards, enabling the imitation agent to do samplingbased self-supervised exploration in the environment. Outperform the expert.
πΉ REGULARIZED INVERSE REINFORCEMENT LEARNING π¦
πΉ Variational Inverse Control with Events: A General Framework for Data-Driven Reward Definition π variational inverse control with events (VICE), which generalizes inverse reinforcement learning methods π π§ β β
πΉ Meta-Inverse Reinforcement Learning with Probabilistic Context Variables π§
we propose a deep latent variable model that is capable of learning rewards from demonstrations of distinct but related tasks in an unsupervised way. Critically, our model can infer rewards for new, structurally-similar tasks from a single demonstration.
πΉ Domain Adaptive Imitation Learning π₯ π π
In the alignment step we execute a novel unsupervised MDP alignment algorithm, GAMA, to learn state and action correspondences from unpaired, unaligned demonstrations. In the adaptation step we leverage the correspondences to zero-shot imitate tasks across domains.
πΉ ADAIL: Adaptive Adversarial Imitation Learning π π₯ π
the discriminator may either simply use the embodiment or dynamics to infer whether it is evaluating expert behavior, and as a consequence fails to provide a meaningful reward signal. we condition our policy on a learned dynamics embedding and we employ a domain-adversarial loss to learn a dynamics-invariant discriminator.
πΉ Generative Adversarial Imitation from Observation π¦ π π₯ π β β β
From a high-level perspective, in imitation from observation, the goal is to enable the agent to extract what the task is by observing some state sequences. GAIfO
πΉ MobILE: Model-Based Imitation Learning From Observation Alone π π₯ π
Imitation Learning from Observation Alone (ILFO). MobILE involves carefully trading off strategic exploration against imitation - this is achieved by integrating the idea of optimism in the face of uncertainty into the distribution matching imitation learning (IL) framework.
πΉ IMITATION LEARNING FROM OBSERVATIONS UNDER TRANSITION MODEL DISPARITY π₯ π π₯
AILO: We consider ILO where the expert and the learner agents operate in different environments (dynamics). We propose an AILO that trains an intermediary policy in the learner environment and uses it as a surrogate expert for the learner.
πΉ Robust Learning from Observation with Model Misspecification π₯ π
Robust-GAILfO: We discuss how our method addresses the dynamics mismatch issue by exploiting the equivalence between the robust MDP formulation and the twoplayer Markov game.
πΉ Learn what matters: cross-domain imitation learning with task-relevant embeddings π₯ π
UDIL: unsupervised cross-domain adversarial imitation learning. We jointly train the learner agentβs policy and learn a mapping between the learner and expert domains with adversarial training. We effect this by using a mutual information criterion to find an embedding of the expertβs state space that contains task-relevant information and is invariant to domain specifics.
πΉ CROSS-DOMAIN IMITATION LEARNING VIA OPTIMAL TRANSPORT π₯
We propose Gromov-Wasserstein Imitation Learning (GWIL), a method for cross-domain imitation that uses the Gromov Wasserstein distance to align and compare states between the different spaces of the agents.
πΉ An Imitation from Observation Approach to Transfer Learning with Dynamics Mismatch π π₯ π β
learning the grounded action transformation can be seen as an IfO problem; GARAT: learn an action transformation policy for transfer learning with dynamics mismatch. we focus on the paradigm of simulator grounding, which modifies the source environmentβs dynamics to more closely match the target environment dynamics using a relatively small amount of target environment data.
πΉ HYAR: ADDRESSING DISCRETE-CONTINUOUS ACTION REINFORCEMENT LEARNING VIA HYBRID ACTION EPRESENTATION
We propose Hybrid Action Representation (HyAR) to learn a compact and decodable latent representation space for the original hybrid action space.
πΉ STATE ALIGNMENT-BASED IMITATION LEARNING π π₯ β
Consider an imitation learning problem that the imitator and the expert have different dynamics models. The state alignment comes from both local and global perspectives and we combine them into a reinforcement learning framework by a regularized policy update objective. ifo
πΉ Strictly Batch Imitation Learning by Energy-based Distribution Matching π₯ π₯ π¦ π
EDM: βBy identifying parameterizations of the (discriminative) model of a policy with the (generative) energy function for state distributions, EDM yields a simple but effective solution that equivalently minimizes a divergence between the occupancy measure for the demonstrator and a model thereof for the imitator.
πΉ SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards π π₯ π β
SQIL is equivalent to a variant of behavioral cloning (BC) that uses regularization to overcome state distribution shift. We accomplish this by giving the agent a constant reward of r = +1 for matching the demonstrated action in a demonstrated state, and a constant reward of r = 0 for all other behavior.
πΉ IQ-Learn: Inverse soft-Q Learning for Imitation π π₯ π π§
We introduce a method for dynamics-aware IL which avoids adversarial training by learning a single Q-function, implicitly representing both reward and policy.
πΉ LS-IQ: IMPLICIT REWARD REGULARIZATION FOR INVERSE REINFORCEMENT LEARNING
πΉ Boosted and Reward-regularized Classification for Apprenticeship Learning π₯ β β
MultiClass Classification and the Large Margin Approach.
πΉ IMITATION LEARNING VIA OFF-POLICY DISTRIBUTION MATCHING π π₯ π₯ π β
These prior distribution matching approaches possess two limitations (On-policy; Separate RL optimization). ---> OFF-POLICY FORMULATION OF THE KL-DIVERGENCE. ---> VALUEDICE: IMITATION LEARNING WITH IMPLICIT REWARDS. (OPE)
πΉ SCALABLE BAYESIAN INVERSE REINFORCEMENT LEARNING π₯ π§
AVRIL: jointly learning an approximate posterior distribution over the reward that scales to arbitrarily complicated state spaces alongside an appropriate policy in a completely offline manner through a variational approach to said latent reward.
πΉ TRANSFERABLE REWARD LEARNING BY DYNAMICS-AGNOSTIC DISCRIMINATOR ENSEMBLE π₯
DARL: learns a dynamics-agnostic discriminator on a latent space mapped from the original state-action space. To reduce the reliance of the discriminator on policies, the reward function is represented as an ensemble of the discriminators during training.
πΉ Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement π π₯ π₯ β
the gap between LfD and LfO actually lies in the disagreement of inverse dynamics models between the imitator and the expert, if following the modeling approach of GAIL. ifo IDDM
πΉ Off-Policy Imitation Learning from Observations π π¦ π₯ π₯ β
OPOLO (Off POlicy Learning from Observations)! ifo // lfo // ope // mode-covering (Forward Distribution Matching) // mode-seeking // dice // LfD // LfO β
πΉ Imitation Learning by State-Only Distribution Matching π₯ π₯
LfO: We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric.
πΉ Plan Your Target and Learn Your Skills: Transferable State-Only Imitation Learning via Decoupled Policy Optimization π₯ π
We propose Decoupled Policy Optimization (DePO) for transferable state-only imitation learning, which decouples the state-to-action mapping policy into a state-to-state mapping state planner and a state-pair-to-action mapping inverse dynamics model. [poster]
πΉ IMITATION LEARNING BY REINFORCEMENT LEARNING π₯ π₯
We show that, for deterministic experts, imitation learning can be done by reduction to reinforcement learning with a stationary reward.
πΉ Imitation by Predicting Observations π₯ π
LfO: FORM (βFuture Observation Reward Modelβ) is derived from an inverse RL objective and imitates using a model of expert behavior learned by generative modelling of the expertβs observations, without needing ground truth actions.
πΉ AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control πΆ
we presented an adversarial learning system for physics based character animation that enables characters to imitate diverse behaviors from large unstructured datasets, without the need for motion planners or other mechanisms for clip selection.
πΉ ARC - Actor Residual Critic for Adversarial Imitation Learning πΆ
We leverage the differentiability property of the AIL reward function and formulate a class of Actor Residual Critic (ARC) RL algorithms that draw a parallel to the standard AC algorithms in RL and uses a residual critic, C function to approximate only the discounted future return (excluding the immediate reward).
πΉ AUTO-ENCODING INVERSE REINFORCEMENT LEARNING π
AEIRL: utilizes the reconstruction error of an auto-encoder as the learning signal, which provides more information for optimizing policies, compared to the binary logistic loss.
πΉ Auto-Encoding Adversarial Imitation Learning
AEAIL:
πΉ Reinforced Imitation Learning by Free Energy Principle π§
πΉ Error Bounds of Imitating Policies and Environments π π¦
πΉ What Matters for Adversarial Imitation Learning?
πΉ Distributionally Robust Imitation Learning π π₯ π§
This paper studies Distributionally Robust Imitation Learning (DROIL) and establishes a close connection between DROIL and Maximum Entropy Inverse Reinforcement Learning.
πΉ Provable Representation Learning for Imitation Learning via Bi-level Optimization
πΉ Provable Representation Learning for Imitation with Contrastive Fourier Features π₯ π₯
We derive a representation learning objective that provides an upper bound on the performance difference between the target policy and a lowdimensional policy trained with max-likelihood, and this bound is tight regardless of whether the target policy itself exhibits low-dimensional structure.
πΉ TRAIL: NEAR-OPTIMAL IMITATION LEARNING WITH SUBOPTIMAL DATA π₯ π
TRAIL (Transition-Reparametrized Actions for Imitation Learning): We present training objectives that use offline datasets to learn a factored transition model whose structure enables the extraction of a latent action space. Our theoretical analysis shows that the learned latent action space can boost the sample-efficiency of downstream imitation learning, effectively reducing the need for large near-optimal expert datasets through the use of auxiliary non-expert data.
πΉ Imitation Learning via Differentiable Physics π₯
ILD: incorporates the differentiable physics simulator as a physics prior into its computational graph for policy learning.
πΉ Of Moments and Matching: A Game-Theoretic Framework for Closing the Imitation Gap π
AdVIL, AdRIL, and DAeQuIL:
πΉ Generalizable Imitation Learning from Observation via Inferring Goal Proximity π π₯
we learn a goal proximity function (task proress) and utilize it as a dense reward for policy learning.
πΉ Show me the Way: Intrinsic Motivation from Demonstrations πΆ
extracting an intrinsic bonus from the demonstrations.
πΉ Out-of-Dynamics Imitation Learning from Multimodal Demonstrations π₯
OOD-IL enables imitation learning to utilize demonstrations from a wide range of demonstrators but introduces a new challenge: some demonstrations cannot be achieved by the imitator due to the different dynamics. develop a better transferability measurement.
πΉ Imitating Latent Policies from Observation π₯ π π₯
ILPO: We introduce a method that characterizes the causal effects of latent actions on observations while simultaneously predicting their likelihood. We then outline an action alignment procedure that leverages a small amount of environment interactions to determine a mapping between the latent and real-world actions.
πΉ Latent Policies for Adversarial Imitation Learning π
We use an action encoder-decoder model to obtain a low-dimensional latent action space and train a LAtent Policy using Adversarial imitation Learning (LAPAL).
πΉ A Ranking Game for Imitation Learning
πΉ Recent Advances in Imitation Learning from Observation
πΉ PREFERENCES IMPLICIT IN THE STATE OF THE WORLD π₯ π§
RLSP: we identify the state of the world at initialization as a source of information about human preferences. Second, we leverage this insight to derive an algorithm, Reward Learning by Simulating the Past (RLSP), which infers reward from initial state based on a Maximum Causal Entropy.
πΉ Population-Guided Imitation Learning π
πΉ Towards Learning to Imitate from a Single Video Demonstration π π₯
using contrastive training to learn a reward function comparing an agentβs behaviour with a single demonstration.
πΉ Concurrent Training Improves the Performance of Behavioral Cloning from Observation π₯
BCO* (behavioral cloning from observation)
πΉ Identifiability and Generalizability from Multiple Experts in Inverse Reinforcement Learning
Reward Identifiability
πΉ Improving Policy Learning via Language Dynamics Distillation
LDD: pretrains a model to predict environment dynamics given demonstrations with language descriptions, and then fine-tunes these language-aware pretrained representations via reinforcement learning.
πΉ LEARNING CONTROL BY ITERATIVE INVERSION π₯
Iterative Inversion (IT-IN): Our input is a set of demonstrations of desired behavior, given as video embeddings of trajectories (without actions), and our method iteratively learns to imitate trajectories generated by the current policy, perturbed by random exploration noise.
πΉ CEIP: Combining Explicit and Implicit Priors for Reinforcement Learning with Demonstrations π
CEIP exploits multiple implicit priors in the form of normalizing flows in parallel to form a single complex prior. Moreover, CEIP uses an effective explicit retrieval and push-forward mechanism to condition the implicit priors.
πΉ Planning for Sample Efficient Imitation Learning π₯ π₯
EfficientImitate (EI): we show the seemingly incompatible two classes of imitation algorithms (BC and AIL) can be naturally unified under our framework, enjoying the benefits of both.
πΉ Robust Imitation via Mirror Descent Inverse Reinforcement Learning
MD-AIRL:
πΉ Learning and Retrieval from Prior Data for Skill-based Imitation Learning π₯
Skill-Augmented Imitation Learning with prior Retrieval (SAILOR)
πΉ LS-IQ: IMPLICIT REWARD REGULARIZATION FOR INVERSE REINFORCEMENT LEARNING
-
Adding Noise
πΉ Learning from Suboptimal Demonstration via Self-Supervised Reward Regression π π₯
Recent attempts to learn from sub-optimal demonstration leverage pairwise rankings and following the Luce-Shepard rule. However, we show these approaches make incorrect assumptions and thus suffer from brittle, degraded performance. We overcome these limitations in developing a novel approach that bootstraps off suboptimal demonstrations to synthesize optimality-parameterized data to train an idealized reward function.
πΉ Robust Imitation Learning from Noisy Demonstrations π₯ π
In this paper, we first theoretically show that robust imitation learning can be achieved by optimizing a classification risk with a symmetric loss. Based on this theoretical finding, we then propose a new imitation learning method that optimizes the classification risk by effectively combining pseudo-labeling with co-training.
πΉ Imitation Learning from Imperfect Demonstration π₯ π π β
a novel approach that utilizes confidence scores, which describe the quality of demonstrations. two-step importance weighting imitation learning (2IWIL) and generative adversarial imitation learning with imperfect demonstration and confidence (IC-GAIL), based on the idea of reweighting.
πΉ Variational Imitation Learning with Diverse-quality Demonstrations π₯ π§
VILD: We show that simple quality-estimation approaches might fail due to compounding error, and fix this issue by jointly estimating both the quality and reward using a variational approach.
πΉ BEHAVIORAL CLONING FROM NOISY DEMONSTRATIONS π π¦
we propose an imitation learning algorithm to address the problem without any environment interactions and annotations associated with the non-optimal demonstrations.
πΉ Robust Imitation Learning from Corrupted Demonstrations π₯ π
We propose a novel robust algorithm by minimizing a Median-of-Means (MOM) objective which guarantees the accurate estimation of policy, even in the presence of constant fraction of outliers.
πΉ Confidence-Aware Imitation Learning from Demonstrations with Varying Optimality π π₯ π
CAIL: learns a well-performing policy from confidence-reweighted demonstrations, while using an outer loss to track the performance of our model and to learn the confidence.
πΉ Imitation Learning by Estimating Expertise of Demonstrators π₯ π
ILEED: We develop and optimize a joint model over a learned policy and expertise levels of the demonstrators. This enables our model to learn from the optimal behavior and filter out the suboptimal behavior of each demonstrator.
πΉ Learning to Weight Imperfect Demonstrations π
We provide a rigorous mathematical analysis, presenting that the weights of demonstrations can be exactly determined by combining the discriminator and agent policy in GAIL.
πΉ Robust Adversarial Imitation Learning via Adaptively-Selected Demonstrations π₯
SAIL: good demonstrations can be adaptively selected for training while bad demonstrations are abandoned.
πΉ Policy Learning Using Weak Supervision π π₯
PeerRL: We treat the βweak supervisionβ as imperfect information coming from a peer agent, and evaluate the learning agentβs policy based on a βcorrelated agreementβ with the peer agentβs policy (instead of simple agreements).
πΉ Rethinking Importance Weighting for Transfer Learning π
We review recent advances based on joint and dynamic importance predictor estimation. Furthermore, we introduce a method of causal mechanism transfer that incorporates causal structure in TL.
πΉ Inverse Decision Modeling: Learning Interpretable Representations of Behavior π₯ π
We develop an expressive, unifying perspective on inverse decision modeling: a framework for learning parameterized representations of sequential decision behavior.
DAC: To address reward bias, we propose a simple mechanism whereby the rewards for absorbing states are also learned; To improve sample efficiency, we perform off-policy training.
πΉ Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations π
T-REX: a reward learning technique for high-dimensional tasks that can learn to extrapolate intent from suboptimal ranked demonstrations.
πΉ Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations π
D-REX: a ranking-based reward learning algorithm that does not require ranked demonstrations, which injects noise into a policy learned through behavioral cloning to automatically generate ranked demonstrations.
πΉ DART: Noise Injection for Robust Imitation Learning π₯
We propose an off-policy approach that injects noise into the supervisorβs policy while demonstrating. This forces the supervisor to demonstrate how to recover from errors. We propose a new algorithm, DART (Disturbances for Augmenting Robot Trajectories), that collects demonstrations with injected noise, and optimizes the noise level to approximate the error of the robotβs trained policy during data collection.
πΉ Bayesian Inverse Reinforcement Learning
πΉ Deep Bayesian Reward Learning from Preferences
B-REX: Our approach uses successor feature representations and preferences over demonstrations to efficiently generate samples from the posterior distribution over the demonstratorβs reward function without requiring an MDP solver.
πΉ Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences π₯
Bayesian REX (B-REX)
πΉ Asking Easy Questions: A User-Friendly Approach to Active Reward Learning π₯ π
we explore an information gain formulation for optimally selecting questions that naturally account for the humanβs ability to answer. Our approach identifies questions that optimize the trade-off between robot and human uncertainty, and determines when these questions become redundant or costly. + Volume Removal Solution
πΉ Few-Shot Preference Learning for Human-in-the-Loop RL π₯ π₯
We pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries.
πΉ A Ranking Game for Imitation Learning π₯ π₯
The rankinggame additionally affords a broader perspective of imitation, going beyond using only expert demonstrations, and utilizing rankings/preferences over suboptimal behaviors.
πΉ Learning Multimodal Rewards from Rankings π₯
We formulate the multimodal reward learning as a mixture learning problem and develop a novel ranking-based learning approach, where the experts are only required to rank a given set of trajectories.
πΉ Semi-Supervised Imitation Learning of Team Policies from Suboptimal Demonstrations
BTIL:
πΉ Learning Reward Functions from Scale Feedback π
Instead of a strict question on which of the two proposed trajectories the user prefers, we allow for more nuanced feedback using a slider bar.
πΉ Interactive Learning from Policy-Dependent Human Feedback
COACH:
πΉ Towards Sample-efficient Apprenticeship Learning from Suboptimal Demonstration πΆ
SSRR, S3RR: noise-performance curve fitting --> regresses a reward function of trajectory states and actions.
πΉ BASIS FOR INTENTIONS: EFFICIENT INVERSE REINFORCEMENT LEARNING USING PAST EXPERIENCE π₯ π
BASIS, which leverages multi-task RL pre-training and successor features to allow an agent to build a strong basis for intentions that spans the space of possible goals in a given domain.
πΉ POSITIVE-UNLABELED REWARD LEARNING π₯ π
PURL: we connect these two classes of reward learning methods (GAIL, SL) to positiveunlabeled (PU) learning, and we show that by applying a large-scale PU learning algorithm to the reward learning problem, we can address both the reward underand over-estimation problems simultaneously.
πΉ Combating False Negatives in Adversarial Imitation Learning π
Fake Conditioning
πΉ Task-Relevant Adversarial Imitation Learning π₯ π₯
TRAIL proposes to constrain the GAIL discriminator such that it is not able to distinguish between certain, preselected expert and agent observations which do not contain task behavior.
πΉ Environment Design for Inverse Reinforcement Learning π₯ π₯
We formalise a framework for this environment design process in which learner and expert repeatedly interact, and construct algorithms that actively seek information about the rewards by carefully curating environments for the human to demonstrate the task in.
πΉ Reward Identification in Inverse Reinforcement Learning
πΉ Identifiability in inverse reinforcement learning
-
Multiple-Intent
πΉ LiMIIRL: Lightweight Multiple-Intent Inverse Reinforcement Learning π₯
Multiple-Intent Inverse Reinforcement Learning (MI-IRL) seeks to find a reward function ensemble to rationalize demonstrations of different but unlabelled intents. Within the popular expectation maximization (EM) framework for learning probabilistic MI-IRL models, we present a warm-start strategy based on up-front clustering of the demonstrations in feature space.
-
Meta IRL
πΉ Meta-Inverse Reinforcement Learning with Probabilistic Context Variables π₯
PEMIRL: we propose a deep latent variable model that is capable of learning rewards from demonstrations of distinct but related tasks in an unsupervised way.
-
LfL
πΉ Inverse Reinforcement Learning from a Gradient-based Learner π π₯
LOGEL: the goal is to recover the reward function being optimized by an agent, given a sequence of policies produced during learning.
-
RL From Preferences
πΉ Deep Reinforcement Learning from Human Preferences π₯
We explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments.
πΉ Reward learning from human preferences and demonstrations in Atari πΆ
We combine two approaches to learning from human feedback: expert demonstrations and trajectory preferences.
We present an off-policy, interactive RL algorithm that capitalizes on the strengths of both feedback and off-policy learning.
πΉ Skill Preferences: Learning to Extract and Execute Robotic Skills from Human Feedback πΆ
We introduce Skill Preferences (SkiP), an algorithm that incorporates human feedback to extract skills from (noisy) offline data and utilize those skills to solve downstream tasks.
We focus on the task of learning from feedback, in which the human trainer not only gives binary evaluative "good" or "bad" feedback for queried state-action pairs, but also provides a visual explanation by annotating relevant features in images. We then propose EXPAND (EXPlanation AugmeNted feeDback) to encourage the model to encode task-relevant features.
πΉ Offline Preference-Based Apprenticeship Learning πΆ
OPAL: Given a database consisting of trajectories without reward labels, we query an expert for preference labels over trajectory segments from the database, learn a reward function from preferences, and then perform offline RL using rewards provided by the learned reward function.
πΉ Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data π π₯ π
This paper introduces a novel strategy called Adaptive Pseudo Augmentation (APA) to encourage healthy competition between the generator and the discriminator. APA alleviates overfitting by employing the generator itself to augment the real data distribution with generated images, which deceives the discriminator adaptively.
πΉ B-Pref: Benchmarking Preference-Based Reinforcement Learning π
We introduce B-Pref: a benchmark specially designed for preference-based RL.
πΉ Batch Reinforcement Learning from Crowds πΆ
This paper tackles a critical challenge that emerged when collecting data from non-expert humans: the noise in preferences.
πΉ Dueling RL: Reinforcement Learning with Trajectory Preferences
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation, where we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
πΉ Teachable Reinforcement Learning via Advice Distillation πΆ
We propose a new supervision paradigm for interactive learning based on βteachableβ decision-making systems that learn from structured advice provided by an external teacher.
πΉ ReIL: A Framework for Reinforced Intervention-based Imitation Learning π
We introduce Reinforced Interventionbased Learning (ReIL), a framework consisting of a general intervention-based learning algorithm and a multi-task imitation learning model aimed at enabling non-expert users to train agents in real environments with little supervision or fine tuning.
πΉ Learning to summarize from human feedback πΆ
πΉ MORAL: Aligning AI with Human Norms through Multi-Objective Reinforced Active Learning πΆ
Through maintaining a distribution over scalarization weights, our approach is able to interactively tune a deep RL agent towards a variety of preferences, while eliminating the need for computing multiple policies.
πΉ Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning π
REED: iterates between encoding environment dynamics in a state-action representation via a self-supervised temporal consistency task, and bootstrapping the preference-based reward function from the state-action representation.
πΉ Reinforcement Learning from Diverse Human Preferences π
The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution.
-
Reward Comparison; PBRS (potential-based reward shaping)
πΉ QUANTIFYING DIFFERENCES IN REWARD FUNCTIONS π₯ π
We introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without a policy optimization step.
πΉ DYNAMICS-AWARE COMPARISON OF LEARNED REWARD FUNCTIONS π π
DARD uses an approximate transition model of the environment to transform reward functions into a form that allows for comparisons that are invariant to reward shaping while only evaluating reward functions on transitions close to their training distribution.
πΉ Preprocessing Reward Functions for Interpretability π π₯
We propose exploiting the intrinsic structure of reward functions by first preprocessing them into simpler but equivalent reward functions, which are then visualized.
πΉ Understanding Learned Reward Functions πΆ
We have explored the use of saliency maps and counterfactuals to understand learned reward functions.
πΉ Explicable Reward Design for Reinforcement Learning Agents π π§
EXPRD allows us to appropriately balance informativeness and sparseness while guaranteeing that an optimal policy induced by the function belongs to a set of target policies. EXPRD builds upon an informativeness criterion that captures the (sub-)optimality of target policies at different time horizons from any given starting state.
πΉ Automatic shaping and decomposition of reward functions
πΉ Dynamic Potential-Based Reward Shaping π
We have proven that a dynamic potential function can be used to shape an agent without altering its optimal policy.
πΉ Expressing Arbitrary Reward Functions as Potential-Based Advice π₯
DPBA: Potential-based reward shaping is a way to provide the agent with a specific form of additional reward, with the guarantee of policy invariance. In this work we give a novel way to incorporate an arbitrary reward function with the same guarantee, by implicitly translating it into the specific form of dynamic advice potentials, which are maintained as an auxiliary value function learnt at the same time.
πΉ Useful Policy Invariant Shaping from Arbitrary Advice π₯ π
PIES biases the agentβs policy toward the advice at the start of the learning, when the agent is the most in need of guidance. Over time, PIES gradually decays this bias to zero, ensuring policy invariance.
πΉ Policy Transfer using Reward Shaping π
We presented a novel approach to policy transfer, encoding the transferred policy as a dynamic potential-based reward shaping function, benefiting from all the theory behind reward shaping.
πΉ Reward prediction for representation learning and reward shaping π
Using our representation for preprocessing high-dimensional observations, as well as using the predictor for reward shaping.
-
Inverse constrain learning (ICL)
πΉ Learning Soft Constraints From Constrained Expert Demonstrations π₯
We consider the setting where the reward function is given, and the constraints are unknown, and propose a method that is able to recoverthese constraints satisfactorily from the expert data.
-
Delayed reward
πΉ
πΉ Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problemsβ π₯ π₯ β π§ β
Offline RL with dynamic programming: distributional shift; policy constraints; uncertainty estimation; conservative Q-learning and Pessimistic Value-function;
https://danieltakeshi.github.io/2020/06/28/offline-rl/ π¦
https://ai.googleblog.com/2020/08/tackling-open-challenges-in-offline.html π¦
https://sites.google.com/view/offlinerltutorial-neurips2020/home π¦
πΉ D4RL: DATASETS FOR DEEP DATA-DRIVEN REINFORCEMENT LEARNING π π
examples of such properties include: datasets generated via hand-designed controllers and human demonstrators, multitask datasets where an agent performs different tasks in the same environment, and datasets collected with mixtures of policies.
πΉ d3rlpy: An Offline Deep Reinforcement Learning Library π
πΉ A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems π§
πΉ An Optimistic Perspective on Offline Reinforcement Learning π β
To enhance generalization in the offline setting, we present Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates.
πΉ OPAL: OFFLINE PRIMITIVE DISCOVERY FOR ACCELERATING OFFLINE REINFORCEMENT LEARNING π₯ when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. OFFLINE unsupervised RL.
πΉ Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization π π π₯
πΉ TOWARDS DEPLOYMENT-EFFICIENT REINFORCEMENT LEARNING: LOWER BOUND AND OPTIMALITY π
We propose such a formulation for deployment-efficient RL (DE-RL) from an βoptimization with constraintsβ perspective: we are interested in exploring an MDP and obtaining a near-optimal policy within minimal deployment complexity, whereas in each deployment the policy can sample a large batch of data.
πΉ MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch Optimization for Deployment Constrained Reinforcement Learning π₯ π
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection. During each offline training session, we bootstrap the policy update by quantifying the amount of uncertainty within our collected data.
πΉ BENCHMARKS FOR DEEP OFF-POLICY EVALUATION π
πΉ KEEP DOING WHAT WORKED: BEHAVIOR MODELLING PRIORS FOR OFFLINE REINFORCEMENT LEARNING π π₯ π₯
It admits the use of data generated by arbitrary behavior policies and uses a learned prior β the advantage-weighted behavior model (ABM) β to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task.
extrapolation or bootstrapping errors: (Fujimoto et al., 2018; Kumar et al., 2019)
πΉ Off-Policy Deep Reinforcement Learning without Exploration π₯ π₯ π β β β
BCQ: We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data.
πΉ Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction π₯ π₯ π§ β
BEAR: We identify bootstrapping error as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator.
πΉ Conservative Q-Learning for Offline Reinforcement Learning π π₯ π π¦ π₯
conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value.
πΉ Mildly Conservative Q-Learning for Offline Reinforcement Learning π π₯ π
We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values.
πΉ Constraints Penalized Q-Learning for Safe Offline Reinforcement Learning π π₯ π
We show that naΓ―ve approaches that combine techniques from safe RL and offline RL can only learn sub-optimal solutions. We thus develop a simple yet effective algorithm, Constraints Penalized Q-Learning (CPQ), to solve the problem.
πΉ Conservative Offline Distributional Reinforcement Learning π¦
CODAC:
πΉ OFFLINE REINFORCEMENT LEARNING HANDS-ON
πΉ Supervised Off-Policy Ranking π
SOPR: aims to rank a set of target policies based on supervised learning by leveraging off-policy data and policies with known performance. [poster]
πΉ Conservative Data Sharing for Multi-Task Offline Reinforcement Learning π₯ π π₯
Conservative data sharing (CDS): We develop a simple technique for data-sharing in multi-task offline RL that routes data based on the improvement over the task-specific data.
πΉ UNCERTAINTY-BASED MULTI-TASK DATA SHARING FOR OFFLINE REINFORCEMENT LEARNING π₯ π₯ π
UTDS: suboptimality gap of UTDS is related to the expected uncertainty of the shared dataset. (CDS)
πΉ Data Sharing without Rewards in Multi-Task Offline Reinforcement Learning
Conservative unsupervised data sharing (CUDS): under a binary-reward assumption, simply utilizing data from other tasks with constant reward labels can not only provide substantial improvement over only using the single-task data and previously proposed success classifiers, but it can also reach comparable performance to baselines that take advantage of the oracle multi-task reward information.
πΉ How to Leverage Unlabeled Data in Offline Reinforcement Learning π₯ π
We provide extensive theoretical and empirical analysis that illustrates how it trades off reward bias, sample complexity and distributional shift, often leading to good results. We characterize conditions under which this simple strategy is effective, and further show that extending it with a simple reweighting approach can further alleviate the bias introduced by using incorrect reward labels.
πΉ PROVABLE UNSUPERVISED DATA SHARING FOR OFFLINE REINFORCEMENT LEARNING π₯ π
PDS utilizes additional penalties upon the reward function learned from labeled data to avoid potential overestimation of the reward.
πΉ Is Pessimism Provably Efficient for Offline RL? π₯ π π₯
Pessimistic value iteration algorithm (PEVI): incorporates a penalty function (pessimism) into the value iteration algorithm. The penalty function simply flips the sign of the bonus function (optimism) for promoting exploration in online RL. We decompose the suboptimality of any policy into three sources: the spurious correlation, intrinsic uncertainty, and optimization error.
πΉ PESSIMISTIC MODEL-BASED OFFLINE REINFORCEMENT LEARNING UNDER PARTIAL COVERAGE π
Constrained Pessimistic Policy Optimization (CPPO): We study model-based offline RL with function approximation under partial coverage. We show that for the model-based setting, realizability in function class and partial coverage together are enough to learn a policy that is comparable to any policies covered by the offline distribution.
πΉ Corruption-Robust Offline Reinforcement Learning π
πΉ Bellman-consistent Pessimism for Offline Reinforcement Learning π₯ π
We introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations.
πΉ Provably Good Batch Reinforcement Learning Without Great Exploration π₯ π₯
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees. In certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability.
πΉ Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity π₯ π§
LCB-Q (value iteration with lower confidence bounds): We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the single policy concentrability assumption which does not require the full coverage of the state-action space.
πΉ Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning π₯ π§
This paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a βreference policyβ \miu close to the optimal policy \piβ in a certain sense.
πΉ Towards Instance-Optimal Offline Reinforcement Learning with Pessimism
πΉ Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning
Pessimistic Actor Critic for Learning without Exploration (PACLE)
πΉ WHEN SHOULD OFFLINE REINFORCEMENT LEARNING BE PREFERRED OVER BEHAVIORAL CLONING? π π
under what environment and dataset conditions can an offline RL method outperform BC with an equal amount of expert data, even when BC is a natural choice? [Should I Run Offline Reinforcement Learning or Behavioral Cloning?]
πΉ PESSIMISTIC BOOTSTRAPPING FOR UNCERTAINTY-DRIVEN OFFLINE REINFORCEMENT LEARNING π π₯
PBRL: We propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints. Specifically, PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty.
πΉ UNCERTAINTY REGULARIZED POLICY LEARNING FOR OFFLINE REINFORCEMENT LEARNING π
Uncertainty Regularized Policy Learning (URPL): URPL adds an uncertainty regularization term in the policy learning objective to enforce to learn a more stable policy under the offline setting. Moreover, we further use the uncertainty regularization term as a surrogate metric indicating the potential performance of a policy.
πΉ Model-Based Offline Meta-Reinforcement Learning with Regularization π π₯ π
We explore model-based offline Meta-RL with regularized Policy Optimization (MerPO), which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions.
πΉ BATCH REINFORCEMENT LEARNING THROUGH CONTINUATION METHOD π₯
We propose a simple yet effective approach, soft policy iteration algorithm through continuation method to alleviate two challenges in policy optimization under batch reinforcement learning: (1) highly non-smooth objective function which is difficult to optimize (2) high variance in value estimates.
πΉ SCORE: SPURIOUS CORRELATION REDUCTION FOR OFFLINE REINFORCEMENT LEARNING π π₯ π₯
We propose a practical and theoretically guaranteed algorithm SCORE that reduces spurious correlations by combing an uncertainty penalty into policy evaluation. We show that this is consistent with the pessimism principle studied in theory, and the proposed algorithm converges to the optimal policy with a sublinear rate under mild assumptions.
our proposed MSG algorithm advocates for using independently learned ensembles, without sharing of target values, and this import design decision is supported by empirical evidence.
πΉ S4RL: Surprisingly Simple Self-Supervision for Offline Reinforcement Learning πΆ
utilizes data augmentations from states to learn value functions that are better at generalizing and extrapolating when deployed in the environment.
πΉ Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills π π₯ π₯
learning a functional understanding of the environment by learning to reach any goal state in a given dataset. We employ goal-conditioned Qlearning with hindsight relabeling and develop several techniques that enable training in a particularly challenging offline setting.
πΉ Behavior Regularized Offline Reinforcement Learning π₯ π₯ π
we introduce a general framework, behavior regularized actor critic (BRAC), to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks.
πΉ BRAC+: Improved Behavior Regularized Actor Critic for Offline Reinforcement Learning π₯
We improved the behavior regularized offline RL by proposing a low-variance upper bound of the KL divergence estimator to reduce variance and gradient penalized policy evaluation such that the learned Q functions are guaranteed to converge.
πΉ Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble πΆ
we observe that state-action distribution shift may lead to severe bootstrap error during fine-tuning, which destroys the good initial policy obtained via offline RL.
πΉ Experience Replay with Likelihood-free Importance Weights π₯ π
To balance bias (from off-policy experiences) and variance (from on-policy experiences), we use a likelihood-free density ratio estimator between onpolicy and off-policy experiences, and use the learned ratios as the prioritization weights.
πΉ MOORe: Model-based Offline-to-Online Reinforcement Learning π₯
employs a prioritized sampling scheme that can dynamically adjust the offline and online data for smooth and efficient online adaptation of the policy.
πΉ Offline Meta-Reinforcement Learning with Online Self-Supervision π π₯
Unlike the online setting, the adaptation and exploration strategies cannot effectively adapt to each other, resulting in poor performance. we propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any ground truth reward labels, to bridge this distribution shift problem.
πΉ Offline Meta-Reinforcement Learning with Advantage Weighting π₯ β
Targeting the offline meta-RL setting, we propose Meta-Actor Critic with Advantage Weighting (MACAW), an optimization-based meta-learning algorithm that uses simple, supervised regression objectives for both the inner and outer loop of meta-training.
πΉ Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning π₯ π₯
CORRO: which decreases the influence of behavior policies on task representations while supporting tasks that differ in reward function and transition dynamics.
πΉ AWAC: Accelerating Online Reinforcement Learning with Offline Datasets π π₯ π
we systematically analyze why this problem (offline + online) is so challenging, and propose an algorithm that combines sample efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of RL policies.
πΉ Guiding Online Reinforcement Learning with Action-Free Offline Pretraining πΆ
AF-Guide consists of an Action-Free Decision Transformer (AFDT) implementing a variant of Upside-Down Reinforcement Learning. It learns to plan the next states from the offline dataset, and a Guided Soft Actor-Critic (Guided SAC) that learns online with guidance from AFDT.
πΉ Critic Regularized Regression π π₯ π β β
CRR: Our alg. can be seen as a form of filtered behavioral cloning where data is selected based on information contained in the policyβs Q-fun. we do not rely on observed returns for adv. estimation.
πΉ Exponentially Weighted Imitation Learning for Batched Historical Data π π₯ π β β
MARWIL: we propose a monotonic advantage reweighted imitation learning strategy that is applicable to problems with complex nonlinear function approximation and works well with hybrid (discrete and continuous) action space.
πΉ BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning π π β
BAIL learns a V function, uses the V function to select actions it believes to be high-performing, and then uses those actions to train a policy network using imitation learning.
πΉ Offline RL Without Off-Policy Evaluation π§ π₯
a unified algorithmic template for offline RL algorithms as offline approximate modified policy iteration.
πΉ MODEL-BASED OFFLINE PLANNING π₯
MBOP: Learning dynamics, action priors, and values; MBOP-Policy; MBOP-Trajopt.
πΉ Model-Based Offline Planning with Trajectory Pruning π π₯
MOPP: MOPP avoids over-restrictive planning while enabling offline learning by encouraging more aggressive trajectory rollout guided by the learned behavior policy, and prunes out problematic trajectories by evaluating the uncertainty of the dynamics model.
πΉ Model-based Offline Policy Optimization with Distribution Correcting Regularization π π
DROP (density ratio regularized offline policy learning ) estimates the density ratio between model-rollouts distribution and offline data distribution via the DICE framework, and then regularizes the model predicted rewards with the ratio for pessimistic policy learning.
πΉ A Minimalist Approach to Offline Reinforcement Learning π π₯ π β β β
We find that we can match the performance of state-of-the-art offline RL algorithms by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data.
πΉ POPO: Pessimistic Offline Policy Optimization π β
Distributional value functions.
πΉ Offline Reinforcement Learning as Anti-Exploration π π₯ β
The core idea is to subtract a prediction-based exploration bonus from the reward, instead of adding it for exploration.
πΉ MOPO: Model-based Offline Policy Optimization π π₯ π π₯
we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policyβs return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data.
πΉ Domain Generalization for Robust Model-Based Offline Reinforcement Learning πΆ
DIMORL: Since different demonstrators induce different data distributions, we show that this can be naturally framed as a domain generalization problem, with each demonstrator corresponding to a different domain.
πΉ MOReL: Model-Based Offline Reinforcement Learning π π₯ π§
This framework consists of two steps: (a) learning a pessimistic MDP (P-MDP) using the offline dataset; (b) learning a near-optimal policy in this P-MDP.
πΉ COMBO: Conservative Offline Model-Based Policy Optimization π₯ π π₯ β β β
This results in a conservative estimate of the value function for out-of-support state-action tuples, without requiring explicit uncertainty estimation.
πΉ Regularizing a Model-based Policy Stationary Distribution to Stabilize Offline Reinforcement Learning π π₯
SDM-GAN: we regularize the undiscounted stationary distribution of the current policy towards the offline data during the policy optimization process. [ppt]
πΉ HYBRID VALUE ESTIMATION FOR OFF-POLICY EVALUATION AND OFFLINE REINFORCEMENT LEARNING π
We propose Hybrid Value Estimation (HVE) to perform a more accurate value function estimation in the offline setting. It automatically adjusts the step length parameter to get a bias-variance trade-off.
πΉ DROMO: Distributionally Robust Offline Model-based Policy Optimization π₯
To extend the basic idea of regularization without uncertainty quantification, we propose distributionally robust offline model-based policy optimization (DROMO), which leverages the ideas in distributionally robust optimization to penalize a broader range of out-of-distribution state-action pairs beyond the standard empirical out-of-distribution Q-value minimization.
πΉ Behavioral Priors and Dynamics Models: Improving Performance and Domain Transfer in Offline RL π
MABE: By adaptive behavioral prior, we mean a policy that approximates the behavior in the offline dataset while giving more importance to trajectories with high rewards.
πΉ Offline Reinforcement Learning with Fisher Divergence Critic Regularization π₯ π π§
We propose using a gradient penalty regularizer for the offset term and demonstrate its equivalence to Fisher divergence regularization, suggesting connections to the score matching and generative energy-based model literature.
πΉ Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning π π₯ β β
UWAC: an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
πΉ Model-based Offline Policy Optimization with Distribution Correcting Regularization
πΉ EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL π₯ π π§
By introducing the Expect-Max Q-Learning operator, we present a novel theoretical setup that takes into account the proposal distribution Β΅(a|s) and the number of action samples N, and hence more closely matches the resulting practical algorithm.
πΉ Lyapunov Density Models: Constraining Distribution Shift in Learning-Based Control π₯ π
We presented Lyapunov density models (LDMs), a tool that can ensure that an agent remains within the distribution of the training data.
πΉ OFFLINE REINFORCEMENT LEARNING WITH IN-SAMPLE Q-LEARNING π₯ π₯
We presented implicit Q-Learning (IQL), a general algorithm for offline RL that completely avoids any queries to values of out-of-sample actions during training while still enabling multi-step dynamic programming. Adopting Expectile regression. old
πΉ Continuous Doubly Constrained Batch Reinforcement Learning π₯ π₯
CDC: The first regularizer combats the extra-overestimation bias in regions that are out-of-distribution. The second regularizer is designed to hedge against the adverse effects of policy updates that severly diverge from behavior policy.
πΉ Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning π π§
ICQ: we propose a novel offline RL algorithm, named Implicit Constraint Q-learning (ICQ), which effectively alleviates the extrapolation error by only trusting the state-action pairs given in the dataset for value estimation.
πΉ Offline Model-based Adaptable Policy Learning π π₯ π
MAPLE tries to model all possible transition dynamics in the out-of-support regions. A context encoder RNN is trained to produce latent codes given the episode history, and the encoder and policy are jointly optimized to maximize average performance across a large ensemble of pretained dynamics models.
πΉ Supported Policy Optimization for Offline Reinforcement Learning πΆ
We present Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density based support constraint. SPOT adopts a VAEbased density estimator to explicitly model the support set of behavior policy.
πΉ Weighted model estimation for offline model-based reinforcement learning π₯
This paper considers weighting with the state-action distribution ratio of offline data and simulated future data, which can be estimated relatively easily by standard density ratio estimation techniques for supervised learning.
πΉ Batch Reinforcement Learning with Hyperparameter Gradients π π₯ π π₯
BOPAH: Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters (in a generalized KL-regularized RL framework), we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data.
πΉ OFFLINE REINFORCEMENT LEARNING WITH VALU-EBASED EPISODIC MEMORY π π§ π₯
We present a new offline V -learning method, EVL (expectile V -learning), and a novel offline RL framework, VEM (Value-based Episodic Memory). EVL learns the value function through the trade-offs between imitation learning and optimal value learning. VEM uses a memory-based planning scheme to enhance advantage estimation and conduct policy learning in a regression manner. IQL
πΉ Offline Reinforcement Learning with Soft Behavior Regularization π₯ π
Soft Behavior-regularized Actor Critic (SBAC): we design a new behavior regularization scheme for offline RL that enables policy improvement guarantee and state-dependent policy regularization.
πΉ Offline Reinforcement Learning with Pseudometric Learning π π₯ π
In the presence of function approximation, and under the assumption of limited coverage of the state-action space of the environment, it is necessary to enforce the policy to visit state-action pairs close to the support of logged transitions. In this work, we propose an iterative procedure to learn a pseudometric (closely related to bisimulation metrics) from logged transitions, and use it to define this notion of closeness.
πΉ Offline Reinforcement Learning with Reverse Model-based Imagination π₯ π₯
Reverse Offline Model-based Imagination (ROMI): We learn a reverse dynamics model in conjunction with a novel reverse policy, which can generate rollouts leading to the target goal states within the offline dataset.
πΉ Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble π π₯
SAC-N: we propose an uncertainty-based model-free offline RL method that effectively quantifies the uncertainty of the Q-value estimates by an ensemble of Q-function networks and does not require any estimation or sampling of the data distribution.
πΉ Q-Ensemble for Offline RL: Donβt Scale the Ensemble, Scale the Batch Size
πΉ ROBUST OFFLINE REINFORCEMENT LEARNING FROM LOW-QUALITY DATA πΆ
AdaPT: we propose an Adaptive Policy constrainT (AdaPT) method, which allows effective exploration on out-ofdistribution actions by imposing an adaptive constraint on the learned policy.
πΉ Regularized Behavior Value Estimation π π₯ π₯
R-BVE uses a ranking regularisation term that favours actions in the dataset that lead to successful outcomes. CRR \ MPO.
πΉ Active Offline Policy Selection π π β β
Gaussian process over policy values; Kernel; Active offline policy selection with Bayesian optimization. We proposed a BO solution that integrates OPE estimates with evaluations obtained by interacting with env.
πΉ Offline Policy Selection under Uncertainty π¦ β
πΉ Offline Learning from Demonstrations and Unlabeled Experience πΆ π π₯
We proposed offline reinforced imitation learning (ORIL) to enable learning from both demonstrations and a large unlabeled set of experiences without reward annotations.
πΉ Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations π₯ π π₯
DWBC: We introduce an additional discriminator to distinguish expert and non-expert data, we propose a cooperation strategy to boost the performance of both tasks, this will result in a new policy learning objective and surprisingly, we find its equivalence to a generalized BC objective, where the outputs of discriminator serve as the weights of the BC loss function.
πΉ Discriminator-Guided Model-Based Offline Imitation Learning π₯
(DMIL) framework, which introduces a discriminator to simultaneously distinguish the dynamics correctness and suboptimality of model rollout data against real expert demonstrations.
πΉ CLARE: CONSERVATIVE MODEL-BASED REWARD LEARNING FOR OFFLINE INVERSE REINFORCEMENT LEARNING π π§
solves offline IRL efficiently via integrating βconservatismβ into a learned reward function and utilizing an estimated dynamics model.
πΉ Offline Preference-Based Apprenticeship Learning π β
OPAL: Given a database consisting of trajectories without reward labels, we query an expert for preference labels over trajectory segments from the database, learn a reward function from preferences, and then perform offline RL using rewards provided by the learned reward function.
πΉ Semi-supervised reward learning for offline reinforcement learning πΆ β
We train a reward function on a pre-recorded dataset, use it to label the data and do offline RL.
πΉ LEARNING VALUE FUNCTIONS FROM UNDIRECTED STATE-ONLY EXPERIENCE π₯
This paper tackles the problem of learning value functions from undirected state only experience (state transitions without action labels i.e. (s, s' , r) tuples).
πΉ Offline Inverse Reinforcement Learning
πΉ Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment π π₯ π
We augment a learned dynamics model with simple transformations that seek to capture potential changes in physical properties of the robot, leading to more robust policies.
πΉ SEMI-PARAMETRIC TOPOLOGICAL MEMORY FOR NAVIGATION
πΉ Mapping State Space using Landmarks for Universal Goal Reaching
πΉ Search on the Replay Buffer: Bridging Planning and Reinforcement Learning
πΉ Hallucinative Topological Memory for Zero-Shot Visual Planning
πΉ Sparse Graphical Memory for Robust Planning π
SGM: aggregates states according to a novel two-way consistency objective, adapting classic state aggregation criteria to goal-conditioned RL: two states are redundant when they are interchangeable both as goals and as starting states.
πΉ Plan2vec: Unsupervised Representation Learning by Latent Plans πΆ
Plan2vec constructs a weighted graph on an image dataset using near-neighbor distances, and then extrapolates this local metric to a global embedding by distilling path-integral over planned path.
πΉ World Model as a Graph: Learning Latent Landmarks for Planning π
L3P: We devise a novel algorithm to learn latent landmarks that are scattered (in terms of reachability) across the goal space as the nodes on the graph.
πΉ Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learning π₯
VMG: we design a graph-structured world model in offline reinforcement learning by building a directed-graph-based Markov decision process (MDP) with rewards allocated to each directed edge as an abstraction of the original continuous environment.
πΉ GOAL-CONDITIONED BATCH REINFORCEMENT LEARNING FOR ROTATION INVARIANT LOCOMOTION πΆ
πΉ Offline Meta-Reinforcement Learning for Industrial Insertion πΆ
We introduced an offline meta-RL algorithm, ODA, that can meta-learn an adaptive policy from offline data, quickly adapt based on a small number of user-provided demonstrations for a new task, and then further adapt through online finetuning.
πΉ Scaling data-driven robotics with reward sketching and batch reinforcement learning
πΉ OFFLINE RL WITH RESOURCE CONSTRAINED ONLINE DEPLOYMENT π
Resourceconstrained setting: We highlight the performance gap between policies trained using the full offline dataset and policies trained using limited features.
πΉ Reinforcement Learning from Imperfect Demonstrations π₯
We propose Normalized Actor-Critic (NAC) that effectively normalizes the Q-function, reducing the Q-values of actions unseen in the demonstration data. NAC learns an initial policy network from demonstrations and refines the policy in the environment.
πΉ Curriculum Offline Imitating Learning π₯
We propose Curriculum Offline Imitation Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return, and improves the current policy along curriculum stages.
πΉ Dealing with the Unknown: Pessimistic Offline Reinforcement Learning π₯
PessORL: penalize high values at unseen states in the dataset, and to cancel the penalization at in-distribution states.
πΉ Adversarially Trained Actor Critic for Offline Reinforcement Learning π₯ π
We propose Adversarially Trained Actor Critic (ATAC) based on a two-player Stackelberg game framing of offline RL: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. [robust policy improvement] [POSTER]
πΉ RVS: WHAT IS ESSENTIAL FOR OFFLINE RL VIA SUPERVISED LEARNING? πΆ
Simply maximizing likelihood with a two-layer feedforward MLP is competitive with state-of-the-art results of substantially more complex methods based on TD learning or sequence modeling with Transformers. Carefully choosing model capacity (e.g., via regularization or architecture) and choosing which information to condition on (e.g., goals or rewards) are critical for performance. THE ESSENTIAL ELEMENTS OF OFFLINE RL VIA SUPERVISED LEARNING
πΉ IMPLICIT OFFLINE REINFORCEMENT LEARNING VIA SUPERVISED LEARNING πΆ
IRvS: Implicit Behavior Cloning
πΉ Contrastive Learning as Goal-Conditioned Reinforcement Learning π π₯ π
instead of adding representation learning parts to an existing RL algorithm, we show (contrastive) representation learning methods can be cast as RL algorithms in their own right.
πΉ When does return-conditioned supervised learning work for offline reinforcement learning? π₯
We find that RCSL (return-conditioned SL) returns the optimal policy under a set of assumptions that are stronger than those needed for the more traditional dynamic programming-based algorithms.
πΉ Implicit Behavioral Cloning π₯ π π₯
In this paper we showed that reformulating supervised imitation learning as a conditional energy-based modeling problem, with inference-time implicit regression, often greatly outperforms traditional explicit policy baselines.
πΉ Implicit Two-Tower Policies π
Implicit Two-Tower (ITT) policies, where the actions are chosen based on the attention scores of their learnable latent representations with those of the input states.
πΉ Latent-Variable Advantage-Weighted Policy Optimization for Offline RL πΆ
LAPO: we study an offline RL setup for learning from heterogeneous datasets where trajectories are collected using policies with different purposes, leading to a multi-modal data distribution.
πΉ AW-Opt: Learning Robotic Skills with Imitation and Reinforcement at Scale π
Our aim is to test the scalability of prior IL + RL algorithms and devise a system based on detailed empirical experimentation that combines existing components in the most effective and scalable way.
πΉ Offline RL Policies Should be Trained to be Adaptive π₯ π
APE-V: optimal policies for offline RL must be adaptive, depending not just on the current state but rather all the transitions seen so far during evaluation.
πΉ Deconfounded Imitation Learning π₯
We then introduce an algorithm for deconfounded imitation learning, which trains an inference model jointly with a latent-conditional policy. At test time, the agent alternates between updating its belief over the latent and acting under the belief.
πΉ Distance-Sensitive Offline Reinforcement Learning π π₯ π
We propose a new method, DOGE (Distance-sensitive Offline RL with better GEneralization). DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution.
πΉ RORL: Robust Offline Reinforcement Learning via Conservative Smoothing π π₯
We explicitly introduce regularization on the policy and the value function for states near the dataset and additional conservative value estimation on these OOD states.
πΉ On the Role of Discount Factor in Offline Reinforcement Learning π₯ π₯
This paper examines two distinct effects of discount factor in offline RL with theoretical analysis, namely the regularization effect and the pessimism effect.
πΉ LEARNING PSEUDOMETRIC-BASED ACTION REPRESENTATIONS FOR OFFLINE REINFORCEMENT LEARNING π₯ π
BMA: This paper proposes an action representation learning framework for offline RL based on a pseudometric, which measures both the behavioral relation and thedata-distributional relation between actions.
πΉ PLAS: Latent Action Space for Offline Reinforcement Learning π
We propose to simply learn the Policy in the Latent Action Space (PLAS) such that this requirement (OOD action) is naturally satisfied.
πΉ LET OFFLINE RL FLOW: TRAINING CONSERVATIVE AGENTS IN THE LATENT SPACE OF NORMALIZING FLOWS πΆ
CNF: we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. diffusion + RL
πΉ Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations
πΉ BEHAVIOR PRIOR REPRESENTATION LEARNING FOR OFFLINE REINFORCEMENT LEARNING π₯
BPR: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm.
πΉ S2P: State-conditioned Image Synthesis for Data Augmentation in Offline Reinforcement Learning π
we firstly propose a generative model, S2P (State2Pixel), which synthesizes the raw pixel of the agent from its corresponding state. It enables bridging the gap between the state and the image domain in RL algorithms, and virtually exploring unseen image distribution via model-based transition in the state space.
πΉ BEHAVIOR PRIOR REPRESENTATION LEARNING FOR OFFLINE REINFORCEMENT LEARNING π₯ π§
BPR: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm.
πΉ AGENT-CONTROLLER REPRESENTATIONS: PRINCIPLED OFFLINE RL WITH RICH EXOGENOUS INFORMATION π₯
we propose to use multi-step inverse models, which have seen a great deal of interest in the RL theory community, to learn Agent-Controller Representations for Offline-RL (ACRO).
πΉ Back to the Manifold: Recovering from Out-of-Distribution States π₯ π
We alleviate the distributional shift at the deployment time by introducing a recovery policy that brings the agent back to the training manifold whenever it steps out of the in-distribution states, e.g., due to an external perturbation.
πΉ State Deviation Correction for Offline Reinforcement Learning π₯
SDC: We first perturb the states sampled from the logged dataset, then simulate noisy next states on the basis of a dynamics model and the policy. We then train the policy to minimize the distances between the noisy next states and the offline dataset.
πΉ A Policy-Guided Imitation Approach for Offline Reinforcement Learning π₯ π₯
POR: During training, the guide-policy and execute-policy are learned using only data from the dataset, in a supervised and decoupled manner. During evaluation, the guide-policy guides the execute-policy by telling where it should go so that the reward can be maximized.
πΉ OFFLINE REINFORCEMENT LEARNING WITH ADAPTIVE BEHAVIOR REGULARIZATION π₯ π π₯
ABR: a novel offline RL algorithm that achieves an adaptive balance between cloning and improving over the behavior policy. By simply adding a sample-based regularizer to the Bellman backup, we construct an adaptively regularized objective for the policy improvement, which implicitly estimates the probability density of the behavior policy.
πΉ Dual Generator Offline Reinforcement Learning π₯ π
DASCO: training two generators: one that maximizes return, with the other capturing the βremainderβ of the data distribution in the offline dataset, such that the mixture of the two is close to the behavior policy.
πΉ Boosting Offline Reinforcement Learning via Data Rebalancing πΆ
ReD (Return-based Data Rebalance)
πΉ Behaviour Discriminator: A Simple Data Filtering Method to Improve Offline Policy Learning πΆ
We propose a behaviour discriminator (BD) concept, a novel and simple data filtering approach based on semisupervised learning, which can accurately discern expert data from a mixed-quality dataset.
πΉ Robust Imitation of a Few Demonstrations with a Backwards Model π₯
BMIL: We train a generative backwards dynamics model and generate short imagined trajectories from states in the demonstrations. By imitating both demonstrations and these model rollouts, the agent learns the demonstrated paths and how to get back onto these paths.
πΉ FROM PLAY TO POLICY: CONDITIONAL BEHAVIOR GENERATION FROM UNCURATED ROBOT DATA π₯
we present Conditional Behavior Transformers (C-BeT), a method that combines the multi-modal generation ability of Behavior Transformer with future-conditioned goal specification.
πΉ CONFIDENCE-CONDITIONED VALUE FUNCTIONS FOR OFFLINE REINFORCEMENT LEARNING π₯ π π₯
CCVL: we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability.
πΉ Designing an Offline Reinforcement Learning Objective from Scratch π₯ π
DOS: We leverage the contrastive learning framework to design a scoring metric that gives high scores to policies that imitate the actions yielding relatively high returns while avoiding those yielding relatively low returns.
πΉ πΉ πΉ πΉ πΉ πΉ
β Designs from Data | offline MBO
πΉ Designs from Data: Offline Black-Box Optimization via Conservative Training see here
πΉ OFFLINE MODEL-BASED OPTIMIZATION VIA NORMALIZED MAXIMUM LIKELIHOOD ESTIMATION π π§
we consider data-driven optimization problems where one must maximize a function given only queries at a fixed set of points. provides a principled approach to handling uncertainty and out-of-distribution inputs.
πΉ Model Inversion Networks for Model-Based Optimization π₯
MINs: This work addresses data-driven optimization problems, where the goal is to find an input that maximizes an unknown score or reward function given access to a dataset of inputs with corresponding scores.
πΉ RoMA: Robust Model Adaptation for Offline Model-based Optimization π
RoMA consists of two steps: (a) a pre-training strategy to robustly train the proxy model and (b) a novel adaptation procedure of the proxy model to have robust estimates for a specific set of candidate solutions.
πΉ Conservative Objective Models for Effective Offline Model-Based Optimization π₯ π
COMs: We propose conservative objective models (COMs), a method that learns a model of the objective function which lower bounds the actual value of the ground-truth objective on outof-distribution inputs and uses it for optimization.
πΉ DATA-DRIVEN OFFLINE OPTIMIZATION FOR ARCHITECTING HARDWARE ACCELERATORS π
PRIME: we develop such a data-driven offline optimization method for designing hardware accelerators. PRIME learns a conservative, robust estimate of the desired cost function, utilizes infeasible points and optimizes the design against this estimate without any additional simulator queries during optimization.
πΉ Conditioning by adaptive sampling for robust design π
πΉ DESIGN-BENCH: BENCHMARKS FOR DATA-DRIVEN OFFLINE MODEL-BASED OPTIMIZATION π₯
Design-Bench, a benchmark for offline MBO with a unified evaluation protocol and reference implementations of recent methods.
πΉ User-Interactive Offline Reinforcement Learning π₯ π₯
LION: We propose an algorithm that allows the user to tune this hyperparameter (the proximity of the learned policy to the original policy) at runtime, thereby overcoming both of the above mentioned issues simultaneously.
πΉ CONFIDENCE-CONDITIONED VALUE FUNCTIONS FOR OFFLINE REINFORCEMENT LEARNING
CCVL: We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far.
πΉ Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning πΆ
We compare model-free, model-based, as well as hybrid offline RL approaches on various industrial benchmark (IB) datasets to test the algorithms in settings closer to real world problems, including complex noise and partially observable states.
πΉ Autofocused oracles for model-based design π₯ π₯
we now reformulate the MBD problem as a non-zero-sum game, which suggests an algorithmic strategy for iteratively updating the oracle within any MBO algorithm
πΉ Data-Driven Offline Decision-Making via Invariant Representation Learning π₯ π π₯
IOM: our approach for addressing distributional shift by enforcing invariance between the learned representations of the training dataset and optimized decisions.
πΉ Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows
CNF: we build upon recent works on learning policies (PLAS) in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder.
πΉ Towards good validation metrics for generative models in offline model-based optimisation π
we propose a principled evaluation framework for model-based optimisation to measure how well a generative model can extrapolate.
πΉ πΉ πΉ πΉ πΉ πΉ
πΉ The Challenges of Exploration for Offline Reinforcement Learning πΆ
With Explore2Offline, we propose to evaluate the quality of collected data by transferring the collected data and inferring policies with reward relabelling and standard offline RL algorithms
πΉ RISK-AVERSE OFFLINE REINFORCEMENT LEARNING π π₯
we present the Offline RiskAverse Actor-Critic (O-RAAC), a model-free RL algorithm that is able to learn risk-averse policies in a fully offline setting.
πΉ REVISITING DESIGN CHOICES IN OFFLINE MODEL-BASED REINFORCEMENT LEARNING π₯
we compare these heuristics (for model uncertainty), and design novel protocols to investigate their interaction with other hyperparameters, such as the number of models, or imaginary rollout horizon. Using these insights, we show that selecting these key hyperparameters using Bayesian Optimization produces superior configurations.
πΉ Latent Plans for Task-Agnostic Offline Reinforcement Learning π
TACO-RL: we combine a low-level policy that learns latent skills via imitation learning and a high-level policy learned from offline reinforcement learning for skill-chaining the latent behavior priors.
πΉ DR3: VALUE-BASED DEEP REINFORCEMENT LEARNING REQUIRES EXPLICIT REGULARIZATION π π₯ π π₯
Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions with excessive βaliasingβ.
πΉ ACTOR-CRITIC ALIGNMENT FOR OFFLINE-TO-ONLINE REINFORCEMENT LEARNING π
ACA: discarding Q-values learned offline as a means to combat distribution shift in offline2online RL
πΉ OFFLINE REINFORCEMENT LEARNING WITH CLOSEDFORM POLICY IMPROVEMENT OPERATORS π π₯
CFPI: the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective.
πΉ CONTEXTUAL TRANSFORMER FOR OFFLINE REINFORCEMENT LEARNING π₯
we explore how prompts can help sequencemodeling based offline-RL algorithms --> extend the framework to the Meta-RL setting and propose Contextual Meta Transformer (CMT).
πΉ HYPER-DECISION TRANSFORMER FOR EFFICIENT ONLINE POLICY ADAPTATION πΆ
HDT: augment the base DT with an adaptation module, whose parameters are initialized by a hyper-network. When encountering unseen tasks, the hyper-network takes a handful of demonstrations as inputs and initializes the adaptation module accordingly.
πΉ FROM PLAY TO POLICY: CONDITIONAL BEHAVIOR GENERATION FROM UNCURATED ROBOT DATA πΆ
Conditional Behavior Transformers (C-BeT), a method that combines the multi-modal generation ability of Behavior Transformer with future-conditioned goal specification.
πΉ ACQL: AN ADAPTIVE CONSERVATIVE Q-LEARNING FRAMEWORK FOR OFFLINE REINFORCEMENT LEARNING π π₯ π₯
two weight functions, corresponding to the out-of-distribution (OOD) actions and actions in the dataset, are introduced to adaptively shape the Q-function.
πΉ ENTROPY-REGULARIZED MODEL-BASED OFFLINE REINFORCEMENT LEARNING π π₯ π
EMO: we devised a hybrid loss function to minimize the negative log-likelihood of the model on the distribution of the offline data while maximizing the entropy in the areas that the support of data is none or minimal.
πΉ OPTIMAL TRANSPORT FOR OFFLINE IMITATION LEARNING πΆ
OTRβs key idea is to use optimal transport to compute an optimal alignment between an unlabeled trajectory in the dataset and an expert demonstration to obtain a similarity measure that can be interpreted as a reward, which can then be used by an offline RL algorithm to learn the policy.
πΉ MIND THE GAP: OFFLINE POLICY OPTIMIZATION FOR IMPERFECT REWARDS π₯ π
RGM: the upper layer optimizes a reward correction term that performs state-action visitation distribution matching w.r.t. a small set of expert data; and the lower layer solves a pessimistic RL problem with the corrected rewards. DICE
πΉ OFFLINE IMITATION LEARNING BY CONTROLLING THE EFFECTIVE PLANNING HORIZON π₯ π§
IGI: we analyze the effect of controlling the discount factor on offline IL and motivate that the discount factor can take a role of a regularizer to prevent the sampling error of the supplementary dataset from hurting the performance.
πΉ MUTUAL INFORMATION REGULARIZED OFFLINE REINFORCEMENT LEARNING π π₯
MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
πΉ ON THE IMPORTANCE OF THE POLICY STRUCTURE IN OFFLINE REINFORCEMENT LEARNING πΆ π
V2AE (Value-Weighted Variational Auto-Encoder): The V2AE algorithm can be interpreted as an approach that divides the state-action space by learning the discrete latent variable and learns the corresponding sub-policies in each region.
πΉ THE CHALLENGES OF EXPLORATION FOR OFFLINE REINFORCEMENT LEARNING πΆ
With Explore2Offline, we propose to evaluate the quality of collected data by transferring the collected data and inferring policies with reward relabelling and standard offline RL algorithms.
πΉ CURIOSITY-DRIVEN UNSUPERVISED DATA COLLECTION FOR OFFLINE REINFORCEMENT LEARNING πΆ
CUDC:
πΉ DISCOVERING GENERALIZABLE MULTI-AGENT COORDINATION SKILLS FROM MULTI-TASK OFFLINE DATA π
ODIS: first extracts task-invariant coordination skills from offline multi-task data and learns to delineate different agent behaviors with the discovered coordination skills. Then we train a coordination policy to choose optimal coordination skills with the centralized training and decentralized execution paradigm.
πΉ SKILL DISCOVERY DECISION TRANSFORMER πΆ
We proposed Skill DT, a variant of Generalized DT, to explore the capabilities of offline skill discovery with sequence modelling.
πΉ HARNESSING MIXED OFFLINE REINFORCEMENT LEARNING DATASETS VIA TRAJECTORY WEIGHTING πΆ π
We show that state-of-the-art offline RL algorithms are overly constrained in mixed datasets with high RPSV (return positive-sided varianc) and under-utilize the minority data.
πΉ EFFICIENT OFFLINE POLICY OPTIMIZATION WITH A LEARNED MODEL π₯
ROSMO: Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy improvements are towards the direction that maximizes the estimated advantage with regularization of the dataset. (MuZero Unplugged)
πΉ CONSERWEIGHTIVE BEHAVIORAL CLONING FOR RELIABLE OFFLINE REINFORCEMENT LEARNING πΆ
ConserWeightive Behavioral Cloning (CWBC): trajectory weighting and conservative regularization.
πΉ TAMING POLICY CONSTRAINED OFFLINE REINFORCEMENT LEARNING FOR NON-EXPERT DEMONSTRATIONS π₯ π
we first introduce gradient penalty over the learned value function to tackle the exploding Q-function gradients induced by the failed closeness constraint on non-expert states. + critic weighted constraint relaxation.
πΉ POLICY EXPANSION FOR BRIDGING OFFLINE-TOONLINE REINFORCEMENT LEARNING π π₯
PEX: After learning the offline policy, we use it as one candidate policy in a policy set, and further learn another policy that will be responsible for further learning as an expansion to the policy set. The two policies will be composed in an adaptive manner for interacting with the environment.
πΉ WHEN DATA GEOMETRY MEETS DEEP FUNCTION: GENERALIZING OFFLINE REINFORCEMENT LEARNING π₯ π
DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution.
πΉ THE IN-SAMPLE SOFTMAX FOR OFFLINE REINFORCEMENT LEARNING π₯ π₯
In-Sample Actor-Critic: POLICY OPTIMIZATION USING THE IN-SAMPLE SOFTMAX
πΉ IN-SAMPLE ACTOR CRITIC FOR OFFLINE REINFORCEMENT LEARNING π₯
In-sample Actor Critic (IAC): conduct in-sample learning by sampling-importance resampling.
πΉ OFFLINE Q-LEARNING ON DIVERSE MULTI-TASK DATA BOTH SCALES AND GENERALIZES
This work shows that offline Q-learning can scale to high-capacity models trained on large, diverse datasets.
πΉ PRE-TRAINING FOR ROBOTS: LEVERAGING DIVERSE MULTITASK DATA VIA OFFLINE RL
PTR: a framework based on offline RL that attempts to effectively learn new tasks by combining pre-training on existing robotic datasets with rapid fine-tuning on a new task.
πΉ DEEP AUTOREGRESSIVE DENSITY NETS VS NEURAL ENSEMBLES FOR MODEL-BASED OFFLINE REINFORCEMENT LEARNING
we ask what are the best dynamic system models, estimating their own uncertainty, for conservativism-based MBRL algorithm
πΉ SPARSE Q-LEARNING: OFFLINE REINFORCEMENT LEARNING WITH IMPLICIT VALUE REGULARIZATION π π₯ π π₯
Implicit Value Regularization (IVR) framework + Sparse Q-learning (SQL).
πΉ EXTREME Q-LEARNING: MAXENT RL WITHOUT ENTROPY π π§
Using EVT, we derive our Extreme Q-Learning framework and consequently online and offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy.
πΉ IS CONDITIONAL GENERATIVE MODELING ALL YOU NEED FOR DECISION-MAKING? πΆ
Decision Diffuser: a conditional generative model for sequential decision making.
πΉ SPRINT: SCALABLE SEMANTIC POLICY PRETRAINING VIA LANGUAGE INSTRUCTION RELABELING πΆ
πΉ Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models π₯
DIAL: we utilize semi-supervised language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data and then train language-conditioned policies on the augmented datasets.
πΉ PSEUDOMETRIC GUIDED ONLINE QUERY AND UPDATE FOR OFFLINE REINFORCEMENT LEARNING π₯
PGO2 has a structural design between the Q-neural network and the Siamese network, which guarantees simultaneous Q-network updating and pseudometric learning, promoting Q-network fine-tuning. In the inference phase, PGO2 solves convex optimizations to identify optimal query actions.
πΉ LIGHTWEIGHT UNCERTAINTY FOR OFFLINE REINFORCEMENT LEARNING VIA BAYESIAN POSTERIOR π₯
we propose a lightweight uncertainty quantifier based on approximate Bayesian inference in the last layer of the Q-network, which estimates the Bayesian posterior with minimal parameters in addition to the ordinary Q-network. Moreover, to avoid mode collapse in OOD samples and improve diversity in the Q-posterior, we introduce a repulsive force for OOD predictions in training.
πΉ Q-ENSEMBLE FOR OFFLINE RL: DONβT SCALE THE ENSEMBLE, SCALE THE BATCH SIZE
πΉ EFFECTIVE OFFLINE REINFORCEMENT LEARNING VIA CONSERVATIVE STATE VALUE ESTIMATION
CSVE:
πΉ CONTRASTIVE VALUE LEARNING: IMPLICIT MODELS FOR SIMPLE OFFLINE RL π₯ π π₯
CVL: learn a different type of model for offline RL, a model which (1) will not require predicting high-dimensional observations and (2) can be directly used to estimate Q-values without requiring either model-based rollouts or model-free temporal difference learning.
πΉ FINE-TUNING OFFLINE POLICIES WITH OPTIMISTIC ACTION SELECTION π₯
O3F: A key insight of our method is that we collect optimistic data without changing the training objective. To collect such exploratory data, we aim to use the knowledge embedded in the Q-function to direct exploration, i.e., selecting actions that are estimated to be better than the ones given by the policy.
πΉ SEMI-SUPERVISED OFFLINE REINFORCEMENT LEARNING WITH ACTION-FREE TRAJECTORIES πΆ
SS-ORL contains three simple and scalable steps: (1) train a multi-transition inverse dynamics model on labelled data, which predicts actions based on transition sequences, (2) fill in proxy-actions for unlabelled data, and finally (3) train an offline RL agent on the combined dataset.
πΉ A CONNECTION BETWEEN ONE-STEP RL AND CRITIC REGULARIZATION IN REINFORCEMENT LEARNING π π₯ π π₯
applying a multi-step critic regularization method with a regularization coefficient of 1 yields the same policy as one-step RL.
πΉ DICHOTOMY OF CONTROL: SEPARATING WHAT YOU CAN CONTROL FROM WHAT YOU CANNOT π₯
DoC: conditioning the policy on a latent variable representation of the future, and designing a mutual information constraint that removes any information from the latent variable associated with randomness in the environment.
πΉ CORRECTING DATA DISTRIBUTION MISMATCH IN OFFLINE META-REINFORCEMENT LEARNING WITH FEWSHOT ONLINE ADAPTATION π π§
GCC: To align adaptation context with the meta-training distribution, GCC utilizes greedy task inference, which diversely samples βtask hypothesesβ and selects a hypothesis with the highest return to update the belief
πΉ OFFLINE REINFORCEMENT LEARNING FROM HETEROSKEDASTIC DATA VIA SUPPORT CONSTRAINTS π π π₯
CQL (ReDS): the learned policy should be free to choose per state how closely to follow the behavior policy to maximize long-term return, as long as the learned policy stays within the support of the behavior policy.
πΉ OFFLINE REINFORCEMENT LEARNING VIA WEIGHTED f-DIVERGENCE π₯ π₯
DICE: we presented DICE via weighted f-divergence, a framework to control the degree of regularization on each state-action by adopting weight k to f-divergence.
πΉ Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning πΆ
We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agentβs performance and training stability.
πΉ HYBRID RL: USING BOTH OFFLINE AND ONLINE DATA CAN MAKE RL EFFICIENT π₯
Hybrid Q-Learning or Hy-Q: we prove that the algorithm is both computationally and statistically efficient whenever the offline dataset supports a high-quality policy and the environment has bounded bilinear rank.
πΉ UniMASK: Unified Inference in Sequential Decision Problems π₯
We show that a single UniMASK model is often capable of carrying out many tasks with performance similar to or better than single-task models.
πΉ OFFLINE REINFORCEMENT LEARNING WITH CLOSEDFORM POLICY IMPROVEMENT OPERATORS π₯ π
The behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. As practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture, giving rise to a closed-form policy improvement operator.
πΉ STATE-AWARE PROXIMAL PESSIMISTIC ALGORITHMS FOR OFFLINE REINFORCEMENT LEARNING π§
State-Aware Conservative Q-Learning (SA-CQL):
πΉ IN-SAMPLE ACTOR CRITIC FOR OFFLINE REINFORCEMENT LEARNING π₯
IAC: utilizes sampling-importance resampling to execute in-sample policy evaluation. IAC only uses the target Q-values of the actions in the dataset to evaluate the trained policy, thus avoiding extrapolation error.
πΉ Future-conditioned Unsupervised Pretraining for Decision Transformer π₯ π
PDT: this feature can be easily incorporated into a return-conditioned framework for online finetuning, by assigning return values to possible futures and sampling future embeddings based on their respective values.
-
Exploration Strategies in Deep Reinforcement Learning [chinese] π¦ π₯ π₯ π₯
πΉ VIME: Variational Information Maximizing Exploration π π π§ β βBNN
the agent should take actions that maximize the reduction in uncertainty about the dynamics.
πΉ Self-Supervised Exploration via Disagreement π
an ensemble of dynamics models and incentivize the agent to explore such that the disagreement of those ensembles is maximized.
πΉ DORA THE EXPLORER: DIRECTED OUTREACHING REINFORCEMENT ACTION-SELECTION π₯ π π§
We propose E-values, a generalization of counters that can be used to evaluate the propagating exploratory value over state-action trajectories. [The Hebrew University of Jerusalem] π
πΉ EXPLORATION BY RANDOM NETWORK DISTILLATION π₯ π medium π β
based on random network distillation (RND) bonus
πΉ Randomized Prior Functions for Deep Reinforcement Learning π₯ π₯ π¦ β β β
πΉ Large-Scale Study of Curiosity-Driven Learning π β
πΉ NEVER GIVE UP: LEARNING DIRECTED EXPLORATION STRATEGIES π π
episodic memory based intrinsic reward using k-nearest neighbors; self-supervised inverse dynamics model; Universal Value Function Approximators; different degrees of exploration/exploitation; distributed RL;
πΉ Self-Imitation Learning via TrajectoryConditioned Policy for Hard-Exploration Tasks π¦ β
πΉ Planning to Explore via Self-Supervised World Models π₯ π₯ π β β βExperiment is good!
a self supervised reinforcement learning agent that tackles both these challenges through a new approach to self-supervised exploration and fast adaptation to new tasks, which need not be known during exploration. unlike prior methods which retrospectively compute the novelty of observations after the agent has already reached them, our agent acts efficiently by leveraging planning to seek out expected future novelty.
πΉ BYOL-Explore: Exploration by Bootstrapped Prediction π₯
BYOL-Explore learns a world representation, the world dynamics, and an exploration policy alltogether by optimizing a single prediction loss in the latent space with no additional auxiliary objective.
πΉ Efficient Exploration via State Marginal Matching π₯ π π§ π₯ β
our work unifies prior exploration methods as performing approximate distribution matching, and explains how state distribution matching can be performed properly
πΉ hard exploration
β Provably efficient RL with Rich Observations via Latent State Decoding π β
Block MDP:
β Provably Efficient Exploration for RL with Unsupervised Learning π β
πΉ Learning latent state representation for speeding up exploration π
Prior experience on separate but related tasks help learn representations of the state which are effective at predicting instantaneous rewards.
πΉ Self-Imitation Learning [reward shaping] π π₯
exploiting past good experiences can indirectly drive deep exploration. we consider exploiting what the agent has experienced, but has not yet learned. Related work: Exploration; Episodic control; Experience replay; Experience replay for actor-critic; Connection between policy gradient and Q-learning; Learning from imperfect demonstrations.
πΉ Generative Adversarial Self-Imitation Learning [reward shaping] π₯ β
GASIL focuses on reproducing past good trajectories, which can potentially make long-term credit assignment easier when rewards are sparse and delayed.
πΉ Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration π₯ π§
To take advantage of the previous sample distribution from the replay buffer for sample-efficient exploration, we propose sample-aware entropy regularization which maximizes the entropy of weighted sum of the policy action distribution and the sample action distribution from the replay buf.
πΉ LEARNING SELF-IMITATING DIVERSE POLICIES π π₯ π₯ β
We view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. We show that with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from experience replays. One approach to achieve better exploration in challenging cases like above is to simultaneously learn multiple diverse policies and enforce them to explore different parts of the high dimensional space.
πΉ Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm 2016
πΉ ADVERSARIALLY GUIDED ACTOR-CRITIC π π₯ π₯
While the adversary mimics the actor by minimizing the KL-divergence between their respective action distributions, the actor, in addition to learning to solve the task, tries to differentiate itself from the adversary predictions.
πΉ Diversity-Driven Exploration Strategy for Deep Reinforcement Learning π π₯ β β
adding a distance measure regularization to the loss function,
πΉ Provably Efficient Maximum Entropy Exploration π
πΉ Reward-Free Exploration for Reinforcement Learning π π₯ π₯
How can we efficiently explore an environment without using any reward information? In the exploration phase, the agent first collects trajectories from an MDP M without a pre-specified reward function. After exploration, it is tasked with computing near-optimal policies under for M for a collection of given reward functions.
πΉ Rethinking Exploration for Sample-Efficient Policy Learning π β
BBE: bias with finite samples, slow adaptation to decaying bonuses, and lack of optimism on unseen transitions ---> UFO, produces policies that are Unbiased with finite samples, Fast-adapting as the exploration bonus changes, and Optimistic with respect to new transitions.
πΉ Provably Efficient Exploration in Policy Optimization π
design a provably efficient policy optimization algorithm that incorporates exploration.
πΉ Dynamic Bottleneck for Robust Self-Supervised Exploration π π₯
We propose a Dynamic Bottleneck (DB) model, which attains a dynamics-relevant representation based on the information-bottleneck principle. Based on the DB model, we further propose DB-bonus, which encourages the agent to explore state-action pairs with high information gain.
πΉ Principled Exploration via Optimistic Bootstrapping and Backward Induction π π₯
We propose a principled exploration method for DRL through Optimistic Bootstrapping and Backward Induction (OB2I). OB2I constructs a generalpurpose UCB-bonus through non-parametric bootstrap in DRL. The UCB-bonus estimates the epistemic uncertainty of state-action pairs for optimistic exploration.
πΉ A Max-Min Entropy Framework for Reinforcement Learning π π§
The proposed max-min entropy framework aims to learn to visit states with low entropy and maximize the entropy of these low-entropy states to promote exploration.
πΉ Exploration in Deep Reinforcement Learning: A Comprehensive Survey π¦
πΉ HYPERDQN: A RANDOMIZED EXPLORATION METHOD FOR DEEP REINFORCEMENT LEARNING π§ π
We present a practical exploration method to address the limitations of RLSVI and BootDQN.
πΉ Rapid Exploration for Open-World Navigation with Latent Goal Models π
RECON: We use an information bottleneck to regularize the learned policy, giving us (i) a compact visual representation of goals, (ii) improved generalization capabilities, and (iii) a mechanism for sampling feasible goals for exploration.
πΉ Better Exploration with Optimistic Actor-Critic π
OAC: we Optimistic Actor Critic, which approximates a lower and upper confidence bound on the state-action value function. This allows us to apply the principle of optimism in the face of uncertainty to perform directed exploration using the upper bound while still using the lower bound to avoid overestimation.
πΉ Guided Exploration in Reinforcement Learning via Monte Carlo Critic Optimization πΆ
An ensemble of Monte Carlo Critics that provides exploratory direction is presented as a controller.
πΉ Tactical Optimism and Pessimism for Deep Reinforcement Learning π₯
TOP: we propose the use of an adaptive approach in which the degree of optimism or pessimism is adjusted dynamically during training. As a consequence of this approach, the optimal degree of optimism can vary across tasks and over the course of a single training run as the model improves.
πΉ Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction π
We adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
πΉ A Unified Framework for Conservative Exploration
πΉ A RISK-SENSITIVE POLICY GRADIENT METHOD
πΉ Policy Gradient for Coherent Risk Measures
πΉ Wasserstein Unsupervised Reinforcement Learning π
By maximizing Wasserstein distance, the agents equipped with different policies may drive themselves to enter different areas of state space and keep as βfarβ as possible from each other to earn greater diversity.
πΉ Towards Effective Context for Meta-Reinforcement Learning: an Approach based on Contrastive Learning π₯
we propose a novel Meta-RL framework called CCM (Contrastive learning augmented Context-based Meta-RL). We first focus on the contrastive nature behind different tasks and leverage it to train a compact and sufficient context encoder. Further, we train a separate exploration policy and theoretically derive a new information-gain-based objective which aims to collect informative trajectories in a few steps.
πΉ Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices π₯ π
DREAM: We learn an exploitation policy without the need for exploration, by conditioning on a learned representation of the problem ID, which provides task-relevant information. We apply an information bottleneck to this representation to encourage discarding of any information not required by the exploitation policy (i.e., task-irrelevant information). Then, we learn an exploration policy to only discover task-relevant information by training it to produce trajectories containing the same information as the learned ID representation.
πΉ MetaCURE: Meta Reinforcement Learning with Empowerment-Driven Exploration π₯
We explicitly model the problem of exploration policy learning, and propose a novel empowerment-driven exploration objective, which aims at maximizing agentβs information gain about the current task.
πΉ REWARDING EPISODIC VISITATION DISCREPANCY FOR EXPLORATION IN REINFORCEMENT LEARNING πΆ
REVD provides intrinsic rewards by evaluating the Renyi divergence-based visitation discrepancy between episodes.
πΉ Redeeming Intrinsic Rewards via Constrained Optimization π₯ π₯
EIPO: automatically tunes the importance of the intrinsic reward: it suppresses the intrinsic reward when exploration is unnecessary and increases it when exploration is required.
πΉ Curiosity in Hindsight π₯ π
BYOL-Hindsight: Our key idea is to learn representations of the future that capture precisely the unpredictable aspects of each outcomeβnot any more, not any lessβwhich we use as additional input for predictions, such that intrinsic rewards do vanish in the limit.
πΉ CIM: Constrained Intrinsic Motivation for Sparse-Reward Continuous Control π₯
CIM: leverage readily attainable task priors to construct a constrained intrinsic objective, and at the same time, exploit the Lagrangian method to adaptively balance the intrinsic and extrinsic bjectives via a simultaneous-maximization framework.
-
Causal inference [ see more in OOD & inFERENCe's blog ]
πΉ
-
reasoning
πΉ CAUSAL DISCOVERY WITH REINFORCEMENT LEARNING πΆ β
πΉ DEEP REINFORCEMENT LEARNING WITH CAUSALITYBASED INTRINSIC REWARD π
The proposed algorithm learns a graph to encode the environmental structure by calculating Average Causal Effect (ACE) between different categories of entities, and an intrinsic reward is given to encourage the agent to interact more with entities belonging to top-ranked categories, which significantly boosts policy learning.
πΉ Causal Confusion in Imitation Learning π π§
propose a solution to combat it through targeted interventionsβeither environment interaction or expert queriesβto determine the correct causal model.
πΉ Sample-Efficient Reinforcement Learning via Counterfactual-Based Data Augmentation π₯ π₯
CTRL: To address the issues of mechanism heterogeneity and related data scarcity, we propose a data-efficient RL algorithm that exploits structural causal models (SCMs) to model the state dynamics, which are estimated by leveraging both commonalities and differences across subjects.
πΉ Causal Dynamics Learning for Task-Independent State Abstraction π₯
CDL: first learns a theoretically proved causal dynamics model that removes unnecessary dependencies between state variables and the action, thus generalizing well to unseen states. A state abstraction can then be derived from the learned dynamics.
-
cv
πΉ OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution Generalization π₯ π
Evaluate OoD generalization algorithms comprehensively on two types of datasets, one dominated by diversity shift and the other dominated by correlation shift. β
πΉ LEARNING TO REACH GOALS VIA ITERATED SUPERVISED LEARNING πΆ π β
GCSL: an agent continually relabels and imitates the trajectories it generates to progressively learn goal-reaching behaviors from scratch. see more in RVS: see more in https://www.youtube.com/watch?v=sVPm7zOrBxM&ab_channel=RAIL πΉ RETHINKING GOAL-CONDITIONED SUPERVISED LEARNING AND ITS CONNECTION TO OFFLINE RL π₯ π π
We propose Weighted GCSL (WGCSL), in which we introduce an advanced compound weight consisting of three parts (1) discounted weight for goal relabeling, (2) goal-conditioned exponential advantage weight, and (3) best advantage weight.
πΉ Learning Latent Plans from Play π₯
Play-GCBC; Play-LM; To learn control from play, we introduce Play-LMP, a selfsupervised method that learns to organize play behaviors in a latent space, then reuse them at test time to achieve specific goals.
πΉ Reward-Conditioned Policies π π₯ π β
Non-expert trajectories collected from suboptimal policies can be viewed as optimal supervision, not for maximizing the reward, but for matching the reward of the given trajectory. Any experience collected by an agent can be used as optimal supervision when conditioned on the quality of a policy.
πΉ Training Agents using Upside-Down Reinforcement Learning π₯
UDRL: The goal of learning is no longer to maximize returns in expectation, but to learn to follow commands that may take various forms such as βachieve total reward R in next T time stepsβ or βreach state S in fewer than T time stepsβ.
πΉ All You Need Is Supervised Learning: From Imitation Learning to Meta-RL With Upside Down RL π
Given the increased interest in the RL-as-SL paradigm, this work aims to construct a more general purpose agent/learning algorithm, but with more concrete implementation details and links to existing RL concepts than prior work.
πΉ Hierarchical Reinforcement Learning With Timed Subgoals
πΉ DEEP IMITATIVE MODELS FOR FLEXIBLE INFERENCE, PLANNING, AND CONTROL π π₯
We propose βImitative Modelsβ to combine the benefits of IL and goal-directed planning. Imitative Models are probabilistic predictive models of desirable behavior able to plan interpretable expert-like trajectories to achieve specified goals.
πΉ ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints
πΉ Simplifying Deep Reinforcement Learning via Self-Supervision πΆ
SSRL: We demonstrate that, without policy gradient or value estimation, an iterative procedure of βlabelingβ data and supervised regression is sufficient to drive stable policy improvement.
πΉ Search on the Replay Buffer: Bridging Planning and Reinforcement Learning π₯ π β β
combines the strengths of planning and reinforcement learning
πΉ Phasic Self-Imitative Reduction for Sparse-Reward Goal-Conditioned Reinforcement Learning π₯ π
PAIR: In the online phase, we perform RL training and collect rollout data while in the offline phase, we perform SL on those successful trajectories from the dataset. Task reduction.
πΉ SOLVING COMPOSITIONAL REINFORCEMENT LEARNING PROBLEMS VIA TASK REDUCTION π₯ π
SIR: Task reduction tackles a hard-to-solve task by actively reducing it to an easier task whose solution is known by the RL agent.
πΉ DYNAMICAL DISTANCE LEARNING FOR SEMI-SUPERVISED AND UNSUPERVISED SKILL DISCOVERY
dynamical distances: a measure of the expected number of time steps to reach a given goal state from any other states
πΉ Contextual Imagined Goals for Self-Supervised Robotic Learning π ββ β β
using the context-conditioned generative model to set goals that are appropriate to the current scene.
πΉ Reverse Curriculum Generation for Reinforcement Learning π π₯ β
Finding the optimal start-state distribution. Our method automatically generates a curriculum of start states that adapts to the agentβs performance, leading to efficient training on goal-oriented tasks.
πΉ Goal-Aware Prediction: Learning to Model What Matters π π₯ π₯ Introduction is good!
we propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space, resulting in a learning objective that more closely matches the downstream task.
πΉ C-LEARNING: LEARNING TO ACHIEVE GOALS VIA RECURSIVE CLASSIFICATION π π¦ π π₯
This Q-function is not useful for predicting or controlling the future state distribution. Fundamentally, this problem arises because the relationship between the reward function, the Q function, and the future state distribution in prior work remains unclear. π» [DIAYN?]
on-policy ---> off-policy ---> goal-conditioned.
πΉ LEARNING TO UNDERSTAND GOAL SPECIFICATIONS BY MODELLING REWARD π π β
ADVERSARIAL GOAL-INDUCED LEARNING FROM EXAMPLES
A framework within which instruction-conditional RL agents are trained using rewards obtained not from the environment, but from reward models which are jointly trained from expert examples.
πΉ Intrinsically Motivated Goal-Conditioned Reinforcement Learning: a Short Survey π π π§
This paper proposes a typology of these methods [intrinsically motivated processes (IMP) (knowledge-based IMG + competence-based IMP); goal-conditioned RL agents] at the intersection of deep rl and developmental approaches, surveys recent approaches and discusses future avenues.
SEE: Language as a Cognitive Tool to Imagine Goals in Curiosity-Driven Exploration
πΉ Self-supervised Learning of Distance Functions for Goal-Conditioned Reinforcement Learning
πΉ PARROT: DATA-DRIVEN BEHAVIORAL PRIORS FOR REINFORCEMENT LEARNING π
We propose a method for pre-training behavioral priors that can capture complex input-output relationships observed in successful trials from a wide range of previously seen tasks.
π» see model-based ddl
πΉ LEARNING WHAT TO DO BY SIMULATING THE PAST π β
we propose the Deep Reward Learning by Simulating the Past (Deep RLSP) algorithm.
πΉ Weakly-Supervised Reinforcement Learning for Controllable Behavior π
two phase approach that learns a disentangled representation, and then uses it to guide exploration, propose goals, and inform a distance metric.
πΉ Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification π π₯
we derive a method based on recursive classification that eschews auxiliary reward functions and instead directly learns a value function from transitions and successful outcomes.
πΉ [C-learning: Learning to achieve goals via recursive classification]
πΉ Example-Based Offline Reinforcement Learning without Rewards
πΉ Outcome-Driven Reinforcement Learning via Variational Inference π π₯ π§
by framing the problem of achieving desired outcomes as variational inference, we can derive an off-policy RL algorithm, a reward function learnable from environment interactions, and a novel Bellman backup that contains a stateβaction dependent dynamic discount factor for the reward and bootstrap.
πΉ Discovering Diverse Solutions in Deep Reinforcement Learning π β
learn infinitely many solutions by training a policy conditioned on a continuous or discrete low-dimensional latent variable.
πΉ Goal-Conditioned Reinforcement Learning with Imagined Subgoals π π₯ π β β β
This high-level policy predicts intermediate states halfway to the goal using the value function as a reachability metric. We donβt require the policy to reach these subgoals explicitly. Instead, we use them to define a prior policy, and incorporate this prior into a KL-constrained pi scheme to speed up and reg.
πΉ Goal-Space Planning with Subgoal Models π π₯
Goal-Space Planning (GSP): The key idea is to plan in a much smaller space of subgoals, and use these (high-level) subgoal values to update state values using subgoal-conditioned mode.
πΉ Goal-Space Planning with Subgoal Models π₯ π
constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models.
πΉ Discovering Generalizable Skills via Automated Generation of Diverse Tasks πΆ
As opposed to prior work on unsupervised discovery of skills which incentivizes the skills to produce different outcomes in the same environment, our method pairs each skill with a unique task produced by a trainable task generator. Procedural content generation (PCG).
πΉ Unbiased Methods for Multi-Goal RL π π π§
First, we vindicate HER by proving that it is actually unbiased in deterministic environments, such as many optimal control settings. Next, for stochastic environments in continuous spaces, we tackle sparse rewards by directly taking the infinitely sparse reward limit.
πΉ Goal-Aware Cross-Entropy for Multi-Target Reinforcement Learning π
GACE: that can be utilized in a self-supervised way using auto-labeled goal states alongside reinforcement learning.
πΉ DisCo RL: Distribution-Conditioned Reinforcement Learning for General-Purpose Policies π₯
Contextual policies provide this capability in principle, but the representation of the context determines the degree of generalization and expressivity. Categorical contexts preclude generalization to entirely new tasks. Goal-conditioned policies may enable some generalization, but cannot capture all tasks that might be desired.
πΉ Demonstration-Conditioned Reinforcement Learning for Few-Shot Imitation π₯
Given a training set consisting of demonstrations, reward functions and transition distributions for multiple tasks, the idea is to define a policy that takes demonstrations and current state as inputs, and to train this policy to maximize the average of the cumulative reward over the set of training tasks.
πΉ C-LEARNING: HORIZON-AWARE CUMULATIVE ACCESSIBILITY ESTIMATION π π§
we introduce the concept of cumulative accessibility functions, which measure the reachability of a goal from a given state within a specified horizon.
πΉ C-PLANNING: AN AUTOMATIC CURRICULUM FOR LEARNING GOAL-REACHING TASKS π₯
Frame the learning of the goal-conditioned policies as expectation maximization: the E-step corresponds to planning an optimal sequence of waypoints using graph search, while the M-step aims to learn a goal-conditioned policy to reach those waypoints.
πΉ Imitating Past Successes can be Very Suboptimal π π₯ π
we prove that existing outcome-conditioned imitation learning methods do not necessarily improve the policy; rather, in some settings they can decrease the expected reward. Nonetheless, we show that a simple modification results in a method that does guarantee policy improvement, under some assumptions.
πΉ Bisimulation Makes Analogies in Goal-Conditioned Reinforcement Learning π₯ π
We propose a new form of state abstraction called goal-conditioned bisimulation that captures functional equivariance, allowing for the reuse of skills to achieve new goals.
πΉ Goal-Conditioned Q-Learning as Knowledge Distillation π₯ π
ReenGAGE: the current Q-value function and the target Qvalue estimate are both functions of the goal, and we would like to train the Q-value function to match its target for all goals.
++DATA++
πΉ Connecting the Dots Between MLE and RL for Sequence Prediction π
A rich set of other algorithms such as RAML, SPG, and data noising, have also been developed from different perspectives. This paper establishes a formal connection between these algorithms. We present a generalized entropy regularized policy optimization formulation, and show that the apparently distinct algorithms can all be reformulated as special instances of the framework, with the only difference being the configurations of a reward function and a couple of hyperparameters.
πΉ Learning Data Manipulation for Augmentation and Weighting π π₯
We have developed a new method of learning different data manipulation schemes with the same single algorithm. Different manipulation schemes reduce to just different parameterization of the data reward function. The manipulation parameters are trained jointly with the target model parameters. (Equivalence between Data and Reward, Gradient-based Reward Learning)
πΉ Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement
HIPI: MaxEnt RL and MaxEnt inverse RL optimize the same multi-task RL objective with respect to trajectories and tasks, respectively.
πΉ HINDSIGHT FORESIGHT RELABELING FOR META-REINFORCEMENT LEARNING π₯ π
Hindsight Foresight Relabeling (HFR): We construct a relabeling distribution using the combination of hindsight, which is used to relabel trajectories using reward functions from the training task distribution, and foresight, which takes the relabeled trajectories and computes the utility of each trajectory for each task.
πΉ Generalized Hindsight for Reinforcement Learning π
Generalized Hindsight: an approximate inverse reinforcement learning technique for relabeling behaviors with the right tasks.
πΉ GENERALIZED DECISION TRANSFORMER FOR OFFLINE HINDSIGHT INFORMATION MATCHING π π₯
We present Generalized Decision Transformer (GDT) for solving any HIM (hindsight information matching) problem, and show how different choices for the feature function and the anti-causal aggregator not only recover DT as a special case, but also lead to novel Categorical DT (CDT) and Bi-directional DT (BDT) for matching different statistics of the future.
πΉ Hindsight Curriculum-guided Hindsight Experience Replay COMPETITIVE EXPERIENCE REPLAY π₯ Energy-Based Hindsight Experience Prioritization DHER: HINDSIGHT EXPERIENCE REPLAY FOR DYNAMIC GOALS
πΉ Diversity-based Trajectory and Goal Selection with Hindsight Experience Replay π
DTGSH: 1) a diversity-based trajectory selection module to sample valuable trajectories for the further goal selection; 2) a diversity-based goal selection module to select transitions with diverse goal states from the previously selected trajectories.
πΉ Exploration via Hindsight Goal Generation π π₯ β
a novel algorithmic framework that generates valuable hindsight goals which are easy for an agent to achieve in the short term and are also potential for guiding the agent to reach the actual goal in the long term.
πΉ UNDERSTANDING HINDSIGHT GOAL RELABELING REQUIRES RETHINKING DIVERGENCE MINIMIZATION π π₯ π π§
we develop a unified objective for goal-reaching that explains such a connection, from which we can derive goal-conditioned supervised learning (GCSL) and the reward function in hindsight experience replay (HER) from first principles.
πΉ CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning π β
This paper proposes CURIOUS, an algorithm that leverages 1) a modular Universal Value Function Approximator with hindsight learning to achieve a diversity of goals of different kinds within a unique policy and 2) an automated curriculum learning mechanism that biases the attention of the agent towards goals maximizing the absolute learning progress.
πΉ Hindsight Generative Adversarial Imitation Learning π₯
achieving imitation learning satisfying no need of demonstrations. [see self-imitation learning]
πΉ MHER: Model-based Hindsight Experience Replay π
Replacing original goals with virtual goals generated from interaction with a trained dynamics model.
πΉ Policy Continuation with Hindsight Inverse Dynamics π π₯ β β
This approach learns from Hindsight Inverse Dynamics based on Hindsight Experience Replay.
πΉ USHER: Unbiased Sampling for Hindsight Experience Replay π₯ π
We propose an asymptotically unbiased importance-sampling-based algorithm to address this problem without sacrificing performance on deterministic environments.
πΉ Experience Replay Optimization π π₯
Self-imitation; experience replay: we propose a novel experience replay optimization (ERO) framework which alternately updates two policies: the agent policy, and the replay policy. The agent is updated to maximize the cumulative reward based on the replayed data, while the replay policy is updated to provide the agent with the most useful experiences.
πΉ MODEL-AUGMENTED PRIORITIZED EXPERIENCE REPLAY πΆ
We propose a novel experience replay method, which we call model-augmented priority experience replay (MaPER), that employs new learnable features driven from components in model-based RL (MbRL) to calculate the scores on experiences.
πΉ TOPOLOGICAL EXPERIENCE REPLAY π₯
TER: If the data sampling strategy ignores the precision of Q-value estimate of the next state, it can lead to useless and often incorrect updates to the Q-values.
πΉ Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing π
Our key idea is to express the expected return objective as a weighted sum of two terms: an expectation over the high-reward trajectories inside a memory buffer, and a separate expectation over trajectories outside of the buffer.
πΉ RETRIEVAL-AUGMENTED REINFORCEMENT LEARNING π
We augment an RL agent with a retrieval process (parameterized as a neural network) that has direct access to a dataset of experiences. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context.
πΉ VARIATIONAL ORACLE GUIDING FOR REINFORCEMENT LEARNING π₯
Variational latent oracle guiding (VLOG) : An important but under-explored aspect is how to leverage oracle observation (the information that is invisible during online decision making, but is available during offline training) to facilitate learning.
πΉ WISH YOU WERE HERE: HINDSIGHT GOAL SELECTION FOR LONG-HORIZON DEXTEROUS MANIPULATION πΆ
We extend hindsight relabelling mechanisms to guide exploration along task-specific distributions implied by a small set of successful demonstrations.
πΉ Hindsight Task Relabelling: Experience Replay for Sparse Reward Meta-RL πΆ
HTR: we present a formulation of hindsight relabeling for meta-RL, which relabels experience during meta-training to enable learning to learn entirely using sparse reward.
πΉ Remember and Forget for Experience Replay π
ReF-ER (1) skips gradients computed from experiences that are too unlikely with the current policy and (2) regulates policy changes within a trust region of the replayed behaviors.
πΉ BENCHMARKING SAMPLE SELECTION STRATEGIES FOR BATCH REINFORCEMENT LEARNING π
We compare six variants of PER (temporal-difference error, n-step return, self-imitation learning objective, pseudo-count, uncertainty, and likelihood) based on various heuristic priority metrics that focus on different aspects of the offline learning setting.
πΉ An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay π₯
We show that any loss function evaluated with non-uniformly sampled data can be transformed into another uniformly sampled loss function with the same expected gradient.
πΉ Self-Imitation Learning via Generalized Lower Bound Q-learning π₯
To provide a formal motivation for the potential performance gains provided by self-imitation learning, we show that n-step lower bound Q-learning achieves a trade-off between fixed point bias and contraction rate, drawing close connections to the popular uncorrected n-step Q-learning.
πΉ Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target πΆ
we combine the n-step action-value alg. Retrace, Q-learning, Tree Backup, Sarsa, and Q(Ο) with an architecture analogous to DQN. It suggests that off-policy correction is not always necessary for learning from samples from the experience replay buffer.
πΉ Adaptive Trade-Offs in Off-Policy Learning π₯
We take a unifying view of this space of algorithms (off-policy learning algorithms ), and consider their trade-offs of three fundamental quantities: update variance, fixed-point bias, and contraction rate.
-
Imitation Learning (See Upper)
πΉ To Follow or not to Follow: Selective Imitation Learning from Observations π β
imitating every step in the demonstration often becomes infeasible when the learner and its environment are different from the demonstration.
-
reward function
πΉ QUANTIFYING DIFFERENCES IN REWARD FUNCTIONS π π¦ β β
we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without training a policy. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy.
πΉ IN-CONTEXT REINFORCEMENT LEARNING WITH ALGORITHM DISTILLATION π₯ π₯
We propose Algorithm Distillation (AD), a method for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model.
πΉ A SURVEY ON MODEL-BASED REINFORCEMENT LEARNING
πΉ Learning Latent Dynamics for Planning from Pixels ββ π¦ π¦ β
πΉ DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION π¦ β
πΉ CONTRASTIVE LEARNING OF STRUCTURED WORLD MODELS π₯ π β β
πΉ Learning Predictive Models From Observation and Interaction π₯ βrelated work is good!
By combining interaction and observation data, our model is able to learn to generate predictions for complex tasks and new environments without costly expert demonstrations.
πΉ medium Tutorial on Model-Based Methods in Reinforcement Learning (icml2020) π¦ β
rail Model-Based Reinforcement Learning: Theory and Practice π¦ ββ β β
πΉ What can I do here? A Theory of Affordances in Reinforcement Learning π π§
βaffordancesβ to describe the fact that certain states enable an agent to do certain actions, in the context of embodied agents. In this paper, we develop a theory of affordances for agents who learn and plan in Markov Decision Processes.
πΉ When to Trust Your Model: Model-Based Policy Optimization π₯ π π§ π₯ β
MBPO: we study the role of model usage in policy optimization both theoretically and empirically.
πΉ Visual Foresight: Model-based deep reinforcement learning for vision-based robotic control π
We presented an algorithm that leverages self-supervision from visual prediction to learn a deep dynamics model on images, and show how it can be embedded into a planning framework to solve a variety of robotic control tasks.
πΉ LEARNING STATE REPRESENTATIONS VIA RETRACING IN REINFORCEMENT LEARNING π
CCWM: a self-supervised instantiation of βlearning via retracingβ for joint representation learning and generative model learning under the model-based RL setting.
πΉ Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization π₯ π π₯
we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. β
πΉ Context-aware Dynamics Model for Generalization in Model-Based Reinforcement Learning π π₯ β β
The intuition is that the true context of the underlying MDP can be captured from recent experiences. learning a global model that can generalize across different dynamics is a challenging task. To tackle this problem, we decompose the task of learning a global dynamics model into two stages: (a) learning a context latent vector that captures the local dynamics, then (b) predicting the next state conditioned on it
πΉ Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement Learning π β
The main idea is updating the most accurate prediction head to specialize each head in certain environments with similar dynamics, i.e., clustering environments.
πΉ Optimism is All You Need: Model-Based Imitation Learning From Observation Alone π¦ β
πΉ PlanGAN: Model-based Planning With Sparse Rewards and Multiple Goals π
train an ensemble of conditional generative models (GANs) to generate plausible trajectories that lead the agent from its current state towards a specified goal. We then combine these imagined trajectories into a novel planning algorithm in order to achieve the desired goal as efficiently as possible.
πΉ MODEL-ENSEMBLE TRUST-REGION POLICY OPTIMIZATION π
we propose to use an ensemble of models to maintain the model uncertainty and regularize the learning process.
πΉ Sample Efficient Reinforcement Learning via Model-Ensemble Exploration and Exploitation πΆ
MEEE, a model-ensemble method that consists of optimistic exploration and weighted exploitation.
πΉ Regularizing Model-Based Planning with Energy-Based Models π π₯
We focus on planning with learned dynamics models and propose to regularize it using energy estimates of state transitions in the environment. ---> probabilistic ensembles with trajectory sampling (PETS), DAE regularization;
πΉ Model-Based Planning with Energy-Based Models π₯
We show that energy-based models (EBMs) are a promising class of models to use for model-based planning. EBMs naturally support inference of intermediate states given start and goal state distributions.
πΉ Can Autonomous Vehicles Identify, Recover From, and Adapt to Distribution Shifts? π
RIP: Our method can detect and recover from some distribution shifts, reducing the overconfident and catastrophic extrapolations in OOD scenes.
πΉ Model-Based Reinforcement Learning via Latent-Space Collocation π₯
LatCo: It is easier to solve long-horizon tasks by planning sequences of states rather than just actions, as the effects of actions greatly compound over time and are harder to optimize.
πΉ Reinforcement Learning with Action-Free Pre-Training from Videos πΆ
APV: we pre-train an action-free latent video prediction model, and then utilize the pre-trained representations for efficiently learning actionconditional world models on unseen environments.
πΉ Regularizing Trajectory Optimization with Denoising Autoencoders π₯
The idea is that we want to reward familiar trajectories and penalize unfamiliar ones because the model is likely to make larger errors for the unfamiliar ones.
πΉ Bridging Imagination and Reality for Model-Based Deep Reinforcement Learning π π₯ π§
BIRD: our basic idea is to leverage information from real trajectories to endow policy improvement on imaginations with awareness of discrepancy between imagination and reality.
πΉ ON-POLICY MODEL ERRORS IN REINFORCEMENT LEARNING π
We present on-policy corrections (OPC) that combines real world data and a learned model in order to get the best of both worlds. The core idea is to exploit the real world data for on policy predictions and use the learned model only to generalize to different actions.
πΉ ALGORITHMIC FRAMEWORK FOR MODEL-BASED DEEP REINFORCEMENT LEARNING WITH THEORETICAL GUARANTEES π π π§ π¦ π₯
SLBO: We design a meta-algorithm with a theoretical guarantee of monotone improvement to a local maximum of the expected reward. The meta-algorithm iteratively builds a lower bound of the expected reward based on the estimated dynamical model and sample trajectories, and then maximizes the lower bound jointly over the policy and the model.
πΉ Model-Augmented Q-Learning πΆ
We propose to estimate not only the Q-values but also both the transition and the reward with a shared network. We further utilize the estimated reward from the model estimators for Q-learning, which promotes interaction between the estimators.
πΉ Monotonic Robust Policy Optimization with Model Discrepancy π π π₯
We propose a robust policy optimization approach, named MRPO, for improving both the average and worst-case performance of policies. We theoretically derived a lower bound for the worst-case performance of a given policy over all environments, and formulated an optimization problem to optimize the policy and sampling distribution together, subject to constraints that bounded the update step in policy optimization and statistical distance between the worst and average case environments.
πΉ Policy Gradient Method For Robust Reinforcement Learning
πΉ Trust the Model When It Is Confident: Masked Model-based Actor-Critic π π
We derive a general performance bound for model-based RL and theoretically show that the divergence between the return in the model rollouts and that in the real environment can be reduced with restricted model usage.
πΉ MBDP: A Model-based Approach to Achieve both Robustness and Sample Efficiency via Double Dropout Planning π π₯
MBDP: Model-Based Double-dropout Planning (MBDP) consists of two kinds of dropout mechanisms, where the rollout-dropout aims to improve the robustness with a small cost of sample efficiency, while the model-dropout is designed to compensate for the lost efficiency at a slight expense of robustness.
πΉ PILCO (probabilistic inference for learning control) Deep PILCO π
πΉ Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models π
Employing uncertainty-aware dynamics models: we propose a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation.
πΉ Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control π π₯
POLO utilizes a global value function approximation scheme, a local trajectory optimization subroutine, and an optimistic exploration scheme.
πΉ Learning Off-Policy with Online Planning π₯ π
LOOP: We provide a theoretical analysis of this method, suggesting a tradeoff between model errors and value function errors and empirically demonstrate this tradeoff to be beneficial in deep reinforcement learning. H-step
πΉ Calibrated Model-Based Deep Reinforcement Learning π₯
This paper explores which uncertainties are needed for model-based reinforcement learning and argues that good uncertainties must be calibrated, i.e. their probabilities should match empirical frequencies of predicted events.
πΉ Model Imitation for Model-Based Reinforcement Learning π π
We propose to learn the transition model by matching the distributions of multi-step rollouts sampled from the transition model and the real ones via WGAN. We theoretically show that matching the two can minimize the difference of cumulative rewards between the real transition and the learned one.
πΉ Model-based Policy Optimization with Unsupervised Model Adaptation π₯
We derive a lower bound of the expected return, which inspires a bound maximization algorithm by aligning the simulated and real data distributions. To this end, we propose a novel model-based rl framework AMPO, which introduces unsupervised model adaptation to minimize the integral probability metric (IPM) between feature distributions from real and simulated data.
πΉ Bidirectional Model-based Policy Optimization π π₯ π₯
We propose to additionally construct a backward dynamics model to reduce the reliance on accuracy in forward model predictions: Bidirectional Model-based Policy Optimization (BMPO) to utilize both the forward model and backward model to generate short branched rollouts for policy optimization.
πΉ Backward Imitation and Forward Reinforcement Learning via Bi-directional Model Rollouts π₯
BIFRL: the agent treats backward rollout traces as expert demonstrations for the imitation of excellent behaviors, and then collects forward rollout transitions for policy reinforcement.
πΉ Self-Consistent Models and Values π₯
We investigate a way of augmenting model-based RL, by additionally encouraging a learned model and value function to be jointly self-consistent.
πΉ MODEL-AUGMENTED ACTOR-CRITIC: BACKPROPAGATING THROUGH PATHS π₯ π
MAAC: We exploit the fact that the learned simulator is differentiable and optimize the policy with the analytical gradient. The objective is theoretically analyzed in terms of the model and value error, and we derive a policy improvement expression with respect to those terms.
πΉ How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization π₯
MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning, leading to a critic tailored for policy improvement.
πΉ Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning
πΉ Discriminator Augmented Model-Based Reinforcement Learning π π
Our approach trains a discriminative model to assess the quality of sampled transitions during planning, and upweight or downweight value estimates computed from high and low quality samples, respectively. We can learn biased dynamics models with advantageous properties, such as reduced value estimation variance during planning.
πΉ Variational Model-based Policy Optimization π π₯ π π§
Jointly learn and improve model and policy using a universal objective function: We propose model-based and model-free policy iteration (actor-critic) style algorithms for the E-step and show how the variational distribution learned by them can be used to optimize the M-step in a fully model-based fashion.
πΉ Model-Based Reinforcement Learning via Imagination with Derived Memory π₯
IDM: It enables the agent to learn policy from enriched diverse imagination with prediction-reliability weight, thus improving sample efficiency and policy robustness
πΉ MISMATCHED NO MORE: JOINT MODEL-POLICY OPTIMIZATION FOR MODEL-BASED RL π₯ π₯
We propose a single objective for jointly training the model and the policy, such that updates to either component increases a lower bound on expected return.
πΉ Operator Splitting Value Iteration π₯ π π₯
Inspired by the splitting approach in numerical linear algebra, we introduce Operator Splitting Value Iteration (OS-VI) for both Policy Evaluation and Control problems. OS-VI achieves a much faster convergence rate when the model is accurate enough.
πΉ Model-Based Reinforcement Learning via Meta-Policy Optimization π₯ π
MB-MPO: Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step, which foregoes the strong reliance on accurate learned dynamics models.
πΉ A RELATIONAL INTERVENTION APPROACH FOR UNSUPERVISED DYNAMICS GENERALIZATION IN MODELBASED REINFORCEMENT LEARNING π π₯
We propose an intervention module to identify the probability of two estimated factors belonging to the same environment, and a relational head to cluster those estimated ZΛs are from the same environments with high probability, thus reducing the redundant information unrelated to the environment.
πΉ Value-Aware Loss Function for Model-based Reinforcement Learning π
Estimating a generative model that minimizes a probabilistic loss, such as the log-loss, is an overkill because it does not take into account the underlying structure of decision problem and the RL algorithm that intends to solve it. We introduce a loss function that takes the structure of the value function into account.
πΉ Iterative Value-Aware Model Learning π π₯
Iterative VAML, that benefits from the structure of how the planning is performed (i.e., through approximate value iteration) to devise a simpler optimization problem.
πΉ Configurable Markov Decision Processes π π₯ π§
In Conf-MDPs the environment dynamics can be partially modified to improve the performance of the learning agent.
πΉ Bridging Worlds in Reinforcement Learning with Model-Advantage π
we show relationships between the proposed model advantage and generalization in RL β using which we provide guarantees on the gap in performance of an agent in new environments.
πΉ Model-Advantage Optimization for Model-Based Reinforcement Learning π π₯ π§
a novel value-aware objective that is an upper bound on the absolute performance difference of a policy across two models.
πΉ Policy-Aware Model Learning for Policy Gradient Methods π₯ π
Decision-Aware Model Learning: We focus on policy gradient planning algorithms and derive new loss functions for model learning that incorporate how the planner uses the model.
πΉ Gradient-Aware Model-Based Policy Search π₯ π
Beyond Maximum Likelihood Model Estimation in Model-based Policy Search ppt
πΉ Model-Based Reinforcement Learning with Value-Targeted Regression π
πΉ Decision-Aware Model Learning for Actor-Critic Methods: When Theory Does Not Meet Practice πΆ
we show empirically that combining Actor-Critic and value-aware model learning can be quite difficult and that naive approaches such as maximum likelihood estimation often achieve superior performance with less computational cost.
πΉ The Value Equivalence Principle for Model-Based Reinforcement Learning π π§
We introduced the principle of value equivalence: two models are value equivalent with respect to a set of functions and a set of policies if they yield the same updates of the former on the latter. Value equivalence formalizes the notion that models should be tailored to their future use and provides a mechanism to incorporate such knowledge into the model learning process.
πΉ Proper Value Equivalence π¦
We start by generalizing the concept of VE to order-k counterparts defined with respect to k applications of the Bellman operator. This leads to a family of VE classes that increase in size as k β \inf. In the limit, all functions become value functions, and we have a special instantiation of VE which we call proper VE or simply PVE.
πΉ Minimax Model Learning π π§ π₯
our approach allows for greater robustness under model misspecification or distribution shift induced by learning/evaluating policies that are distinct from the data-generating policy.
πΉ On Effective Scheduling of Model-based Reinforcement Learning π₯
AutoMBPO: we aim to investigate how to appropriately schedule these hyperparameters, i.e., real data ratio, model training frequency, policy training iteration, and rollout length, to achieve optimal performance of Dyna-style MBRL algorithms.
πΉ Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy π
Policy-adaptation Model-based Actor-Critic (PMAC), which learns a policy-adapted dynamics model based on a policy-adaptation mechanism. This mechanism dynamically adjusts the historical policy mixture distribution to ensure the learned model can continually adapt to the state-action visitation distribution of the evolving policy.
πΉ When to Update Your Model: Constrained Model-based Reinforcement Learning π₯ π π§
CMLO: learning models from a dynamically-varying number of explorations benefit the eventual returns
We use an ensemble of neural networks with different initializations to tackle epistemic and aleatoric uncertainty issues faced during environment model learning.
πΉ DREAMERPRO: RECONSTRUCTION-FREE MODEL-BASED REINFORCEMENT LEARNING WITH PROTOTYPICAL REPRESENTATIONS
combining the prototypical representation learning with temporal dynamics learning.
ProtoCAD: extracts useful contextual information with the help of the prototypes clustered over batch and benefits model-based RL in two folds: 1) It utilizes a temporally consistent prototypical regularizer; 2) A context representation is designed and can significantly improve the dynamics generalization ability.
We hypothesize that the key functionality of the probabilistic dynamics model ensemble is to regularize the Lipschitz condition of the value function using generated samples.
β Zero-Order Trajectory Optimizers / Planning
πΉ Sample-efficient Cross-Entropy Method for Real-time Planning π§
i Cross-Entropy Method (CEM):
πΉ Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers π§
Adaptive Policy EXtraction (APEX):
πΉ MODEL-BASED VISUAL PLANNING WITH SELF-SUPERVISED FUNCTIONAL DISTANCES π
We present a self-supervised method for model-based visual goal reaching, which uses both a visual dynamics model as well as a dynamical distance function learned using model-free rl. Related work!
β model-based offline
πΉ Representation Balancing MDPs for Off-Policy Policy Evaluation π§ β
πΉ REPRESENTATION BALANCING OFFLINE MODEL-BASED REINFORCEMENT LEARNING π§ β
πΉ Skill-based Model-based Reinforcement Learning π
SkiMo: that enables planning in the skill space using a skill dynamics model, which directly predicts the skill outcomes, rather than predicting all small details in the intermediate states, step by step.
πΉ MODEL-BASED REINFORCEMENT LEARNING WITH MULTI-STEP PLAN VALUE ESTIMATION πΆ
MPPVE: We employ the multi-step plan value estimation, which evaluates the expected discounted return after executing a sequence of action plans at a given state, and updates the policy by directly computing the multi-step policy gradient via plan value estimation.
πΉ Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning π§
CDPO:
CABI: generates reliable samples and can be combined with any model-free offline RL method
πΉ VARIATIONAL LATENT BRANCHING MODEL FOR OFF-POLICY EVALUATION πΆ
VLBM: try to accurately capture the dynamics underlying environments from offline training data that provide limited coverage of the state and action space; for model-based OPE
πΉ CONSERVATIVE BAYESIAN MODEL-BASED VALUE EXPANSION FOR OFFLINE POLICY OPTIMIZATION π₯ π
CBOP: that trades off model-free and model-based estimates during the policy evaluation step according to their epistemic uncertainties, and facilitates conservatism by taking a lower bound on the Bayesian posterior value estimate
πΉ LATENT VARIABLE REPRESENTATION FOR REINFORCEMENT LEARNING π§
LV-Rep
πΉ PESSIMISTIC MODEL-BASED ACTOR-CRITIC FOR OFFLINE REINFORCEMENT LEARNING: THEORY AND ALGORITHMS π§
πΉ MODEM: ACCELERATING VISUAL MODEL-BASED REINFORCEMENT LEARNING WITH DEMONSTRATIONS π₯
We identify key ingredients for leveraging demonstrations in model learning β policy pretraining, targeted exploration, and oversampling of demonstration data β which forms the three phases of our model-based RL framework.
πΉ Reinforcement Learning: Theory and Algorithms π π¦
πΉ Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning πΆ β
πΉ Predictive Information Accelerates Learning in RL π β
We train Soft Actor-Critic (SAC) agents from pixels with an auxiliary task that learns a compressed representation of the predictive information of the RL environment dynamics using a contrastive version of the Conditional Entropy Bottleneck (CEB) objective.
πΉ Speeding up Reinforcement Learning with Learned Models π¦ β
πΉ DYNAMICS-AWARE EMBEDDINGS π β
A forward prediction objective for simultaneously learning embeddings of states and action sequences.
πΉ DIVIDE-AND-CONQUER REINFORCEMENT LEARNING π
we develop a novel algorithm that instead partitions the initial state space into βslicesβ, and optimizes an ensemble of policies, each on a different slice.
πΉ Continual Learning of Control Primitives: Skill Discovery via Reset-Games π π₯
We do this by exploiting the insight that the need to βreset" an agent to a broad set of initial states for a learning task provides a natural setting to learn a diverse set of βreset-skills".
πΉ DIFFERENTIABLE TRUST REGION LAYERS FOR DEEP REINFORCEMENT LEARNING π π§
We derive trust region projections based on the Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius norm for Gaussian distributions. Related work is good!
πΉ BENCHMARKS FOR DEEP OFF-POLICY EVALUATION π π₯ π§
DOPE is designed to measure the performance of OPE methods by 1) evaluating on challenging control tasks with properties known to be difficult for OPE methods, but which occur in real-world scenarios, 2) evaluating across a range of policies with different values, to directly measure performance on policy evaluation, ranking and selection, and 3) evaluating in ideal and adversarial settings in terms of dataset coverage and support.
πΉ Universal Off-Policy Evaluation π
We take the first steps towards a universal off-policy estimator (UnO) that estimates and bounds the entire distribution of returns, and then derives estimates and simultaneous bounds for all parameters of interest.
πΉ Trajectory-Based Off-Policy Deep Reinforcement Learning π π₯ π§ β
Incorporation of previous rollouts via importance sampling greatly improves data-efficiency, whilst stochastic optimization schemes facilitate the escape from local optima.
πΉ Off-Policy Policy Gradient with State Distribution Correction π§
πΉ DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections π₯ π π₯
Off-Policy Policy Evaluation (OPE) ---> Learning Stationary Distribution Corrections ---> Off-Policy Estimation with Multiple Unknown Behavior Policies. , DualDICE, for estimating the discounted stationary distribution corrections.
πΉ AlgaeDICE: Policy Gradient from Arbitrary Experience π π π§ β β β
We introduce a new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution. ALgorithm for policy Gradient from Arbitrary Experience via DICE (AlgaeDICE).
πΉ GENDICE: GENERALIZED OFFLINE ESTIMATION OF STATIONARY VALUES π π₯ π₯ ββ β β
Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions, derived from fundamental properties of the stationary distribution, and exploiting constraint reformulations based on variational divergence minimization.
πΉ GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values π β
πΉ Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation π π₯ π§ β
The key idea is to apply importance sampling on the average visitation distribution of single steps of state-action pairs, instead of the much higher dimensional distribution of whole trajectories.
πΉ Off-Policy Evaluation via the Regularized Lagrangian π₯ π π§ β
we unify these estimators (DICE) as regularized Lagrangians of the same linear program.
πΉ OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation π π
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms.
πΉ SMODICE: Versatile Offline Imitation Learning via State Occupancy Matching
πΉ DEMODICE: OFFLINE IMITATION LEARNING WITH SUPPLEMENTARY IMPERFECT DEMONSTRATIONS π π₯ π
An algorithm for offline IL from expert and imperfect demonstrations that achieves state-of-the-art performance on various offline IL tasks.
πΉ OFF-POLICY CORRECTION FOR ACTOR-CRITIC ALGORITHMS IN DEEP REINFORCEMENT LEARNING πΆ
AC-Off-POC: Through a novel discrepancy measure computed by the agentβs most recent action decisions on the states of the randomly sampled batch of transitions, the approach does not require actual or estimated action probabilities for any policy and offers an adequate one-step importance sampling.
πΉ A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation π₯ π₯
We bridge the gap between MIS and deep RL by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep RL methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains.
πΉ Policy-Adaptive Estimator Selection for Off-Policy Evaluation π₯ π₯
PAS-IF: synthesizes appropriate subpopulations by minimizing the squared distance between the importance ratio induced by the true evaluation policy and that induced by the pseudo evaluation policy (in OPE), which we call the importance fitting step.
πΉ How Far Iβll Go: Offline Goal-Conditioned Reinforcement Learning via f-Advantage Regression π
Goal-conditioned f-Advantage Regression (GoFAR), a novel regressionbased offline GCRL algorithm derived from a state-occupancy matching perspective; the key intuition is that the goal-reaching task can be formulated as a stateoccupancy matching problem between a dynamics-abiding imitator agent and an expert agent that directly teleports to the goal.
πΉ Minimax Weight and Q-Function Learning for Off-Policy Evaluation π₯ π§ β β
Minimax Weight Learning (MWL); Minimax Q-Function Learning. Doubly Robust Extension and Sample Complexity of MWL & MQL.
πΉ Minimax Value Interval for Off-Policy Evaluation and Policy Optimization π π₯ π π§
we derive the minimax value intervals by slightly altering the derivation of two recent methods [1], one of βweight-learningβ style (Sec. 4.1) and one of βvalue-learningβ style (Sec. 4.2), and show that under certain conditions, they merge into a single unified value interval whose validity only relies on either Q or W being well-specified (Sec. 4.3).
πΉ Reinforcement Learning via Fenchel-Rockafellar Duality π₯ π π§ β β β
Policy Evaluation: LP form of Q ---> policy evaluation via largrangian ---> change the problem before applying duality (constant function, f-divergence, fenchel-rockafellar duality); Policy Optimization: policy gradient ---> offline policy gradient via the lagrangian ---> fenchel-rockafellar duality for the regularized optimization (regularization with the kl-d) ---> imitation learning; RL with the LP form of V: max-likelihood policy learning ---> policy evaluation with the V-lp; Undiscounted Settings
πΉ ADVANTAGE-WEIGHTED REGRESSION: SIMPLE AND SCALABLE OFF-POLICY REINFORCEMENT LEARNING π π₯ π₯
Our proposed approach, which we refer to as advantage-weighted regression (AWR), consists of two standard supervised learning steps: one to regress onto target values for a value function, and another to regress onto weighted target actions for the policy. [see MPO]
πΉ Relative Entropy Policy Search π₯ β
REPS: it allows an exact policy update and may use data generated while following an unknown policy to generate a new, better policy.
πΉ Overcoming Exploration in Reinforcement Learning with Demonstrations π₯
We present a system to utilize demonstrations along with reinforcement learning to solve complicated multi-step tasks. Q-Filter. BC.
πΉ Fitted Q-iteration by Advantage Weighted Regression π π₯ π₯ β
we show that by using a soft-greedy action selection the policy improvement step used in FQI can be simplified to an inexpensive advantage weighted regression. <--- greedy action selection in continuous.
πΉ Q-Value Weighted Regression: Reinforcement Learning with Limited Data π₯ π
QWR: We replace the value function critic of AWR with a Q-value function. AWR --> QWR.
πΉ SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning π₯ π
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration. [Rainbow]
πΉ Revisiting Rainbow: Promoting more Insightful and Inclusive Deep Reinforcement Learning Research π
πΈ Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networks π₯
we propose MeanQ, a simple ensemble method that estimates target values as ensemble means.
πΉ Explaining Off-Policy Actor-Critic From A Bias-Variance Perspective π
To understand an off-policy actor-critic algorithm, we show the policy evaluation error on the expected distribution of transitions decomposes into the Bellman error, the bias from policy mismatch, and the variance from sampling.
πΉ SOPE: Spectrum of Off-Policy Estimators π π₯ π
Combining Trajectory-Based and Density-Based Importance Sampling: We present a new perspective in off-policy evaluation connecting two popular estimators, PDIS and SIS, and show that PDIS and SIS lie as endpoints on the Spectrum of Off-Policy Estimators SOPEn which interpolates between them.
πΉ Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators π₯ π π§
Generalized Bellman Operator: QΟ (Ξ»), Tree-Backup(Ξ») (henceforth denoted by TB(Ξ»)), Retrace(Ξ»), and Q-trace.
πΉ Efficient Continuous Control with Double Actors and Regularized Critics π₯
DARC: We show that double actors help relieve overestimation bias in DDPG if built upon single critic, and underestimation bias in TD3 if built upon double critics. (they enhance the exploration ability of the agent.)
πΉ A Unified Off-Policy Evaluation Approach for General Value Function π§
GenTD:
Model Selection:
πΉ Pessimistic Model Selection for Offline Deep Reinforcement Learning π₯ π§
We propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee, which features a provably effective framework for finding the best policy among a set of candidate models.
πΉ REVISITING BELLMAN ERRORS FOR OFFLINE MODEL SELECTION π₯
Supervised Bellman Validation (SBV)
πΉ PARAMETER-BASED VALUE FUNCTIONS π β
Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters.
πΉ Reinforcement Learning without Ground-Truth State π
relabeling the original goal with the achieved goal to obtain positive rewards
πΉ Ecological Reinforcement Learning π
πΉ Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning π β
πΉ Taylor Expansion Policy Optimization π π₯ π π§ β
a policy optimization formalism that generalizes prior work (e.g., TRPO) as a firstorder special case. We also show that Taylor expansions intimately relate to off-policy evaluation.
πΉ Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning π β
Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning.
πΉ Deep Reinforcement Learning with Robust and Smooth Policy π
Motivated by the fact that many environments with continuous state space have smooth transitions, we propose to learn a smooth policy that behaves smoothly with respect to states. We develop a new framework β Smooth Regularized Reinforcement Learning (SR2L), where the policy is trained with smoothness-inducing regularization.
πΉ If MaxEnt RL is the Answer, What is the Question? π π₯ π
πΉ Maximum Entropy RL (Provably) Solves Some Robust RL Problems π₯ π
Our main contribution is a set of proofs showing that standard MaxEnt RL optimizes lower bounds on several possible robust objectives, reflecting a degree of robustness to changes in the dynamics and to certain changes in the reward.
πΉ Your Policy Regularizer is Secretly an Adversary π¦
πΉ Estimating Q(s, s') with Deep Deterministic Dynamics Gradients π π₯
We highlight the benefits of this approach in terms of value function transfer, learning within redundant action spaces, and learning off-policy from state observations generated by sub-optimal or completely random policies.
πΉ RANDOMIZED ENSEMBLED DOUBLE Q-LEARNING: LEARNING FAST WITHOUT A MODEL π₯
REDQ: (i) a Update-To-Data (UTD) ratio >> 1; (ii) an ensemble of Q functions; (iii) in-target minimization across a random subset of Q functions from the ensemble.
πΉ DROPOUT Q-FUNCTIONS FOR DOUBLY EFFICIENT REINFORCEMENT LEARNING πΆ
To make REDQ more computationally efficient, we propose a method of improving computational efficiency called Dr.Q, which is a variant of REDQ that uses a small ensemble of dropout Q-functions.
πΉ Disentangling Dynamics and Returns: Value Function Decomposition with Future Prediction π β
we propose a two-step understanding of value estimation from the perspective of future prediction, through decomposing the value function into a reward-independent future dynamics part and a policy-independent trajectory return part.
πΉ DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction
πΉ Regret Minimization Experience Replay in Off-Policy Reinforcement Learning π₯ π
ReMERN and ReMERT: We start from the regret minimization objective, and obtain an optimal prioritization strategy for Bellman update that can directly maximize the return of the policy. The theory suggests that data with higher hindsight TD error, better on-policiness and more accurate Q value should be assigned with higher weights during sampling.
πΉ Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift π
Existing off-policy gradient based methods do not correct for the state distribution mismatch, and in this work we show that instead of computing the ratio over state distributions, we can instead minimize the KL between the target and behaviour state distributions to account for the state distribution shift in off-policy learning.
πΉ Fast Efficient Hyperparameter Tuning for Policy Gradient Methods πΆ
Hyperparameter Optimisation on the Fly (HOOF): The main idea is to use existing trajectories sampled by the policy grad. method to optimise a one-step improvement objective, yielding a sample and computationally efficient alg. that is easy to implement.
πΉ REWARD SHIFTING FOR OPTIMISTIC EXPLORATION AND CONSERVATIVE EXPLOITATION πΆ
We bring the key insight that a positive reward shifting leads to conservative exploitation, while a negative reward shifting leads to curiosity-driven exploration.
πΉ Exploiting Reward Shifting in Value-Based Deep RL
U
πΉ Heuristic-Guided Reinforcement Learning π₯ π
HuRL: We show how heuristic-guided RL induces a much shorter-horizon subproblem that provably solves the original task. Our framework can be viewed as a horizon-based regularization for controlling bias and variance in RL under a finite interaction budget.
πΉ Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning π₯ π§
Our results provide strong evidence for our hypothesis that large differences in action-gap sizes are detrimental to the performance of approximate RL.
πΉ ORCHESTRATED VALUE MAPPING FOR REINFORCEMENT LEARNING π₯
We present a general convergent class of reinforcement learning algorithms that is founded on two distinct principles: (1) mapping value estimates to a different space using arbitrary functions from a broad class, and (2) linearly decomposing the reward signal into multiple channels.
πΉ Discount Factor as a Regularizer in Reinforcement Learning π₯ π
We show an explicit equivalence between using a reduced discount factor and adding an explicit regularization term to the algorithmβs loss.
πΉ Learning to Score Behaviors for Guided Policy Optimization π₯ π§
We show that by utilizing the dual formulation of the WD, we can learn score functions over policy behaviors that can in turn be used to lead policy optimization towards (or away from) (un)desired behaviors.
πΉ Dual Policy Distillation π
DPD: a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment and extract knowledge from each other to enhance their learning.
πΉ Jump-Start Reinforcement Learning π
JSRL: an algorithm that employs two policies to solve tasks: a guide-policy, and an exploration-policy. By using the guide-policy to form a curriculum of starting states for the exploration-policy, we are able to efficiently improve performance on a set of simulated robotic tasks.
πΉ Distilling Policy Distillation π π
We sought to highlight some of the strengths, weaknesses, and potential mathematical inconsistencies in different variants of distillation used for policy knowledge transfer in reinforcement learning.
πΉ Regularized Policies are Reward Robust π₯ π§ β
we find that the optimal policy found by a regularized objective is precisely an optimal policy of a reinforcement learning problem under a worst-case adversarial reward.
πΉ Reinforcement Learning as One Big Sequence Modeling Problem π π₯ π§ β β β
Addressing RL as a sequence modeling problem significantly simplifies a range of design decisions: we no longer require separate behavior policy constraints, as is common in prior work on offline model-free RL, and we no longer require ensembles or other epistemic uncertainty estimators, as is common in prior work on model-based RL.
πΉ Decision Transformer: Reinforcement Learning via Sequence Modeling π₯ π
πΉ How Crucial is Transformer in Decision Transformer? πΆ
These results suggest that the strength of the Decision Transformer for continuous control tasks may lie in the overall sequential modeling architecture and not in the Transformer per se.
πΉ Prompting Decision Transformer for Few-Shot Policy Generalization π₯ π
We propose a Prompt-based Decision Transformer (Prompt-DT), which leverages the sequential modeling ability of the Transformer architecture and the prompt framework to achieve few-shot adaptation in offline RL.
πΉ Bootstrapped Transformer for Offline Reinforcement Learning πΆ
Bootstrapped Transformer, which incorporates the idea of bootstrapping and leverages the learned model to self-generate more offline data to further boost the sequence model training. CABI (Double Check Your State Before Trusting It)
πΉ On-Policy Deep Reinforcement Learning for the Average-Reward Criterion π π₯
By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the two policies and Kemenyβs constant.
πΉ Average-Reward Reinforcement Learning with Trust Region Methods π π₯
Firstly, we develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint.
πΉ Trust Region Policy Optimization π π₯ π₯ β β
πΉ Benchmarking Deep Reinforcement Learning for Continuous Control π π₯ β β
πΉ P3O: Policy-on Policy-off Policy Optimization π₯
This paper develops a simple alg. named P3O that interleaves offpolicy updates with on-policy updates.
πΉ Policy Gradients Incorporating the Future π₯
we consider the problem of incorporating information from the entire trajectory in model-free online and offline RL algorithms, enabling an agent to use information about the future to accelerate and improve its learning.
πΉ Generalizable Episodic Memory for Deep Reinforcement Learning π π₯
Generalizable Episodic Memory: We propose Generalizable Episodic Memory (GEM), which effectively organizes the state-action values of episodic memory in a generalizable manner and supports implicit planning on memorized trajectories.
πΉ Generalized Proximal Policy Optimization with Sample Reuse π π₯ π
GePPO: We combine the theoretically supported stability benefits of on-policy algorithms with the sample efficiency of off-policy algorithms. We develop policy improvement guarantees that are suitable for the off-policy setting, and connect these bounds to the clipping mechanism used in Proximal Policy Optimization.
πΉ Zeroth-Order Supervised Policy Improvement π₯ π₯ β
The policy learning of ZOSPI has two steps: 1), it samples actions and evaluates those actions with a learned value estimator, and 2) it learns to perform the action with the highest value through supervised learning.
πΉ SAMPLE EFFICIENT ACTOR-CRITIC WITH EXPERIENCE REPLAY π π₯
including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization.
πΉ Safe and efficient off-policy reinforcement learning π₯ π§ β β
Retrace(Ξ»); low variance, safe, efficient,
πΉ Relative Entropy Regularized Policy Iteration π₯ π₯
The algorithm alternates between Q-value estimation, local policy improvement and parametric policy fitting; hard constraints control the rate of change of the policy. And a decoupled update for mean and covarinace of a Gaussian policy avoids premature convergence. [see MPO]
πΉ Q-Learning for Continuous Actions with Cross-Entropy Guided Policies π
Our approach trains the Q-function using iterative sampling with the Cross-Entropy Method (CEM), while training a policy network to imitate CEMβs sampling behavior.
πΉ SUPERVISED POLICY UPDATE FOR DEEP REINFORCEMENT LEARNING π π₯ π₯ β β β
FORWARD AGGREGATE AND DISAGGREGATE KL CONSTRAINTS; BACKWARD KL CONSTRAINT; L CONSTRAINT;
πΉ Maximizing Ensemble Diversity in Deep Q-Learning πΆ
Reducing overestimation bias by increasing representation dissimilarity in ensemble based deep q-learning.
πΉ Value-driven Hindsight Modelling π
we propose to learn what to model in a way that can directly help value prediction.
πΉ Dual Policy Iteration π π₯
DPI: We present and analyze Dual Policy Iterationβa framework that alternatively computes a non-reactive policy via more advanced and systematic search, and updates a reactive policy via imitating the non-reactive one. [MPO, AWR]
πΉ Regret Minimization for Partially Observable Deep Reinforcement Learning π β
πΉ THE IMPORTANCE OF PESSIMISM IN FIXED-DATASET POLICY OPTIMIZATION π π₯ π» π¦ β
Algs can follow the pessimism principle, which states that we should choose the policy which acts optimally in the worst possible world. We show why pessimistic algorithms can achieve good performance even when the dataset is not informative of every policy, and derive families of algorithms which follow this principle.
πΉ Bridging the Gap Between Value and Policy Based Reinforcement Learning π₯ π₯ π β
we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces.
πΉ Equivalence Between Policy Gradients and Soft Q-Learning π π§ β
The soft Q-learning loss gradient can be interpreted as a policy gradient term plus a baseline-error-gradient term, corresponding to policy gradient instantiations such as A3C.
πΉ An operator view of policy gradient methods π₯
We use this framework to introduce operator-based versions of well-known policy gradient methods.
πΉ MAXIMUM REWARD FORMULATION IN REINFORCEMENT LEARNING π§
We formulate an objective function to maximize the expected maximum reward along a trajectory, derive a novel functional form of the Bellman equation, introduce the corresponding Bellman operators, and provide a proof of convergence.
πΉ Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error π₯
The magnitude of the Bellman error is smaller for biased value functions due to cancellations caused from both sides of the Bellman equation. The relationship between Bellman error and value error is broken if the dataset is missing relevant transitions.
πΉ CONVERGENT AND EFFICIENT DEEP Q NETWORK ALGORITHM π
We show that DQN can indeed diverge and cease to operate in realistic settings. we propose a convergent DQN (C-DQN) that is guaranteed to converge.
πΉ LEARNING SYNTHETIC ENVIRONMENTS AND REWARD NETWORKS FOR REINFORCEMENT LEARNING π
We use bi-level optimization to evolve SEs and RNs: the inner loop trains the RL agent, and the outer loop trains the parameters of the SE / RN via an evolution strategy.
πΉ IS HIGH VARIANCE UNAVOIDABLE IN RL? A CASE STUDY IN CONTINUOUS CONTROL
πΉ Reinforcement Learning with a Terminator π₯ π₯
We define the Termination Markov Decision Process (TerMDP), an extension of the MDP framework, in which episodes may be interrupted by an external non-Markovian observer.
πΉ Truly Deterministic Policy Optimization π π
We proposed a deterministic policy gradient method (TDPO) based on the use of a deterministic Vine (DeVine) gradient estimator and the Wasserstein metric. We proved monotonic payoff guarantees for our method, and defined a novel surrogate for policy optimization.
πΉ Automated Reinforcement Learning (AutoRL): A Survey and Open Problems π¦
πΉ CGAR: Critic Guided Action Redistribution in Reinforcement Leaning πΆ
the Q value predicted by the critic is a better signal to redistribute the action originally sampled from the policy distribution predicted by the actor.
πΉ Value Function Decomposition for Iterative Design of Reinforcement Learning Agents πΆ
SAC-D: We also introduce decomposition-based tools that exploit this information, including a new reward influence metric, which measures each reward componentβs effect on agent decision-making.
πΉ Emphatic Algorithms for Deep Reinforcement Learning π₯
πΉ Off-Policy Evaluation for Large Action Spaces via Embeddings
we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. [poster]
a model-based offline RL approach which first learns a personalized simulator for each agent by collectively using the historical trajectories across all agents, prior to learning a policy.
πΉ Value Function Decomposition for Iterative Design of Reinforcement Learning Agents π₯
we introduce SAC-D, a variant of soft actor-critic (SAC) adapted for value decomposition.
πΉ Gradient Temporal-Difference Learning with Regularized Corrections π₯
πΉ The Primacy Bias in Deep Reinforcement Learning π₯
"Your assumptions are your windows on the world. Scrub them off every once in a while, or the light wonβt come in." [poster]
πΉ Memory-Constrained Policy Optimization πΆ
In addition to using the proximity of one single old policy as the first trust region as done by prior works, we propose to form a second trust region through the construction of another virtual policy that represents a wide range of past policies.
πΉ A Temporal-Difference Approach to Policy Gradient Estimation π₯ π
TDRC: By using temporaldifference updates of the gradient critic from an off-policy data stream, we develop the first estimator that side-steps the distribution shift issue in a model-free way. [poster]
πΉ gamma-models: Generative Temporal Difference Learning for Infinite-Horizon Prediction π₯ π π₯
Our goal is to make long-horizon predictions without the need to repeatedly apply a single-step model.
πΉ Generalised Policy Improvement with Geometric Policy Composition π π§
GGPI:
πΉ Taylor Expansions of Discount Factors π π§
We study the effect that this discrepancy of discount factors has during learning, and discover a family of objectives that interpolate value functions of two distinct discount factors.
πΉ Learning Retrospective Knowledge with Reverse Reinforcement Learning
Since such questions (how much fuel do we expect a car to have given it is at B at time t?) emphasize the influence of possible past events on the present, we refer to their answers as retrospective knowledge. We show how to represent retrospective knowledge with Reverse GVFs, which are trained via Reverse RL. [see GenTD]
We focus on bootstrapping targets used when estimating value functions, and propose a new backup target, the lambda-return mixture, which implicitly combines value-predictive knowledge (used by TD methods) with (successor) feature-predictive knowledgeβwith a parameter lambda capturing how much to rely on each.
πΉ A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms π₯
We then propose to interpret the discounting in the critic in terms of a bias-variance-representation trade-off and provide supporting empirical results. In the second scenario, we consider optimizing a discounted objective (gamma < 1) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.
πΉ An Analytical Update Rule for General Policy Optimization π π₯ π₯
The contributions of this paper include: (1) a new theoretical result that tightens existing bounds for local policy search using trust-region methods; (2) a closed-form update rule for general stochastic policies with monotonic improvement guarantee; [poster]
πΉ Deep Reinforcement Learning at the Edge of the Statistical Precipice π₯ π
With the aim of increasing the fieldβs confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. [rilable]
πΉ Safe Policy Improvement Approaches and their Limitations
SPIBB
πΉ BSAC: Bayesian Strategy Network Based Soft Actor-Critic in Deep Reinforcement Learning πΆ
(BSAC) model by organizing several sub-policies as a joint policy
πΉ Collect & Infer - a fresh look at data-efficient Reinforcement Learning π
Collect and Infer, which explicitly models RL as two separate but interconnected processes, concerned with data collection and knowledge inference respectively.
πΉ A DATASET PERSPECTIVE ON OFFLINE REINFORCEMENT LEARNING π π₯
we define characteristics of behavioral policies as exploratory for yielding high expected information in their interaction with the Markov Decision Process (MDP) and as exploitative for having high expected return. Understanding the Effects of Dataset Characteristics on Offline Reinforcement Learning
πΉ Distributional Actor-Critic Ensemble for Uncertainty-Aware Continuous Control π₯
UA-DDPG: It exploits epistemic uncertainty to accelerate exploration and aleatoric uncertainty to learn a risk-sensitive policy (also known as risk-averse RL, safe RL, and conservative RL).
πΉ On the Reuse Bias in Off-Policy Reinforcement Learning π₯ π
BIRIS: We further provide a high-probability upper bound of the Reuse Bias, and show that controlling one term of the upper bound can control the Reuse Bias by introducing the concept of stability for off-policy algorithms
πΉ Reinforcement Learning with a Terminator π₯ π₯
We define the Termination Markov Decision Process (TerMDP), an extension of the MDP framework, in which episodes may be interrupted by an external non-Markovian observer
πΉ Low-Rank Modular Reinforcement Learning via Muscle Synergy π₯
SOLAR: exploits the redundant nature of DoF in robot control. Actuators are grouped into synergies by an unsupervised learning method, and a synergy action is learned to control multiple actuators in synchrony. In this way, we achieve a low-rank control at the synergy level.
πΉ INAPPLICABLE ACTIONS LEARNING FOR KNOWLEDGE TRANSFER IN REINFORCEMENT LEARNING π₯
SDAS-MDP: Knowing this information (inapplicable actions) can help reduce the sample complexity of RL algorithms by masking the inapplicable actions from the policy distribution to only explore actions relevant to finding an optimal policy.
πΉ Rethinking Value Function Learning for Generalization in Reinforcement Learning πΆ
Dynamics-aware Delayed-Critic Policy Gradient (DDCPG): a policy gradient algorithm that implicitly penalizes value estimates by optimizing the value network less frequently with more training data than the policy network.
πΉ VARIATIONAL LATENT BRANCHING MODEL FOR OFF-POLICY EVALUATION π π₯
VLBM leverages and extends the variational inference framework with the recurrent state alignment (RSA), which is designed to capture as much information underlying the limited training data, by smoothing out the information flow between the variational (encoding) and generative (decoding) part of VLBM. Moreover, we also introduce the branching architecture to improve the modelβs robustness against randomly initialized model weights.
-
A Survey on Transfer Learning for Multiagent Reinforcement Learning Systems π π¦
πΉ Counterfactual Multi-Agent Policy Gradients π₯
COMA: to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agentβs action, while keeping the other agentsβ actions fixed.
πΉ Value-Decomposition Networks For Cooperative Multi-Agent Learning π
VDN: aims to learn an optimal linear value decomposition from the team reward signal, by back-propagating the total Q gradient through deep neural networks representing the individual component value functions.
πΉ QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning π₯ π
QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations.
πΉ Best Possible Q-Learning π
BQL: Best Possible Operator
-
INTRINSIC REWARD
πΉ Hierarchical Cooperative Multi-Agent Reinforcement Learning with Skill Discovery π₯ π β β
The set of low-level skills emerges from an intrinsic reward that solely promotes the decodability of latent skill variables from the trajectory of a low-level skill, without the need for hand-crafted rewards for each skill.
πΉ The Emergence of Individuality in Multi-Agent Reinforcement Learning π₯ π₯ π
EOI learns a probabilistic classifier that predicts a probability distribution over agents given their observation and gives each agent an intrinsic reward of being correctly predicted by the classifier.
πΉ A Maximum Mutual Information Framework for Multi-Agent Reinforcement Learning π π§ β β
introducing a latent variable to induce nonzero mutual information between actions.
πΉ MASER: Multi-Agent Reinforcement Learning with Subgoals Generated from Experience Replay Buffer π π₯ π₯
MASER automatically generates proper subgoals for multiple agents from the experience replay buffer by considering both individual Q-value and total Qvalue. MASER designs individual intrinsic reward for each agent based on actionable representation relevant to Q-learning.
-
Learning Latent Representations
πΉ Learning Latent Representations to Influence Multi-Agent Interaction π β
We propose a reinforcement learningbased framework for learning latent representations of an agentβs policy, where the ego agent identifies the relationship between its behavior and the other agentβs future strategy. The ego agent then leverages these latent dynamics to influence the other agent, purposely guiding them towards policies suitable for co-adaptation.
-
communicative partially-observable stochastic game (Comm-POSG)
πΉ TRUST REGION POLICY OPTIMISATION IN MULTI-AGENT REINFORCEMENT LEARNING π
We extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms.
πΉ Beyond Rewards: a Hierarchical Perspective on Offline Multiagent Behavioral Analysis π π₯
MOHBA: We introduce a model-agnostic method for discovery of behavior clusters in multiagent domains, using variational inference to learn a hierarchy of behaviors at the joint and local agent levels.
πΉ Discovered Policy Optimisation π₯
we explore the Mirror Learning space by meta-learning a βdriftβ function. We refer to the result as Learnt Policy Optimisation (LPO). By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy Optimisation (DPO).
πΉ A Unified View of Entropy-Regularized Markov Decision Processes π₯ π§ β
using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations
πΉ A Theory of Regularized Markov Decision Processes ππ π₯ π§
We have introduced a general theory of regularized MDPs, where the usual Bellman evaluation operator is modified by either a fixed convex function or a Bregman divergence between consecutive policies. We shown how many (variations of) existing algorithms could be derived from this general algorithmic scheme, and also analyzed and discussed the related propagation of errors.
πΉ Mirror Learning: A Unifying Framework of Policy Optimisation π πΆ
we introduce a novel theoretical framework, named Mirror Learning, which provides theoretical guarantees to a large class of algorithms, including TRPO and PPO.
πΉ Munchausen Reinforcement Learning π β π₯ π π§
Yet, another estimate could be leveraged to bootstrap RL: the current policy. Our core contribution stands in a very simple idea: adding the scaled log-policy to the immediate reward.
πΉ Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning π π₯ π π₯ π₯ π§ β
Convex Conjugacy for KL and Entropy Regularization; 1) Mirror Descent MPI: SAC, Soft Q-learning; Softmax DQN, mellowmax policy, TRPO, MPO, DPP, CVI:droplet:; 2) Dual Averaging MPI:droplet::
πΉ Proximal Iteration for Deep Reinforcement Learning π₯
Our contribution is to employ Proximal Iteration for optimization in deep RL.
πΉ Theoretical Analysis of Efficiency and Robustness of Softmax and Gap-Increasing Operators in Reinforcement Learning π π§
We propose and analyze conservative value iteration (CVI), which unifies value iteration, soft value iteration, advantage learning, and dynamic policy programming.
πΉ Momentum in Reinforcement Learning π π₯
We derive Momentum Value Iteration (MoVI), a variation of Value iteration that incorporates this momentum idea. Our analysis shows that this allows MoVI to average errors over successive iterations.
πΉ Geometric Value Iteration: Dynamic Error-Aware KL Regularization for Reinforcement Learning π π₯ π
we propose a novel algorithm, Geometric Value Iteration (GVI), that features a dynamic error-aware KL coefficient design with the aim of mitigating the impact of errors on performance. Our experiments demonstrate that GVI can effectively exploit the trade-off between learning speed and robustness over uniform averaging of a constant KL coefficient.
πΉ Near Optimal Policy Optimization via REPS π π§
Relative entropy policy search (REPS)
πΉ On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations π₯ π§
we show that KL-regularized reinforcement learning with behavioral reference policies derived from expert demonstrations can suffer from pathological training dynamics that can lead to slow, unstable, and suboptimal online learning.
πΉ ON COVARIATE SHIFT OF LATENT CONFOUNDERS IN IMITATION AND REINFORCEMENT LEARNING π π§
We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning.
πΉ Constrained Policy Optimization π π₯ π₯ π β
We propose Constrained Policy Optimization (CPO), the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration.
πΉ Reward Constrained Policy Optimization π π₯ β β
we present a novel multi-timescale approach for constrained policy optimization, called βReward Constrained Policy Optimizationβ (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one.
πΉ PROJECTION-BASED CONSTRAINED POLICY OPTIMIZATION π π₯ β β
the first step performs a local reward improvement update, while the second step reconciles any constraint violation by projecting the policy back onto the constraint set.
πΉ First Order Constrained Optimization in Policy Space π₯ π
Using data generated from the current policy, FOCOPS first finds the optimal update policy by solving a constrained optimization problem in the nonparameterized policy space. FOCOPS then projects the update policy back into the parametric policy space.
πΉ Reinforcement Learning with Convex Constraints π₯ π π§
we propose an algorithmic scheme that can handle a wide class of constraints in RL tasks, specifically, any constraints that require expected values of some vector measurements (such as the use of an action) to lie in a convex set.
πΉ Batch Policy Learning under Constraints π π§ β β
propose a flexible meta-algorithm that admits any batch reinforcement learning and online learning procedure as subroutines.
πΉ A Primal-Dual Approach to Constrained Markov Decision Processes π π§
πΉ Reward is enough for convex MDPs π₯ π π₯ π§ β
It is easy to see that Convex MDPs in which goals are expressed as convex functions of stationary distributions cannot, in general, be formulated in this manner (maximising a cumulative reward).
πΉ Challenging Common Assumptions in Convex Reinforcement Learning π π₯
We show that erroneously optimizing the infinite trials objective in place of the actual finite trials one, as it is usually done, can lead to a significant approximation error.
πΉ DENSITY CONSTRAINED REINFORCEMENT LEARNING π π β
We prove the duality between the density function and Q function in CRL and use it to develop an effective primal-dual algorithm to solve density constrained reinforcement learning problems.
πΉ Control Regularization for Reduced Variance Reinforcement Learning π₯ π
CORERL: we regularize the behavior of the deep policy to be similar to a policy prior, i.e., we regularize in function space. We show that functional reg. yields a bias-variance trade-off, and propose an adaptive tuning strategy to optimize this trade-off.
πΉ REGULARIZATION MATTERS IN POLICY OPTIMIZATION - AN EMPIRICAL STUDY ON CONTINUOUS CONTROL πΆ
We present the first comprehensive study of regularization techniques with multiple policy optimization algorithms on continuous control tasks.
πΉ REINFORCEMENT LEARNING WITH SPARSE REWARDS USING GUIDANCE FROM OFFLINE DEMONSTRATION π₯ π π₯
The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data.
πΉ MIRROR DESCENT POLICY OPTIMIZATION π π₯
We derive on-policy and off-policy variants of MDPO (mirror descent policy optimization), while emphasizing important design choices motivated by the existing theory of MD in RL.
πΉ BREGMAN GRADIENT POLICY OPTIMIZATION π₯ π₯
We propose a Bregman gradient policy optimization (BGPO) algorithm based on both the basic momentum technique and mirror descent iteration.
πΉ Safe Policy Improvement by Minimizing Robust Baseline Regret [see more in offline_rl]
πΉ Safe Policy Improvement with Baseline Bootstrapping π π₯ π
Our approach, called SPI with Baseline Bootstrapping (SPIBB), is inspired by the knows-what-it-knows paradigm: it bootstraps the trained policy with the baseline when the uncertainty is high.
πΉ Safe Policy Improvement with Soft Baseline Bootstrapping π π₯ π
Instead of binarily classifying the state-action pairs into two sets (the uncertain and the safe-to-train-on ones), we adopt a softer strategy that controls the error in the value estimates by constraining the policy change according to the local model uncertainty.
πΉ SPIBB-DQN: Safe batch reinforcement learning with function approximation
πΉ Safe policy improvement with estimated baseline bootstrapping
πΉ Incorporating Explicit Uncertainty Estimates into Deep Offline Reinforcement Learning :fire
deep-SPIBB: Evaluation step regularization + Uncertainty.
πΉ Accelerating Safe Reinforcement Learning with Constraint-mismatched Baseline Policies π₯ π
SPACE: We propose an iterative policy optimization algorithm that alternates between maximizing expected return on the task, minimizing distance to the baseline policy, and projecting the policy onto the constraint satisfying set.
πΉ Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning π π₯
We propose Conservative and Adaptive Penalty (CAP), a model-based safe RL framework that accounts for potential modeling errors by capturing model uncertainty and adaptively exploiting it to balance the reward and the cost objectives.
πΉ Learning to be Safe: Deep RL with a Safety Critic πΆ
We propose to learn how to be safe in one set of tasks and environments, and then use that learned intuition to constrain future behaviors when learning new, modified tasks.
πΉ CONSERVATIVE SAFETY CRITICS FOR EXPLORATION π₯ π π₯
CSC: we target the problem of safe exploration in RL by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration.
πΉ Conservative Distributional Reinforcement Learning with Safety Constraints π₯
We propose the CDMPO algorithm to solve safety-constrained RL problems. Our method incorporates a conservative exploration strategy as well as a conservative distribution function. CSC + distributional RL + MPO + WAPID
πΉ CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning π π₯ π π₯
(i) We provide a rigorous theoretical analysis to extend the surrogate functions to generalized advantage estimator (GAE). GAE significantly reduces variance empirically while maintaining a tolerable level of bias, which is an efficient step for us to design CUP; (ii) The proposed bounds are tighter than existing works, i.e., using the proposed bounds as surrogate functions are better local approximations to the objective and safety constraints. (iii) The proposed CUP provides a non-convex implementation via first-order optimizers, which does not depend on any convex approximation.
πΉ Constrained Variational Policy Optimization for Safe Reinforcement Learning π₯ π§
CVPO: [poster]
πΉ A Review of Safe Reinforcement Learning: Methods, Theory and Applications π¦
πΉ MESA: Offline Meta-RL for Safe Adaptation and Fault Tolerance πΆ
We cast safe exploration as an offline metaRL problem, where the objective is to leverage examples of safe and unsafe behavior across a range of environments to quickly adapt learned risk measures to a new environment with previously unseen dynamics.
πΉ Safe Driving via Expert Guided Policy Optimization π π₯
We develop a novel EGPO method which integrates the guardian in the loop of reinforcement learning. The guardian is composed of an expert policy to generate demonstration and a switch function to decide when to intervene.
πΉ EFFICIENT LEARNING OF SAFE DRIVING POLICY VIA HUMAN-AI COPILOT OPTIMIZATION π₯
Human-AI Copilot Optimization (HACO): Human can take over the control and demonstrate to the agent how to avoid probably dangerous situations or trivial behaviors.
πΉ SAFER: DATA-EFFICIENT AND SAFE REINFORCEMENT LEARNING THROUGH SKILL ACQUISITION π₯
We propose SAFEty skill pRiors, a behavioral prior learning algorithm that accelerates policy learning on complex control tasks, under safety constraints. Through principled contrastive training on safe and unsafe data, SAFER learns to extract a safety variable from offline data that encodes safety requirements, as well as the safe primitive skills over abstract actions in different scenarios.
πΉ Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees π₯
We propose the Sim-to-Lab-to-Real framework that combines Hamilton-Jacobi reachability analysis and PAC-Bayes generalization guarantees to safely close the sim2real gap. Joint training of a performance and a backup policy in Sim training (1st stage) ensures safe exploration during Lab training (2nd stage).
πΉ Reachability Constrained Reinforcement Learning π
this paper proposes the reachability CRL (RCRL) method by using reachability analysis to establish the novel self-consistency condition and characterize the feasible sets. The feasible sets are represented by the safety value function.
πΉ Robust psi-Divergence MDPs π₯
we develop a novel solution framework for robust MDPs with s-rectangular ambiguity sets that decomposes the problem into a sequence of robust Bellman updates and simplex projections.
Multi-Objective RL:
πΉ Offline Constrained Multi-Objective Reinforcement Learning via Pessimistic Dual Value Iteration π₯
πΉ Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer π₯ π₯
We showed that any transfer learning problem within the SF framework can be mapped into an equivalent problem of learning multiple policies in MORL under linear preferences. We then introduced a novel SF-based extension of the OLS algorithm (SFOLS) to iteratively construct a set of policies whose SFs form a CCS. [poster]
πΉ Q-PENSIEVE: BOOSTING SAMPLE EFFICIENCY OF MULTI-OBJECTIVE RL THROUGH MEMORY SHARING OF Q-SNAPSHOTS π₯ π
we propose Q-Pensieve, a policy improvement scheme that stores a collection of Q-snapshots to jointly determine the policy update direction and thereby enables data sharing at the policy level.
πΉ A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation π₯ π π₯
Envelope MOQ-Learning: We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences.
πΉ PD-MORL: PREFERENCE-DRIVEN MULTIOBJECTIVE REINFORCEMENT LEARNING ALGORITHM π₯ π
We observe that the preference vectors have similar directional angles to the corresponding vectorized Q-values for a given state. Using the insight, we utilize the cosine similarity between the preference vector and vectorized Q-values in the Bellmanβs optimality operator to guide the training.
we explore optimal transport discrepancies (which include the Wasserstein distance) to define trust regions, and we propose a novel algorithm β Optimal Transport Trust Region Policy Optimization (OT-TRPO) β for continuous state-action spaces. We circumvent the infinite-dimensional optimization problem for PO by providing a one-dimensional dual reformulation for which strong duality holds.
-
Distributional RL Hao Liang, CUHK slide π¦ π¦ β β
πΉ C51: A Distributional Perspective on Reinforcement Learning π¦ β
πΉ CS598 - Statistical rl - NanJiang π¦ β
πΉ Information-Theoretic Considerations in Batch Reinforcement Learning π π π¦
πΉ Implicit Quantile Networks for Distributional Reinforcement Learning π¦ β
πΉ Continual Learning with Deep Generative Replay π§ πΆ
We propose the Deep Generative Replay, a novel framework with a cooperative dual model architecture consisting of a deep generative model (βgeneratorβ) and a task solving model (βsolverβ).
πΉ online learning; regret π¦ β
πΉ RESET-FREE LIFELONG LEARNING WITH SKILL-SPACE PLANNING π₯
We propose Lifelong Skill Planning (LiSP), an algorithmic framework for non-episodic lifelong RL based on planning in an abstract space of higher-order skills. We learn the skills in an unsupervised manner using intrinsic rewards and plan over the learned skills using a learned dynamics model.
πΉ Donβt Start From Scratch: Leveraging Prior Data to Automate Robotic Reinforcement Learning π
Our main contribution is demonstrating that incorporating prior data into a reinforcement learning system simultaneously addresses several key challenges in real-world robotic RL: sample-efficiency, zero-shot generalization, and autonomous non-episodic learning.
πΉ A State-Distribution Matching Approach to Non-Episodic Reinforcement Learning π π₯
Assuming access to a few demonstrations, we propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations. [poster]
πΉ You Only Live Once: Single-Life Reinforcement Learning via Learned Reward Shaping π₯ π
SLRL. (QWALE) that addresses the dearth of supervision by employing a distribution matching strategy that leverages the agentβs prior experience as guidance in novel situations.
πΉ FULLY ONLINE META-LEARNING WITHOUT TASK BOUNDARIES π₯
we propose a Fully Online MetaLearning (FOML) algorithm, which does not require any ground truth knowledge about the task boundaries and stays fully online without resetting back to pre-trained weights.
πΉ Learn the Time to Learn: Replay Scheduling in Continual Learning π
Storing historical data is cheap in many real-world applications, yet replaying all historical data would be prohibited due to processing time constraints. In such settings, we propose learning the time to learn for a continual learning system, in which we learn replay schedules over which tasks to replay at different time steps.
πΉ Self-Paced Contextual Reinforcement Learning π π¦ β β
We introduce a novel relative entropy reinforcement learning algorithm that gives the agent the freedom to control the intermediate task distribution, allowing for its gradual progression towards the target context distribution.
πΉ Self-Paced Deep Reinforcement Learning π π¦ β β
In this paper, we propose an answer by interpreting the curriculum generation as an inference problem, where distributions over tasks are progressively learned to approach the target task. This approach leads to an automatic curriculum generation, whose pace is controlled by the agent, with solid theoretical motivation and easily integrated with deep RL algorithms.
πΉLearning with AMIGO: Adversarially Motivated Intrinsic Goals π Lil'Log-Curriculum π β
(Intrinsic motivation + Curriculum learning)
πΉ Information Directed Reward Learning for Reinforcement Learning π₯ π π₯
IDRL: uses a Bayesian model of the reward and selects queries that maximize the information gain about the difference in return between plausibly optimal policies.
πΉ Actively Learning Costly Reward Functions for Reinforcement Learning πΆ
ACRL, an extension to standard reinforcement learning methods in the context of (computationally) expensive rewards, which models the reward of given applications using machine learning models.
πΉ Human-Timescale Adaptation in an Open-Ended Task Space π₯ π
AdA: Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agentβs capabilities.
πΉ VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models π₯ π₯
VOXPOSER extracts language-conditioned affordances and constraints from LLMs and grounds them to the perceptual space using VLMs, using a code interface and without additional training to either component.
πΉ RoCo: Dialectic Multi-Robot Collaboration with Large Language Models π₯
πΉ REWARD DESIGN WITH LANGUAGE MODELS π₯
πΉ Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
πΉ Language to Rewards for Robotic Skill Synthesis π₯ π π₯
Using reward as the intermediate interface generated by LLMs, we can effectively bridge the gap between high-level language instructions or corrections to low-level robot actions.
πΉ Code as Policies: Language Model Programs for Embodied Control π₯
Given examples (via few-shot prompting), robots can use code-writing large language models (LLMs) to translate natural language commands into robot policy code which process perception outputs, parameterize control primitives, recursively generate code for undefined functions, and generalize to new tasks.sss
πΉ βNo, to the Rightβ β Online Language Corrections for Robotic Manipulation via Shared Autonomy πΆ
Language-Informed Latent Actions with Corrections (LILAC)
πΉ Lila: Language-informed latent actions π₯
LILA learns to use language to modulate this controller, providing users with a language-informed control space: given an instruction like βplace the cereal bowl on the tray,β LILA may learn a 2-DoF space where one dimension controls the distance from the robotβs end-effector to the bowl, and the other dimension controls the robotβs end-effector pose relative to the grasp point on the bowl.
-
Locomotion
ETG-RL: Unlike prior methods that use a fixed trajectory generator, the generator continually optimizes the shape of the output trajectory for the given task, providing diversified motion priors to guide the policy learning.
πΉ REvolveR: Continuous Evolutionary Models for Robot-to-robot Policy Transfer πΆ
πΉ Deep Whole-Body Control: Learning a Unified Policy for Manipulation and Locomotion
πΉ Barkour: Benchmarking Animal-level Agility with Quadruped Robots π₯
Omni-directional walking, slope, and jumping policies are trained in simulation using RL. We then run the policies to create datasets which we use to distill a generalist Locomotion-Transformer policy.
πΉ LATTE: LAnguage Trajectory TransformEr π₯
Our method leverages pre-trained language models (BERT and CLIP) to encode the userβs intent and target objects directly from a free-form text input and scene images, fuses geometrical features generated by a transformer encoder network, and finally outputs trajectories using a transformer decoder, without the need of priors related to the task or robot information.
-
RR
πΉ BENCHMARKING OFFLINE REINFORCEMENT LEARNING ON REAL-ROBOT HARDWARE π§
-
Place
πΉ MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning π₯ π₯
MaskPlace recasts placement as a problem of learning pixellevel visual representation to comprehensively describe millions of modules on a chip, enabling placement in a high-resolution canvas and a large action space.
πΉ DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement :no-mouth:
casting the analytical placement problem equivalently to training a neural network.
πΉ On Joint Learning for Solving Placement and Routing in Chip Design π₯
DeepPR. One key design in our (reinforcement) learning paradigm involves a multi-view embedding model to encode both global graph level and local node level information of the input macros.
πΉ The Policy-gradient Placement and Generative Routing Neural Networks for Chip Design π₯
PRNet:
-
Contrastive Divergence (CD)
πΉ Training Products of Experts by Minimizing Contrastive Divergence π₯ π Notes π β
C: contrastive = perceivable difference(s)
D: divergence = general trend of such differences (over epochs)
πΉ A Contrastive Divergence for Combining Variational Inference and MCMC π§ β
πΉ CONTRASTIVE DIVERGENCE LEARNING IS A TIME REVERSAL ADVERSARIAL GAME π₯ π§
Specifically, we show that CD is an adversarial learning procedure, where a discriminator attempts to classify whether a Markov chain generated from the model has been time-reversed.
-
DISTRIBUTIONALLY ROBUST OPTIMIZATION (DRO)
πΉ MODELING THE SECOND PLAYER IN DISTRIBUTIONALLY ROBUST OPTIMIZATION π π₯ β β
we argue instead for the use of neural generative models to characterize the worst-case distribution, allowing for more flexible and problem-specific selection of the uncertainty set.
πΉ Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning
πΉ Variance-based regularization with convex objectives
πΉ Adaptive Regularization for Adversarial Training π
we develop a new data-adaptive regularization algorithm for adversarial training called Anti-Robust Weighted Regularization (ARoW). (more methods: PGD-Training, TRADES, GAIR-AT, FAT, MMA)
-
Distribution shift; Robust;
πΉ Rethinking Importance Weighting for Deep Learning under Distribution Shift π β
πΉ Variational Inference based on Robust Divergences π π§
Maximum Likelihood Estimation and Its Robust Variants. density power divergence; the Ξ²-divergence. Ξ³-divergence;
πΉ A New Kind of Adversarial Example π
we consider the opposite which is adversarial examples that can fool a human but not a model
-
Implicit learning
πΉ Generalization Bounded Implicit Learning of Nearly Discontinuous Functions π₯ π₯
πΉ A Tutorial on Energy-Based Learning π π₯
Energy-Based Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference + Learning
πΉ Implicit Generation and Modeling with Energy-Based Models π
We present an algorithm and techniques for training energy based models that scale to challenging high-dimensional domains.
πΉ Compositional Visual Generation with Energy Based Models π
πΉ Improved Contrastive Divergence Training of Energy-Based Model π₯ π
We show that a gradient term neglected in the popular contrastive divergence formulation is both tractable to estimate and is important in avoiding training instabilities that previously limited applicability and scalability of energy-based models.
πΉ How to Train Your Energy-Based Models π₯ π₯
We start by explaining maximum likelihood training with Markov chain Monte Carlo (MCMC), and proceed to elaborate on MCMC-free approaches, including Score Matching (SM) and Noise Constrastive Estimation (NCE).
CEM: the first probabilistic characterization of AT through a unified understanding of robustness and generative ability, interprets unsupervised contrastive learning as animportant sampling of CEM.
πΉ YOUR CLASSIFIER IS SECRETLY AN ENERGY BASED MODEL AND YOU SHOULD TREAT IT LIKE ONE π₯ π
We propose to reinterpret a standard discriminative classifier of p(y|x) as an energy based model for the joint distribution p(x, y).
-
Diffussion
πΉ Sliced Score Matching: A Scalable Approach to Density and Score Estimation π₯
We show this difficulty (computing the Hessian of logdensity functions) can be mitigated by projecting the scores onto random vectors before comparing them.
πΉ Generative Modeling by Estimating Gradients of the Data Distribution π₯
NCSN: we perturb the data with different levels of Gaussian noise, and jointly estimate the corresponding scores, i.e., the vector fields of gradients of the perturbed data distribution for all noise levels. For sampling, we propose an annealed Langevin dynamics where we use gradients corresponding to gradually decreasing noise levels as the sampling process gets closer to the data manifold.
πΉ Improved Techniques for Training Score-Based Generative Models π
πΉ Denoising Diffusion Probabilistic Models π₯ π π₯
Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.
πΉ Improved Denoising Diffusion Probabilistic Models π
πΉ SCORE-BASED GENERATIVE MODELING THROUGH STOCHASTIC DIFFERENTIAL EQUATIONS π π₯
Using SED, encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities.
πΉ A Connection Between Score Matching and Denoising Autoencoders π₯
πΉ Understanding Diffusion Models: A Unified Perspective π π₯ π π₯ π₯
Understanding Diffusion Models: A Unified Perspective
πΉ Conditional Image Generation with Score-Based Diffusion Models π₯
CMDE: we introduce a multi-speed diffusion framework, which leads to a new estimator for the conditional score.
πΉ Score-based Generative Modeling in Latent Space π₯ π
Latent Score-based Generative Model (LSGM)
πΉ D2C: Diffusion-Decoding Models for Few-Shot Conditional Generation π₯
D2C uses a learned diffusion-based prior over the latent representations to improve generation and contrastive selfsupervised learning to improve representation quality.
πΉ Resolving Label Uncertainty with Implicit Posterior Models π π₯ π π₯
We propose a method for jointly inferring labels across a collection of data samples, where each sample consists of an observation and a prior belief about the label.
πΉ Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space π π₯
PPGNs are composed of 1) a generator network G that is capable of drawing a wide range of image types and 2) a replaceable βconditionβ network C that tells the generator what to draw.
πΉ Toward Multimodal Image-to-Image Translation π
we aim to model a distribution of possible outputs in a conditional generative modeling setting. The ambiguity of the mapping is distilled in a low-dimensional latent vector, which can be randomly sampled at test time. A generator learns to map the given input, combined with this latent code, to the output.
πΉ LATENT CONSTRAINTS: LEARNING TO GENERATE CONDITIONALLY FROM UNCONDITIONAL GENERATIVE MODELS π₯
By post-hoc learning latent constraints, value functions that identify regions in latent space that generate outputs with desired attributes, we can conditionally sample from these regions with gradient-based optimization or amortized actor functions.
πΉ Conditioning by adaptive sampling for robust design π π§ MBO
we propose a method to solve this problem (data far from the training distribution) that uses model-based adaptive sampling to estimate a distribution over the design space, conditioned on the desired properties. + diffusion?
πΉ Back to the Source: Diffusion-Driven Test-Time Adaptation π π₯
We instead update the target data, by projecting all test inputs toward the source domain with a generative diffusion model. Our diffusion-driven adaptation method, DDA, shares its models for classification and generation across all domains.
πΉ Let us Build Bridges: Understanding and Extending Diffusion Generative Models π
By viewing diffusion models as latent variable models with unobserved diffusion trajectories and applying maximum likelihood estimation (MLE) with latent trajectories imputed from an auxiliary distribution, we show that both the model construction and the imputation of latent trajectories amount to constructing diffusion bridge processes that achieve deterministic values and constraints at end point, for which we provide a systematic study and a suit of tools.
πΉ CLASSIFIER-FREE DIFFUSION GUIDANCE π₯
We jointly train a conditional and an unconditional diffusion model, and we combine the resulting conditional and unconditional score estimates to attain a trade-off between sample quality and diversity similar to that obtained using classifier guidance.
πΉ LEARNING ENERGY-BASED MODELS BY DIFFUSION RECOVERY LIKELIHOOD π π₯
Each EBM is trained with recovery likelihood, which maximizes the conditional probability of the data at a certain noise level given their noisy versions at a higher noise level. Optimizing recovery likelihood is more tractable than marginal likelihood, as sampling from the conditional distributions is much easier than sampling from the marginal distributions.
πΉ Planning with Diffusion for Flexible Behavior Synthesis π₯
πΉ Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning π₯
we propose Diffusion-QL that utilizes a conditional diffusion model as a highly expressive policy class for behavior cloning and policy regularization.
πΉ OFFLINE REINFORCEMENT LEARNING VIA HIGHFIDELITY GENERATIVE BEHAVIOR MODELING π₯
we adopt a generative approach by decoupling the learned policy into two parts: an expressive generative behavior model (diffusion model) and an action evaluation model (Q-value of behavior policy).
πΉ KNOW YOUR BOUNDARIES: THE ADVANTAGE OF EXPLICIT BEHAVIORAL CLONING IN OFFLINE RL
ARQ: utilizing a score-based generative model for behavior cloning.
πΉ A Regularized Implicit Policy for Offline Reinforcement Learning πΆ
We further propose a simple modification to the classical policy-matching methods for regularizing with respect to the dual form of the JensenβShannon divergence and the integral probability metrics.
πΉ IMAGEN VIDEO: HIGH DEFINITION VIDEO GENERATION WITH DIFFUSION MODELS π₯
Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models.
πΉ ANALYZING DIFFUSION AS SERIAL REPRODUCTION π₯ π§
By identifying a correspondence between diffusion models and a well-known paradigm in cognitive science known as serial reproduction, whereby human agents iteratively observe and reproduce stimuli from memory, we show how the aforementioned properties (weak sensitivity to the choice of noise family and the role of adequate scheduling of noise levels) of diffusion models can be explained as a natural consequence of this correspondence.
πΉ Parallel Diffusion Models of Operator and Image for Blind Inverse Problems π₯
BlindDPS: a framework for solving blind inverse problems by jointly estimating the parameters of the forward measurement operator and the image to be reconstructed.
-
Data Valuation
πΉ Data shapley: Equitable valuation of data for machine learning
πΉ DATA VALUATION USING REINFORCEMENT LEARNING π₯
DVRL: We train the data value estimator using a reinforcement signal of the reward obtained on a small validation set that reflects performance on the target task.
πΉ Data Valuation for Offline Reinforcement Learning π₯
DVORL: allows us to identify relevant and high-quality transitions, improving the performance and transferability of policies learned by offline reinforcement learning algorithms.
-
IMOP, IOP: Inverse (Multiobjective) Optimization Problem
-
Action Learning
πΉ Active inference: demystified and compared π₯ π
an accessible overview of the discretestate formulation of active inference, highlighting natural behaviors in active inference that are generally engineered in reinforcement learning;
πΉ Active inference, Bayesian optimal design, and expected utility π₯
When removing prior outcomes preferences from expected free energy, active inference reduces to optimal Bayesian design, i.e., information gain maximization. Conversely, active inference reduces to Bayesian decision theory in the absence of ambiguity and relative risk, i.e., expected utility maximization.
πΉ DEEP ACTIVE INFERENCE AS VARIATIONAL POLICY GRADIENTS π₯
πΉ Deep active inference agents using Monte-Carlo methods π₯
we present a neural architecture for building deep active inference agents operating in complex, continuous statespaces using multiple forms of Monte-Carlo (MC) sampling.
πΉ Efficient Meta Reinforcement Learning for Preference-based Fast Adaptation π₯ π₯
ANOLE: The agent can adapt to new tasks by querying humanβs preference between behavior trajectories instead of using per-step numeric rewards. By extending techniques from information theory, our approach can design query sequences to maximize the information gain from human interactions while tolerating the inherent error of non-expert human oracle.
πΉ Prior Preference Learning from Experts: Designing a Reward with Active Inference
πΉ Exploration and preference satisfaction trade-off in reward-free learning
πΉ Active Inference in Robotics and Artificial Agents: Survey and Challenges
πΉ Learning Human Objectives by Evaluating Hypothetical Behavior π₯ π₯
reward query synthesis via trajectory optimization (ReQueST): an algorithm that synthesizes hypothetical behaviors in order to safely and efficiently train neural network reward models in environments with high-dimensional, continuous states.
-
others
πΉ Structured Prediction with Partial Labelling through the Infimum Loss π π§ β β
πΉ Bridging the Gap Between f-GANs and Wasserstein GANs π π₯ π ββ
we propose an new training objective where we additionally optimize over a set of importance weights over the generated samples. By suitably constraining the feasible set of importance weights, we obtain a family of objectives which includes and generalizes the original f-GAN and WGAN objectives.
πΉ f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization π π₯
πΉ Discriminator Contrastive Divergence: Semi-Amortized Generative Modeling by Exploring Energy of the Discriminatorβ π π₯
DCD: Compared to standard GANs, where the generator is directly utilized to obtain new samples, our method proposes a semi-amortized generation procedure where the samples are produced with the generatorβs output as an initial state.
πΉ DISCRIMINATOR REJECTION SAMPLING π₯
We ask if the information retained in the weights of the discriminator at the end of the training procedure can be used to βimproveβ the generator.
πΉ On Symmetric Losses for Learning from Corrupted Labels π₯ π§
using a symmetric loss is advantageous in the balanced error rate (BER) minimization and area under the receiver operating characteristic curve (AUC) maximization from corrupted labels.
πΉ A Symmetric Loss Perspective of Reliable Machine Learning π π§ π₯ β
a symmetric loss can yield robust classification from corrupted labels in balanced error rate (BER) minimization and area under the receiver operating characteristic curve (AUC) maximization.
πΉ Connecting Generative Adversarial Networks and Actor-Critic Methods π β
GANs can be seen as a modified actor-critic method with blind actors (stateless) in stateless MDPs.
πΉ An Image is Worth More Than a Thousand Words: Towards Disentanglement in the Wild π π
we propose a method for disentangling a set of factors which are only partially labeled, as well as separating the complementary set of residual factors that are never explicitly specified. β
πΉ Recomposing the Reinforcement Learning Building Blocks with Hypernetworks π₯
To consider the interaction between the input variables, we suggest using a Hypernetwork architecture where a primary network determines the weights of a conditional dynamic network.
πΉ SUBJECTIVE LEARNING FOR OPEN-ENDED DATA π₯
OSL: we present a novel supervised learning framework of learning from open-ended data, which is modeled as data implicitly sampled from multiple domains with the data in each domain obeying a domain-specific target function.
πΉ The State of Sparse Training in Deep Reinforcement Learning
The State of Sparse Training in Deep Reinforcement Learning
πΉ Learning Iterative Reasoning through Energy Minimization π₯
We train a neural network to parameterize an energy landscape over all outputs, and implement each step of the iterative reasoning as an energy minimization step to find a minimal energy solution.
πΉ Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting π₯ π
A standard technique to correct this bias is importance sampling, where samples from the model are weighted by the likelihood ratio under model and true distributions. When the likelihood ratio is unknown, it can be estimated by training a probabilistic classifier to distinguish samples from the two distributions.
πΉ Telescoping Density-Ratio Estimation π
we introduce a new framework, telescoping density-ratio estimation (TRE), that enables the estimation of ratios between highly dissimilar densities in high-dimensional spaces.
πΉ Adaptive Multi-stage Density Ratio Estimation for Learning Latent Space Energy-based Model π₯
we develop the adaptive multi-stage density ratio estimation which breaks the estimation into multiple stages and learn different stages of density ratio sequentially and adaptively. The latent prior model can be gradually learned using ratio estimated in previous stage so that the final latent space EBM prior can be naturally formed by product of ratios in different stages.
πΉ Distribution Augmentation for Generative Modeling π
DisAug: Our approach applies augmentation functions to data and, importantly, conditions the generative model on the specific function used.
πΉ ON HARD EPISODES IN META-LEARNING π₯
Different episodes, however, may vary in hardness and quality leading to a wide gap in the meta-learnerβs performance across episodes. We investigate various properties of hard episodes and highlight their connection to catastrophic forgetting during meta-training.
πΉ Data Augmentation for Meta-Learning πΆ
We systematically dissect the meta-learning pipeline and investigate the distinct ways in which data augmentation can be integrated at both the image and class levels.
πΉ UNIFORM PRIORS FOR DATA-EFFICIENT TRANSFER πΆ
features that are most transferable have high uniformity in the embedding space and propose a uniformity regularization scheme that encourages better transfer and feature reuse.
πΉ Instance-based Learning for Knowledge Base Completion π₯ π
a new method for knowledge base completion (KBC): instance-based learning (IBL)
πΉ Dataset Distillation by Matching Training Trajectories π₯
MTT: we propose a new formulation that optimizes our distilled data to guide networks to a similar state as those trained on real data across many training steps.
πΉ Dataset Distillation via Factorization π
HaBa: we further introduce a pair of adversarial contrastive constraints on the resultant hallucination networks and bases, which increase the diversity of generated images and inject more discriminant information into the factorization.
πΉ RETHINKING SKIP CONNECTION MODEL AS A LEARNABLE MARKOV CHAIN π₯ π
we introduce the conception of a learnable Markov chain for the residual-like model, and propose a simple routine of penal connection to boost the model performance and alleviate the model degradation in deep depth as well.
πΉ DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature π
we first demonstrate that text sampled from an LLM tends to occupy negative curvature regions of the modelβs log probability function.
-
Distillation
πΉ Policy Distillation with Selective Input Gradient Regularization for Efficient Interpretability π₯
We propose an approach of Distillation with selective Input Gradient Regularization (DIGR) which uses policy distillation and input gradient regularization to produce new policies that achieve both high interpretability and computation efficiency in generating saliency maps.
πΉ Gradient-based Bi-level Optimization for Deep Learning: A Survey π π₯ π π₯ π₯
Bi-level optimization embeds one problem within another and the gradient-based category solves the outer level task by computing the hypergradient.
πΉ Perspectives on Incorporating Expert Feedback into Model Updates
we consider how to capture interactions between practitioners and experts systematically. We devise a taxonomy to match expert feedback types with practitioner updates. A practitioner may receive feedback from an expert at the observationor domain-level, and convert this feedback into updates to the dataset, loss function, or parameter space.
β π βοΈ π π π π π π π π― π π π§ π π°οΈ π‘ π π π π© πΊ π΅ π β³ β π· π π π π π β
-
Deep Reinforcement Learning amidst Lifelong Non-Stationarity https://arxiv.org/pdf/2006.10701.pdf
-
Learning Robot Skills with Temporal Variational Inference https://arxiv.org/pdf/2006.16232.pdf
-
Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design https://arxiv.org/pdf/0912.3995.pdf [icml2020 test of time award] π β
-
On Learning Sets of Symmetric Elements https://arxiv.org/pdf/2002.08599.pdf [icml2020 outstanding paper awards] π β
-
Non-delusional Q-learning and Value Iteration https://papers.nips.cc/paper/8200-non-delusional-q-learning-and-value-iteration.pdf [NeurIPS2018 Best Paper Award]
-
SurVAE Flows: Surjections to Bridge the Gap between VAEs and Flows [Max Welling] https://arxiv.org/pdf/2007.02731.pdf
πΉ Normalizing Flows: An Introduction and Review of Current Methods π ; Citing: Normalizing Flows for Probabilistic Modeling and Inference π π₯ π₯ π₯ ; lil-log: Flow-based Deep Generative Models ; Jianlin Su: f-VAES π¦ ; Deep generative models π¦ Another slide;
πΉ Deep Kernel Density Estimation (Maximum Likelihood, Neural Density Estimation (Auto Regressive Models + Normalizing Flows), Score Matching (MRF), Kernel Exponential Family (RKHS), Deep Kernel);
ToM
-
Self-Supervised Learning lil-log π¦ ;
πΉ Self-Supervised Exploration via Disagreement π
πΉ
-
πΉ Comparing Distributions by Measuring Differences that Affect Decision Making π π₯ π π₯
H divergence: (H Entropy) We propose a new class of discrepancies based on the optimal loss for a decision task β two distributions are different if the optimal decision loss is higher on their mixture than on each individual distribution. By suitably choosing the decision task, this generalizes the JS divergence and the MMD family.
πΉ RotoGrad: Gradient Homogenization in Multitask Learning π π π₯
We introduce RotoGrad, an algorithm that tackles negative transfer as a whole: it jointly homogenizes gradient magnitudes and directions, while ensuring training convergence. The proposed strategy is to introduce additional parameterized rotation matrices, each of which modifies the shared representation before it is passed to a corresponding task-specific branch. The parameters of these rotation matrices are optimized to maximize gradient similarity between different tasks at the branch point; this optimization step is interlaced with standard updates of other network parameters to minimize total task loss.
πΉ META DISCOVERY: LEARNING TO DISCOVER NOVEL CLASSES GIVEN VERY LIMITED DATA π π₯ π₯
Demystifying Assumptions in Learning to Discover Novel Classes (L2DNC): find that high-level semantic features should be shared among the seen and unseen classes. CATA ( Clustering-rule-aware Task Sampler): Data have multiple views. However, there are always one view or a few views that are dominate for each instance, and these dominated views are similar with high-level semantic meaning. We propose to use dominated views to replace with clustering rules.
πΉ Learning Surrogate Losses π₯
We learn smooth relaxation versions of the true losses by approximating them through a surrogate neural network.
πΉ Learning Surrogates via Deep Embedding π₯
Training neural networks by minimizing learned surrogates that approximate the target evaluation metric.
πΉ RELATIONAL SURROGATE LOSS LEARNING π₯ π
Instead of directly approximating the evaluation metrics as previous methods, this paper proposes a new learning method by revisiting the purpose of loss functions, which is to distinguish the performance of models. Hence, the authors aim to learn the surrogate losses by making the surrogate losses have the same discriminability as the evaluation metrics. The idea is straightforward and is easy to implement by using ranking correlation as an optimization objective.
πΉ Iterative Teacher-Aware Learning π π₯ π₯
We propose a gradient optimization based teacher-aware learner who can incorporate teacherβs cooperative intention into the likelihood function and learn provably faster compared with the naive learning algorithms used in previous machine teaching works.
πΉ MAXIMIZING ENSEMBLE DIVERSITY IN DEEP REINFORCEMENT LEARNING π
We describe Maximize Ensemble Diversity in Reinforcement Learning (MED-RL), a set of regularization methods inspired from the economics and consensus optimization to improve diversity in the ensemble based deep reinforcement learning methods by encouraging inequality between the networks during training.
-
AAA:
πΉ Efficiently Identifying Task Groupings for Multi-Task Learning π₯
Our method determines task groupings in a single run by training all tasks together and quantifying the effect to which one taskβs gradient would affect another taskβs loss.
πΉ Learning from Failure: Training Debiased Classifier from Biased Classifier π π
Our idea is twofold; (a) we intentionally train the first network to be biased by repeatedly amplifying its βprejudiceβ, and (b) we debias the training of the second network by focusing on samples that go against the prejudice of the biased network in (a).
πΉ Just Train Twice: Improving Group Robustness without Training Group Information π
We propose a simple two-stage approach, JTT, that minimizes the loss over a reweighted dataset (second stage) where we upweight training examples that are misclassified at the end of a few steps of standard training (first stage).
πΉ Environment Inference for Invariant Learning
πΉ Robustness between the worst and average case π₯ π§
We proposed a definition of intermediate-q robustness that smooths the gap between robustness to random perturbations and adversarial robustness by generalizing these notions of robustness as functional `q norms of the loss function over the perturbation distribution.
πΉ Adaptive Risk Minimization: Learning to Adapt to Domain Shift π π₯
We aim to learn models that adapt at test time to domain shift using unlabeled test points. Our primary contribution is to introduce the framework of adaptive risk minimization (ARM), in which models are directly optimized for effective adaptation to shift by learning to adapt on the training domains.
πΉ Unsupervised Learning of Compositional Energy Concepts π₯ π₯
We propose COMET, which discovers and represents concepts as separate energy functions, enabling us to represent both global concepts as well as objects under a unified framework. COMET discovers energy functions through recomposing the input image, which we find captures independent factors without additional supervision.
πΉ LEARNING A LATENT SEARCH SPACE FOR ROUTING PROBLEMS USING VARIATIONAL AUTOENCODERS π π₯
CVAE-Opt: This paper proposes a learning based approach for solving combinatorial optimization problems such as routine using continuous optimizers. The key idea is to learn a continuous latent space via conditional VAE to represent solutions and perform search in this latent space for new problems at the test-time.
πΉ Learning to Solve Vehicle Routing Problems: A Survey π
πΉ Time Series Forecasting Models Copy the Past: How to Mitigate π π₯ π
In the presence of noise and uncertainty, neural network models tend to replicate the last observed value of the time series, thus limiting their applicability to real-world data. We also propose a regularization term penalizing the replication of previously seen values.
πΉ Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation π₯ π
GFlowNet: based on a view of the generative process as a flow network, making it possible to handle the tricky case where different trajectories can yield the same final state, e.g., there are many ways to sequentially add atoms to generate some molecular graph.
πΉ Biological Sequence Design with GFlowNets π₯
We also propose a scheme to incorporate existing labeled datasets of candidates, in addition to a reward function, to speed up learning in GFlowNets.
-
Interpretation
πΉ CONTRASTIVE EXPLANATIONS FOR REINFORCEMENT LEARNING VIA EMBEDDED SELF PREDICTIONS π
We investigate a deep reinforcement learning (RL) architecture that supports explaining why a learned agent prefers one action over another.
-
Label Noise
πΉ Eliciting Informative Feedback: The Peer-Prediction Method π₯ π π₯
Each rater merely reports a signal, and the system applies proper scoring rules to the implied posterior beliefs about another rater's report.
πΉ Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates π
We introduce a new family of loss functions that we name as peer loss functions, which enables learning from noisy labels and does not require a priori specification of the noise rates.
πΉ LEARNING WITH INSTANCE-DEPENDENT LABEL NOISE: A SAMPLE SIEVE APPROACH π π
CORES2: We propose to train a classifier using a novel confidence regularization (CR) term and theoretically guarantee that, under mild assumptions, minimizing the confidence regularized cross-entropy (CE) loss on the instance-based noisy distribution is equivalent to minimizing the pure CE loss on the corresponding βunobservableβ clean distribution.
πΉ A Second-Order Approach to Learning with Instance-Dependent Label Noise π₯ π§
We propose and study the potentials of a second-order approach that leverages the estimation of several covariance terms defined between the instance-dependent noise rates and the Bayes optimal label. We show that this set of second-order statistics successfully captures the induced imbalances.
πΉ Does label smoothing mitigate label noise? π₯
We related smoothing to one of these correction techniques, and re-interpreted it as a form of regularisation.
πΉ Understanding Generalized Label Smoothing when Learning with Noisy Labels π
We unify label smoothing with either positive or negative smooth rate into a generalized label smoothing (GLS) framework. We proceed to show that there exists a phase transition behavior when finding the optimal label smoothing rate for GLS.
πΉ Understanding Instance-Level Label Noise: Disparate Impacts and Treatments π§ π₯
πΉ WHEN OPTIMIZING f -DIVERGENCE IS ROBUST WITH LABEL NOISE π
We derive a nice decoupling property for a family of f-divergence measures when label noise presents, where the divergence is shown to be a linear combination of the variational difference defined on the clean distribution and a bias term introduced due to the noise.
πΉ Can Less be More? When Increasing-to-Balancing Label Noise Rates Considered Beneficial
We are primarily inspired by three observations: 1) In contrast to reducing label noise rates, increasing the noise rates is easy to implement; 2) Increasing a certain class of instancesβ label noise to balance the noise rates (increasing-to-balancing) results in an easier learning problem; 3) Increasing-to-balancing improves fairness guarantees against label bias.
-
Semi-supervise; self-training
πΉ TEMPORAL ENSEMBLING FOR SEMI-SUPERVISED LEARNING π₯
We introduce self-ensembling, where we form a consensus prediction of the unknown labels using the outputs of the network-in-training on different epochs, and most importantly, under different regularization and input augmentation conditions.
πΉ Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data π π¦
This work provides a unified theoretical analysis of selftraining with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning. At the core of our analysis is a simple but realistic βexpansionβ assumption, which states that a low-probability subset of the data must expand to a neighborhood with large probability relative to the subset. We also assume that neighborhoods of examples in different classes have minimal overlap.
πΉ Self-training with Noisy Student improves ImageNet classification π₯
The teacher produces high-quality pseudo labels by reading in clean images, while the student is required to reproduce those labels with augmented images as input.
πΉ Unsupervised Data Augmentation for Consistency Training π
We present a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning.
πΉ Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning π
Virtual adversarial loss is defined as the robustness of the conditional label distribution around each input data point against local perturbation. Unlike adversarial training, our method defines the adversarial direction without label information and is hence applicable to semi-supervised learning.
πΉ Semi-supervised Learning by Entropy Minimization π₯
πΉ Robustness to Adversarial Perturbations in Learning from Incomplete Data π π§
We unify two major learning frameworks: Semi-Supervised Learning (SSL) and Distributionally Robust Learning (DRL).
πΉ SENTRY: Selective Entropy Optimization via Committee Consistency for Unsupervised Domain Adaptation π
A UDA algorithm that judges the reliability of a target instance based on its predictive consistency under a committee of random image transformations.
πΉ Deep Co-Training with Task Decomposition for Semi-Supervised Domain Adaptation
πΉ Debiased Contrastive Learning π
we develop a debiased contrastive objective that corrects for the sampling of same-label datapoints, even without knowledge of the true labels.
πΉ Positive Unlabeled Contrastive Learning π₯
puNCE: that leverages the available explicit (from labeled samples) and implicit (from unlabeled samples) supervision to learn useful representations from positive unlabeled input data.
πΉ Boosting Few-Shot Learning With Adaptive Margin Loss π₯ π
This paper proposes an adaptive margin principle to improve the generalization ability of metric-based meta-learning approaches for few-shot learning problems.
πΉ INFONCE IS A VARIATIONAL AUTOENCODER π₯
the self-supervised variational autoencoder (SSVAE)
πΉ Learning to Drop Out: An Adversarial Approach to Training Sequence VAEs π₯
training sequence VAEs is challenging: autoregressive decoders can often explain the data without utilizing the latent space, known as posterior collapse
πΉ SIMPER: SIMPLE SELF-SUPERVISED LEARNING OF PERIODIC TARGETS π₯ π₯
We present SimPer, a simple contrastive SSL regime for learning periodic information in data. To exploit the periodic inductive bias, SimPer introduces customized augmentations, feature similarity measures, and a generalized contrastive loss for learning efficient and robust periodic representations.
πΉ THE HIDDEN UNIFORM CLUSTER PRIOR IN SELF-SUPERVISED LEARNING π₯ π₯
By moving away from conventional uniformity priors (in self-suprevised learning) and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets.
-
Uncertainty (Calibration); OOD;
πΉ Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles π₯
πΉ Accurate Uncertainties for Deep Learning Using Calibrated Regression π₯
πΉ Calibrated Reliable Regression using Maximum Mean Discrepancy ππ
We propose the calibrated regression method using the maximum mean discrepancy by minimizing the kernel embedding measure.
πΉ Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder π₯ π
The Likelihood Regret of a single input can be interpreted as the log ratio between its likelihood obtained by the posterior distribution optimized individually for that input and the likelihood approximated by the VAE.
πΉ Likelihood Ratios for Out-of-Distribution Detection π₯ π₯
We propose a likelihood ratio method for deep generative models which effectively corrects for these confounding background statistics (using a background model).
πΉ INPUT COMPLEXITY AND OUT-OF-DISTRIBUTION DETECTION WITH LIKELIHOOD-BASED GENERATIVE MODELS π
We use an estimate of input complexity to derive an efficient and parameter-free OOD score, which can be seen as a likelihood-ratio, akin to Bayesian model comparison.
πΉ Hierarchical VAEs Know What They Donβt Know π π§
HVAE; BIVA: a likelihood-ratio based score for OOD detection and define it to explicitly ensure that data must be in-distribution across all feature levels to be regarded in-distribution.
-
GAN
πΉ A Unified View of cGANs with and without Classifiers π₯ π
This submission proposes to analyze the most popular variations of conditional GANs (ACGAN, ProjGAN, ContraGAN) under a unified, energy-based, formulation (ECGAN).
πΉ Why Are Conditional Generative Models Better Than Unconditional Ones? πΆ
we propose self-conditioned diffusion models (SCDM), which is trained conditioned on indices clustered by the k-means algorithm on the features extracted by a model pre-trained in a self-supervised manner.
πΉ Partition-Guided GANs π₯
We break down learning complex high dimensional distributions to simpler sub-tasks, supporting diverse data samples. Our solution relies on designing a partitioner that breaks the space into smaller regions, each having a simpler distribution, and training a different generator for each partition.
πΉ Self-labeled Conditional GANs π₯ π₯
we propose Self-labeled Conditional GANs(slcGANs) that learn to assign labels to the images automatically by incorporating an additional clustering network (infoGAN).
πΉ Instance-Conditioned GAN π₯ π
We partition the data manifold into a mixture of overlapping neighborhoods described by a datapoint and its nearest neighbors, and introduce a model, called instance-conditioned GAN (IC-GAN), which learns the distribution around each datapoint.
πΉ Self-Guided Diffusion Models π₯
By leveraging a feature extraction function and a self-annotation function, our method provides guidance signals at various image granularities: from the level of holistic images to object boxes and even segmentation masks.
πΉ A Unified Generative Adversarial Network Training via Self-Labeling and Self-Attention π₯ π
We exploit the assumption that neural network generators can be trained more easily to map nearby latent vectors to data with semantic similarities, than across separate categories. We use generated data samples and their corresponding artificial conditioning labels to train a classifier. The classifier is then used to self-label real data.
πΉ
-
Pareto
πΉ Pareto Multi-Task Learning π₯ π₯ π₯ π§
we proposed a novel Pareto Multi-Task Learning (Pareto MTL) algorithm to generate a set of well-distributed Pareto solutions with different trade-offs among tasks for a given multi-task learning (MTL) problem.
πΉ Efficient Continuous Pareto Exploration in Multi-Task Learning zhihu π₯ π₯ π β β β
πΉ Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control π¦ β
πΉ Pareto Domain Adaptation π
To reach a desirable solution on the target domain, we design a surrogate loss mimicking target classification. To improve target-prediction accuracy to support the mimicking, we propose a target-prediction refining mechanism which exploits domain labels via Bayesβ theorem.
πΉ PARETO POLICY POOL FOR MODEL-BASED OFFLINE REINFORCEMENT LEARNING
πΉ Conflict-Averse Gradient Descent for Multi-task Learning π π₯
CAGrad: minimizes the average loss function, while leveraging the worst local improvement of individual tasks to regularize the algorithm trajectory. CAGrad balances the objectives automatically and still provably converges to a minimum over the average loss.
πΉ Mitigating Modality Collapse in Multimodal VAEs via Impartial Optimization π₯
We show how to detect the sub-graphs in the computational graphs where gradients conflict (impartiality blocks), as well as how to leverage existing gradient-conflict solutions from multitask learning to mitigate modality collapse.
-
BNN; BO;
πΉ Auto-Encoding Variational Bayes π β
πΉ Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift π π₯
we develop an approximate Bayesian inference scheme based on posterior regularisation, wherein unlabelled target data are used as βpseudo-labelsβ of model confidence that are used to regularise the modelβs loss on labelled source data.
πΉ Understanding Uncertainty in Bayesian Deep Learning π¦
πΉ Bayesian Optimization Augmented with Actively Elicited Expert Knowledge π₯
we tackle the problem of incorporating expert knowledge into BO (PBNN), with the goal of further accelerating the optimization.
-
HyperNetworks
πΉ HYPERNETWORKS π
πΉ NEURAL ARCHITECTURE SEARCH WITH REINFORCEMENT LEARNING
πΉ META-LEARNING WITH LATENT EMBEDDING OPTIMIZATION π₯ π
LEO: learning a data-dependent latent generative representation of model parameters, and performing gradient-based meta-learning in this low dimensional latent space.
πΉ CONTINUAL LEARNING WITH HYPERNETWORKS π₯
Instead of recalling the input-output relations of all previously seen data, task-conditioned hypernetworks only require rehearsing task-specific weight realizations, which can be maintained in memory using a simple regularizer.
πΉ Continual Model-Based Reinforcement Learning with Hypernetworks π
πΉ Hypernetwork-Based Augmentation π₯
We propose an efficient gradient-based search algorithm, called Hypernetwork-Based Augmentation (HBA), which simultaneously learns model parameters and augmentation hyperparameters in a single training.
πΉ Hypernetworks in Meta-Reinforcement Learning π₯
We 1) show that hypernetwork initialization is also a critical factor in meta-RL, and that naive initializations yield poor performance; 2) propose a novel hypernetwork initialization scheme that matches or exceeds the performance of a state-of-the-art approach proposed for supervised settings, as well as being simpler and more general.
πΉ Goal-Conditioned Generators of Deep Policies π₯
Using context commands of the form βgenerate a policy that achieves a desired expected return,β our NN generators combine powerful exploration of parameter space with generalization across commands to iteratively find better and better policies.
πΉ General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States π₯
Here we combine the actor-critic architecture of ParameterBased Value Functions and the policy embedding of Policy Evaluation Networks to learn a single value function for evaluating (and thus helping to improve) any policy represented by a deep neural network (NN).
πΉ Policy Evaluation Networks π₯ π
PVN: We introduced a network that can generalize in policy space, by taking policy fingerprints as inputs. These fingerprints are differentiable policy embeddings obtained by inspecting the policyβs behaviour in a set of key states.
πΉ PARAMETER-BASED VALUE FUNCTIONS π₯
We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can generalize across different policies.
πΉ LEARNING TO LEARN WITH GENERATIVE MODELS OF NEURAL NETWORK CHECKPOINTS π
Our model is a conditional diffusion transformer that, given an initial input parameter vector and a prompted loss, error, or return, predicts the distribution over parameter updates that achieve the desired metric.
πΉ Hypernetworks for Zero-shot Transfer in Reinforcement Learning πΆ
Our technical approach is based upon viewing each RL algorithm as a mapping from the MDP specifics to the near-optimal value function and policy and seek to approximate it with a hypernetwork that can generate near-optimal value functions and policies, given the parameters of the MDP.
- Gaussian Precess, Kernel Method, EM, Conditional Neural Process, Neural Process, (Deep Mind, ICML2018) π β
- Weak Duality, Fenchel-Legendre Duality, Convex-Optimization, Convex-Optimization - bilibili,
- Online Convex Optimization, ONLINE LEARNING, Convex Optimization (book),
- Total variation distance,
- Ising model, Gibbs distribution, VAEBM,
- f-GAN, GAN-OP, ODE: GAN,
- [Wasserstein Distance](Wasserstein Distance), Statistical Aspects of Wasserstein Distances, Optimal Transport and Wasserstein Distance, [Metrics for GAN](An empirical study on evaluation metrics of generative adversarial networks), Metrics for GAN zhihu, MMD: Maximum Mean Discrepancy,
- MARKOV-LIPSCHITZ DEEP LEARNING,
- Rainbow π¦ , β
- VC dimensition,
- BALD,
OpenAI Spinning Up, OpenAI Blog, OpenAI Baselines, DeepMind, BAIR, Stanford AI Lab,
Lil'Log, Andrej Karpathy blog, The Gradient, RAIL - course - RL, RAIL - cs285, inFERENCe,
UCB: Tuomas Haarnoja, Pieter Abbeel, Sergey Levine, Abhishek Gupta, Coline Devin, YuXuan (Andrew) Liu, Rein Houthooft, Glen Berseth,
UCSD: Xiaolong Wang,
CMU: Benjamin Eysenbach, Ruslan Salakhutdinov,
Standord: Chelsea Finn, [Tengyu Ma], [Tianhe Yu], [Rui Shu],
NYU: Rob Fergus,
MIT: Bhairav Mehta, Leslie Kaelbling, Joseph J. Lim,
Caltech: Joseph Marino, Yisong Yue Homepage,
DeepMind: David Silver, Yee Whye Teh [Homepage], Alexandre Galashov, Leonard Hasenclever [GS], Siddhant M. Jayakumar, Zhongwen Xu, Markus Wulfmeier [HomePage], Wojciech Zaremba, Aviral Kumar,
Google: Ian Fischer, Danijar Hafner [Homepage], Ofir Nachum, Yinlam Chow, Shixiang Shane Gu, [Mohammad Ghavamzadeh]
Montreal: Anirudh Goyal Homepage,
Toronto: Jimmy Ba; Amir-massoud Farahmand;
Columbia: Yunhao (Robin) Tang,
OpenAI:
THU: Chongjie Zhang [Homepage], Yi Wu, Mingsheng Long [Homepage],
PKU: Zongqing Lu,
NJU: Yang Yu,
TJU: Jianye Hao,
HIT: PR&IS research center,
Salesforce : Alexander Trott,
Flowers Lab (INRIA): CΓ©dric Colas,