Name	Name	Last commit message	Last commit date
parent directory ..
images	images
outs	outs
quad_controller_rl	quad_controller_rl
.DS_Store	.DS_Store
README.md	README.md
RL-Quadcopter.html	RL-Quadcopter.html
RL-Quadcopter.ipynb	RL-Quadcopter.ipynb
output_4_1.png	output_4_1.png
output_6_1.png	output_6_1.png
output_8_1.png	output_8_1.png

Project: Train a Quadcopter How to Fly

Design an agent that can fly a quadcopter, and then train it using a reinforcement learning algorithm of your choice! Try to apply the techniques you have learnt, but also feel free to come up with innovative ideas and test them.

Instructions

Note: If you haven't done so already, follow the steps in this repo's README to install ROS, and ensure that the simulator is running and correctly connecting to ROS.

When you are ready to start coding, take a look at the quad_controller_rl/src/ (source) directory to better understand the structure. Here are some of the salient items:

src/: Contains all the source code for the project.
- quad_controller_rl/: This is the root of the Python package you'll be working in.
- ...
- tasks/: Define your tasks (environments) in this sub-directory.
  - __init__.py: When you define a new task, you'll have to import it here.
  - base_task.py: Generic base class for all tasks, with documentation.
  - takeoff.py: This is the first task, already defined for you, and set to run by default.
- ...
- agents/: Develop your reinforcement learning agents here.
  - __init__.py: When you define a new agent, you'll have to import it here, just like tasks.
  - base_agent.py: Generic base class for all agents, with documentation.
  - policy_search.py: A sample agent has been provided here, and is set to run by default.
- ...

Tasks

Open up the base class for tasks, BaseTask, defined in tasks/base_task.py:

class BaseTask:
    """Generic base class for reinforcement learning tasks."""

    def __init__(self):
        """Define state and action spaces, initialize other task parameters."""
        pass
    
    def set_agent(self, agent):
        """Set an agent to carry out this task; to be called from update."""
        self.agent = agent
    
    def reset(self):
        """Reset task and return initial condition."""
        raise NotImplementedError
    
    def update(self, timestamp, pose, angular_velocity, linear_acceleration):
        """Process current data, call agent, return action and done flag."""
        raise NotImplementedError

All tasks must inherit from this class to function properly. You will need to override the reset() and update() methods when defining a task, otherwise you will get NotImplementedError's. Besides these two, you should define the state (observation) space and the action space for the task in the constructor, __init__(), and initialize any other variables you may need to run the task.

Now compare this with the first concrete task Takeoff, defined in tasks/takeoff.py:

class Takeoff(BaseTask):
    """Simple task where the goal is to lift off the ground and reach a target height."""
    ...

In __init__(), notice how the state and action spaces are defined using OpenAI Gym spaces, like Box. These objects provide a clean and powerful interface for agents to explore. For instance, they can inspect the dimensionality of a space (shape), ask for the limits (high and low), or even sample a bunch of observations using the sample() method, before beginning to interact with the environment. We also set a time limit (max_duration) for each episode here, and the height (target_z) that the quadcopter needs to reach for a successful takeoff.

The reset() method is meant to give you a chance to reset/initialize any variables you need in order to prepare for the next episode. You do not need to call it yourself; it will be invoked externally. And yes, it will be called once before each episode, including the very first one. Here Takeoff doesn't have any episode variables to initialize, but it must return a valid initial condition for the task, which is a tuple consisting of a Pose and Twist object. These are ROS message types used to convey the pose (position, orientation) and velocity (linear, angular) you want the quadcopter to have at the beginning of an episode. You may choose to supply the same initial values every time, or change it a little bit, e.g. Takeoff drops the quadcopter off from a small height with a bit of randomness.

Tip: Slightly randomized initial conditions can help the agent explore the state space faster.

Finally, the update() method is perhaps the most important. This is where you define the dynamics of the task and engage the agent. It is called by a ROS process periodically (roughly 30 times a second, by default), with current data from the simulation. A number of arguments are available: timestamp (you can use this to check for timeout, or compute velocities), pose (position, orientation of the quadcopter), angular_velocity, and linear_acceleration. You do not have to include all these variables in every task, e.g. Takeoff only uses pose information, and even that requires a 7-element state vector.

Once you have prepared the state you want to pass on to your agent, you will need to compute the reward, and check whether the episode is complete (e.g. agent crossed the time limit, or reached a certain height). Note that these two things (reward and done) are based on actions that the agent took in the past. When you are writing your own agents, you have to be mindful of this.

Now you can pass in the state, reward and done values to the agent's step() method and expect an action vector back that matches the action space that you have defined, in this case a Box(6,). After checking that the action vector is non-empty, and clamping it to the space limits, you have to convert it into a ROS Wrench message. The first 3 elements of the action vector are interpreted as force in x, y, z directions, and the remaining 3 elements convey the torque to be applied around those axes, respectively.

Return the Wrench object (or None if you don't want to take any action) and the done flag from your update() method (note that when done is True, the Wrench object is ignored, so you can return None instead). This will be passed back to the simulation as a control command, and will affect the quadcopter's pose, orientation, velocity, etc. You will be able to gauge the effect when the update() method is called in the next time step.

Agents

Reinforcement learning agents are defined in a similar way. Open up the generic agent class, BaseAgent, defined in agents/base_agent.py, and the sample agent RandomPolicySearch defined in agents/policy_search.py. They are actually even simpler to define - you only need to implement the step() method that is discussed above. It needs to consume state (vector), reward (scalar value) and done (boolean), and produce an action (vector). The state and action vectors must match the respective space indicated by the task. And that's it!

Well, that's just to get things working correctly! The sample agent given RandomPolicySearch uses a very simplistic linear policy to directly compute the action vector as a dot product of the state vector and a matrix of weights. Then, it randomly perturbs the parameters by adding some Gaussian noise, to produce a different policy. Based on the average reward obtained in each episode ("score"), it keeps track of the best set of parameters found so far, how the score is changing, and accordingly tweaks a scaling factor to widen or tighten the noise.

%%html
<div style="width: 100%; text-align: center;">
    <h3>Teach a Quadcopter How to Tumble</h3>
    <video poster="images/quadcopter_tumble.png" width="640" controls muted>
        <source src="images/quadcopter_tumble.mp4" type="video/mp4" />
        <p>Video: Quadcopter tumbling, trying to get off the ground</p>
    </video>
</div>

Teach a Quadcopter How to Tumble

Obviously, this agent performs very poorly on the task. It does manage to move the quadcopter, which is good, but instead of a stable takeoff, it often leads to dizzying cartwheels and somersaults! And that's where you come in - your first task is to design a better agent for this takeoff task. Instead of messing with the sample agent, create new file in the agents/ directory, say policy_gradients.py, and define your own agent in it. Remember to inherit from the base agent class, e.g.:

class DDPG(BaseAgent):
    ...

You can borrow whatever you need from the sample agent, including ideas on how you might modularize your code (using helper methods like act(), learn(), reset_episode_vars(), etc.).

Note: This setup may look similar to the common OpenAI Gym paradigm, but there is one small yet important difference. Instead of the agent calling a method on the environment (to execute an action and obtain the resulting state, reward and done value), here it is the task that is calling a method on the agent (step()). If you plan to store experience tuples for learning, you will need to cache the last state ($S_{t-1}$) and last action taken ($A_{t-1}$), then in the next time step when you get the new state ($S_t$) and reward ($R_t$), you can store them along with the done flag ($\left\langle S_{t-1}, A_{t-1}, R_t, S_t, \mathrm{done?}\right\rangle$).

When an episode ends, the agent receives one last call to the step() method with done set to True - this is your chance to perform any cleanup/reset/batch-learning (note that no reset method is called on an agent externally). The action returned on this last call is ignored, so you may safely return None. The next call would be the beginning of a new episode.

One last thing - in order to run your agent, you will have to edit agents/__init__.py and import your agent class in it, e.g.:

from quad_controller_rl.agents.policy_gradients import DDPG

Then, while launching ROS, you will need to specify this class name on the commandline/terminal:

roslaunch quad_controller_rl rl_controller.launch agent:=DDPG

Okay, now the first task is cut out for you - follow the instructions below to implement an agent that learns to take off from the ground. For the remaining tasks, you get to define the tasks as well as the agents! Use the Takeoff task as a guide, and refer to the BaseTask docstrings for the different methods you need to override. Use some debug print statements to understand the flow of control better. And just like creating new agents, new tasks must inherit BaseTask, they need be imported into tasks/__init__.py, and specified on the commandline when running:

roslaunch quad_controller_rl rl_controller.launch task:=Hover agent:=DDPG

Tip: You typically need to launch ROS and then run the simulator manually. But you can automate that process by either copying/symlinking your simulator to quad_controller_rl/sim/DroneSim (DroneSim must be an executable/link to one), or by specifying it on the command line, as follows:
roslaunch quad_controller_rl rl_controller.launch task:=Hover agent:=DDPG sim:=<full path>

Task 1: Takeoff

Implement takeoff agent

Train an agent to successfully lift off from the ground and reach a certain threshold height. Develop your agent in a file under agents/ as described above, implementing at least the step() method, and any other supporting methods that might be necessary. You may use any reinforcement learning algorithm of your choice (note that the action space consists of continuous variables, so that may somewhat limit your choices).

The task has already been defined (in tasks/takeoff.py), which you should not edit. The default target height (Z-axis value) to reach is 10 units above the ground. And the reward function is essentially the negative absolute distance from that set point (upto some threshold). An episode ends when the quadcopter reaches the target height (x and y values, orientation, velocity, etc. are ignored), or when the maximum duration is crossed (5 seconds). See Takeoff.update() for more details, including episode bonus/penalty.

As you develop your agent, it's important to keep an eye on how it's performing. Build in a mechanism to log/save the total rewards obtained in each episode to file. Once you are satisfied with your agent's performance, return to this notebook to plot episode rewards, and answer the questions below.

Plot episode rewards

Plot the total rewards obtained in each episode, either from a single run, or averaged over multiple runs.

# TODO: Read and plot episode rewards
import pandas as pd

csv_file = "./outs/takeoff.csv"
df_stats = pd.read_csv(csv_file)
df_stats[['total_reward']].plot(title="Episode Rewards")

<matplotlib.axes._subplots.AxesSubplot at 0x7f6ca51eb128>

Q: What algorithm did you use? Briefly discuss why you chose it for this task.

A: 通过简单的比较当前的位置的z分量和目标位置的z分量进行比较，并将误差作为奖励方程。

Q: Using the episode rewards plot, discuss how the agent learned over time.

Was it an easy task to learn or hard?
Was there a gradual learning curve, or an aha moment?
How good was the final performance of the agent? (e.g. mean rewards over the last 10 episodes)

（1）该问题相对比较容易实现；（2）该学习曲线中存在急速上升的部分；（3）最终的性能为-1041.42

Task 2: Hover

Implement hover agent

Now, your agent must take off and hover at the specified set point (say, 10 units above the ground). Same as before, you will need to create an agent and implement the step() method (and any other supporting methods) to apply your reinforcement learning algorithm. You may use the same agent as before, if you think your implementation is robust, and try to train it on the new task. But then remember to store your previous model weights/parameters, in case your results were worth keeping.

States and rewards

Even if you can use the same agent, you will need to create a new task, which will allow you to change the state representation you pass in, how you verify when the episode has ended (the quadcopter needs to hover for at least a few seconds), etc. In this hover task, you may want to pass in the target height as part of the state (otherwise how would the agent know where you want it to go?). You may also need to revisit how rewards are computed. You can do all this in a new task file, e.g. tasks/hover.py (remember to follow the steps outlined above to create a new task):

class Hover(BaseTask):
    ...

Q: Did you change the state representation or reward function? If so, please explain below what worked best for you, and why you chose that scheme. Include short code snippet(s) if needed.

修改了state的描述，增加了速度的分量，以便更好的获得稳定的hover效果（通常来说，速度越小，越稳定）；

修改了reward function，通过计算位置、朝向和速度的加权误差作为奖励方程，以便获取稳定的悬浮状态，详细代码如下：

error_position = np.linalg.norm(self.target_position - state[0:3])
error_orientation = np.linalg.norm(self.target_orientation - state[3:7])
error_velocity = np.linalg.norm(self.target_velocity - state[7:10])
reward = - (self.position_weight * error_position + 
            self.orientation_weight * error_orientation +
            self.velocity_weight * error_velocity)

Implementation notes

Q: Discuss your implementation below briefly, using the following questions as a guide:

What algorithm(s) did you try? What worked best for you?
What was your final choice of hyperparameters (such as $\alpha$, $\gamma$, $\epsilon$, etc.)?
What neural network architecture did you use (if any)? Specify layers, sizes, activation functions, etc.

(1) 使用了位置、朝向和速度的误差作为奖励方程，如果位置误差大于max_error_position，给予负面奖励；如果持续时间大于max_duration，给予正面奖励；

(2) 使用到的参数及默认值如下：

# Task-specific parameters
self.max_duration = 5.0  # secs
self.max_error_position = 8.0
self.target_position = np.array([0.0, 0.0, 10.0])  # ideally hovers at 10 units
self.position_weight = 0.6
self.target_orientation = np.array([0.0, 0.0, 0.0, 1.0])
self.orientation_weight = 0.0
self.target_velocity = np.array([0.0, 0.0, 0.0])  # ideally zero velocity
self.velocity_weight = 0.4

(3) 没有使用神经网络；

Plot episode rewards

As before, plot the episode rewards, either from a single run, or averaged over multiple runs. Comment on any changes in learning behavior.

# TODO: Read and plot episode rewards
import pandas as pd

csv_file = "./outs/hover.csv"
df_stats = pd.read_csv(csv_file)
df_stats[['total_reward']].plot(title="Episode Rewards")

<matplotlib.axes._subplots.AxesSubplot at 0x7fc8d2c5a9b0>

Task 3: Landing

What goes up, must come down! But safely!

Implement landing agent

This time, you will need to edit the starting state of the quadcopter to place it at a position above the ground (at least 10 units). And change the reward function to make the agent learn to settle down gently. Again, create a new task for this (e.g. Landing in tasks/landing.py), and implement the changes. Note that you will have to modify the reset() method to return a position in the air, perhaps with some upward velocity to mimic a recent takeoff.

Once you're satisfied with your task definition, create another agent or repurpose an existing one to learn this task. This might be a good chance to try out a different approach or algorithm.

Initial condition, states and rewards

Q: How did you change the initial condition (starting state), state representation and/or reward function? Please explain below what worked best for you, and why you chose that scheme. Were you able to build in a reward mechanism for landing gently?

初始化起始状态为起飞的目标位置，默认朝向，且没有初速度；状态不仅包含位置和朝向，还包含速度的分量；奖励方程通过位置、朝向和速度的误差加权和来描述；因为速度对于降落非常终于，要保证平稳的落地，必须考虑速度的因素，并且使用误差作为奖励方程，能获取平滑的奖励值，以便更好的通过policy gradient的方式进行计算。从目前的仿真结果来看，落地还算平稳，仍然有优化空间。

Implementation notes

Q: Discuss your implementation below briefly, using the same questions as before to guide you.

(1) 使用了位置、朝向和速度的误差作为奖励方程，除了位置的z分量为0，表示以及落地，给予正面奖励；其余的都给予负面奖励（如：速度过大、持续时间过长等）；

(2) 使用到的参数及默认值如下：

# Task-specific parameters
self.max_duration = 5.0  # secs
self.start_z = 10.0  # target height (z position)
self.target_position = np.array([0.0, 0.0, 0.0])
self.position_weight = 0.7
self.target_orientation = np.array([0.0, 0.0, 0.0, 1.0])
self.orientation_weight = 0.0
self.target_velocity = np.array([0.0, 0.0, 0.0])  # ideally zero velocity
self.velocity_weight = 0.3

(3) 没有使用神经网络；

Plot episode rewards

As before, plot the episode rewards, either from a single run, or averaged over multiple runs. This task is a little different from the previous ones, since you're starting in the air. Was it harder to learn? Why/why not?

# TODO: Read and plot episode rewards
import pandas as pd

csv_file = "./outs/landing.csv"
df_stats = pd.read_csv(csv_file)
df_stats[['total_reward']].plot(title="Episode Rewards")

<matplotlib.axes._subplots.AxesSubplot at 0x7fe7cccae6a0>

Task 4: Combined

In order to design a complete flying system, you will need to incorporate all these basic behaviors into a single agent.

Setup end-to-end task

The end-to-end task we are considering here is simply to takeoff, hover in-place for some duration, and then land. Time to create another task! But think about how you might go about it. Should it be one meta-task that activates appropriate sub-tasks, one at a time? Or would a single combined task with something like waypoints be easier to implement? There is no right or wrong way here - experiment and find out what works best (and then come back to answer the following).

Q: What setup did you ultimately go with for this combined task? Explain briefly.

该task仅仅实现相关的调度逻辑，以便更好的驱动takeoff、hover和landing的功能；根据状态的情况，来确定处于哪个task，然后使用该子task来执行。

Implement combined agent

Using your end-to-end task, implement the combined agent so that it learns to takeoff (at least 10 units above ground), hover (again, at least 10 units above ground), and gently come back to ground level.

Combination scheme and implementation notes

Just like the task itself, it's up to you whether you want to train three separate (sub-)agents, or a single agent for the complete end-to-end task.

Q: What did you end up doing? What challenges did you face, and how did you resolve them? Discuss any other implementation notes below.

经过尝试，发现使用一个智能体效果不是很理想，所以使用的是3个不同的智能体，分别对应3个不同的task，提前单独训练3个智能体的神经网络，并将神经网络的参数保存下来，在Combined的task中根据不同的条件触发不同的task，并使用不同的agent。

Plot episode rewards

As before, plot the episode rewards, either from a single run, or averaged over multiple runs.

# TODO: Read and plot episode rewards

Reflections

Q: Briefly summarize your experience working on this project. You can use the following prompts for ideas.

What was the hardest part of the project? (e.g. getting started, running ROS, plotting, specific task, etc.)
How did you approach each task and choose an appropriate algorithm/implementation for it?
Did you find anything interesting in how the quadcopter or your agent behaved?

这个项目中最难的部分为running ROS部分，由于使用的是mac电脑，所以需要使用vbox虚拟机，存在一系列环境配置的问题（如：仿真器连接不上，设置共享文件夹等）。最开始花了大量的时间来训练takeoff的task模型，后来触发hover和landing相对比较容易一些。task的算法实现上没有使用神经网络，由于时间方面因素，希望以后有机会继续完善一下。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RL-Quadcopter

RL-Quadcopter

README.md

Project: Train a Quadcopter How to Fly

Instructions

Tasks

Agents

Teach a Quadcopter How to Tumble

Task 1: Takeoff

Implement takeoff agent

Plot episode rewards

Task 2: Hover

Implement hover agent

States and rewards

Implementation notes

Plot episode rewards

Task 3: Landing

Implement landing agent

Initial condition, states and rewards

Implementation notes

Plot episode rewards

Task 4: Combined

Setup end-to-end task

Implement combined agent

Combination scheme and implementation notes

Plot episode rewards

Reflections

Files

RL-Quadcopter

Directory actions

More options

Directory actions

More options

Latest commit

History

RL-Quadcopter

Folders and files

parent directory

README.md

Project: Train a Quadcopter How to Fly

Instructions

Tasks

Agents

Teach a Quadcopter How to Tumble

Task 1: Takeoff

Implement takeoff agent

Plot episode rewards

Task 2: Hover

Implement hover agent

States and rewards

Implementation notes

Plot episode rewards

Task 3: Landing

Implement landing agent

Initial condition, states and rewards

Implementation notes

Plot episode rewards

Task 4: Combined

Setup end-to-end task

Implement combined agent

Combination scheme and implementation notes

Plot episode rewards

Reflections