Skip to content

Latest commit

 

History

History
35 lines (28 loc) · 1.56 KB

20200928183306-multi-armed_bandits_.md

File metadata and controls

35 lines (28 loc) · 1.56 KB
Error in user YAML: (<unknown>): mapping values are not allowed in this context at line 3 column 65
---
title: Multi-Armed Bandits
date: 2020-09-28 18:33
tags: :armed-bandits:example:reinforcement-learning:action-value:
type: note
---

Multi-Armed Bandits

  • A multi-armed bandits is a problem in RL that widely used as example:
    • The problem is: A algorithm has to choose a one arm at step t and this arm will return a reward regarding this selection. The numerical reward chosen from a stationary probability distribution that depends on the action algorithm selected.
    • The objective is to maximize the expected total reward over some time period.
  • Variation:
    • K-armed, with K possible arms
    • Action selected on time step t as $A_{t}$, and the corresponding reward as $R_{t}$.
    • $q_{*}(a) = \mathop{\mathbb{E}}[R_{t}|A_{t}= a]$ <- Optimum Solution
    • Estimate value of action a at time step t as $Q_{t}(a)$. We would like $Q_{t}(a)$ to be close to $q_{*}(a)$.
  • Exploration and Exploitation.
    • Greedy and non-greedy actions
    • Exploration: It happens when use non-greedy actions. In another word, non-optimum actions with the highest value(reward)
    • Exploitation: Algorithm uses the highest action's value(reward).
    • Balance between exploration and exploitation
  • First model:
  • Incremental variation: