A specialized optimization toolkit for the BlackBox Challenge competition.
The problem, in short, is to teach an agent to play a game with unknown rules. The game consists of a number of steps, at each step the agent chooses one of the available actions (a small integer) seeing just some undescribed state (a bunch of floats), and gets a score update in return.
Such a formulation covers a wide subset of reinforced learning problems, allowing the toolkit to be potentially used for various optimization tasks.
Just a short excerpt from a few pages long list of notes and todos:
- In its current form, the toolkit presents just a simplistic, evolutionary approach, all of the coded trainers treat the blackbox opaquely, not even looking at intermediate level scores.
- The optimization process still requires a human to choose the results for further extension and refinement; and humans are not that easy to come by ;-)
- Parts of the toolkit are over-optimized; for instance, trainers that could benefit from statically compiled code are quite imaginable, but none of the implemented ones really needs it.
- The bot interface should be extended to allow more sophisticated trainers.
Although the framework is a simple Monte Carlo simulator at its heart, it has some nice attributes:
- Simple interface between bots and trainers, each trainer can train every bot.
- C-speed bots, trainers, collectors, and processors; writing a new bot only requires describing its parameters and coding a single function.
- Extensive training infrastructure: seed trainers with previous results, move parts of parameter sets between bots, use different parameter sets for parts of levels, or emphasize components of the state.
- Most of the training configuration is done using probability distributions, rather than fixed constants, giving a lot of flexibility.
Requires Python >= 3.4, NumPy >= 1.11, and Cython >= 0.24. Currently the only way to use the toolkit is to clone the source from GitHub and get the game interface and levels from the competition's site.
Copy the levels
folder, interface.pxd
(from the fast_bot
folder),
and the proper interface-X.so
for your environment from the package into
the folder that this file is in.
Alternatively, code an interface for your own problem according to the specification! Both Python and C API has to be provided, but there are no requirements on the levels content or format, other than that the interface can use them.
The file and directory structure should look as follows:
blackbox
├── data (records gathered by collectors)
├── levels (should hold train/test_level.data)
├── packs (self-contained bots for sharing)
├── params (stored parameters, NumPy's format)
├ bot_*.pyx (bot implementations)
├ bots.pyx (add new bots here)
├ trainer_*.pyx (trainer implementations)
├ trainers.pyx (add new trainers here)
...
├ interface.[pxd,so] (the game interface should be here)
The toolkit consists of six command-line utilities. To see a full list of
options for any of them invoke it with --help
. Some expanded descriptions
and more examples can be found within their module docstrings.
Takes a bot name, trainer name and a configuration for it (plus a ton of optional parameters). For example:
./train.py linear local 100 1
starts a random local search for a set of coefficients for a linear regression bot.
Evaluates a bot on selected levels and prints out the scores. For instance:
./play.py linear 0
would play the bot with the coefficients found in the previous section on all available levels.
Lets you see a set of coefficients and their history. To review the coefficients generated above you would type:
./view.py linear 0
Plays a bot on a level and stores some data, such as consequent states, actions taken, or intermediate rewards:
./collect.py srs --bot linear 0
Analyzes data sets gathered by a collector to display statistics or preprocess them for a trainer.
./process.py stats srs 0
Creates a standalone bot package (that still requires the game interface and levels).
The agents choosing the actions.
All current bot implementations assume 4 actions, some can only play a
particular number of steps on each act()
call.
A simple regression bot, optimized.
Bots that try to simulate an entry, or a few, of the hidden state, and evaluate actions using state reinforced with beliefs.
The present bots simply update the beliefs linearly based on their previous values and the visible state, and use linear regression for action choosing.
Linear regression using the current and the last state.
Approximates action value as a linear function of the state components and their derivatives (with respect to level time). Or actually, uses finite backward differences to approximate the derivatives.
Adds a coefficient for each pair of state components.
Bots that can use different sets of coefficients at different parts of the level. The "parts" may be phases (first half, second half), congruences (even step, odd step) or some other temporal patterns.
The problem solvers, or optimization algorithms.
Randomly changes a number of parameter entries at each step. Can either redraw values anew or do slight adjustments, according to a variation scale given.
Almost a random local search, but sometimes accepts score drops. Useful when you feel you are stuck in a local maximum
Instead of taking a single variation scale, these variants are configured with a high and low value and gradually diminish parameter variations during the course of a training session.
Chooses the best from the seeds provided.
Combines several parameter sets into a single multi-parameter set (for an "_m" bot).
Tries all phase assignments printing their evaluations.
Collectors run a bot through a level, possibly moving back in time by using checkpoints, observe various things such as states, rewards or optimal paths and save them for later processing.
Checks possible immediate rewards after each state.
Stores states encountered, scores seen, and actions done during a playthrough.
Finds all states that could have come after each state.
Processors are meant to analyze the data gathered by collectors. Such a split of responsibilities lets you quickly realize new investigations.
How do the first few states look? How do they depend on the first few actions?
How does the state look on average? What are typical rewards for actions?
Are the state components independent from each other?
Is the level composed of smaller sublevels?
What was going on during that playthrough?