Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(ek): add pooltool env and related configs #227

Merged
merged 70 commits into from
Jul 4, 2024

Conversation

ekiefl
Copy link
Contributor

@ekiefl ekiefl commented May 27, 2024

Overview

Hello,

This is a draft PR for integrating pooltool into LightZero.

Pooltool is a billiards simulation environment that offers unique opportunities for research in the RL space, with turn-based game play, continuous action spaces, and long-term planning requirements. Billiards in general is a game with many different variants that is played around the world, so I think it could be a fun and satisfying learning environment for the RL community to adopt.

For background on this idea, please see this LightZero discussion thread: #182. Since that thread was last active, pooltool has transformed into a more mature project with proper documentation, a stable API, and is currently being reviewed in pyOpenSci.

Features

Here is the current status of the project so far.

A new pooltool subpackage has been added to zoo/. Currently one type of billiards game has been included: sum to three. Code that is non-specific to sum to three has been placed outside the sum_to_three subpackage and can be reused for implementing other games in the future.

Two observation modes have been implemented: continuous coordinate based observations, and image-based observations rendered with pygame.

To see how fast observations can be simulated, see zoo/pooltool/sum_to_three/envs/profile/speed.py

To see what the image-based observations look like, run the example code in the module docstring in zoo/pooltool/image_representation.py.

Checklist

There are still some things to be done:

  • play_with_bot_mode. Currently there is no bot implemented that plays the game. I could try and create a rule-based bot that plays sum to three with reasonable efficiency, or we could create an MCTS bot. Either way, playing against a bot seems like a must-have feature
  • Multiprocess mode (env_manager=dict(type="subprocess")) fails when running image mode. I think this is because the PygameRenderer object in SumToThreeSimulator is not pickleable. To see this for yourself, run python zoo/pooltool/sum_to_three/config/sum_to_three_image_config.py. Everything runs if you set env_manager=dict(type="base").
  • Prove that sum to three is learnable. I have done some preliminary experiments that I shared in the discussion thread, but it would be nice to determine the settings required for an agent to learn perfect or almost perfect play. I do not have the computing resources for training.
  • Convert pooltool/README.md into a proper documentation resource.

What's next

Before investing any more effort, I am looking for your feedback. Is this still interesting to you guys? What minimum requirements should we establish before merging?

@puyuan1996 puyuan1996 added the environment New or improved environment label May 28, 2024
@puyuan1996
Copy link
Collaborator

puyuan1996 commented May 28, 2024

Before investing any more effort, I am looking for your feedback. Is this still interesting to you guys? What minimum requirements should we establish before merging?

Hello,

Yes, as mentioned in the link, we still have a strong interest in integrating pooltool into LightZero and greatly appreciate you submitting this PR. Before the official merge, we have the following minimum requirements:

  • The provided configuration file (config) must be runnable and error-free, capable of learning near-optimal behavior.
  • Comments must be clear and properly formatted.
  • Core code should maintain readability and extensibility.

As you mentioned, due to potential limitations in your computing resources, the first item can be debugged by our colleagues. If there are any specific questions related to the game, we will consult you. Regarding the bot part, it can be submitted in a separate PR post-merge. For the env_manager issue, we will default to using the base version, and for the subprocess issue, we will confirm and debug it.

So, currently, what we need from you is to address the second and third points mentioned above. I have initially reviewed your PR, and it seems to require only minor modifications. We will review other necessary details in the coming weeks.

Thank you once again for your contributions to LightZero and the RL community!

@puyuan1996 puyuan1996 added the config New or improved configuration label May 28, 2024
@PaParaZz1
Copy link
Member

please modify the name of this pull request as our previous format.

@ekiefl ekiefl changed the title Integrating pooltool feature(ek): add pooltool env May 30, 2024
@puyuan1996 puyuan1996 marked this pull request as ready for review June 23, 2024 08:25
Copy link
Collaborator

@puyuan1996 puyuan1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once these suggestions are addressed, they can be merged into the main branch. Thank you for your patience and dedication.

.gitignore Show resolved Hide resolved
zoo/pooltool/README.md Outdated Show resolved Hide resolved

## Results

The results end up in `./data_pooltool_ctree/`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we successfully complete the debugging process, we will update the corresponding description here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I updated the README btw: e6138a7

zoo/pooltool/sum_to_three/reward.py Outdated Show resolved Hide resolved
@@ -0,0 +1,55 @@
from easydict import EasyDict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Hello, we have added a configuration for a model-free SAC algorithm and validated it locally. The learning curves for both collect and eval show that the algorithm can converge quickly, indicating that the environment setup is correct. Currently, the results after adjusting SEZ-related configurations are similar to before, and we will make further improvements in the future. Next, we will focus mainly on the representation of continuous action spaces and the analysis of learning dynamics.
  • collect
    image
  • eval
    image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thank you for your patience, I noticed that in our previous settings, the model_update_ratio was using the default value of 0.1, which resulted in too few updates given the same number of environment steps. Additionally, since replay_buffer_size is set to 1e6 and reanalyze_ratio is 0, many of the target strategies in the buffer are close to random, leading to poor performance. After adjusting the model_update_ratio from 0.1 to 1, the current experimental results are as follows (latest commit: 5bdcf48):

For image-obs (reanalyze_ratio = 0.):

  • Data collection
    image

  • Data evaluation
    image

For vector-obs (reanalyze_ratio = 0.25):

  • Data collection
    image

  • Data evaluation
    image

It can be observed that image-obs (reanalyze_ratio = 0.) is already able to achieve performance similar to SAC, but the performance of vector-obs is still similar to your previous run results. We speculate that this might be because the 4-dimensional vector in vector-obs provides less information compared to the (5, 100, 50) setting in image-obs. Therefore, my question is whether the 4-dimensional vector in vector-obs already provides all the information necessary for optimal decision-making? Currently, experiments with the image-obs setting (reanalyze_ratio = 0.25 and variants with reduced replay_buffer_size) are still ongoing, and we will update the results once the experiments are completed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I have supplemented some experimental results of image-obs (reanalyze_ratio=0.25) here. We conducted three variant experiments: condition-sigma rbs1e6, condition-sigma rbs1e5, and fixed-sigma rbs1e6. Specifically, these experiments handled the variance of the policy in different ways, namely sigma dependent on input learning (referred to as condition-sigma) and fixed sigma=0.3 (referred to as fixed-sigma). Additionally, the size of the replay buffer was adjusted, denoted as rbs-x. The results show that the performance of these variants is consistent with expectations and significantly better than the results with reanalyze_ratio=0.

  • collect
    image

  • eval
    image

@puyuan1996 puyuan1996 changed the title feature(ek): add pooltool env feature(ek): add pooltool env and related configs Jun 24, 2024
@puyuan1996
Copy link
Collaborator

Hello, due to your efforts, the environment, tests, and documentation of pooltool have met the standards of LightZero. The current experimental results are also similar to the model-based RL algorithm SAC. Therefore, we have decided to merge this PR. Any further performance adjustments will be communicated here or through a new issue. Thank you for your contribution.

@puyuan1996 puyuan1996 merged commit 39dfa3c into opendilab:main Jul 4, 2024
@ekiefl
Copy link
Contributor Author

ekiefl commented Jul 13, 2024

Therefore, my question is whether the 4-dimensional vector in vector-obs already provides all the information necessary for optimal decision-making?

I agree. I think the sum-to-three task is simple enough that the vector observation contains all the necessary information for optimal decision making, and that is why the vector observation and image observation perform similarly.

However, my guess is that image observations will outperform vector observations for harder games like 8-ball and 9-ball. These games present significantly higher levels of complexity with more balls that can block shot paths. In such scenarios, vector observations might struggle to encode the required spatial relationships. So I think your work tuning the image observation config is really important for future developments.

The results show that the performance of these variants is consistent with expectations and significantly better than the results with reanalyze_ratio=0

Amazing work! It's awesome to see the eval reward mean reach 8!

Hello, due to your efforts, the environment, tests, and documentation of pooltool have met the standards of LightZero. The current experimental results are also similar to the model-based RL algorithm SAC. Therefore, we have decided to merge this PR. Any further performance adjustments will be communicated here or through a new issue. Thank you for your contribution.

🥳 🍾 Thanks for your proactive attitude towards collaboration, it was a very positive experience for me.


Eventually, I would like to contribute a real pool game, like 8-ball. Before that is possible, I think we should expand the sum-to-three environment to support 2-player mode, since most pool games are two players. One caveat to turn play in pool is that turns do not alternate with each shot, but start once the other player misses a shot. I also think we should create a heuristic-based bot. And finally, it would be awesome if we could play against the bot within the pooltool application. Does that sound like a good roadmap to you? I can make the necessary PRs when I find time.

@puyuan1996
Copy link
Collaborator

Hello,

I apologize for the late response, I inadvertently overlooked your message earlier. Thank you for your interest and valuable feedback on our project! We are delighted to hear your suggestion about continuing the development of the billiards game in LightZero.

Your input is highly valuable, and we completely agree that adding a two-player mode is a great direction for expansion. Regarding the issue where ball turns do not simply alternate, we can address this by adjusting the to_play parameter in the environment.

As for the design of heuristic bots, we might need your expertise in billiards. Once we have trained the agent models, implementing the feature to compete against bots in the pooltool application should not be a problem.

We warmly welcome you to submit the relevant PR at your convenience and look forward to your contributions. Your involvement will be crucial for the development of the project. Should you have any questions or need further discussions, please feel free to reach out to us.

Thank you once again for your support and cooperation!

Best wishes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
config New or improved configuration environment New or improved environment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants