Skip to content

Feature/abstract data collector #156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
Jun 8, 2025

Conversation

Ben-geo
Copy link
Collaborator

@Ben-geo Ben-geo commented May 28, 2025

This PR introduces the AbstractDataCollector class, an extensible and standardized base for implementing data collection logic in mesa-frames. It provides a structured interface for collecting both model-level and agent-level data during simulation runs, with support for custom triggers, in-memory or external storage, and flexible reporter definitions.

Key Features :

  • Model & Agent Reporters: Accepts user-defined Callable or attribute names to specify what to collect.

  • Trigger-Based Collection: Supports conditional data collection via a trigger function; defaults to always collect.

  • In-Memory Storage with Optional External Flush: Stores collected data in memory with support for flushing to external backends (defined in child classes).

  • Reset Behavior: Automatically resets in-memory data after flush if reset_memory=True.

  • Abstract Interface: Defines the following abstract methods to be implemented by concrete subclasses:

    • _collect() : Actual logic for collecting model and agent data.
    • get_data() : Returns in-memory collected data.
    • _flush(): Flushes data to persistent storage.

Copy link

codecov bot commented May 28, 2025

Codecov Report

Attention: Patch coverage is 0% with 47 lines in your changes missing coverage. Please review.

Project coverage is 89.81%. Comparing base (ddd8031) to head (1329439).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
mesa_frames/abstract/datacollector.py 0.00% 47 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #156      +/-   ##
==========================================
- Coverage   92.27%   89.81%   -2.46%     
==========================================
  Files          11       12       +1     
  Lines        1721     1768      +47     
==========================================
  Hits         1588     1588              
- Misses        133      180      +47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Ben-geo Ben-geo requested a review from adamamer20 May 28, 2025 20:55
Copy link
Member

@adamamer20 adamamer20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben for putting this together.
I'm a bit confused about the interplay between collect(), _collect(), get_data(), and flush(), particularly regarding their intended public/private semantics and expected usage. Below are a few points for discussion:


1. Public vs. private: should collect() really be exposed?

Currently, collect() is a public method that checks should_collect() and then calls the abstract _collect().

If the intention is that users or developers should never call _collect() directly (as it's meant to be implemented by subclasses), then exposing collect() makes sense. However, if users are not meant to call collect() themselves either (e.g., because data is collected automatically after every n steps), then perhaps collect() should be renamed to something like _collect_if_triggered() or otherwise marked as internal.

Alternatively, it might be clearer to split responsibilities:

  • Keep collect() as the public method that always collects (i.e., unconditional).
  • Introduce a new internal method like _collect_if_triggered() that checks the trigger before collecting.

This would make the public API more predictable and reduce the risk of misuse.


2. Clarify the difference between collect() and get_data()

From the current docstrings:

  • collect() triggers data collection if the trigger condition is met.
  • get_data() returns the collected data currently in memory.

But it’s unclear:

  • Does get_data() trigger a collection if no data has been collected yet?
  • Does it always return just the in-memory data (self._frames)?
  • Could it return data from an external backend (if flushed)?

If get_data() is purely an accessor for self._frames, it might make sense to expose it as a @property instead, and rename it accordingly (e.g. data).
If it can also trigger collection or fetch from storage, that should be made explicit in the docstring and usage examples.


3. Docstrings could be more explicit and example-driven

For example, in flush(), it’s not immediately obvious what reset_memory does unless one reads the implementation. Could you clarify in the docstring:

  • What it means to "flush" data (to storage? to file? to database?).
  • What effect reset_memory has (i.e., clears in-memory buffers after flushing).

More broadly, it would be very helpful if each public method (collect, get_data, flush, reset) had:

  • A short, high-level summary of when it should be used,
  • And, ideally, a minimal usage example in the docstring to illustrate expected patterns.

@Ben-geo
Copy link
Collaborator Author

Ben-geo commented May 31, 2025

1. Public vs. private: should collect() really be exposed?

Currently, collect() is a public method that checks should_collect() and then calls the abstract _collect().

If the intention is that users or developers should never call _collect() directly (as it's meant to be implemented by subclasses), then exposing collect() makes sense. However, if users are not meant to call collect() themselves either (e.g., because data is collected automatically after every n steps), then perhaps collect() should be renamed to something like _collect_if_triggered() or otherwise marked as internal.

Alternatively, it might be clearer to split responsibilities:

  • Keep collect() as the public method that always collects (i.e., unconditional).
  • Introduce a new internal method like _collect_if_triggered() that checks the trigger before collecting.

This would make the public API more predictable and reduce the risk of misuse.

So in mesa data collection the model is used like this

class MoneyModel(mesa.Model):
    """A model with some number of agents."""

    def __init__(self, n, width, height, seed=None):
        super().__init__(seed=seed)
        self.num_agents = n
        self.grid = OrthogonalMooreGrid(
            (width, height), torus=True, capacity=10, random=self.random
        )
        # Instantiate DataCollector
        self.datacollector = mesa.DataCollector(
            model_reporters={"Gini": compute_gini}, agent_reporters={"Wealth": "wealth"}
        )

        # Create agents
        agents = MoneyAgent.create_agents(
            self,
            self.num_agents,
            self.random.choices(self.grid.all_cells.cells, k=self.num_agents),
        )

    def step(self):
        # Collect data each step
        self.datacollector.collect(self)
        self.agents.shuffle_do("move")
        self.agents.do("give_money")

i thought we will do the same and the user has to call collect in step so it should be public right and _collect will only be called if the data actually should be collected based on _should_collect

  1. Clarify the difference between collect() and get_data()
  2. Docstrings could be more explicit and example-driven

i think I fixed these with current changes

I was also thinking of a load_data methd, this is for loading specific data but I wanted to know if this is necessary and should we implement it, if we proceed we should probably have parameters like [model,agent,both] and maybe step
but what will default be like load everything? and that can cause memory issues so I don't think we should implement a load_data method
what's your take on this?

@Ben-geo Ben-geo marked this pull request as ready for review June 1, 2025 12:39
@adamamer20
Copy link
Member

adamamer20 commented Jun 1, 2025

Hey Ben, looks good to me! Just make sure to fix any pre-commit errors before merging.

You can run checks locally with pre-commit run --all or uv run pre-commit --all if you're using uv.
If you're on VS Code, there's also a handy extension:
https://marketplace.visualstudio.com/items?itemName=elagil.pre-commit-helper

No worries about CodeCov since it's just an abstract implementation.

@Ben-geo Ben-geo force-pushed the feature/abstract-data-collector branch from 8b5ab29 to 9cac504 Compare June 4, 2025 19:41
@Ben-geo Ben-geo force-pushed the feature/abstract-data-collector branch from 65f8caa to 9a05e68 Compare June 4, 2025 19:47
@Ben-geo
Copy link
Collaborator Author

Ben-geo commented Jun 4, 2025

Hey @adamamer20 I ran pre commit and i got get conflicts which for some reason I get conflicts, I am not able to solve it after pulling from main.
could you give me some context on what's happening here, is it because I didn't set up my environment correctly?
should I revert to the commit before the precommit?

@adamamer20
Copy link
Member

adamamer20 commented Jun 6, 2025

@Ben-geo I recently merged #143 into main, which modified the uv.lock file — this is the file uv uses to lock dependencies. Since you also touched uv.lock in your branch, GitHub flagged a merge conflict.

I resolved it by accepting all changes from the incoming branch (main) using VS Code. If you're curious how to handle this sort of conflict next time, here's a short video on how I did it.

Also, it looks like uv.lock changed on your side because you added pre-commit directly to the core dependencies of mesa-frames. While this works, it's better to list dev tools like pre-commit in the dev dependency group so that end users installing mesa-frames don’t get unnecessary dev tools.

To fix that, I merged #158 — now running uv sync will properly install pre-commit as a dev dependency only.


As for the pre-commit issues:

(mesa-frames) adam@Supermacchina:~/projects/mesa-frames$ pre-commit run --all
ruff (legacy alias)......................................................Failed
- hook id: ruff
- exit code: 1

mesa_frames/abstract/datacollector.py:79:9: D400 First line should end with a period
   |
77 |           storage: Literal["memory:", "csv:", "postgresql:"] = "memory:",
78 |       ):
79 | /         """
80 | |         Initialize a Datacollector
81 | |
82 | |         Parameters
83 | |         ----------
84 | |         model : ModelDF
85 | |             The model object from which data is collected.
86 | |         model_reporters : dict[str, Callable], optional
87 | |             Functions to collect data at the model level.
88 | |         agent_reporters : dict[str, Union[str, Callable]], optional
89 | |             Attributes or functions to collect data at the agent level.
90 | |         trigger : Callable, optional
91 | |             A function(model) -> bool that determines whether to collect data.
92 | |         reset_memory : bool
93 | |             Whether to reset in-memory data after flushing. Default is True.
94 | |         storage : str
95 | |             Storage backend URI (e.g. 'memory:', 'csv:', 'postgresql:').
96 | |         """
   | |___________^ D400
97 |           self._model = model
98 |           self._model_reporters = model_reporters or {}
   |
   = help: Add period

mesa_frames/abstract/datacollector.py:106:9: D400 First line should end with a period
    |
105 |       def collect(self) -> None:
106 | /         """
107 | |         Trigger Data collection
108 | |
109 | |         This method caslls _collect() to perform actual data collection
110 | |
111 | |         Example
112 | |         -------
113 | |         >>> datacollector.collect()
114 | |         """
    | |___________^ D400
115 |           self._collect()
    |
    = help: Add period

mesa_frames/abstract/datacollector.py:118:9: D400 First line should end with a period
    |
117 |       def conditional_collect(self) -> None:
118 | /         """
119 | |         Trigger data collection if condition is met
120 | |
121 | |         This method caslls _collect() to perform actual data collection
122 | |
123 | |         Example
124 | |         -------
125 | |         >>> datacollector.conditional_collect()
126 | |         """
    | |___________^ D400
127 |           if self._should_collect():
128 |               self._collect()
    |
    = help: Add period

mesa_frames/abstract/datacollector.py:131:9: D400 First line should end with a period
    |
130 |       def _should_collect(self) -> bool:
131 | /         """
132 | |         Evaluates whether data should be collected at current step
133 | |
134 | |         Returns
135 | |         -------
136 | |         bool
137 | |             True if the configured trigger condition is met, False otherwise.
138 | |         """
    | |___________^ D400
139 |           return self._trigger(self._model)
    |
    = help: Add period

mesa_frames/abstract/datacollector.py:131:9: D401 First line of docstring should be in imperative mood: "Evaluates whether data should be collected at current step"
    |
130 |       def _should_collect(self) -> bool:
131 | /         """
132 | |         Evaluates whether data should be collected at current step
133 | |
134 | |         Returns
135 | |         -------
136 | |         bool
137 | |             True if the configured trigger condition is met, False otherwise.
138 | |         """
    | |___________^ D401
139 |           return self._trigger(self._model)
    |

mesa_frames/abstract/datacollector.py:143:9: D400 First line should end with a period
    |
141 |       @abstractmethod
142 |       def _collect(self):
143 | /         """
144 | |         Performs the actual data collection logic
145 | |
146 | |         This method must be im
147 | |         """
    | |___________^ D400
148 |           pass
    |
    = help: Add period

mesa_frames/abstract/datacollector.py:143:9: D401 First line of docstring should be in imperative mood: "Performs the actual data collection logic"
    |
141 |       @abstractmethod
142 |       def _collect(self):
143 | /         """
144 | |         Performs the actual data collection logic
145 | |
146 | |         This method must be im
147 | |         """
    | |___________^ D401
148 |           pass
    |

mesa_frames/abstract/datacollector.py:153:9: D400 First line should end with a period
    |
151 |       @abstractmethod
152 |       def data(self) -> Any:
153 | /         """
154 | |         Returns collected data currently in memory as a dataframe
155 | |
156 | |         Example:
157 | |         -------
158 | |         >>> df = datacollector.data
159 | |         >>> print(df)
160 | |         """
    | |___________^ D400
161 |           pass
    |
    = help: Add period

mesa_frames/abstract/datacollector.py:206:9: D400 First line should end with a period
    |
204 |       @property
205 |       def seed(self) -> int:
206 | /         """
207 | |         Function to get the model seed
208 | |
209 | |         Example:
210 | |         --------
211 | |         >>> seed = datacollector.seed
212 | |         """
    | |___________^ D400
213 |           return self._model._seed
    |
    = help: Add period

Found 9 errors.
No fixes available (7 hidden fixes can be enabled with the `--unsafe-fixes` option).

ruff format..............................................................Passed
pyupgrade................................................................Passed
trim trailing whitespace.................................................Passed
check toml...............................................................Passed
check yaml...............................................................Passed
codespell................................................................Failed
- hook id: codespell
- exit code: 65

mesa_frames/abstract/datacollector.py:172: Parametrs ==> Parameters
mesa_frames/abstract/datacollector.py:174: intialization ==> initialization

markdownlint-cli2........................................................Passed
pydoclint................................................................Failed
- hook id: pydoclint
- exit code: 1

Skipping files that match this pattern: \.git|\.tox
docs/api/conf.py
examples/boltzmann_wealth/performance_plot.py
examples/sugarscape_ig/__init__.py
examples/sugarscape_ig/performance_comparison.py
examples/sugarscape_ig/ss_mesa/__init__.py
examples/sugarscape_ig/ss_mesa/agents.py
examples/sugarscape_ig/ss_mesa/model.py
examples/sugarscape_ig/ss_polars/__init__.py
examples/sugarscape_ig/ss_polars/agents.py
examples/sugarscape_ig/ss_polars/model.py
mesa_frames/__init__.py
mesa_frames/abstract/__init__.py
mesa_frames/abstract/agents.py
mesa_frames/abstract/datacollector.py
mesa_frames/abstract/mixin.py
mesa_frames/abstract/space.py
mesa_frames/concrete/__init__.py
mesa_frames/concrete/agents.py
mesa_frames/concrete/agentset.py
mesa_frames/concrete/mixin.py
mesa_frames/concrete/model.py
mesa_frames/concrete/space.py
mesa_frames/types_.py
mesa_frames/utils.py
tests/__init__.py
tests/test_agents.py
tests/test_agentset.py
tests/test_grid.py
tests/test_mixin.py
tests/test_modeldf.py

mesa_frames/abstract/datacollector.py
    70: DOC105: Method `AbstractDataCollector.__init__`: Argument names match, but type hints in these args do not match: model_reporters, agent_reporters, trigger, storage

(mesa-frames) adam@Supermacchina:~/projects/mesa-frames$ 

Here's a summary of the failing hooks:

ruff

  • Many docstrings in mesa_frames/abstract/datacollector.py are missing a period on the first line (D400).
  • Some docstrings aren't in imperative mood (D401), e.g., "Evaluates whether...""Evaluate whether..."

codespell

  • Typos detected:

    • ParametrsParameters
    • intializationinitialization

pydoclint

  • Signature mismatches in the AbstractDataCollector.__init__ docstring compared to actual type hints.

Copy link
Member

@adamamer20 adamamer20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small change for imports. Everything else, LGTM.

Copy link
Member

@adamamer20 adamamer20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small change for imports. Everything else, LGTM.

@adamamer20 adamamer20 self-requested a review June 8, 2025 11:18
@adamamer20 adamamer20 enabled auto-merge (squash) June 8, 2025 11:19
@adamamer20 adamamer20 merged commit a08eb36 into projectmesa:main Jun 8, 2025
9 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants