-
Notifications
You must be signed in to change notification settings - Fork 0
fixed polar data type issue, where it would not save the data properly #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…y because a datetime variable would be str instead of datetime
Summary of ChangesHello @LuisJG8, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily addresses a critical data type inconsistency that was hindering the proper storage of timestamp data using the Polars library. By ensuring that the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully addresses a data type mismatch for the got_data_in field by changing it from a string to a datetime object, which resolves an issue with Polars. While the main fix is correct, I've identified a critical issue in worker.py where a global variable is used for timestamps, leading to stale data. Additionally, there are several suggestions to improve code quality, such as using timezone-aware datetimes consistently, removing commented-out code, and implementing placeholder logic in a new Airflow DAG.
|
|
||
| logger = get_task_logger(__name__) | ||
| todays_date = datetime.now().isoformat() | ||
| todays_date = datetime.now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining todays_date at the module level is a critical issue. It gets initialized only once when the worker module is imported, causing all tasks to use the same, stale timestamp. This affects not only got_data_in but also the fallbacks for created_at, updated_at, and pushed_at. To ensure data accuracy, the timestamp should be generated inside the get_github_data task each time it runs. It is also recommended to use timezone-aware datetimes, e.g., datetime.now(timezone.utc) (which would require importing timezone from datetime).
| @task | ||
| def run_the_queue(): | ||
| print("hello") | ||
|
|
||
| run_the_queue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The task run_the_queue is currently a placeholder that only prints "hello". This is misleading given the DAG's name (run_queue) and its description, which states it's for running a Celery queue. The task should contain the logic to actually trigger or run the Celery/RabbitMQ process. If this is a work in progress, consider adding a TODO comment to make the placeholder nature explicit.
| @task | |
| def run_the_queue(): | |
| print("hello") | |
| run_the_queue() | |
| @task | |
| def run_the_queue(): | |
| # TODO: Implement logic to run the Celery queue for GitHub data processing. | |
| print("hello") | |
| run_the_queue() |
|
|
||
|
|
||
| today = datetime.now(timezone.utc).strftime("%Y-%m-%d") | ||
| today = datetime.now().strftime("%Y-%m-%d") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change removes timezone awareness from the today variable by switching from datetime.now(timezone.utc) to datetime.now(). Using naive datetimes can lead to ambiguity and bugs, especially in distributed systems. It's recommended to consistently use timezone-aware datetimes to ensure consistency.
| today = datetime.now().strftime("%Y-%m-%d") | |
| today = datetime.now(timezone.utc).strftime("%Y-%m-%d") |
| # duckdb.sql("DESCRIBE SELECT * FROM df").show() | ||
|
|
||
| # duckdb.sql("SELECT language, COUNT(language) AS c_p \ | ||
| # FROM df \ | ||
| # GROUP BY language \ | ||
| # ORDER BY c_p DESC").show() | ||
|
|
||
| duckdb.sql("DESCRIBE SELECT * FROM df").show() No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section contains commented-out code and a duplicated line (duckdb.sql("DESCRIBE SELECT * FROM df").show()). To improve code clarity and maintainability, it's best to remove unused, commented-out code.
| # duckdb.sql("DESCRIBE SELECT * FROM df").show() | |
| # duckdb.sql("SELECT language, COUNT(language) AS c_p \ | |
| # FROM df \ | |
| # GROUP BY language \ | |
| # ORDER BY c_p DESC").show() | |
| duckdb.sql("DESCRIBE SELECT * FROM df").show() | |
| duckdb.sql("DESCRIBE SELECT * FROM df").show() |
| # airflow-init: | ||
| # image: apache/airflow:2.9.2 | ||
| # environment: | ||
| # AIRFLOW__CORE__EXECUTOR: LocalExecutor | ||
| # AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: sqlite:////opt/airflow/airflow.db | ||
| # AIRFLOW__CORE__LOAD_EXAMPLES: "false" | ||
| # volumes: | ||
| # - ./airflow/dags:/opt/airflow/dags | ||
| # - ./:/opt/airflow/project | ||
| # - airflow_data:/opt/airflow | ||
| # command: ["airflow", "db", "init"] | ||
|
|
||
| # airflow-scheduler: | ||
| # image: apache/airflow:2.9.2 | ||
| # depends_on: | ||
| # - airflow-init | ||
| # environment: | ||
| # AIRFLOW__CORE__EXECUTOR: LocalExecutor | ||
| # AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: sqlite:////opt/airflow/airflow.db | ||
| # AIRFLOW__CORE__LOAD_EXAMPLES: "false" | ||
| # volumes: | ||
| # - ./airflow/dags:/opt/airflow/dags | ||
| # - ./:/opt/airflow/project | ||
| # - airflow_data:/opt/airflow | ||
| # command: ["airflow", "scheduler"] | ||
|
|
||
| # airflow-webserver: | ||
| # image: apache/airflow:2.9.2 | ||
| # depends_on: | ||
| # - airflow-init | ||
| # environment: | ||
| # AIRFLOW__CORE__EXECUTOR: LocalExecutor | ||
| # AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: sqlite:////opt/airflow/airflow.db | ||
| # AIRFLOW__CORE__LOAD_EXAMPLES: "false" | ||
| # volumes: | ||
| # - ./airflow/dags:/opt/airflow/dags | ||
| # - ./:/opt/airflow/project | ||
| # - airflow_data:/opt/airflow | ||
| # ports: | ||
| # - "8080:8080" | ||
| # command: ["airflow", "webserver"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A large block of commented-out configuration for Airflow has been added. If this is not actively used, it should be removed to keep the docker-compose.yml file clean and focused on the running services. If this is for documentation or an alternative setup, consider moving it to a separate file (e.g., docker-compose.airflow.yml) or into the project's documentation.
| # ===== MESSAGE METADATA ===== | ||
| message_id: str | ||
| got_data_in: str = Field(default_factory=lambda: datetime.now().isoformat()) | ||
| got_data_in: datetime = Field(default_factory=lambda: datetime.now()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using datetime.now() creates a naive datetime object, which can lead to issues with timezones. It is a best practice to use timezone-aware datetimes to avoid ambiguity. You can use datetime.now(timezone.utc) for this. Note that you will need to add from datetime import timezone at the top of the file.
| got_data_in: datetime = Field(default_factory=lambda: datetime.now()) | |
| got_data_in: datetime = Field(default_factory=lambda: datetime.now(timezone.utc)) |
changed got_data_in variable to datetime instead of str, polars was throwing an error because got_data_in was a string but the other values were datetime, having the todays_date variable as datetime.now().isoformat() would make the variable str instead of datetime. By changing it to datetime.now(), it fixed the issue because the variable would be consider a datetime instead of str.