Skip to content

Development #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 4, 2025
Merged

Development #22

merged 9 commits into from
Apr 4, 2025

Conversation

gjbex
Copy link
Owner

@gjbex gjbex commented Apr 4, 2025

Summary by Sourcery

Update project dependencies, documentation, and add new example scripts for web scraping and data preprocessing

New Features:

  • Added a preprocessing script for web scraping OpenMP FAQ
  • Added a new Jupyter notebook on Copy-on-Write in pandas

Enhancements:

  • Updated environment.yml with additional dependencies
  • Updated Altair and Polars notebook examples
  • Removed hands-on directory

Documentation:

  • Updated README files
  • Added documentation for web scraping preprocessing script

Some of these libraries get flagged as security vulnerabilities
by GitHub dependency checks.  Given that they are not required,
removing is simpler than updating.
Although a link to Google Colab is a good idea, the data is not
on Google Colab, making executing the notebook there impossible.
The notebook could be adapted to deal with that, but for now, it
is better to remove it.
- Update all packages
- Add Altair
Copy link

sourcery-ai bot commented Apr 4, 2025

Reviewer's Guide by Sourcery

This pull request includes several updates and additions, including adding new dependencies, updating the Altair notebook, improving the formatting of the polars notebook, adding a new notebook illustrating Copy-on-Write, and adding a new script to preprocess the OpenMP FAQ.

Updated class diagram for Altair chart selections

classDiagram
    class alt_selection {
        +type: str
        +nearest: bool
        +on: str
        +fields: list[str]
        +empty: str
    }
    class alt_selection_point {
        +type: str
        +fields: list[str]
        +bind: str
    }

    note for alt_selection "Replaced alt.selection with alt.selection_point"
    note for alt_selection_point "To be used for legend and interactive charts"

    alt_selection <|-- alt_selection_point
Loading

Flow diagram for web scraping preprocessing

graph LR
    A[HTML File] --> B(Beautiful Soup Parsing)
    B --> C{Extract Questions and Answers}
    C --> D[JSONL Output]
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#ccf,stroke:#333,stroke-width:2px
Loading

File-Level Changes

Change Details Files
Added numerous dependencies to the environment.yml file to ensure the project has all the necessary libraries for data science tasks.
  • Added dependencies such as lxml, ca-certificates, certifi, openssl, and many more.
  • Included libraries for data manipulation (e.g., scikit-learn, polars, pyarrow).
  • Added libraries for visualization (e.g., altair, hvplot, panel, bokeh).
  • Included libraries for geospatial data processing (e.g., pyproj, libspatialindex, geopandas-base, libgdal).
  • Added libraries for web scraping and parsing (e.g., beautifulsoup4, lxml).
  • Included libraries for cloud storage access (e.g., aws-sdk-cpp, azure-storage-blobs-cpp, libgoogle-cloud-storage).
environment.yml
Updated the Altair interaction notebook to use Vega-Lite v5.20.1 and Vega v5, and modified the selection logic.
  • Updated Vega-Lite version from 4.8.1 to 5.20.1.
  • Updated Vega version to 5.
  • Modified the JavaScript code to load Vega and Vega-Lite libraries from CDN.
  • Updated the Vega-Lite schema URL.
  • Changed selection type from single and multi to selection_point.
  • Replaced add_selection with add_params.
source-code/altair/altair_interaction.ipynb
Improved the formatting of the polars notebook.
  • Added newlines and indentation to improve readability of the group_by aggregation in the polars notebook.
source-code/polars/patient_data.ipynb
Added a description of a new preprocessing script to the web-scraping README.
  • Added a description of the preprocessing script, which scrapes a web page containing FAQs and prints them in JSONL format.
source-code/web-scraping/README.md
Added a new notebook illustrating Copy-on-Write to the pandas README.
  • Added a description of the copy_on_write.ipynb notebook, which illustrates how data is shared between related notebooks and the role Copy-on-Write plays.
source-code/pandas/README.md
Removed the hands-on directory from the main README.
  • Removed the line referencing the hands-on directory.
README.md
Removed the python_for_data_science_linux64_conda_specs.txt file.
  • Removed the python_for_data_science_linux64_conda_specs.txt file.
python_for_data_science_linux64_conda_specs.txt
Added a new notebook illustrating Copy-on-Write to the pandas directory.
  • Added a new notebook copy_on_write.ipynb to the pandas directory.
  • The notebook illustrates how data is shared between related notebooks and the role Copy-on-Write plays in order to prevent accidental data modifications in more than one dataframe.
source-code/pandas/copy_on_write.ipynb
Added a new script to preprocess the OpenMP FAQ to the web-scraping directory.
  • Added a new script preprocess_openmp_faq.py to the web-scraping directory.
  • The script reads the OpenMP FAQ from an HTML file, and writes the output to standard output in JSONL format.
source-code/web-scraping/preprocessing/preprocess_openmp_faq.py
Added a new HTML file containing the OpenMP FAQ to the web-scraping directory.
  • Added a new HTML file openmp_faq.html to the web-scraping directory.
  • The HTML file contains the OpenMP FAQ.
source-code/web-scraping/preprocessing/openmp_faq.html
Added a new JSONL file containing the OpenMP FAQ to the web-scraping directory.
  • Added a new JSONL file openmp_faq.jsonl to the web-scraping directory.
  • The JSONL file contains the OpenMP FAQ in a structured format.
source-code/web-scraping/preprocessing/openmp_faq.jsonl
Removed the hands-on/patient_data.ipynb file.
  • Removed the hands-on/patient_data.ipynb file.
hands-on/patient_data.ipynb
Removed the source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/analytics.js file.
  • Removed the source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/analytics.js file.
source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/analytics.js
Removed the source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/disclaim-element.js file.
  • Removed the source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/disclaim-element.js file.
source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/disclaim-element.js
Removed the source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/graph-calc.js file.
  • Removed the source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/graph-calc.js file.
source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/graph-calc.js
Removed the source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery-1.4.2.min.js file.
  • Removed the source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery-1.4.2.min.js file.
source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery-1.4.2.min.js
Removed the source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery.min.js file.
  • Removed the source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery.min.js file.
source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery.min.js
Removed the source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery.tablesorter.min.js file.
  • Removed the source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery.tablesorter.min.js file.
source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery.tablesorter.min.js

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!
  • Generate a plan of action for an issue: Comment @sourcery-ai plan on
    an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@gjbex gjbex merged commit e24c567 into master Apr 4, 2025
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @gjbex - I've reviewed your changes and found some issues that need to be addressed.

Blocking issues:

  • The Google Analytics ID is hardcoded in the HTML. (link)
  • The Google Analytics ID is hardcoded in the HTML. (link)
  • The Google Analytics ID is hardcoded in the HTML. (link)
  • The Google Analytics ID is hardcoded in the HTML. (link)
  • reCAPTCHA site key is hardcoded in the HTML. (link)
  • reCAPTCHA site key is hardcoded in the HTML. (link)
  • reCAPTCHA site key is hardcoded in the HTML. (link)

Overall Comments:

  • The environment.yml file has a large number of dependencies; consider using a more minimal set of dependencies and letting conda resolve the rest.
  • The notebook diffs show changes to the vega-lite version; consider updating the notebook to use the latest version of vega-lite.
Here's what I looked at during the review
  • 🟡 General issues: 4 issues found
  • 🔴 Security: 7 blocking issues
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +28 to +33
def clean_answer(text):
text = text.replace('\n', ' ').strip()
pos = text.find('Version 3.0 Last updated:')
text = text[:pos]
pos = text.find('(Quote from:')
return text[:pos].strip()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider handling cases when the expected marker substring is not found.

The function uses text.find() and then slices using the returned position. If the substring (e.g. 'Version 3.0 Last updated:' or '(Quote from:') is absent, find() returns -1 and may lead to unintended truncation. Adding an explicit check for -1 would make the code more robust.

Suggested change
def clean_answer(text):
text = text.replace('\n', ' ').strip()
pos = text.find('Version 3.0 Last updated:')
text = text[:pos]
pos = text.find('(Quote from:')
return text[:pos].strip()
def clean_answer(text):
text = text.replace('\n', ' ').strip()
pos = text.find('Version 3.0 Last updated:')
if pos != -1:
text = text[:pos]
pos = text.find('(Quote from:')
if pos != -1:
text = text[:pos]
return text.strip()

Comment on lines +7 to +9
"source": [
"## Requirments"
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (typo): Typo in markdown header.

The header '## Requirments' contains a typo; it should be spelled '## Requirements'.

Suggested change
"source": [
"## Requirments"
]
"source": [
"## Requirements"
]

Comment on lines +506 to +508
"source": [
"Indeed, the first column of `data2` was affected, its minimum and maximem values aore no longer approximately -1000 and 1000 as before, but rather -500 and 500 respectively."
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (typo): Typographical errors in descriptive text.

The markdown text contains typographical errors: 'maximem' should be 'maximum' and 'aore' should be 'are'. Correcting these will improve the professionalism of the notebook.

Suggested change
"source": [
"Indeed, the first column of `data2` was affected, its minimum and maximem values aore no longer approximately -1000 and 1000 as before, but rather -500 and 500 respectively."
]
"source": [
"Indeed, the first column of `data2` was affected, its minimum and maximum values are no longer approximately -1000 and 1000 as before, but rather -500 and 500 respectively."
]

Comment on lines +29 to +31
1. `copy_on_write.ipynb`: Jupyter notebook that illustrates how data is shared
between related notebooks and the role Copy-on-Write plays in order to
prevent accidental data modifications in more than one dataframe.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (typo): "Copy-on-Write" should likely be "copy-on-write".

<meta property="og:image:height" content="100"/>
<meta property="og:image:type" content="image/png"/>
<!-- This site uses the Google Analytics by MonsterInsights plugin v9.2.2 - Using Analytics tracking - https://www.monsterinsights.com/ -->
<script src="//www.googletagmanager.com/gtag/js?id=G-WFCTSPN1R1" data-cfasync="false" data-wpfc-render="false" type="text/javascript" async></script>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 issue (security): The Google Analytics ID is hardcoded in the HTML.

/* ]]> */
</script>
<script type="text/javascript" src="https://www.openmp.org/wp-content/plugins/contact-form-7/includes/js/index.js?ver=6.0" id="contact-form-7-js"></script>
<script type="text/javascript" src="https://www.google.com/recaptcha/api.js?render=6Lepn-wUAAAAAJLIq_PnJ7mEek8wpQyykB8Ktt2v&amp;ver=3.0" id="google-recaptcha-js"></script>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 issue (security): reCAPTCHA site key is hardcoded in the HTML.

<script type="text/javascript" src="https://www.openmp.org/wp-includes/js/dist/vendor/wp-polyfill.min.js?ver=3.15.0" id="wp-polyfill-js"></script>
<script type="text/javascript" id="wpcf7-recaptcha-js-before">
/* <![CDATA[ */
var wpcf7_recaptcha = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 issue (security): reCAPTCHA site key is hardcoded in the HTML.

<script type="text/javascript" id="wpcf7-recaptcha-js-before">
/* <![CDATA[ */
var wpcf7_recaptcha = {
"sitekey": "6Lepn-wUAAAAAJLIq_PnJ7mEek8wpQyykB8Ktt2v",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 issue (security): reCAPTCHA site key is hardcoded in the HTML.

answer = ''
next_tag = h4_tag.find_next('p')
while next_tag and next_tag.name == 'p':
answer += next_tag.text.strip() + ' '
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Use f-string instead of string concatenation (use-fstring-for-concatenation)

Suggested change
answer += next_tag.text.strip() + ' '
answer += f'{next_tag.text.strip()} '

Comment on lines +41 to +43
json_strs = []
for tag in doc.find_all(is_question_header):
json_strs.append(f'{{"input": "{get_question(tag)}", "output": "{get_answer(tag)}"}}')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Convert for loop into list comprehension (list-comprehension)

Suggested change
json_strs = []
for tag in doc.find_all(is_question_header):
json_strs.append(f'{{"input": "{get_question(tag)}", "output": "{get_answer(tag)}"}}')
json_strs = [
f'{{"input": "{get_question(tag)}", "output": "{get_answer(tag)}"}}'
for tag in doc.find_all(is_question_header)
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant