Development #22

gjbex · 2025-04-04T12:34:34Z

Summary by Sourcery

Update project dependencies, documentation, and add new example scripts for web scraping and data preprocessing

New Features:

Added a preprocessing script for web scraping OpenMP FAQ
Added a new Jupyter notebook on Copy-on-Write in pandas

Enhancements:

Updated environment.yml with additional dependencies
Updated Altair and Polars notebook examples
Removed hands-on directory

Documentation:

Updated README files
Added documentation for web scraping preprocessing script

Some of these libraries get flagged as security vulnerabilities by GitHub dependency checks. Given that they are not required, removing is simpler than updating.

Although a link to Google Colab is a good idea, the data is not on Google Colab, making executing the notebook there impossible. The notebook could be adapted to deal with that, but for now, it is better to remove it.

- Update all packages - Add Altair

sourcery-ai · 2025-04-04T12:34:38Z

Reviewer's Guide by Sourcery

This pull request includes several updates and additions, including adding new dependencies, updating the Altair notebook, improving the formatting of the polars notebook, adding a new notebook illustrating Copy-on-Write, and adding a new script to preprocess the OpenMP FAQ.

Updated class diagram for Altair chart selections

classDiagram
    class alt_selection {
        +type: str
        +nearest: bool
        +on: str
        +fields: list[str]
        +empty: str
    }
    class alt_selection_point {
        +type: str
        +fields: list[str]
        +bind: str
    }

    note for alt_selection "Replaced alt.selection with alt.selection_point"
    note for alt_selection_point "To be used for legend and interactive charts"

    alt_selection <|-- alt_selection_point

Flow diagram for web scraping preprocessing

graph LR
    A[HTML File] --> B(Beautiful Soup Parsing)
    B --> C{Extract Questions and Answers}
    C --> D[JSONL Output]
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#ccf,stroke:#333,stroke-width:2px

File-Level Changes

Change	Details	Files
Added numerous dependencies to the environment.yml file to ensure the project has all the necessary libraries for data science tasks.	Added dependencies such as lxml, ca-certificates, certifi, openssl, and many more. Included libraries for data manipulation (e.g., scikit-learn, polars, pyarrow). Added libraries for visualization (e.g., altair, hvplot, panel, bokeh). Included libraries for geospatial data processing (e.g., pyproj, libspatialindex, geopandas-base, libgdal). Added libraries for web scraping and parsing (e.g., beautifulsoup4, lxml). Included libraries for cloud storage access (e.g., aws-sdk-cpp, azure-storage-blobs-cpp, libgoogle-cloud-storage).	`environment.yml`
Updated the Altair interaction notebook to use Vega-Lite v5.20.1 and Vega v5, and modified the selection logic.	Updated Vega-Lite version from 4.8.1 to 5.20.1. Updated Vega version to 5. Modified the JavaScript code to load Vega and Vega-Lite libraries from CDN. Updated the Vega-Lite schema URL. Changed selection type from `single` and `multi` to `selection_point`. Replaced `add_selection` with `add_params`.	`source-code/altair/altair_interaction.ipynb`
Improved the formatting of the polars notebook.	Added newlines and indentation to improve readability of the `group_by` aggregation in the polars notebook.	`source-code/polars/patient_data.ipynb`
Added a description of a new `preprocessing` script to the web-scraping README.	Added a description of the `preprocessing` script, which scrapes a web page containing FAQs and prints them in JSONL format.	`source-code/web-scraping/README.md`
Added a new notebook illustrating Copy-on-Write to the pandas README.	Added a description of the `copy_on_write.ipynb` notebook, which illustrates how data is shared between related notebooks and the role Copy-on-Write plays.	`source-code/pandas/README.md`
Removed the `hands-on` directory from the main README.	Removed the line referencing the `hands-on` directory.	`README.md`
Removed the `python_for_data_science_linux64_conda_specs.txt` file.	Removed the `python_for_data_science_linux64_conda_specs.txt` file.	`python_for_data_science_linux64_conda_specs.txt`
Added a new notebook illustrating Copy-on-Write to the pandas directory.	Added a new notebook `copy_on_write.ipynb` to the pandas directory. The notebook illustrates how data is shared between related notebooks and the role Copy-on-Write plays in order to prevent accidental data modifications in more than one dataframe.	`source-code/pandas/copy_on_write.ipynb`
Added a new script to preprocess the OpenMP FAQ to the web-scraping directory.	Added a new script `preprocess_openmp_faq.py` to the web-scraping directory. The script reads the OpenMP FAQ from an HTML file, and writes the output to standard output in JSONL format.	`source-code/web-scraping/preprocessing/preprocess_openmp_faq.py`
Added a new HTML file containing the OpenMP FAQ to the web-scraping directory.	Added a new HTML file `openmp_faq.html` to the web-scraping directory. The HTML file contains the OpenMP FAQ.	`source-code/web-scraping/preprocessing/openmp_faq.html`
Added a new JSONL file containing the OpenMP FAQ to the web-scraping directory.	Added a new JSONL file `openmp_faq.jsonl` to the web-scraping directory. The JSONL file contains the OpenMP FAQ in a structured format.	`source-code/web-scraping/preprocessing/openmp_faq.jsonl`
Removed the `hands-on/patient_data.ipynb` file.	Removed the `hands-on/patient_data.ipynb` file.	`hands-on/patient_data.ipynb`
Removed the `source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/analytics.js` file.	Removed the `source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/analytics.js` file.	`source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/analytics.js`
Removed the `source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/disclaim-element.js` file.	Removed the `source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/disclaim-element.js` file.	`source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/disclaim-element.js`
Removed the `source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/graph-calc.js` file.	Removed the `source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/graph-calc.js` file.	`source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/graph-calc.js`
Removed the `source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery-1.4.2.min.js` file.	Removed the `source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery-1.4.2.min.js` file.	`source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery-1.4.2.min.js`
Removed the `source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery.min.js` file.	Removed the `source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery.min.js` file.	`source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery.min.js`
Removed the `source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery.tablesorter.min.js` file.	Removed the `source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery.tablesorter.min.js` file.	`source-code/pandas/data/sea_levels/Obtaining Tide Gauge Data_files/jquery.tablesorter.min.js`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!
Generate a plan of action for an issue: Comment @sourcery-ai plan on
an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

review-notebook-app · 2025-04-04T12:34:38Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

sourcery-ai

Hey @gjbex - I've reviewed your changes and found some issues that need to be addressed.

Blocking issues:

The Google Analytics ID is hardcoded in the HTML. (link)
The Google Analytics ID is hardcoded in the HTML. (link)
The Google Analytics ID is hardcoded in the HTML. (link)
The Google Analytics ID is hardcoded in the HTML. (link)
reCAPTCHA site key is hardcoded in the HTML. (link)
reCAPTCHA site key is hardcoded in the HTML. (link)
reCAPTCHA site key is hardcoded in the HTML. (link)

Overall Comments:

The environment.yml file has a large number of dependencies; consider using a more minimal set of dependencies and letting conda resolve the rest.
The notebook diffs show changes to the vega-lite version; consider updating the notebook to use the latest version of vega-lite.

Here's what I looked at during the review

🟡 General issues: 4 issues found
🔴 Security: 7 blocking issues
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-04-04T12:35:56Z

source-code/web-scraping/preprocessing/preprocess_openmp_faq.py

+def clean_answer(text):
+    text = text.replace('\n', ' ').strip()
+    pos = text.find('Version 3.0 Last updated:')
+    text = text[:pos]
+    pos = text.find('(Quote from:')
+    return text[:pos].strip()


suggestion: Consider handling cases when the expected marker substring is not found.

The function uses text.find() and then slices using the returned position. If the substring (e.g. 'Version 3.0 Last updated:' or '(Quote from:') is absent, find() returns -1 and may lead to unintended truncation. Adding an explicit check for -1 would make the code more robust.

Suggested change

def clean_answer(text):

text = text.replace('\n', ' ').strip()

pos = text.find('Version 3.0 Last updated:')

text = text[:pos]

pos = text.find('(Quote from:')

return text[:pos].strip()

def clean_answer(text):

text = text.replace('\n', ' ').strip()

pos = text.find('Version 3.0 Last updated:')

if pos != -1:

text = text[:pos]

pos = text.find('(Quote from:')

if pos != -1:

text = text[:pos]

return text.strip()

sourcery-ai · 2025-04-04T12:35:56Z

source-code/pandas/copy_on_write.ipynb

+   "source": [
+    "## Requirments"
+   ]


nitpick (typo): Typo in markdown header.

The header '## Requirments' contains a typo; it should be spelled '## Requirements'.

Suggested change

"source": [

"## Requirments"

]

"source": [

"## Requirements"

]

sourcery-ai · 2025-04-04T12:35:56Z

source-code/pandas/copy_on_write.ipynb

+   "source": [
+    "Indeed, the first column of `data2` was affected, its minimum and maximem values aore no longer approximately -1000 and 1000 as before, but rather -500 and 500 respectively."
+   ]


nitpick (typo): Typographical errors in descriptive text.

The markdown text contains typographical errors: 'maximem' should be 'maximum' and 'aore' should be 'are'. Correcting these will improve the professionalism of the notebook.

Suggested change

"source": [

"Indeed, the first column of `data2` was affected, its minimum and maximem values aore no longer approximately -1000 and 1000 as before, but rather -500 and 500 respectively."

]

"source": [

"Indeed, the first column of `data2` was affected, its minimum and maximum values are no longer approximately -1000 and 1000 as before, but rather -500 and 500 respectively."

]

sourcery-ai · 2025-04-04T12:35:57Z

source-code/pandas/README.md

+1. `copy_on_write.ipynb`: Jupyter notebook that illustrates how data is shared
+   between related notebooks and the role Copy-on-Write plays in order to
+   prevent accidental data modifications in more than one dataframe.


suggestion (typo): "Copy-on-Write" should likely be "copy-on-write".

sourcery-ai · 2025-04-04T12:35:57Z

source-code/web-scraping/preprocessing/openmp_faq.html

+<meta property="og:image:height" content="100"/>
+<meta property="og:image:type" content="image/png"/>
+<!-- This site uses the Google Analytics by MonsterInsights plugin v9.2.2 - Using Analytics tracking - https://www.monsterinsights.com/ -->
+<script src="//www.googletagmanager.com/gtag/js?id=G-WFCTSPN1R1"  data-cfasync="false" data-wpfc-render="false" type="text/javascript" async></script>


🚨 issue (security): The Google Analytics ID is hardcoded in the HTML.

sourcery-ai · 2025-04-04T12:35:57Z

source-code/web-scraping/preprocessing/openmp_faq.html

+/* ]]> */
+</script>
+<script type="text/javascript" src="https://www.openmp.org/wp-content/plugins/contact-form-7/includes/js/index.js?ver=6.0" id="contact-form-7-js"></script>
+<script type="text/javascript" src="https://www.google.com/recaptcha/api.js?render=6Lepn-wUAAAAAJLIq_PnJ7mEek8wpQyykB8Ktt2v&amp;ver=3.0" id="google-recaptcha-js"></script>


🚨 issue (security): reCAPTCHA site key is hardcoded in the HTML.

sourcery-ai · 2025-04-04T12:35:57Z

source-code/web-scraping/preprocessing/openmp_faq.html

+<script type="text/javascript" src="https://www.openmp.org/wp-includes/js/dist/vendor/wp-polyfill.min.js?ver=3.15.0" id="wp-polyfill-js"></script>
+<script type="text/javascript" id="wpcf7-recaptcha-js-before">
+/* <![CDATA[ */
+var wpcf7_recaptcha = {


🚨 issue (security): reCAPTCHA site key is hardcoded in the HTML.

sourcery-ai · 2025-04-04T12:35:57Z

source-code/web-scraping/preprocessing/openmp_faq.html

+<script type="text/javascript" id="wpcf7-recaptcha-js-before">
+/* <![CDATA[ */
+var wpcf7_recaptcha = {
+"sitekey": "6Lepn-wUAAAAAJLIq_PnJ7mEek8wpQyykB8Ktt2v",


🚨 issue (security): reCAPTCHA site key is hardcoded in the HTML.

sourcery-ai · 2025-04-04T12:35:57Z

source-code/web-scraping/preprocessing/preprocess_openmp_faq.py

+    answer = ''
+    next_tag = h4_tag.find_next('p')
+    while next_tag and next_tag.name == 'p':
+        answer += next_tag.text.strip() + ' '


suggestion (code-quality): Use f-string instead of string concatenation (use-fstring-for-concatenation)

Suggested change

answer += next_tag.text.strip() + ' '

answer += f'{next_tag.text.strip()} '

sourcery-ai · 2025-04-04T12:35:57Z

source-code/web-scraping/preprocessing/preprocess_openmp_faq.py

+    json_strs = []
+    for tag in doc.find_all(is_question_header):
+        json_strs.append(f'{{"input": "{get_question(tag)}", "output": "{get_answer(tag)}"}}')


suggestion (code-quality): Convert for loop into list comprehension (list-comprehension)

Suggested change

json_strs = []

for tag in doc.find_all(is_question_header):

json_strs.append(f'{{"input": "{get_question(tag)}", "output": "{get_answer(tag)}"}}')

json_strs = [

f'{{"input": "{get_question(tag)}", "output": "{get_answer(tag)}"}}'

for tag in doc.find_all(is_question_header)

]

gjbex added 9 commits November 28, 2024 13:49

Reformat code

78e4921

Add Beatiful Soup example

69348b5

Add Copy-on-Write discussion

4507272

Remove Javascript libraries

e2571f7

Some of these libraries get flagged as security vulnerabilities by GitHub dependency checks. Given that they are not required, removing is simpler than updating.

Add pyarrow as column storage

79608ba

Merge remote-tracking branch 'origin/development' into development

ff05618

Update for Altair 5.x

4e8548e

Remove hands-on material

55d551d

Although a link to Google Colab is a good idea, the data is not on Google Colab, making executing the notebook there impossible. The notebook could be adapted to deal with that, but for now, it is better to remove it.

Update environment

588924e

- Update all packages - Add Altair

gjbex merged commit e24c567 into master Apr 4, 2025

sourcery-ai bot reviewed Apr 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Development #22

Development #22

Uh oh!

gjbex commented Apr 4, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Apr 4, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

review-notebook-app bot commented Apr 4, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Apr 4, 2025

Uh oh!

sourcery-ai bot Apr 4, 2025

Uh oh!

sourcery-ai bot Apr 4, 2025

Uh oh!

sourcery-ai bot Apr 4, 2025

Uh oh!

sourcery-ai bot Apr 4, 2025

Uh oh!

sourcery-ai bot Apr 4, 2025

Uh oh!

sourcery-ai bot Apr 4, 2025

Uh oh!

sourcery-ai bot Apr 4, 2025

Uh oh!

sourcery-ai bot Apr 4, 2025

Uh oh!

sourcery-ai bot Apr 4, 2025

Uh oh!

Uh oh!

	answer += next_tag.text.strip() + ' '
	answer += f'{next_tag.text.strip()} '

Development #22

Development #22

Uh oh!

Conversation

gjbex commented Apr 4, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide by Sourcery

Updated class diagram for Altair chart selections

Flow diagram for web scraping preprocessing

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

review-notebook-app bot commented Apr 4, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gjbex commented Apr 4, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Apr 4, 2025 •

edited

Loading