Skip to content

Commit

Permalink
update docs, fix whitespace, some lint fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
danielfrees committed Sep 14, 2023
1 parent 100f9cb commit e612ba1
Show file tree
Hide file tree
Showing 40 changed files with 1,579 additions and 1,517 deletions.
26 changes: 2 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Available on PyPI! Simply `pip install scrapemed`.

ScrapeMed is designed to make large-scale data science projects relying on PubMed Central (PMC) easy. The raw XML that can be downloaded from PMC is inconsistent and messy, and ScrapeMed aims to solve that problem at scale. ScrapeMed downloads, validates, cleans, and parses data from nearly all PMC articles into `Paper` objects which can then be used to build datasets (`paperSet`s), or investigated in detail for literature reviews.

Beyond the heavy-lifting performed behind the scenes by ScrapeMed to standardize data scraped from PMC, a number of features are included to make data science and literature review work easier. A few are listed below:
Beyond the heavy-lifting performed behind the scenes by ScrapeMed to standardize data scraped from PMC, a number of features are included to make data science and literature review work easier. A few are listed below:

- `Paper`s can be queried with natural language [`.query()`], or simply chunked and embedded for storage in a vector DB [`.vectorize()`]. `Paper`s can also be converted to pandas Series easily [`.to_relational()`] for data science workflows.

Expand All @@ -50,7 +50,7 @@ Beyond the heavy-lifting performed behind the scenes by ScrapeMed to standardiz

## Documentation

[ScrapeMed documentation](https://scrapemed.readthedocs.io/en/latest/) is hosted on Read The Docs!
The [docs](https://scrapemed.readthedocs.io/en/latest/) are hosted on Read The Docs!

## Sponsorship

Expand All @@ -59,26 +59,4 @@ Beyond the heavy-lifting performed behind the scenes by ScrapeMed to standardiz
If you'd like to sponsor a feature or donate to the project, reach out to me at danielfrees@g.ucla.edu.


## Developer Usage

*License: MIT*

Feel free to fork and continue work on the ScrapeMed package, it is licensed under the MIT license to promote collaboration, extension, and inheritance.

Make sure to create a conda environment and install the necessary requirements before developing this package.

ie: `$ conda create --name myenv --file requirements.txt`

Add a `.env` file in your base scrapemed directory with a variable defined as follows: `PMC_EMAIL=youremail@example.com`. This is necessary for several of the test scripts and may be useful for your development in general.

You will need to install clang++ for `chromadb` and `Paper` vectorization to work. You also need to make sure you have `python 3.10.2` or later installed and active in your dev environment.

***Now an overview of the package structure:***

Under `examples` you can find some example work using the scrapemed modules, which may provide some insight into usage possibilities.

Under `examples/data` you will find some example downloaded data (XML from Pubmed Central). It is recommended that any time you download data while working out of the notebooks, it should go here. Downloads will also go here by default when passing `download=True` to the scrapemed module functions which allow you to do so.

Under `scrapemed/tests` you will find several python scripts which can be run using pytest. If you also clone and update the `.github/workflows/test-scrapemed.yml` for your forked repo, these tests will be automatically run on `git push`. Under `scrapemed/test/testdata` are some XML data crafted for the purpose of testing scrapemed. This data is necessary to run some of the testing scripts.

Each of the scrapemed python modules has a docstring at the top describing its general purpose and usage. All functions should also have descriptive docstrings and descriptions of input/output. Please contact me if any documentation is unclear.
Binary file modified docs/build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/build/doctrees/index.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/scrapemed.doctree
Binary file not shown.
16 changes: 8 additions & 8 deletions docs/build/html/_modules/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -9,25 +9,25 @@
<!--[if lt IE 9]>
<script src="../_static/js/html5shiv.min.js"></script>
<![endif]-->

<script src="../_static/jquery.js?v=5d32c60e"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script src="../_static/documentation_options.js?v=e2a723ec"></script>
<script src="../_static/doctools.js?v=888ff710"></script>
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="search" title="Search" href="../search.html" />
</head>

<body class="wy-body-for-nav">
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >



<a href="../index.html" class="icon icon-home">
scrapemed
</a>
Expand Down Expand Up @@ -66,7 +66,7 @@
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">

<h1>All modules for which code is available</h1>
<ul><li><a href="scrapemed/paper.html">scrapemed.paper</a></li>
<li><a href="scrapemed/paperSet.html">scrapemed.paperSet</a></li>
Expand All @@ -88,7 +88,7 @@ <h1>All modules for which code is available</h1>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.


</footer>
</div>
Expand All @@ -99,7 +99,7 @@ <h1>All modules for which code is available</h1>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</script>

</body>
</html>
46 changes: 23 additions & 23 deletions docs/build/html/_modules/scrapemed/paper.html
Original file line number Diff line number Diff line change
Expand Up @@ -9,25 +9,25 @@
<!--[if lt IE 9]>
<script src="../../_static/js/html5shiv.min.js"></script>
<![endif]-->

<script src="../../_static/jquery.js?v=5d32c60e"></script>
<script src="../../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script src="../../_static/documentation_options.js?v=e2a723ec"></script>
<script src="../../_static/doctools.js?v=888ff710"></script>
<script src="../../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../../genindex.html" />
<link rel="search" title="Search" href="../../search.html" />
<link rel="search" title="Search" href="../../search.html" />
</head>

<body class="wy-body-for-nav">
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >



<a href="../../index.html" class="icon icon-home">
scrapemed
</a>
Expand Down Expand Up @@ -67,10 +67,10 @@
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">

<h1>Source code for scrapemed.paper</h1><div class="highlight"><pre>
<span></span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd">The scrapemed &quot;paper&quot; module is intended as the primary point of contact for </span>
<span class="sd">The scrapemed &quot;paper&quot; module is intended as the primary point of contact for</span>
<span class="sd">scrapemed end users.</span>

<span class="sd">Paper objects are defined here, as well end-user functionality for scraping data</span>
Expand Down Expand Up @@ -185,14 +185,14 @@ <h1>Source code for scrapemed.paper</h1><div class="highlight"><pre>
<span class="k">def</span> <span class="nf">from_pmc</span><span class="p">(</span><span class="bp">cls</span><span class="p">,</span> <span class="n">pmcid</span><span class="p">:</span><span class="nb">int</span><span class="p">,</span> <span class="n">email</span><span class="p">:</span><span class="nb">str</span><span class="p">,</span> <span class="n">download</span><span class="p">:</span><span class="nb">bool</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">validate</span><span class="p">:</span><span class="nb">bool</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">verbose</span><span class="p">:</span><span class="nb">bool</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">suppress_warnings</span><span class="p">:</span><span class="nb">bool</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">suppress_errors</span><span class="p">:</span><span class="nb">bool</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Generate a Paper from a pmcid. Specify your email for auth.</span>
<span class="sd"> </span>

<span class="sd"> [pmcid] - Unique PMCID for the article to parse.</span>
<span class="sd"> [email] - Provide your email address for authentication with PMC</span>
<span class="sd"> [download] - Whether or not to download the XML retreived from PMC</span>
<span class="sd"> [validate] - Whether or not to validate the XML from PMC against NLM articleset 2.0 DTD (HIGHLY RECOMMENDED)</span>
<span class="sd"> [verbose] - Whether or not to have verbose output for testing</span>
<span class="sd"> [suppress_warnings] - Whether to suppress warnings while parsing XML. </span>
<span class="sd"> Note: Warnings are frequent, because of the variable nature of PMC XML data. </span>
<span class="sd"> [suppress_warnings] - Whether to suppress warnings while parsing XML.</span>
<span class="sd"> Note: Warnings are frequent, because of the variable nature of PMC XML data.</span>
<span class="sd"> Recommended to suppress when parsing many XMLs at once.</span>
<span class="sd"> [suppress_errors] - Return None on failed XML parsing, instead of raising an error.</span>
<span class="sd"> &quot;&quot;&quot;</span>
Expand Down Expand Up @@ -220,14 +220,14 @@ <h1>Source code for scrapemed.paper</h1><div class="highlight"><pre>
<span class="sd"> :param int pmcid: PMCID for the XML. THis is required intentionally, to ensure trustworthy unique indexing of PMC XMLs.</span>
<span class="sd"> :param ET.Element root: Root element of the PMC XML tree.</span>
<span class="sd"> :param bool verbose: Report verbose output or not. Intended for testing.</span>
<span class="sd"> :param bool suppress_warnings: Suppress warnings while parsing XML or not. </span>
<span class="sd"> Note: Warnings are frequent, because of the variable nature of PMC XML data. </span>
<span class="sd"> :param bool suppress_warnings: Suppress warnings while parsing XML or not.</span>
<span class="sd"> Note: Warnings are frequent, because of the variable nature of PMC XML data.</span>
<span class="sd"> Recommended to suppress when parsing many XMLs at once.</span>
<span class="sd"> :param bool suppress_errors: Return None on failed XML parsing, instead of raising an error.</span>
<span class="sd"> Recommended to suppress when parsing many XMLs at once, unless failure is not an option.</span>
<span class="sd"> </span>

<span class="sd"> :returns: A Paper object initialized via the passed XML.</span>
<span class="sd"> :rtype: Paper </span>
<span class="sd"> :rtype: Paper</span>
<span class="sd"> &quot;&quot;&quot;</span>
<span class="n">paper_dict</span> <span class="o">=</span> <span class="n">parse</span><span class="o">.</span><span class="n">generate_paper_dict</span><span class="p">(</span><span class="n">pmcid</span><span class="p">,</span> <span class="n">root</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="n">verbose</span><span class="p">,</span> <span class="n">suppress_warnings</span><span class="o">=</span><span class="n">suppress_warnings</span><span class="p">,</span> <span class="n">suppress_errors</span><span class="o">=</span><span class="n">suppress_errors</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">cls</span><span class="p">(</span><span class="n">paper_dict</span><span class="p">)</span></div>
Expand Down Expand Up @@ -324,7 +324,7 @@ <h1>Source code for scrapemed.paper</h1><div class="highlight"><pre>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> For two Paper objects to be equal, they must share the same PMCID and have the same date of last update.</span>

<span class="sd"> Two Papers may be exactly equal but be downlaoded or parsed on different dates. These will not evaluate to equal. </span>
<span class="sd"> Two Papers may be exactly equal but be downlaoded or parsed on different dates. These will not evaluate to equal.</span>
<span class="sd"> Simply compare Paper1.pmcid and Paper2.pmcid if that is your desired behavior.</span>

<span class="sd"> Note also that articles which are not open access on PMC may not have a PMCID, and a unique comparison will need to be made for these.</span>
Expand All @@ -338,7 +338,7 @@ <h1>Source code for scrapemed.paper</h1><div class="highlight"><pre>
<a class="viewcode-back" href="../../scrapemed.html#scrapemed.paper.Paper.to_relational">[docs]</a>
<span class="k">def</span> <span class="nf">to_relational</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">:</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Generates a pandas Series representation of the paper. Some data will be lost, </span>
<span class="sd"> Generates a pandas Series representation of the paper. Some data will be lost,</span>
<span class="sd"> but most useful text data and metadata will be retained in the relational shape.</span>
<span class="sd"> &quot;&quot;&quot;</span>

Expand Down Expand Up @@ -387,7 +387,7 @@ <h1>Source code for scrapemed.paper</h1><div class="highlight"><pre>

<span class="k">def</span> <span class="nf">_serialize_df</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">df</span><span class="p">):</span>
<span class="k">return</span> <span class="n">df</span><span class="o">.</span><span class="n">to_html</span><span class="p">()</span>
<span class="c1">#---------------End Helper functions for to_relational--------------------- </span>
<span class="c1">#---------------End Helper functions for to_relational---------------------</span>

<div class="viewcode-block" id="Paper.vectorize">
<a class="viewcode-back" href="../../scrapemed.html#scrapemed.paper.Paper.vectorize">[docs]</a>
Expand Down Expand Up @@ -451,7 +451,7 @@ <h1>Source code for scrapemed.paper</h1><div class="highlight"><pre>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Generate id for a PMC text chunk, using pmcid and index of the chunk.</span>
<span class="sd"> The chunk indices should be unique. Recommended to use indexes from the result</span>
<span class="sd"> of chunk model. </span>
<span class="sd"> of chunk model.</span>
<span class="sd"> &quot;&quot;&quot;</span>
<span class="k">return</span> <span class="sa">f</span><span class="s2">&quot;pmcid-</span><span class="si">{</span><span class="n">pmcid</span><span class="si">}</span><span class="s2">-chunk-</span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">index</span><span class="p">)</span><span class="si">}</span><span class="s2">&quot;</span>

Expand Down Expand Up @@ -484,16 +484,16 @@ <h1>Source code for scrapemed.paper</h1><div class="highlight"><pre>
<a class="viewcode-back" href="../../scrapemed.html#scrapemed.paper.Paper.query">[docs]</a>
<span class="k">def</span> <span class="nf">query</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">:</span><span class="nb">str</span><span class="p">,</span> <span class="n">n_results</span><span class="p">:</span><span class="nb">int</span> <span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_before</span><span class="p">:</span><span class="nb">int</span> <span class="o">=</span> <span class="mi">2</span><span class="p">,</span> <span class="n">n_after</span><span class="p">:</span><span class="nb">int</span> <span class="o">=</span> <span class="mi">2</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span><span class="nb">str</span><span class="p">]:</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Query the paper with natural language questions. </span>
<span class="sd"> Query the paper with natural language questions.</span>
<span class="sd"> Input:</span>
<span class="sd"> [query] - string question</span>
<span class="sd"> [n_results] - number of most semantically similar paper sections to retrieve</span>
<span class="sd"> [n_before] - int, how many chunks before the match to include in combined output</span>
<span class="sd"> [n_after] - int, how many chunks after the match to include in combined output</span>

<span class="sd"> Output:</span>
<span class="sd"> Dict with key(s) = most semantically similar result chunk(s), and value(s) = Paper text(s) </span>
<span class="sd"> around the most semantically similar result chunk(s). Text length determined by </span>
<span class="sd"> Dict with key(s) = most semantically similar result chunk(s), and value(s) = Paper text(s)</span>
<span class="sd"> around the most semantically similar result chunk(s). Text length determined by</span>
<span class="sd"> the chunk size used in self.vectorize() and n_before and n_after.</span>
<span class="sd"> &quot;&quot;&quot;</span>

Expand Down Expand Up @@ -595,7 +595,7 @@ <h1>Source code for scrapemed.paper</h1><div class="highlight"><pre>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.


</footer>
</div>
Expand All @@ -606,7 +606,7 @@ <h1>Source code for scrapemed.paper</h1><div class="highlight"><pre>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</script>

</body>
</html>
Loading

0 comments on commit e612ba1

Please sign in to comment.