arXiv data pull #1

bmcfee · 2014-10-11T16:58:45Z

arXiv makes its data available by S3 buckets, see: http://arxiv.org/help/bulk_data_s3

Some highlights from this page:

Papers are available in both PDF and latex source
Complete PDF data is about 270GB
Complete source data is about 190GB
All data lives in requester-pays buckets, so we'll have to cover the cost of the pull ($0.12/GB, about $50 total for both)

Questions:

Where are we going to host the local copies? Anyone want to volunteer server space?
How will analysis work?
- Option 1: grep pdf text for urls and/or known DOIs for software
- Option 2: parse the source directly, maybe with plasTex? This could get expensive and difficult, but may give more reliable results

dfm · 2014-10-11T17:48:36Z

I downloaded the full dataset onto a machine in the Physics dept. about a year ago. I'll look into running the incremental update.

My intuition is that mining the source would be easier because PDFs are such a pain in the ass and TeX is just text... why do you think it would be harder?

cranmer · 2014-10-11T18:08:27Z

Hey

I sent out a message to crew that did the URL link rot study for astrophysics arXiv stuff.

Surprisingly, INSPIRE uses pdf to extract references instead of source. They say they get better results. I asked for a link to code.
We can get an rtf from them, but only a small fraction will include code.

Impactstory and ORCID have Apis that explicitly tag code I think. That will give us a highly curated author's view... biased list with a strong signal.

Similarly, Zenodo and figshare can have metadata and collections specific to code.

Kyle

On Oct 11, 2014, at 10:48 AM, Dan Foreman-Mackey notifications@github.com wrote:

I downloaded the full dataset onto a machine in the Physics dept. about a year ago. I'll look into running the incremental update.

My intuition is that mining the source would be easier because PDFs are such a pain in the ass and TeX is just text... why do you think it would be harder?

—
Reply to this email directly or view it on GitHub.

bmcfee · 2014-10-11T19:45:07Z

My intuition is that mining the source would be easier because PDFs are such a pain in the ass and TeX is just text... why do you think it would be harder?

TeX is just text, but parsing it correctly could be a substantial undertaking. For instance, we can't just mine the .bib files because not all entries get cited. We could just crawl the bits that get compiled down into bbl, but then we'd be in the business of compiling gigs of tex, which could easily take weeks. Similarly, parsing out links by grepping for http or href could get tricky, especially when dealing with things that are difficult to render in tex directly (`url/~username/project.html' comes to mind).

PDF could be easier (but lossier), since we're working directly with the rendered output.

cranmer · 2014-10-11T21:18:22Z

Link rot study is here:
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0104798#s4
They have some discussion of their url cleaning. Probably some lessons learned there. I asked if they had some code for this. Ok, I just found link to their extracted URLs here:
http://thedata.harvard.edu/dvn/dv/astrocite/faces/study/StudyPage.xhtml?globalId=hdl:10904/10214

I'm thinking it would be good to separate some common arxiv data mining parts into it's own repository.

I've also contacted Thorsten Schwander, who is my buddy that was part of the arXiv team for several years and is a PDF parsing expert for INSPIRE to see if I can get hands on code or if he has any words of wisdom.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arXiv data pull #1

arXiv data pull #1

bmcfee commented Oct 11, 2014

dfm commented Oct 11, 2014

cranmer commented Oct 11, 2014

bmcfee commented Oct 11, 2014

cranmer commented Oct 11, 2014

arXiv data pull #1

arXiv data pull #1

Comments

bmcfee commented Oct 11, 2014

dfm commented Oct 11, 2014

cranmer commented Oct 11, 2014

bmcfee commented Oct 11, 2014

cranmer commented Oct 11, 2014