-
Notifications
You must be signed in to change notification settings - Fork 0
arXiv data pull #1
Comments
I downloaded the full dataset onto a machine in the Physics dept. about a year ago. I'll look into running the incremental update. My intuition is that mining the source would be easier because PDFs are such a pain in the ass and TeX is just text... why do you think it would be harder? |
Hey I sent out a message to crew that did the URL link rot study for astrophysics arXiv stuff. Surprisingly, INSPIRE uses pdf to extract references instead of source. They say they get better results. I asked for a link to code. Impactstory and ORCID have Apis that explicitly tag code I think. That will give us a highly curated author's view... biased list with a strong signal. Similarly, Zenodo and figshare can have metadata and collections specific to code. Kyle
|
TeX is just text, but parsing it correctly could be a substantial undertaking. For instance, we can't just mine the .bib files because not all entries get cited. We could just crawl the bits that get compiled down into bbl, but then we'd be in the business of compiling gigs of tex, which could easily take weeks. Similarly, parsing out links by grepping for PDF could be easier (but lossier), since we're working directly with the rendered output. |
Link rot study is here: I'm thinking it would be good to separate some common arxiv data mining parts into it's own repository. I've also contacted Thorsten Schwander, who is my buddy that was part of the arXiv team for several years and is a PDF parsing expert for INSPIRE to see if I can get hands on code or if he has any words of wisdom. |
arXiv makes its data available by S3 buckets, see: http://arxiv.org/help/bulk_data_s3
Some highlights from this page:
Questions:
The text was updated successfully, but these errors were encountered: