Skip to content

Latest commit

 

History

History
82 lines (62 loc) · 3.92 KB

data.md

File metadata and controls

82 lines (62 loc) · 3.92 KB

GLAM data for using with fastai course materials

Below are some potential datasets that could be used with course materials. They have been suggested because they meet the following criteria:

  1. they are openly available
  2. most are already have labels
  3. most are small enough to work with interactively

Contributing

If you know of other GLAM related datasets that might work well then please feel free to make a pull request or open an issue. I would suggest restricting this list to things which meet the first criteria above i.e the dataset isn't behind a paywall/subscription.

Images 🖼

Classification

Internet Archive 'judge a book by its cover'

Classifcation of book covers into 'useful' or 'not useful'

iMet Collection 2020 - FGVC7

"Recognize artwork attributes from The Metropolitan Museum of Art"

iMet Collection 2019 - FGVC6

"Recognize artwork attributes from The Metropolitan Museum of Art"

Iconclass AI Test Set

"A test dataset and challenge to apply machine learning to collections described with the Iconclass classification system."

Object detection

Newspaper Navigator

"This dataset consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. "


Text 📖

Classification

UK Selective Web Archive Website Classification Dataset

"We are particularly interested in understanding whether high-level metadata like this can be used to train an appropriate automatic classification system so that we might use this manually generated dataset to partially automate the categorisation of our larger archives. We expect that a appropriate classifier might require more information about each site in order to produce reliable results, and are looking at augmenting this dataset with further information in the future."

Books divided by Genre from the Digitised 19th century books dataset

"A dataset derived from the Digitised 19th Century Books dataset which classifies the books by genre (Drama, Poetry, Prose, Music and unidentified)."


Tabular 🗂

Books divided by Genre from the Digitised 19th century books dataset

"A dataset derived from the Digitised 19th Century Books dataset which classifies the books by genre (Drama, Poetry, Prose, Music and unidentified)."

UK Selective Web Archive Website Classification Dataset

"We are particularly interested in understanding whether high-level metadata like this can be used to train an appropriate automatic classification system so that we might use this manually generated dataset to partially automate the categorisation of our larger archives. We expect that a appropriate classifier might require more information about each site in order to produce reliable results, and are looking at augmenting this dataset with further information in the future."