Skip to content

'Seed' page images and label masks dataset of all ads in Softalk magazine based on MAGAZINEgts ground-truth format. Max_pixels is 1M.

License

Notifications You must be signed in to change notification settings

SoftalkAppleProject/datasets_ml_all_ads_1M

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

datasets_ml_all_ads_1M

A MAGAZINEgts Ground-Truth Dataset for Softalk magazine (1980-84)

This is the 'seed' dataset of source page and label mask images of all ads in Softalk magazine based on the MAGAZINEgts ground-truth storage format. The Max_pixels setting is 1M pixels, and all page and label mask images are in PNG format.

There are currently 7,109 elements in what will eventually be a dataset of 7,157 ads appearing in Softalk magazine (1980-84). The dataset will be complete once this small number of data inconsistencies are resolved. Even without these resolutions, this #ML model training dataset is ready for researcher use.

Update: Please see section below about support by the #MAGAZINEgts format to support generation of 'Non-case Computational Complement Subsets for Document Structure Model Training'

Please see our #DATeCH2017 and #DATeCH2019 posters for additional information about the #MAGAZINEgts ground-truth format providing integrated complex document structure and content depiction models based on an ontological "stack" of #cidocCRM, FRBRoo, and PRESSoo standards.

The XML-based MAGAZINEgts file (~13+ MB) for Softalk magazine is here as part of the Softalk magazine collection at the Internet Archive. The growing set of page images and machine-learning label mask images for use with the MAGAZINEgts metamodel and metadata are available here on GitHub.

Small view of DATeCH2019 poster

This dataset is to be used for training machine learning models to recognize magazine advertisements. This dataset includes both actual and predicted bounding-box dimensions for each ad in the magazine. The predicted bounding-box is based on the PRESSoo Issuing Rules of the Softalk Advertising Model contained in the Metamodel partition of the MAGAZINEgts file.

Note that this dataset is the simplest 'seed' dataset which will initially be of limited value for model training. No distinction is being made based on ad size, shape, and position on the page. This 'seed' dataset can, however, be used to generate any number of alternative and more detailed model training datasets based on mapping these ads to various parameter-patterns of the PRESSoo Issuing Rules for the Advertising Model of Softalk magazine found in the Metamodel partition of the MAGAZINEgts file describing the Softalk magazine collection at the Internet Archive.

For example, this 'seed' dataset can be used to generate an #ML model-training dataset that lets the model understand the interrelationships between an advertisement's size and shape in terms of the allowable positions on a page for that ad. Using these page images and the 1M max_pixel all_ads dataset elements in the Metadata partition of the Softalk magazine MAGAZINEgts file, a new set of appropriately-colored label masks can be generated using the ad_spec bounding-box location provided by the ground-truth 'actual' measures provided by this 'seed' dataset.

We intend to integrate Stanford DAWN's Snorkel framework to the FactMiners Toolkit to handle the labeling, transformation, and slicing functions that such model-training dataset generation requires. In the meantime, here is a screenshot of an Excel spreadsheet PivotTable of the distribution of advertisements in the 48 issues of Softalk magazine:

Screenshot of Excel PivotTable showing distribution of ads in Softalk

NOTE: This PivotTable shows an early total of 7,164 total ads in the Ground-Truth branch of the #MAGAZINEgts DocumentStructure partition. This early total included 'dirty data' entries, including specs for known ads that appear on missing pages from the current state of the scan repository. These "to be resolved" entries are not included in this #MachineLearning model-training dataset. This early PivotTable-based examination of the distribution of Softalk ads, however, is still very reflective of the distribution of items in this #ML model training dataset.

Non-case Computational Complement Subsets for Document Structure Model Training

This 'seed' dataset of all advertisements in Softalk magazine includes a dataset of 2,288 non-case examples of all Softalk pages WITHOUT an advertisement on them. These page images and their 'blank' associated label images are found in a sibling subdirectory, noncase, containing its images and label subdirectories. Model trainers may pull matched pairs of these non-case entries into their training, evaluation, and test subsets to create balanced distributions of case and non-case training elements.

While considering the need to generate the non-case subsets needed for #MachineLearning model training, I was forced to consider the boundrary between explicit and computational metadata as part of the #MAGAZINEgts ground-truth storage format. As you will see when exploring the softalk_publication.xml file, there is no explicit branch describing the non-case subset in the all_ads ML dataset in the Metadata partition. This lack of explicit stored-representation of the non-case subset reflects the set-theoretic nature of this and similar subsets. That is, the simplest description of the non-case elements is that they are "everything but" the document structure of interest in the "main" model training dataset. So, in this case of identifying the subset membership for pages that are not cited in the Advertisers Index of Softalk, we simply need to compute the complement of the AdIndex which encompasses the complete set of Softalk magazine pages with advertising. For each of these complementary non-case instances, the bounding-box dimensions of the non-existent document structure, in this case an advertisement, is a "nothing to see here" (0, 0, 0, 0) rectangle. Both these membership-defining elements for the non-case subset are computable based on the explicit data in the Document Structure and Metamodel partitions of the #MAGAZINEgts format.

For example, the full non-case dataset for this 'seed' all_ads model-training dataset can be computationally generated by reference to the Metadata/Leaf2ppg_map and the DocumentStructure/Advertisement/AdIndex branches of the #MAGAZINEgts file for Softalk magazine. The generation of the all_ads non-case dataset for Softalk advertisements was added to the FactMiners Toolkit through the addition of a single Xquery query onto the softalk_publication.xml file -- acutally done via a direct query onto the Toolkit's BaseX XML database -- together with the addition of a 22-line method, generate_structure_noncase_dataset, to the tool's Python implementation.

Note: As per the PRESSoo Issuing Rules for the Advertising Model of Softalk magazine described in the Metamodel partition of the #MAGAZINEgts file, Softalk's Ad Index is found in the DocumentStructure branch of this Reference Model instance of this ground-truth storage format. This is because each issue of Softalk had a printed Advertiser Index in addition to such structures as Table of Contents, etc.. Most reasearcher- or tool-generated indices for other document structures of the magazine will be of the derived type, and therefore these non-print indices will be found in the Metadata partition of a #MAGAZINEgts file.

More as it evolves...

-- Jim Salmons -- FactMiners and The Softalk Apple Project Broomfield, Colorado USA

About

'Seed' page images and label masks dataset of all ads in Softalk magazine based on MAGAZINEgts ground-truth format. Max_pixels is 1M.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published