Skip to content

Update multi-table dataset list #535

@R-Palazzo

Description

@R-Palazzo

Problem Description

Currently, the monthly multi-table benchmark runs on 7 demo datasets. We aim to expand this by including all datasets from the following list:

['WebKP',
 'DCG',
 'UW_std',
 'Same_gen',
 'CORA',
 'got_families',
 'SalesDB',
 'UTube',
 'Student_loan',
 'Hepatitis_std',
 'Elti',
 'Bupa',
 'Toxicology',
 'imdb_ijs',
 'ftp',
 'imdb_small',
 'imdb_MovieLens',
 'Pima',
 'university',
 'legalActs',
 'Dunur',
 'Mesh',
 'world',
 'airbnb-simplified',
 'trains',
 'FNHK',
 'fake_hotels',
 'SAT',
 'genes',
 'Biodegradability',
 'Pyrimidine',
 'mutagenesis',
 'restbase',
 'Triazine',
 'Carcinogenesis',
 'fake_hotels_extended',
 'Mooney_Family',
 'PTE',
 'Facebook',
 'multi_table_ID_demo_dataset',
 'SAP',
 'Chess',
 'Countries',
 'NCAA',
 'Atherosclerosis',
 'nations',
 'TubePricing',
 'financial',
 'Accidents',
 'MuskSmall',
 'NBA',
 'AustralianFootball',
 'PremierLeague',
 'OMOP_CDM_dayz']

Expected behavior

Add the 'sdv_datasets' parameter with the list of datasets when running the benchmark.

for synthesizer_group in MODALITY_TO_SETUP[modality]['synthesizers_split']:

Additional context

All those datasets are publicly available on sdv

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions