Skip to content

Conversation

@plato-12
Copy link

Add spam_dataset from Elements of Statistical Learning (GSoC-25 tasks)

Description

This PR adds a new dataset implementation for the spam email classification dataset from the "Elements of Statistical Learning" book. This dataset contains 4601 emails with 57 features and a binary classification indicating whether the email is spam or not.

Features

  • Implementation of spam_dataset() function with documentation
  • Test suite to verify dataset functionality
  • Example vignette showing how to use the dataset for a classification task

Use Case

This dataset can be useful for binary classification exercises and demos, as it's a well-known dataset in the statistical learning community. It's relatively small (compared to image datasets) but provides a realistic classification problem.

Testing

  • All tests for the new dataset pass
  • Manual verification of the dataset loader and example usage has been performed

@dfalbel
Copy link
Member

dfalbel commented Mar 18, 2025

Hi @plato-12,

Can you direct your Pr to trochdatasets? https://github.com/mlverse/torchdatasets

@plato-12 plato-12 closed this by deleting the head repository Apr 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants