Skip to content

Add tabular learning methods #141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Aug 22, 2021

Conversation

manikyabard
Copy link
Contributor

Adds the implementation of DLPipelines.jl LearningMethod interface for tabular regression and tabular classification tasks.

@lorenzoh
Copy link
Member

lorenzoh commented Aug 10, 2021

Some notes on integration with the data block API

Ideally the table-specific data processing functionality would be encapsulated in an Encoding called TabularPreprocessing that works on a TableRow Block. If the target variable is separated from the row, we could then write the learning tasks as:

# tabular classification
BlockMethod(
    (TableRow(catcols, contcols), Label(classes)),
    (
        TabularPreprocessing(),  # only transforms `TableRow`
        OneHot(),  # only transforms `Label`
    )
)

# tabular regression
BlockMethod(
    (TableRow(catcols, contcols), Continuous(n)),
    (
        TabularPreprocessing(),  # only transforms `TableRow`
    )
)

This would require the following changes:

  • instead of using data containers with observations row, separate out target variable such that one observation is (row, target)

  • move preprocessing into an encoding TabularPreprocessing

  • implement block TabularRow

    Block needs to hold information on which columns are categorical and continuous, so something like

    struct TabularRow <: Block
        catcols
        contcols
    end
  • implement block Continuous for regression targets

@manikyabard manikyabard force-pushed the manikyabard/tabularmethods branch from 26bb2c4 to 9b92a28 Compare August 12, 2021 06:00
@@ -246,7 +246,7 @@ function testencoding(encoding, block, data = mockblock(block))
@test checkblock(outblock, outdata)

# Test decoding (if supported) works correctly
inblock = decodedblock(encoding, outblock, true)
inblock = decodedblock(encoding, outblock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the encodings for which decoding isn't supported yet, like TabularTransforms (I'll rename this to TabularPreprocessing), it will give incorrect results I think.
The encoded block after applying TabularTransforms on a TableRow is EncodedTableRow, and if decoding was supported, then the decoded block should be a TableRow again. But as it isn't supported, using the fill option would make it so that EncodedTableRow is returned again (instead of nothing through the fallback method) which wouldn't be correct.
The tests after this could fail because inblock (which would be the encoded block in this situation), doesn't have to be the same as the original block. Also I'm not sure if the if condition would ever be false as of now.

catcols, contcols
end

function gettransforms(td::Datasets.TableDataset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use this in a constructor for TabularTransform so you can call it like TabularTransform(TableDataset(...))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, so something like this?

TabularTransform(td::TableDataset) = TabularTransform(gettransforms(td))

Copy link
Member

@lorenzoh lorenzoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good 👍

@lorenzoh
Copy link
Member

Anything else that is missing before this can be merged?

Only thing I can think of are simple BlockMethod wrappers like those in src/fasterai/learningmethods.jl? Each should have a constructor that takes just the blocks i.e. ::Tuple{<:TableRow, <:Continuous} and one convenience constructor. They can then be registered with some block types so you can find them using findlearningmethods(s. the same file for how).

Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more small nitpicks

@manikyabard manikyabard force-pushed the manikyabard/tabularmethods branch from a5294a5 to f95149c Compare August 18, 2021 05:45
@manikyabard
Copy link
Contributor Author

I think the tests are failing because the model PR isn't merged yet.

@manikyabard manikyabard marked this pull request as ready for review August 19, 2021 04:37
@darsnack darsnack requested a review from lorenzoh August 19, 2021 13:51
@darsnack
Copy link
Member

Is there a reason that decode(::Encoding, ::Block) defaults to nothing (same for encode)? I get that it allows us to distinguish between when a block is modified but happens to be the same type vs. when it is not affected by Encoding. But is there value in tracking that information? Could we just default to an identity map like

encodedblock(::Encoding, block::Block) = block
decodedblock(::Encoding, block::Block) = block

This seems intuitive to me. Unless we use the nothing information later in the pipeline.

@lorenzoh
Copy link
Member

Is there a reason that decode(::Encoding, ::Block) defaults to nothing (same for encode)? I get that it allows us to distinguish between when a block is modified but happens to be the same type vs. when it is not affected by Encoding. But is there value in tracking that information? Could we just default to an identity map like

encodedblock(::Encoding, block::Block) = block
decodedblock(::Encoding, block::Block) = block

This seems intuitive to me. Unless we use the nothing information later in the pipeline.

That's the only way to know that data was not changed without running the encoding. encode and decode themselves do have an identity as the default, encodedblock and decodedblock just handle the block info.

@lorenzoh
Copy link
Member

This looks good.

Are the tests failing due to the issue with the decoding not being implemented? In that case, we could add a definition for justdecodedblock without implementing decode. Or add a decode = true kwarg to testencoding so that it only does the decoding tests if decode == true.

@manikyabard
Copy link
Contributor Author

This looks good.

Are the tests failing due to the issue with the decoding not being implemented? In that case, we could add a definition for justdecodedblock without implementing decode. Or add a decode = true kwarg to testencoding so that it only does the decoding tests if decode == true.

I think the tests might be failing because methodmodel is being tested as well and the model code isn't present in this branch yet.

@manikyabard manikyabard force-pushed the manikyabard/tabularmethods branch from 9ff64ae to d4abcc1 Compare August 22, 2021 18:05
@lorenzoh
Copy link
Member

maybe some minor doc improvements could still be made, but let's do that in another PR that also integrates the notebook with the documentation. We still have time for that before the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants