Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify API for FeaturizeText ? #2460

Closed
abgoswam opened this issue Feb 7, 2019 · 8 comments
Closed

Modify API for FeaturizeText ? #2460

abgoswam opened this issue Feb 7, 2019 · 8 comments
Assignees
Labels
API Issues pertaining the friendly API
Milestone

Comments

@abgoswam
Copy link
Member

abgoswam commented Feb 7, 2019

In the MLContext for the text featurizer, the input column names are taken as a IEnumerable

public static TextFeaturizingEstimator FeaturizeText(this TransformsCatalog.TextTransforms catalog,
string outputColumnName,
IEnumerable<string> inputColumnNames,
TextFeaturizingEstimator.Options options)

#2394 (comment) recommends making them params instead.

Should we modify this API ?

@sfilipi

@sfilipi
Copy link
Member

sfilipi commented Feb 7, 2019

The TextFeaturizingEstimator was modeled like the learners. IMO should follow the same pattern of having Options..

@abgoswam
Copy link
Member Author

abgoswam commented Feb 8, 2019

@sfilipi . Just to clarify .. we do follow the pattern of passing Options for the advanced options for the algorithm.

The issue was created to discuss your comment of whether we should use IEnumerable<string> for the inputColumnNames. I believe your suggestion was to not use IEnumerable and use params instead ?

@justinormont
Copy link
Contributor

Last I looked, we also couldn't change the ngram/chargram lengths, which is quite odd as this is the main text transform hyperparameter which shows gains on text problems.

We should be encouraging users to modify this hyperparameter.

You can see a good chart of the sensitivity to ngram/chargran lengths in #2305:

image
Each line/color represents a certain ngram+chargram length with the pareto frontier highlighted; the connected line varies with a sweep across iter=N. The fastest results are to the right, and the best accuracy is at the top, hence points to the top right are best.

@shauheen
Copy link
Contributor

#838 related?

@abgoswam
Copy link
Member Author

abgoswam commented Feb 23, 2019

@shauheen . This is not related to #838 . Both #838 and Justin's comment above are slightly orthogonal to the original issue.


The FeaturizeText transform estimator can take in a set of column names as input, represented as IEnumerable<string> inputColumnNames in the API above. Note this API was part of an earlier commit e3830910 itself , related to the input , output ordering of column names.

In one of the recent PRs, @sfilipi was curious if we should change the API signature to take in the inputColumnNames as params string[] inputColumnNames instead of IEnumerable<string> inputColumnNames. This issue was created to track that comment : #2394 (comment)

I suggest we keep the API as-is .

  • A params parameter must be the last parameter in a formal parameter list. So, if we make it params string[], the API would look like :
public static TextFeaturizingEstimator FeaturizeText(this TransformsCatalog.TextTransforms catalog, 
     string outputColumnName, 
     TextFeaturizingEstimator.Options options,
     params string[] inputColumnNames) 
  • Note how this API breaks the convention we follow in the other transform estimators : the outputColumnName should be immediately followed by inputColumnNames .

@sfilipi . Should I close this issue ?

@wschin
Copy link
Member

wschin commented Mar 1, 2019

@abgoswam 's suggestion sounds reasonable to me. Making it params means we will only have one variable-length argument, which is input columns' names in the original comment. But why can't output names be a param?

@abgoswam
Copy link
Member Author

abgoswam commented Mar 1, 2019

@wschin .. could you give an example. not sure what you mean by output names being a param ?

@wschin
Copy link
Member

wschin commented Mar 1, 2019

There are one-to-one mapping, many-to-one mapping, and many-to-many maaping. If we use params to denote input names, why shouldn't we do the same for output names? Using params also makes our signature less typed, so I don't like it.

@shauheen shauheen added this to the 0319 milestone Mar 4, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

No branches or pull requests

5 participants