This repository was archived by the owner on Nov 16, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 62
Initial implementation of the WordTokenizer transform. #296
Merged
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
9f66e3b
Initial test of WordTokenizer.
b802844
Merge branch 'master' into word-tokenizer
a0e3640
Remove debug code that accidentally made it througth the merge.
414fb4b
Update the WordTokenizer_df example.
cc99e17
Remove unnecessary import from WordTokenizer_df.
68c739e
Add WordTokenizer example.
1169fe0
Add initial unit test for WordTokenizer.
954e8b6
Excluded WordTokenizer from most tests in test_estimator_checks.
6d293bc
Merge branch 'master' into word-tokenizer
pieths 10a384d
Whitespace change to restart ci run. Mac run lost communication.
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
############################################################################### | ||
# WordTokenizer | ||
|
||
from nimbusml import Pipeline, FileDataStream | ||
from nimbusml.datasets import get_dataset | ||
from nimbusml.preprocessing.text import WordTokenizer | ||
|
||
# data input (as a FileDataStream) | ||
path = get_dataset("wiki_detox_train").as_filepath() | ||
|
||
data = FileDataStream.read_csv(path, sep='\t') | ||
print(data.head()) | ||
# Sentiment SentimentText | ||
# 0 1 ==RUDE== Dude, you are rude upload that carl p... | ||
# 1 1 == OK! == IM GOING TO VANDALIZE WILD ONES WIK... | ||
# 2 1 Stop trolling, zapatancas, calling me a liar m... | ||
# 3 1 ==You're cool== You seem like a really cool g... | ||
# 4 1 ::::: Why are you threatening me? I'm not bein... | ||
|
||
tokenize = WordTokenizer(char_array_term_separators=[" "]) << {'wt': 'SentimentText'} | ||
pipeline = Pipeline([tokenize]) | ||
|
||
tokenize.fit(data) | ||
y = tokenize.transform(data) | ||
|
||
print(y.drop(labels='SentimentText', axis=1).head()) | ||
# Sentiment wt.000 wt.001 wt.002 wt.003 wt.004 wt.005 ... wt.366 wt.367 wt.368 wt.369 wt.370 wt.371 wt.372 | ||
# 0 1 ==RUDE== Dude, you are rude upload ... None None None None None None None | ||
# 1 1 == OK! == IM GOING TO ... None None None None None None None | ||
# 2 1 Stop trolling, zapatancas, calling me a ... None None None None None None None | ||
# 3 1 ==You're cool== You seem like a ... None None None None None None None | ||
# 4 1 ::::: Why are you threatening me? ... None None None None None None None |
33 changes: 33 additions & 0 deletions
33
src/python/nimbusml/examples/examples_from_dataframe/WordTokenizer_df.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
############################################################################### | ||
# WordTokenizer | ||
|
||
import pandas | ||
from nimbusml import Pipeline | ||
from nimbusml.preprocessing.text import WordTokenizer | ||
|
||
# create the data | ||
customer_reviews = pandas.DataFrame(data=dict(review=[ | ||
"I really did not like the taste of it", | ||
"It was surprisingly quite good!", | ||
"I will never ever ever go to that place again!!", | ||
"The best ever!! It was amazingly good and super fast", | ||
"I wish I had gone earlier, it was that great", | ||
"somewhat dissapointing. I'd probably wont try again", | ||
"Never visit again... rascals!"])) | ||
|
||
tokenize = WordTokenizer(char_array_term_separators=[" ", "n"]) << 'review' | ||
|
||
pipeline = Pipeline([tokenize]) | ||
|
||
tokenize.fit(customer_reviews) | ||
y = tokenize.transform(customer_reviews) | ||
|
||
print(y) | ||
# review.00 review.01 review.02 review.03 review.04 review.05 review.06 review.07 review.08 review.09 review.10 review.11 | ||
# 0 I really did ot like the taste of it None None None | ||
# 1 It was surprisi gly quite good! None None None None None None | ||
# 2 I will ever ever ever go to that place agai !! None | ||
pieths marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# 3 The best ever!! It was amazi gly good a d super fast | ||
# 4 I wish I had go e earlier, it was that great None | ||
# 5 somewhat dissapoi ti g. I'd probably wo t try agai None None | ||
# 6 Never visit agai ... rascals! None None None None None None None |
89 changes: 89 additions & 0 deletions
89
src/python/nimbusml/internal/core/preprocessing/text/wordtokenizer.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
# -------------------------------------------------------------------------------------------- | ||
# Copyright (c) Microsoft Corporation. All rights reserved. | ||
# Licensed under the MIT License. | ||
# -------------------------------------------------------------------------------------------- | ||
# - Generated by tools/entrypoint_compiler.py: do not edit by hand | ||
""" | ||
WordTokenizer | ||
""" | ||
|
||
__all__ = ["WordTokenizer"] | ||
|
||
|
||
from ....entrypoints.transforms_wordtokenizer import transforms_wordtokenizer | ||
from ....utils.utils import trace | ||
from ...base_pipeline_item import BasePipelineItem, DefaultSignature | ||
|
||
|
||
class WordTokenizer(BasePipelineItem, DefaultSignature): | ||
""" | ||
**Description** | ||
The input to this transform is text, and the output is a vector of text containing the words (tokens) in the original text. The separator is space, but can be specified as any other character (or multiple characters) if needed. | ||
|
||
:param char_array_term_separators: Array of single character term | ||
separator(s). By default uses space character separator. | ||
|
||
:param params: Additional arguments sent to compute engine. | ||
|
||
""" | ||
|
||
@trace | ||
def __init__( | ||
self, | ||
char_array_term_separators=None, | ||
**params): | ||
BasePipelineItem.__init__( | ||
self, type='transform', **params) | ||
|
||
self.char_array_term_separators = char_array_term_separators | ||
|
||
@property | ||
def _entrypoint(self): | ||
return transforms_wordtokenizer | ||
|
||
@trace | ||
def _get_node(self, **all_args): | ||
|
||
input_columns = self.input | ||
if input_columns is None and 'input' in all_args: | ||
input_columns = all_args['input'] | ||
if 'input' in all_args: | ||
all_args.pop('input') | ||
|
||
output_columns = self.output | ||
if output_columns is None and 'output' in all_args: | ||
output_columns = all_args['output'] | ||
if 'output' in all_args: | ||
all_args.pop('output') | ||
|
||
# validate input | ||
if input_columns is None: | ||
raise ValueError( | ||
"'None' input passed when it cannot be none.") | ||
|
||
if not isinstance(input_columns, list): | ||
raise ValueError( | ||
"input has to be a list of strings, instead got %s" % | ||
type(input_columns)) | ||
|
||
# validate output | ||
if output_columns is None: | ||
output_columns = input_columns | ||
|
||
if not isinstance(output_columns, list): | ||
raise ValueError( | ||
"output has to be a list of strings, instead got %s" % | ||
type(output_columns)) | ||
|
||
algo_args = dict( | ||
column=[ | ||
dict( | ||
Source=i, | ||
Name=o) for i, | ||
o in zip( | ||
input_columns, | ||
output_columns)] if input_columns else None, | ||
char_array_term_separators=self.char_array_term_separators) | ||
|
||
all_args.update(algo_args) | ||
return self._entrypoint(**all_args) |
76 changes: 76 additions & 0 deletions
76
src/python/nimbusml/internal/entrypoints/transforms_wordtokenizer.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# - Generated by tools/entrypoint_compiler.py: do not edit by hand | ||
""" | ||
Transforms.WordTokenizer | ||
""" | ||
|
||
|
||
from ..utils.entrypoints import EntryPoint | ||
from ..utils.utils import try_set, unlist | ||
|
||
|
||
def transforms_wordtokenizer( | ||
data, | ||
output_data=None, | ||
model=None, | ||
column=None, | ||
char_array_term_separators=None, | ||
**params): | ||
""" | ||
**Description** | ||
The input to this transform is text, and the output is a vector of | ||
text containing the words (tokens) in the original text. The | ||
separator is space, but can be specified as any other | ||
character (or multiple characters) if needed. | ||
|
||
:param column: New column definition(s) (inputs). | ||
:param data: Input dataset (inputs). | ||
:param char_array_term_separators: Array of single character term | ||
separator(s). By default uses space character separator. | ||
(inputs). | ||
:param output_data: Transformed dataset (outputs). | ||
:param model: Transform model (outputs). | ||
""" | ||
|
||
entrypoint_name = 'Transforms.WordTokenizer' | ||
inputs = {} | ||
outputs = {} | ||
|
||
if column is not None: | ||
inputs['Column'] = try_set( | ||
obj=column, | ||
none_acceptable=True, | ||
is_of_type=list, | ||
is_column=True) | ||
if data is not None: | ||
inputs['Data'] = try_set( | ||
obj=data, | ||
none_acceptable=False, | ||
is_of_type=str) | ||
if char_array_term_separators is not None: | ||
inputs['CharArrayTermSeparators'] = try_set( | ||
obj=char_array_term_separators, | ||
none_acceptable=True, | ||
is_of_type=list) | ||
if output_data is not None: | ||
outputs['OutputData'] = try_set( | ||
obj=output_data, | ||
none_acceptable=False, | ||
is_of_type=str) | ||
if model is not None: | ||
outputs['Model'] = try_set( | ||
obj=model, | ||
none_acceptable=False, | ||
is_of_type=str) | ||
|
||
input_variables = { | ||
x for x in unlist(inputs.values()) | ||
if isinstance(x, str) and x.startswith("$")} | ||
output_variables = { | ||
x for x in unlist(outputs.values()) | ||
if isinstance(x, str) and x.startswith("$")} | ||
|
||
entrypoint = EntryPoint( | ||
name=entrypoint_name, inputs=inputs, outputs=outputs, | ||
input_variables=input_variables, | ||
output_variables=output_variables) | ||
return entrypoint |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,7 @@ | ||
from .chartokenizer import CharTokenizer | ||
from .wordtokenizer import WordTokenizer | ||
|
||
__all__ = [ | ||
'CharTokenizer' | ||
'CharTokenizer', | ||
'WordTokenizer' | ||
] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# -------------------------------------------------------------------------------------------- | ||
# Copyright (c) Microsoft Corporation. All rights reserved. | ||
# Licensed under the MIT License. | ||
# -------------------------------------------------------------------------------------------- | ||
# - Generated by tools/entrypoint_compiler.py: do not edit by hand | ||
""" | ||
WordTokenizer | ||
""" | ||
|
||
__all__ = ["WordTokenizer"] | ||
|
||
|
||
from sklearn.base import TransformerMixin | ||
|
||
from ...base_transform import BaseTransform | ||
from ...internal.core.preprocessing.text.wordtokenizer import \ | ||
WordTokenizer as core | ||
from ...internal.utils.utils import trace | ||
|
||
|
||
class WordTokenizer(core, BaseTransform, TransformerMixin): | ||
""" | ||
**Description** | ||
The input to this transform is text, and the output is a vector of text containing the words (tokens) in the original text. The separator is space, but can be specified as any other character (or multiple characters) if needed. | ||
|
||
:param columns: see `Columns </nimbusml/concepts/columns>`_. | ||
|
||
:param char_array_term_separators: Array of single character term | ||
separator(s). By default uses space character separator. | ||
|
||
:param params: Additional arguments sent to compute engine. | ||
|
||
""" | ||
|
||
@trace | ||
def __init__( | ||
self, | ||
char_array_term_separators=None, | ||
columns=None, | ||
**params): | ||
|
||
if columns: | ||
params['columns'] = columns | ||
BaseTransform.__init__(self, **params) | ||
core.__init__( | ||
self, | ||
char_array_term_separators=char_array_term_separators, | ||
**params) | ||
self._columns = columns | ||
|
||
def get_params(self, deep=False): | ||
""" | ||
Get the parameters for this operator. | ||
""" | ||
return core.get_params(self) |
33 changes: 33 additions & 0 deletions
33
src/python/nimbusml/tests/preprocessing/text/test_wordtokenizer.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# -------------------------------------------------------------------------------------------- | ||
# Copyright (c) Microsoft Corporation. All rights reserved. | ||
# Licensed under the MIT License. | ||
# -------------------------------------------------------------------------------------------- | ||
import unittest | ||
|
||
import pandas | ||
from nimbusml import Pipeline | ||
from nimbusml.preprocessing.text import WordTokenizer | ||
|
||
|
||
class TestWordTokenizer(unittest.TestCase): | ||
|
||
def test_wordtokenizer(self): | ||
customer_reviews = pandas.DataFrame(data=dict(review=[ | ||
"I really did not like the taste of it", | ||
"It was surprisingly quite good!"])) | ||
|
||
tokenize = WordTokenizer(char_array_term_separators=[" ", "n"]) << 'review' | ||
pipeline = Pipeline([tokenize]) | ||
|
||
tokenize.fit(customer_reviews) | ||
y = tokenize.transform(customer_reviews) | ||
|
||
self.assertEqual(y.shape, (2, 9)) | ||
|
||
self.assertEqual(y.loc[0, 'review.3'], 'ot') | ||
self.assertEqual(y.loc[1, 'review.3'], 'gly') | ||
self.assertEqual(y.loc[1, 'review.6'], None) | ||
|
||
|
||
if __name__ == '__main__': | ||
unittest.main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.