Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringNormalizer+Tokenizer misses n-grams #4201

Open
xadupre opened this issue Jun 11, 2020 · 1 comment
Open

StringNormalizer+Tokenizer misses n-grams #4201

xadupre opened this issue Jun 11, 2020 · 1 comment
Assignees
Labels
contributions welcome lower priority issues for the core ORT teams

Comments

@xadupre
Copy link
Member

xadupre commented Jun 11, 2020

Describe the bug
This is related to issue onnx/sklearn-onnx#485. onnxruntime seems to be missing n-grams if there are stopwords in between. ngrams([a b c] , (1, 2)) --> (a, ab, b, bc, c). If b is a stopwords, we should have ngrams([a b c] , (1, 2), stopwords=['b']) --> (a, ac, c) but onnxruntime seems to return (a, c).

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows
  • ONNX Runtime installed from (source or binary): source
  • ONNX Runtime version: 1.3
  • Python version: 3.7
  • Visual Studio version (if applicable): 2019

To Reproduce

import numpy
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession, __version__ as ort_version
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, SVR
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_selection import SelectKBest
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import StringTensorType
import onnx

stopwords = ['the', 'and', 'is']
X_train = numpy.array([
    "This is the first document",
    "This document is the second document.",
    "And this is the third one",
    "Is this the first document?",
]).reshape((4, 1))
y_train = numpy.array([0, 1, 0, 1])

model_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(
        stop_words=stopwords, lowercase=True,
        ngram_range=(1, 2), max_features=30000)),
])

model_pipeline.fit(X_train.ravel(), y_train)
initial_type = [('input', StringTensorType([None, 1]))]
model_onnx = convert_sklearn(
    model_pipeline, "cv", initial_types=initial_type,
    options={SVC: {'zipmap': False}})

exp = [model_pipeline.transform(X_train.ravel()).toarray()]

sess = InferenceSession(model_onnx.SerializeToString())
got = sess.run(None, {'input': X_train})
if verbose:
    voc = model_pipeline.steps[0][-1].vocabulary_
    voc = list(sorted([(v, k) for k, v in voc.items()]))
    for kv in voc:
        print(kv)
for a, b in zip(exp, got):
    if verbose:
        print(a)
        print(b)
    assert_almost_equal(a, b)

Output (onnxruntime does not detect n-grams (this first).

(0, 'document')
(1, 'document second')
(2, 'first')
(3, 'first document')
(4, 'one')
(5, 'second')
(6, 'second document')
(7, 'third')
(8, 'third one')
(9, 'this')
(10, 'this document')
(11, 'this first')
(12, 'this third')
[[1 0 1 1 0 0 0 0 0 1 0 1 0]
 [2 1 0 0 0 1 1 0 0 1 1 0 0]
 [0 0 0 0 1 0 0 1 1 1 0 0 1]
 [1 0 1 1 0 0 0 0 0 1 0 1 0]]
[[1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [2. 0. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0.]
 [1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]

Expected behavior
The two last matrices must be equal.

@xadupre xadupre changed the title StringNormalizer+ Tokenizer misses n-grams StringNormalizer+Tokenizer misses n-grams Jun 11, 2020
@yuslepukhin yuslepukhin self-assigned this Jun 16, 2020
@stale
Copy link

stale bot commented Aug 15, 2020

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributions welcome lower priority issues for the core ORT teams
Projects
None yet
Development

No branches or pull requests

3 participants