StringNormalizer+Tokenizer misses n-grams #4201

xadupre · 2020-06-11T10:03:13Z

Describe the bug
This is related to issue onnx/sklearn-onnx#485. onnxruntime seems to be missing n-grams if there are stopwords in between. ngrams([a b c] , (1, 2)) --> (a, ab, b, bc, c). If b is a stopwords, we should have ngrams([a b c] , (1, 2), stopwords=['b']) --> (a, ac, c) but onnxruntime seems to return (a, c).

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows
ONNX Runtime installed from (source or binary): source
ONNX Runtime version: 1.3
Python version: 3.7
Visual Studio version (if applicable): 2019

To Reproduce

import numpy
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession, __version__ as ort_version
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, SVR
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_selection import SelectKBest
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import StringTensorType
import onnx

stopwords = ['the', 'and', 'is']
X_train = numpy.array([
    "This is the first document",
    "This document is the second document.",
    "And this is the third one",
    "Is this the first document?",
]).reshape((4, 1))
y_train = numpy.array([0, 1, 0, 1])

model_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(
        stop_words=stopwords, lowercase=True,
        ngram_range=(1, 2), max_features=30000)),
])

model_pipeline.fit(X_train.ravel(), y_train)
initial_type = [('input', StringTensorType([None, 1]))]
model_onnx = convert_sklearn(
    model_pipeline, "cv", initial_types=initial_type,
    options={SVC: {'zipmap': False}})

exp = [model_pipeline.transform(X_train.ravel()).toarray()]

sess = InferenceSession(model_onnx.SerializeToString())
got = sess.run(None, {'input': X_train})
if verbose:
    voc = model_pipeline.steps[0][-1].vocabulary_
    voc = list(sorted([(v, k) for k, v in voc.items()]))
    for kv in voc:
        print(kv)
for a, b in zip(exp, got):
    if verbose:
        print(a)
        print(b)
    assert_almost_equal(a, b)

Output (onnxruntime does not detect n-grams (this first).

(0, 'document')
(1, 'document second')
(2, 'first')
(3, 'first document')
(4, 'one')
(5, 'second')
(6, 'second document')
(7, 'third')
(8, 'third one')
(9, 'this')
(10, 'this document')
(11, 'this first')
(12, 'this third')
[[1 0 1 1 0 0 0 0 0 1 0 1 0]
 [2 1 0 0 0 1 1 0 0 1 1 0 0]
 [0 0 0 0 1 0 0 1 1 1 0 0 1]
 [1 0 1 1 0 0 0 0 0 1 0 1 0]]
[[1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [2. 0. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0.]
 [1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]

Expected behavior
The two last matrices must be equal.

The text was updated successfully, but these errors were encountered:

stale · 2020-08-15T19:59:55Z

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

xadupre changed the title ~~StringNormalizer+ Tokenizer misses n-grams~~ StringNormalizer+Tokenizer misses n-grams Jun 11, 2020

yuslepukhin self-assigned this Jun 16, 2020

stale bot added the wontfix label Aug 15, 2020

faxu added contributions welcome lower priority issues for the core ORT teams and removed wontfix labels Sep 23, 2020

xadupre mentioned this issue Sep 23, 2021

I am facing issues in converting Random forest with complex pipelines onnx/sklearn-onnx#712

Closed

geofforigin8 mentioned this issue Apr 19, 2024

Is TfIdfVectorizer safe to use? onnx/onnx#6078

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StringNormalizer+Tokenizer misses n-grams #4201

StringNormalizer+Tokenizer misses n-grams #4201

xadupre commented Jun 11, 2020

stale bot commented Aug 15, 2020

StringNormalizer+Tokenizer misses n-grams #4201

StringNormalizer+Tokenizer misses n-grams #4201

Comments

xadupre commented Jun 11, 2020

stale bot commented Aug 15, 2020