You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
This is related to issue onnx/sklearn-onnx#485. onnxruntime seems to be missing n-grams if there are stopwords in between. ngrams([a b c] , (1, 2)) --> (a, ab, b, bc, c). If b is a stopwords, we should have ngrams([a b c] , (1, 2), stopwords=['b']) --> (a, ac, c) but onnxruntime seems to return (a, c).
System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows
ONNX Runtime installed from (source or binary): source
ONNX Runtime version: 1.3
Python version: 3.7
Visual Studio version (if applicable): 2019
To Reproduce
import numpy
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession, __version__ as ort_version
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, SVR
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_selection import SelectKBest
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import StringTensorType
import onnx
stopwords = ['the', 'and', 'is']
X_train = numpy.array([
"This is the first document",
"This document is the second document.",
"And this is the third one",
"Is this the first document?",
]).reshape((4, 1))
y_train = numpy.array([0, 1, 0, 1])
model_pipeline = Pipeline([
('vectorizer', CountVectorizer(
stop_words=stopwords, lowercase=True,
ngram_range=(1, 2), max_features=30000)),
])
model_pipeline.fit(X_train.ravel(), y_train)
initial_type = [('input', StringTensorType([None, 1]))]
model_onnx = convert_sklearn(
model_pipeline, "cv", initial_types=initial_type,
options={SVC: {'zipmap': False}})
exp = [model_pipeline.transform(X_train.ravel()).toarray()]
sess = InferenceSession(model_onnx.SerializeToString())
got = sess.run(None, {'input': X_train})
if verbose:
voc = model_pipeline.steps[0][-1].vocabulary_
voc = list(sorted([(v, k) for k, v in voc.items()]))
for kv in voc:
print(kv)
for a, b in zip(exp, got):
if verbose:
print(a)
print(b)
assert_almost_equal(a, b)
Output (onnxruntime does not detect n-grams (this first).
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Describe the bug
This is related to issue onnx/sklearn-onnx#485. onnxruntime seems to be missing n-grams if there are stopwords in between.
ngrams([a b c] , (1, 2)) --> (a, ab, b, bc, c)
. If b is a stopwords, we should havengrams([a b c] , (1, 2), stopwords=['b']) --> (a, ac, c)
but onnxruntime seems to return(a, c)
.System information
To Reproduce
Output (onnxruntime does not detect n-grams (
this first
).Expected behavior
The two last matrices must be equal.
The text was updated successfully, but these errors were encountered: