Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release/0.19.7 #813

Merged
merged 17 commits into from
Jun 20, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,3 @@ script: tox
after_success:
- tox -e coverage-report
- codecov

cache: pip
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,21 @@
# Changelog
All notable changes to this project will be documented in this file.

## [0.19.7]
### Changed
- Re-score ambiguous `DeterministicIntentParser` results based on slots [#791](https://github.com/snipsco/snips-nlu/pull/791)
- Accept ambiguous results from `DeterministicIntentParser` when confidence score is above 0.5 [#797](https://github.com/snipsco/snips-nlu/pull/797)
- Avoid generating number variations when not needed [#799](https://github.com/snipsco/snips-nlu/pull/799)
- Moved the NLU random state from the config to the shared resources [#801](https://github.com/snipsco/snips-nlu/pull/801)
- Reduce custom entity parser footprint in training time [#804](https://github.com/snipsco/snips-nlu/pull/804)
- Bumped `scikit-learn` to `>=0.21,<0.22` for `python>=3.5` and `>=0.20<0.21` for `python<3.5` [#801](https://github.com/snipsco/snips-nlu/pull/801)
- Update dependencies [#811](https://github.com/snipsco/snips-nlu/pull/811)

### Fixed
- Fixed a couple of bugs in the data augmentation which were making the NLU training non-deterministic [#801](https://github.com/snipsco/snips-nlu/pull/801)
- Remove deprecated code in dataset generation [#803](https://github.com/snipsco/snips-nlu/pull/803)
- Fix possible override of entity values when generating variations [#808](https://github.com/snipsco/snips-nlu/pull/808)

## [0.19.6]
### Fixed
- Raise an error when using unknown intents in intents filter [#788](https://github.com/snipsco/snips-nlu/pull/788)
Expand Down Expand Up @@ -269,6 +284,7 @@ several commands.
- Fix compiling issue with `bindgen` dependency when installing from source
- Fix issue in `CRFSlotFiller` when handling builtin entities

[0.19.7]: https://github.com/snipsco/snips-nlu/compare/0.19.6...0.19.7
[0.19.6]: https://github.com/snipsco/snips-nlu/compare/0.19.5...0.19.6
[0.19.5]: https://github.com/snipsco/snips-nlu/compare/0.19.4...0.19.5
[0.19.4]: https://github.com/snipsco/snips-nlu/compare/0.19.3...0.19.4
Expand Down
20 changes: 20 additions & 0 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,26 @@ the dataset we generated earlier:

engine.fit(dataset)

Note that, by default, training of the NLU engine is non-deterministic:
training and testing multiple times on the same data may produce different
outputs.

Reproducible trainings can be achieved by passing a **random seed** to the
engine:

.. code-block:: python

seed = 42
engine = SnipsNLUEngine(config=CONFIG_EN, random_state=seed)
engine.fit(dataset)


.. note::

Due to a ``scikit-learn`` bug fixed in version ``0.21`` we can't guarantee
any deterministic behavior if you're using a Python version ``<3.5`` since
``scikit-learn>=0.21`` is only available starting from Python ``>=3.5``


Parsing
-------
Expand Down
25 changes: 13 additions & 12 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,22 +17,23 @@
readme = f.read()

required = [
"deprecation>=2.0,<3.0",
"enum34>=1.1,<2.0; python_version<'3.4'",
"future>=0.16,<0.17",
"numpy>=1.15,<1.16",
"funcsigs>=1.0,<2.0; python_version<'3.4'",
"future>=0.16,<0.18",
"num2words>=0.5.6,<0.6",
"numpy>=1.15,<2.0",
"pathlib>=1.0,<2.0; python_version<'3.4'",
"plac>=0.9.6,<2.0",
"pyaml>=17.0,<20.0",
"requests>=2.0,<3.0",
"scikit-learn>=0.20,<0.21; python_version<'3.5'",
"scikit-learn>=0.21.1,<0.22; python_version>='3.5'",
"scipy>=1.0,<2.0",
"scikit-learn>=0.19,<0.20",
"sklearn-crfsuite>=0.3.6,<0.4",
"semantic_version>=2.6,<3.0",
"snips-nlu-utils>=0.8,<0.9",
"sklearn-crfsuite>=0.3.6,<0.4",
"snips-nlu-parsers>=0.2,<0.3",
"num2words>=0.5.6,<0.6",
"plac>=0.9.6,<1.0",
"requests>=2.0,<3.0",
"pathlib==1.0.1; python_version < '3.4'",
"pyaml>=17,<18",
"deprecation>=2,<3",
"funcsigs>=1.0,<2.0; python_version < '3.4'"
"snips-nlu-utils>=0.8,<0.9",
]

extras_require = {
Expand Down
2 changes: 1 addition & 1 deletion snips_nlu/__about__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
__email__ = "clement.doumouro@snips.ai, adrien.ball@snips.ai"
__license__ = "Apache License, Version 2.0"

__version__ = "0.19.6"
__version__ = "0.19.7"
__model_version__ = "0.19.0"

__download_url__ = "https://github.com/snipsco/snips-nlu-language-resources/releases/download"
Expand Down
24 changes: 16 additions & 8 deletions snips_nlu/cli/generate_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,21 @@

@plac.annotations(
language=("Language of the assistant", "positional", None, str),
files=("List of intent and entity files", "positional", None, str, None,
"filename"))
def generate_dataset(language, *files):
"""Create a Snips NLU dataset from text friendly files"""
yaml_files=("List of intent and entity yaml files", "positional", None,
str, None, "filename"))
def generate_dataset(language, *yaml_files):
"""Creates a Snips NLU dataset from YAML definition files

Check :meth:`.Intent.from_yaml` and :meth:`.Entity.from_yaml` for the
format of the YAML files.

Args:
language (str): language of the dataset (iso code)
*yaml_files: list of intent and entity definition files in YAML format.

Returns:
None. The json dataset output is printed out on stdout.
"""
language = unicode_string(language)
if any(f.endswith(".yml") or f.endswith(".yaml") for f in files):
dataset = Dataset.from_yaml_files(language, list(files))
else:
dataset = Dataset.from_files(language, list(files))
dataset = Dataset.from_yaml_files(language, list(yaml_files))
print(json_string(dataset.json, indent=2, sort_keys=True))
1 change: 1 addition & 0 deletions snips_nlu/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
BUILTIN_ENTITY_PARSER = "builtin_entity_parser"
CUSTOM_ENTITY_PARSER = "custom_entity_parser"
MATCHING_STRICTNESS = "matching_strictness"
RANDOM_STATE = "random_state"

# resources
RESOURCES = "resources"
Expand Down
6 changes: 3 additions & 3 deletions snips_nlu/data_augmentation.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,10 @@ def get_entities_iterators(intent_entities, language,
add_builtin_entities_examples, random_state):
entities_its = dict()
for entity_name, entity in iteritems(intent_entities):
utterance_values = random_state.permutation(list(entity[UTTERANCES]))
utterance_values = random_state.permutation(sorted(entity[UTTERANCES]))
if add_builtin_entities_examples and is_builtin_entity(entity_name):
entity_examples = get_builtin_entity_examples(entity_name,
language)
entity_examples = get_builtin_entity_examples(
entity_name, language)
# Builtin entity examples must be kept first in the iterator to
# ensure that they are used when augmenting data
iterator_values = entity_examples + list(utterance_values)
Expand Down
6 changes: 4 additions & 2 deletions snips_nlu/dataset/intent.py
Original file line number Diff line number Diff line change
Expand Up @@ -306,7 +306,8 @@ def capture_slot(state):
next_colon_pos = state.find(':')
next_square_bracket_pos = state.find(']')
if next_square_bracket_pos < 0:
raise IntentFormatError("Missing ending ']' in annotated utterance")
raise IntentFormatError(
"Missing ending ']' in annotated utterance \"%s\"" % state.input)
if next_colon_pos < 0 or next_square_bracket_pos < next_colon_pos:
slot_name = state[:next_square_bracket_pos]
state.move(next_square_bracket_pos)
Expand All @@ -327,7 +328,8 @@ def capture_slot(state):
def capture_tagged(state):
next_pos = state.find(')')
if next_pos < 1:
raise IntentFormatError("Missing ending ')' in annotated utterance")
raise IntentFormatError(
"Missing ending ')' in annotated utterance \"%s\"" % state.input)
else:
tagged_text = state[:next_pos]
state.add_tagged(tagged_text)
Expand Down
65 changes: 45 additions & 20 deletions snips_nlu/dataset/validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
from future.utils import iteritems, itervalues
from snips_nlu_parsers import get_all_languages

from snips_nlu.common.dataset_utils import (validate_key, validate_keys,
validate_type)
from snips_nlu.constants import (
AUTOMATICALLY_EXTENSIBLE, CAPITALIZE, DATA, ENTITIES, ENTITY, INTENTS,
LANGUAGE, MATCHING_STRICTNESS, SLOT_NAME, SYNONYMS, TEXT, USE_SYNONYMS,
Expand All @@ -18,8 +20,9 @@
from snips_nlu.exceptions import DatasetFormatError
from snips_nlu.preprocessing import tokenize_light
from snips_nlu.string_variations import get_string_variations
from snips_nlu.common.dataset_utils import validate_type, validate_key, \
validate_keys

NUMBER_VARIATIONS_THRESHOLD = 1e3
VARIATIONS_GENERATION_THRESHOLD = 1e4


def validate_and_format_dataset(dataset):
Expand Down Expand Up @@ -111,7 +114,7 @@ def _extract_entity_values(entity):
return values


def _validate_and_format_custom_entity(entity, queries_entities, language,
def _validate_and_format_custom_entity(entity, utterance_entities, language,
builtin_entity_parser):
validate_type(entity, dict, object_label="entity")

Expand Down Expand Up @@ -146,30 +149,48 @@ def _validate_and_format_custom_entity(entity, queries_entities, language,
if not entry[VALUE]:
continue
validate_type(entry[SYNONYMS], list, object_label="entity synonyms")
entry[SYNONYMS] = [s.strip() for s in entry[SYNONYMS]
if len(s.strip()) > 0]
entry[SYNONYMS] = [s.strip() for s in entry[SYNONYMS] if s.strip()]
valid_entity_data.append(entry)
entity[DATA] = valid_entity_data

# Compute capitalization before normalizing
# Normalization lowercase and hence lead to bad capitalization calculation
formatted_entity[CAPITALIZE] = _has_any_capitalization(queries_entities,
formatted_entity[CAPITALIZE] = _has_any_capitalization(utterance_entities,
language)

validated_utterances = dict()
# Map original values an synonyms
for data in entity[DATA]:
ent_value = data[VALUE]
if not ent_value:
continue
validated_utterances[ent_value] = ent_value
if use_synonyms:
for s in data[SYNONYMS]:
if s and s not in validated_utterances:
if s not in validated_utterances:
validated_utterances[s] = ent_value

# Number variations in entities values are expensive since each entity
# value is parsed with the builtin entity parser before creating the
# variations. We avoid generating these variations if there's enough entity
# values

# Add variations if not colliding
all_original_values = _extract_entity_values(entity)
if len(entity[DATA]) < VARIATIONS_GENERATION_THRESHOLD:
variations_args = {
"case": True,
"and_": True,
"punctuation": True
}
else:
variations_args = {
"case": False,
"and_": False,
"punctuation": False
}

variations_args["numbers"] = len(
entity[DATA]) < NUMBER_VARIATIONS_THRESHOLD

variations = dict()
for data in entity[DATA]:
ent_value = data[VALUE]
Expand All @@ -178,10 +199,11 @@ def _validate_and_format_custom_entity(entity, queries_entities, language,
values_to_variate.update(set(data[SYNONYMS]))
variations[ent_value] = set(
v for value in values_to_variate
for v in get_string_variations(value, language,
builtin_entity_parser))
for v in get_string_variations(
value, language, builtin_entity_parser, **variations_args)
)
variation_counter = Counter(
[v for vars in itervalues(variations) for v in vars])
[v for variations_ in itervalues(variations) for v in variations_])
non_colliding_variations = {
value: [
v for v in variations if
Expand All @@ -195,22 +217,25 @@ def _validate_and_format_custom_entity(entity, queries_entities, language,
validated_utterances = _add_entity_variations(
validated_utterances, non_colliding_variations, entry_value)

# Merge queries entities
queries_entities_variations = {
ent: get_string_variations(ent, language, builtin_entity_parser)
for ent in queries_entities
# Merge utterances entities
utterance_entities_variations = {
ent: get_string_variations(
ent, language, builtin_entity_parser, **variations_args)
for ent in utterance_entities
}
for original_ent, variations in iteritems(queries_entities_variations):

for original_ent, variations in iteritems(utterance_entities_variations):
if not original_ent or original_ent in validated_utterances:
continue
validated_utterances[original_ent] = original_ent
for variation in variations:
if variation and variation not in validated_utterances:
if variation and variation not in validated_utterances \
and variation not in utterance_entities:
validated_utterances[variation] = original_ent
formatted_entity[UTTERANCES] = validated_utterances
return formatted_entity


def _validate_and_format_builtin_entity(entity, queries_entities):
def _validate_and_format_builtin_entity(entity, utterance_entities):
validate_type(entity, dict, object_label="builtin entity")
return {UTTERANCES: set(queries_entities)}
return {UTTERANCES: set(utterance_entities)}
6 changes: 2 additions & 4 deletions snips_nlu/default_configs/config_de.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,8 +111,7 @@
"min_utterances": 200,
"capitalization_ratio": 0.2,
"add_builtin_entities_examples": True
},
"random_seed": None
}
},
"intent_classifier_config": {
"unit_name": "log_reg_intent_classifier",
Expand Down Expand Up @@ -140,8 +139,7 @@
"unknown_words_replacement_string": None,
"keep_order": True
}
},
"random_seed": None
}
}
}
]
Expand Down
6 changes: 2 additions & 4 deletions snips_nlu/default_configs/config_en.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,8 +97,7 @@
"min_utterances": 200,
"capitalization_ratio": 0.2,
"add_builtin_entities_examples": True
},
"random_seed": None
}
},
"intent_classifier_config": {
"unit_name": "log_reg_intent_classifier",
Expand Down Expand Up @@ -126,8 +125,7 @@
"unknown_words_replacement_string": None,
"keep_order": True
}
},
"random_seed": None
}
}
}
]
Expand Down
5 changes: 2 additions & 3 deletions snips_nlu/default_configs/config_es.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@
"capitalization_ratio": 0.2,
"add_builtin_entities_examples": True
},
"random_seed": None

},
"intent_classifier_config": {
"unit_name": "log_reg_intent_classifier",
Expand Down Expand Up @@ -118,8 +118,7 @@
"unknown_words_replacement_string": None,
"keep_order": True
}
},
"random_seed": None
}
}
}
]
Expand Down
6 changes: 2 additions & 4 deletions snips_nlu/default_configs/config_fr.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,7 @@
"min_utterances": 200,
"capitalization_ratio": 0.2,
"add_builtin_entities_examples": True
},
"random_seed": None
}
},
"intent_classifier_config": {
"unit_name": "log_reg_intent_classifier",
Expand Down Expand Up @@ -118,8 +117,7 @@
"unknown_words_replacement_string": None,
"keep_order": True
}
},
"random_seed": None
}
}
}
]
Expand Down
Loading