Classification is the problem of identifying to which of a set of categories (subpopulations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.
Sequence Labeling is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. It is a subclass of structured (output) learning, since we are predicting a sequence object rather than a discrete or real value predicted in classification problems.
Shallow Parsing is a kind of Sequence Labeling. The main difference from Sequence Labeling task, such as Part-of-Speech Tagging, where there is an output label (tag) per token; Shallow Parsing additionally performs chunking -- segmentation of input sequence into constituents. Chunking is required to identify categories (or types) of multi-word expressions.
In other words, we want to be able to capture information that expressions like "New York" that consist of 2 tokens,
constitute a single unit.
What this means in practice is that Shallow Parsing performs jointly (or not) 2 tasks:
- Segmentation of input into constituents (spans)
- Classification (Categorization, Labeling) of these constituents into predefined set of labels (types)
Corpus in CoNLL format consists of series of sentences, separated by blank lines. Each sentence is encoded using a table (or "grid") of values, where each line corresponds to a single word, and each column corresponds to an annotation type (such as various token-level features & labels).
The set of columns used by CoNLL-style files can vary from corpus to corpus.
Since a line in a data can correspond to any token (word or not), it is referred to by a more general term token.
Similarly, since a data can be composed of units more or less than a sentence,
a new line separated unit is referred to as block.
The notation scheme is used to label multi-word spans in token-per-line format, e.g. that New York is a LOCATION concept that has 2 tokens.
As such a token-level tag consists of an affix that encodes segmentation information
and label that encodes type information.
Consequently, the corpus tagset consists of all possible affix and label combinations.
A segment encoded with affixes and assigned a label is referred to as chunk.
-
Both, prefix and suffix notations are commons:
- prefix: B-LOC
- suffix: LOC-B
-
Meaning of Affixes
- I for Inside of span
- O for Outside of span (no prefix or suffix, just
O) - B for Beginning of span
- No affix (useful when there are no multi-word concepts)
IO: deficient withoutBIOB: see aboveIOBE:Efor End of span (LinBILOUfor Last)IOBES:Sfor Singleton (UinBILOUfor Unit)
There are several methods to evaluate performance of shallow parsing models.
They can be evaluated at token-level and at chunk-level.
The unit of evaluation in this case is a tag of a token,
and what is evaluated is how accurately a model assigns tags to tokens.
Consequently, the token (or tag) accuracy measures the amount of correctly predicted tags.
Since a tag consists of an affix-label pair,
it is additionally possible to separately compute affix and label performances.
The unit of evaluation in this case is a chunk, and the evaluation is "joint";
in the sense that it jointly evaluates segmentation and labeling.
That is, a true unit is the one that has correct label and span.
Similar to token-level evaluation, it is possible to evaluate segmentation independently of labeling.
This is achieved ignoring the chunk label, e.g. by converting all of them to a single label.
Token-level evaluation is readily available from a number of packages,
and can be easily computed using scikit-learn's classification_report, for instance.
Chunk-level evaluation was originally provided by
conlleval perl script within CoNLL Shared Tasks.
However, the one limitation of conlleval is that it does not support IOBES or BILOU schemes.
The conlleval script was ported to python numerous times, and these ports have various functionalities.
One notable port is seqeval,
which is also included in Hugging Face's evaluate package.
To install econll run:
pip install econll
It is possible to run econll from command-line, as well as to import the methods.
usage: PROG [-h] -d DATA [-r REFS]
[--separator SEPARATOR] [--boundary BOUNDARY] [--docstart DOCSTART]
[--kind {prefix,suffix}] [--glue GLUE] [--otag OTAG]
[-f {conll,parse,mdown}] [-o OUTS]
[{eval,conv}]
eCoNLL: Extended CoNLL Utilities
positional arguments:
{eval,conv} task to perform
options:
-h, --help show this help message and exit
I/O Arguments:
-d DATA, --data DATA path to data/hypothesis file
-r REFS, --refs REFS path to references file
Data Format Arguments:
--separator SEPARATOR
field separator string
--boundary BOUNDARY block separator string
--docstart DOCSTART doc start string
Tag Format Arguments:
--kind {prefix,suffix}
tag order
--glue GLUE tag separator
--otag OTAG outside tag
Data Conversion Arguments:
-f {conll,parse,mdown}, --form {conll,parse,mdown}
output format (kind)
-o OUTS, --outs OUTS path to output file
python -m econll -d DATA
python -m econll eval -d DATA
python -m econll eval -d DATA -r REFS
python -m econll conv -d DATA -f FORMAT -o PATH
This project adheres to Semantic Versioning.