Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: composite labels #5

Open
nschneid opened this issue Feb 19, 2014 · 1 comment
Open

Proposal: composite labels #5

nschneid opened this issue Feb 19, 2014 · 1 comment

Comments

@nschneid
Copy link
Collaborator

Currently, responses are always atomic. However, for some applications it would be nice to represent labels (categorical responses) with limited structure such that features are extracted over parts of the structure, as well as the full label. We are still talking about making a single prediction, not structured prediction; the proposal is simply to enable a richer space of features over labels.

For example, in building a part-of-speech classifier, one could have features that score the full, fine-grained POS tag, as well as features that group together related tags into coarser categories to share statistical strength.

Define a composite label as a categorical response that is made up of multiple categorical parts, or components. The components could be characters in a string (such as a bit string), or in an explicit structure (such as a JSON data structure).

In the model, there will be a feature for every input characteristic (percept) and any full (simple or composite) label. In addition, when a percept is scored with a composite label, a feature for every component will fire that conjoins the percept with that component. So if every label is a POS tag and consists of two components, a coarse component and a fine component, there will be three features that fire for the label for every percept: one with the coarse component, one with the fine component, and one with the full label.

We assume the output space of the classifier will not be affected by the use of composite labels—only full labels (simple or composite) seen during training will be candidates for prediction.

Interface

Information about label structure could be (a) inferred automatically from the name of the label, (b) specified in the response file, in place of a single string name for the label, or (c) specified in some other file as a mapping of label names to richer structures. The interface proposed here will allow (a) or (b).

Let the option --composite-labels [json|string] [positional|bag] enable this feature:

  • If json (the default format) is specified, then all responses will be read as JSON objects. There are three allowed types of responses: JSON strings, lists of strings, and maps from strings to strings. JSON strings are interpreted as simple labels; in a list of strings, each string is a component; and in a map, the key-value pairs are components.
  • If string is specified, then all responses will be read as unquoted strings and treated as composite; the components are individual characters.
  • If positional (the default ordering) is specified, then any sequential composite labels (the label name in string mode, lists in json mode) are treated as ordered slot-fillers; i.e., each component is conjoined with its offset in the sequence.
  • If bag is specified, then any sequential composite labels are interpreted as bags of components; within a label, any repetition of a component will trigger an error. JSON maps are always treated as bags of key-value pairs.

Examples

If all labels are length-2 POS tags like NN = noun singular, NS = noun plural, PN = pronoun singular, PS = pronoun plural, etc., the following are equivalent ways to specify the response:

  • PN with --composite-labels string positional (note that bag would conflate the two possible uses of N!)
  • ["P", "N"] with --composite-labels json positional
  • {"coarse": "P", "fine": "N"} with --composite-labels json

If all labels are fixed-length bitstrings, the following are equivalent:

  • 01011 with --composite-labels string positional
  • ["0", "1", "0", "1", "1"] with --composite-labels json positional
  • {"0": "0", "1": "1", "2": "0", "3": "1", "4": "1"} with --composite-labels json

If the labels are clusters of morphosyntactic attributes, then with --composite-labels json bag, the two labels ["noun", "singular", "accusative"] and ["verb", "past", "singular", "causative"] would share one component in common: features associated with the "singular" component would fire for both.

@redpony
Copy link
Owner

redpony commented Feb 19, 2014

Great write up. I'll try to find someone who is interested in doing this. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants