This is an implementation of the paper Investigating the Role of Centering Theory in Neural Coreference Resolution.
Centering theory (CT; Grosz et al, 1995) provides a linguistic analysis of the structure of discourse. According to the theory, local coherence of discourse arises from the manner and extent to which successive utterances make reference to the same entities. In this paper, we investigate the connection between centering theory and modern coreference resolution systems. We provide an operationalization of centering and systematically investigate if the neural coreference resolvers adhere to the rules of centering theory by defining various discourse metrics and developing a search-based methodology. Our information-theoretic analysis reveals a positive dependence between coreference and centering; but also shows that high-quality neural coreference resolvers may not benefit much from explicitly modeling centering ideas. Our analysis further shows that contextualized embeddings contain much of the coherence information, which helps explain why CT can only provide little gains to modern neural coreference resolvers which make use of pretrained representations. Finally, we discuss factors that contribute to coreference which are not modeled by CT such as commonsense and recency bias.
conda create --name ct python=3.6 numpy pandas
conda activate ct
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
pip install pytorch-transformers
conda install -c conda-forge python=3.6 allennlp
conda install -c conda-forge allennlp-models
or pip install allennlp-models
The operationalization of centering theory is implemented in ct/centering.py:
A class representing the annotations available for a single CONLL formatted sentence
or a sentenece with coref predictions.
- document_id:
int. - line_id:
int. The true sentence id within the document. - words:
List[str]. A list of tokens corresponding to this sentence. The Onotonotes tokenization, need to be mapped. - clusters:
Dict[int, List[Tuple[int, int]]]. - pos_tags:
List[str]. The pos annotation of each word. - srl_frames:
List[Tuple[str, List[str]]]. A dictionary keyed by the verb in the sentence for the given Propbank frame labels, in a BIO format. - named_entities:
List[str]. The BIO tags for named entities in the sentence. - gram_roles:
Dict[str, List[Tuple[int, int]]]. The keys are 'subj', 'obj'. The values are lists of spans.
A class representing the annotations for a CONLL formatted document.
- document_id:
int. - sentences:
List[ConvertedSent]. - entity_ids:
List[int]. A list of entity ids that appear in this documents according to theclustersin all theconvertedSents.
A class representing the centering properties for ConvertedSent.
Ontonotes Annotations:
- document_id:
int. - line_id:
int. The true sentence id within the document. - words:
List[str]. A list of tokens corresponding to this sentence. The Onotonotes tokenization, need to be mapped. - clusters:
Dict[int, List[Tuple[int, int]]]. - pos_tags:
List[str]. The pos annotation of each word. - named_entities:
List[str]. The BIO tags for named entities in the sentence. - gram_roles:
Dict[str, List[Tuple[int, int]]]. The keys are 'subj', 'obj'. The values are lists of spans. - semantic_roles: the spans of different semantic roles in this uttererance, a dict where the keys are 'ARG0', 'ARG1'. The values are lists of spans.
Utterance-level properties:
- ranking:
str. eithergrlorsrl. - CF_list:
List[int]. - CF_weights:
Dict[int, float]. The keys are entity id's and the values are their corresponding weights. - CP:
int. The highest ranked element in the CF_list.
Discourse-level properties:
- CB_list:
List[int]. A list ofentity_ids which are the CB candidates in this utterance. - CB_weights:
Dict[int, float]. The keys areentity_ids and the values are their weights. - CB:
int. The highest ranked entity in theCB_list. - first_CP:
int. The first mentioned entity in the utterance. - transition:
Transition - cheapness:
bool. Cb(Un) = Cp(Un-1) - coherence:
bool. Cb(Un) = Cb(Un-1) - salience:
bool. Cb(Un) = Cp(Un) - nocb:
bool. TheCB_listis empty.
Note that the init function automatically setup all the utterance-level properties,
e.g. create the CF_list with the correct ranking. However, the discourse-level properties need to be set manually.
A class representing a discourse with centering properties.
- document_id:
int. - utterances:
List[CenteringUtterance]. - ranking:
str. eithergrlorsrl. - first_CP:
int. The first mentioned entity in the entire discourse. - len:
int. The number of utterances in this discourse. - salience: the ratio of salient transitions to all transitions (
len-1). - coherence: the ratio of coherent transitions to all transitions (
len-1). - cheapness: the ratio of cheap transitions to all transitions (
len-1). - nocb: the ratio of transitions with nocb to all transitions (
len-1).
Create a list of convertedSent by
converted_sentence = ConvertedSent(document_id=document_id, # int
line_id=line_id, # int
words=words, # List[str]
clusters=clusters,
pos_tags=pos_tags,
gram_roles=gram_roles)Add CT properties to the converted_document (the list of convertedSent) by constructing a CenteringDiscourse object:
centeringDiscourse = CenteringDiscourse(converted_document, ranking="grl")Calculte the CT scores:
final_CT_scores, unnormalized_CT_scores = centering.calculate_permutation_scores(centeringDiscourse)unnormalized_CT_scores:Dict[str, float]. A dict of unnormalized CT scores, where the scores are the ratio of the number of uttterances where a certain CT predicate being true to the total numbers of uttterances.final_CT_scores:Dict[str, float]. A dict of final CT scores. For example,{"nocb": 0, "salience": 0, "coherence": 0, "cheapness": 0, "transition": 0, "kp": 0}
python get_coref_F1.py
python -m ct.ct_ontonotes \
--experiment-ids gold, coref-spanbert-base-2021.1.5 \
--epoch best \
--save-path path/to/coreference/models
To ask questions or report problems, please contact yucjiang@ethz.ch.

