This model was developed in order to extract Stanford coref
from already tokenized corpus,
in order to avoid aligning model output tokenization with corpus tokenization (which are usually different). im using the corpus tagged data (tokens, sentences,...)
to create the tokenized data in CoreNLP format then feed to the stanford pipeline while skipping tokenization.
For AllenNLP/SpaCy coreference resolution in python, you can find in this repo
Include implementation/example for extracting ECB+ corpus coref
information
- Java 1.8
- Gradle
- ECB+ corpus for running WD from ECB+ (Download ECB+)
-
Clone the repo
-
From command line navigate to project root directory and run:
./gradlew clean buildCorefJar
Should get a message saying: BUILD SUCCESSFUL in 25s
-
Then run command
java --add-modules java.se.ee -Xms4096m -Xmx8192m -jar build/libs/stanford-coref-1.0-SNAPSHOT.jar -corpus=ECB+ -output=output/ecb_wd_coref.json -threads=4
-corpus
: the path location of corpus folder (eg. ECB+)-output
: file to save the json wd coref into-threads
: set number of threads to run with (Default=2)
- Clone the repo
- Inherit
IDataLoader
and create a newDataLoader
for parsing your corpus (seeEcbDataLoader
for example) - replace
IDataLoader
inExtractStanfordCoref
,main()
method
Output is in a json format, containing a list of within document coreference mentions:
[
{
"coref_chain": "0",
"doc_id": "36_5ecb.xml",
"sent_id": 4,
"tokens_number": [
1,
2
],
"tokens_str": "Mr. Blackmore"
},
{
"coref_chain": "0",
"doc_id": "36_5ecb.xml",
"sent_id": 4,
"tokens_number": [
7
],
"tokens_str": "he"
},
.
.
.
]
json field | Value | comment |
---|---|---|
coref_chain | Text | The mention coref cluster id |
doc_id | Text | the document this mention belong to |
sent_id | int | Mention original document sentence ID |
tokens_number | List[int] | Mention span (text phrase as set in tokens_str) original tokens ids |
tokens_str | String | The mention/span phrase |