-
Notifications
You must be signed in to change notification settings - Fork 35
Update Conll-U to fully support and cover EWT dataset #194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ts, paragraphs, and sentences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Some minor changes requested inline.
Would you mind also running the modified files through black
before checking in?
self._sentence_id = None | ||
self._paragraph_id = None | ||
self._doc_id = None | ||
self.conll_09_format = predicate_args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class field name should start with underscore.
def has_conll_u_metadata(self): | ||
return (self._sentence_id is not None) or (self._paragraph_id is not None) or (self._doc_id is not None) | ||
|
||
def set_conll_u_metadata(self, doc_id: str = None, paragraph_id: str = None, sent_id: str = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind augmenting these fields with a configurable set of key-value pairs? The CoNLL-U spec seems to think the comments before each sentence are supposed to hold arbitrary named attributes; see https://universaldependencies.org/format.html#sentence-boundaries-and-comments
|
||
def add_line_ewt(self, line_num: int, line_elems: List[str]): | ||
""" | ||
:param line_num: Location in file, for error reporting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a note here to tell how this is different from add_line
?
if len(line_elems) < 2 + len(self._column_names): | ||
if len(line_elems) > 2 + self._num_standard_cols: | ||
line_elems.extend(['_' for i in range(2 + len(self._column_names) - len(line_elems))]) | ||
print(f"Unexpected number of elements {len(line_elems)} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like this print
statement is supposed to be here.
-> List[List[_SentenceData]]: | ||
""" | ||
|
||
Parses EWT file format to python objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function just for EWT, or is it for anything that meets the CoNLL-U standard? If the former, the function ought to be called _parse_ewt_file
; if the latter, this docstring needs to be updated.
@@ -40,6 +40,7 @@ | |||
# Special token that CoNLL-2003 format uses to delineate the documents in | |||
# the collection. | |||
_CONLL_DOC_SEPARATOR = "-DOCSTART-" | |||
_EWT_DOC_SEPERATOR = "# newdoc id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this separator is actually officially part of the CoNLL-U standard and isn't just for EWT; see https://universaldependencies.org/format.html#paragraph-and-document-boundaries
elif line_elems[0] == "# sent_id": | ||
sentence_id = line_elems[1] | ||
current_sentence.set_conll_u_metadata(sent_id=sentence_id) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be an additional branch to this if
statement that gathers up any other metadata key-value pairs that the dataset sees fit to attach to the sentence (see https://universaldependencies.org/format.html#sentence-boundaries-and-comments).
It would be nice if there was also a way to have this function map a user-configurable set of additional metadata values to additional columns of the returned DataFrame.
:param merge_subtokens: dictates how to handle tokens that are smaller than one word. By default, we keep | ||
the subtokens as two seperate entities, but if this is set to true, the subtokens will be merged into a | ||
single entity, of the same length as the token, and their attributes will be concatenated | ||
:param merge_subtoken_seperator: If merge subtokens is selected, concatenate the attributes with this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seperator ==> separator
Addresses issue #191 and adds support for importing CoNLL-U data-format files, especially those in EWT, Global Dependencies, and conll_2009 formats, as well as Ontonotes.
Created a separate method from the conll_to_df as a new entry-point for these dataformats, which supports similar options to other available packages supporting .conllu files (such as Spacy). Refactored common code within the io/conll module to separate methods.