Sentence index when splitting long sentences into non-overlapping chunks

Hi @mandarjoshi90, thanks much for this awesome library. 

Quick question - I am attempting coreference resolution on a corpus where the word count of many (tokenized) sentences is greater than max_segment_len,  (say, for spanbert_base with max_segment_len = 384). I am tackling this by chunking such sentences into multiple segments by splitting them (non-overlapping). 

My questions:
1. Is this a valid approach? (in line with your response to another question here: https://github.com/mandarjoshi90/coref/issues/33)
2. Let’s say the sentence index of a sample long sentence is X. When the tokens of this sentence are chunked between 2 segments (S1 and S2), will the sentence index for tokens in both S1 and S2 be X? Or does this need to be handled differently?

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence index when splitting long sentences into non-overlapping chunks #98

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Sentence index when splitting long sentences into non-overlapping chunks #98

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions