Skip to content

Conversation

@MihaiSurdeanu
Copy link
Contributor

What do you think @kwalcock, @myedibleenso ?
See the unit test for the expected behavior.

Copy link
Member

@kwalcock kwalcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code is fine. I'm a little worried about ramifications, but don't let that hold it back.

WHITESPACE=22
SEQ_OF_UNICODES=23
ErrorCharacter=24
'[SB]'=8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is odd, but I assume, correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e., line 25.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was what Antlr generates, so I am assuming it is correct.


// found the control string that enforces sentence breaks
// note that this token is NOT added to the sentences produced
else if(crt.word == SENTENCE_BREAK_CONTROL_STRING) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be reassured if this was also dependent on this.useControlStrings or SentenceSplitter.useControlStrings which defaults to false. Those who want to use the feature can turn it on if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Can you please add it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be set without needing to create a custom Processor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No... Because we need to adjust the corresponding antlr grammar. Unless we come up with a generic format for the control string, e.g., anything between square brackets? Or, anything between double square brackets, e.g., [[SB]]? Then we can let people set the string to whatever values they want.


// Control string that enforces a sentence break
// If you change this value, change also the SENTENCEBREAK in OpenDomainLexer.g to the same value (and recompile the Antlr grammar)
val SENTENCE_BREAK_CONTROL_STRING = "[SB]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the procedure would be to figure out before tokenization, so before Processor.mkDocument, where the control strings should be, like where there's a <br>, and change them to [SB]. These two strings happen to be the same length and one could take the resulting Document and substitute the old text for the new text in order to preserve the original. If the strings are different lengths, all the offsets would be off and the substitution won't work. We would lose (easy) access to the original document text. Will that be a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is likely that this will change offsets (e.g., when replacing newlines with '[SB]'). Users need to be aware of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to worry about supporting reinsertion of the original original token in this case of sentence boundaries (at least not at this stage).

That said, I think that is something we think about supporting for cases where a user wants to preserve unrecognized tokens (ex. through re-insertion).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kwalcock , would you feel more comfortable using a control string with higher entropy (ex. <[*^[SB]^*]>)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with it just being off by default. If someone turns it on, (a simple SentenceSplitter.useControlString = true) and it's important, they can be responsible for making sure that the control string is not already in their text and if necessary, escaping it before and unescaping it after, etc.

We do in general have cases in which provenance is important and the original text needs to be preserved. This new feature is still useful and can be used when the original is not so important, though.

@myedibleenso
Copy link
Member

@MihaiSurdeanu @kwalcock , I had need of this again today and it got me thinking: It would be helpful to have a test related to what we expect the value of a Document's text to be when the control string is used:

c/o B.A.Z. Bub[SB]
Morning Star Industries, Ltd.[SB]
666 Ring of Fire Circle[SB]
Lake of Fire, AZ 85666[SB]
signs you might be living in a simulation (recognize these warning signs)...[SB]
  - the earliest sound you remember from your childhood is the Windows startup theme[SB]
  - ....[SB]

@kwalcock
Copy link
Member

@myedibleenso, can you check TestMkCombinedDocuments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants