Modernize with immutable Document and Sentence #842

kwalcock · 2025-05-27T01:46:42Z

No description provided.

Compile with no warnings, some renamed processor variables, view changes

kwalcock · 2025-05-27T02:10:05Z

Please ignore the individual commits and instead see file differences. This updates the project to be compatible with the current LTS (long term support) version of Scala, which is 3.3.6, and its conventions. It sacrifices some compatibility with older versions in that it insists that both Document and Sentence be immutable. In this version they have no vars or exposed Arrays. The Seqs are usually implemented by ArraySeq, which wraps the underlying array for read-only access. (A Scala3+ version would have an IArray here for efficiency.) The most invasive changes are in BalaurProcessor (including some remaining TODOs) and then the numerous updates from Array to Seq in method signatures. There were very few algorithmic changes needed, but they should be regression tested. There are many more potential changes, especially if the Stanford stack no longer needs to be supported. I resisted splitting Sentence into TokenizedSentence and AnnotatedSentence versions, but that would be a logical next step.

kwalcock · 2025-05-27T02:31:50Z

One disadvantage is that access to the Scala Seqs from Java is more complicated than it is to Array. This shows especially in JavaProcessorsExample.scala (https://github.com/clulab/processors/blob/kwalcock/processors2b/apps/src/main/java/org/clulab/processors/apps/ProcessorsJavaExample.java)

MihaiSurdeanu

This looks great. Thank you @kwalcock !
Is the switch from Array to Seq to enforce immutability?

kwalcock · 2025-05-27T17:22:40Z

@MihaiSurdeanu, it is partly to enforce it (immutability), but almost just as much to document it. Scala 2.13 and 3 compel one to do this by not allowing without significant complaint and making copies of the arrays in cases like

val seq: Seq[Int] = Array(1, 2)

which is what all the implicit conversions in e.g. https://github.com/clulab/processors/tree/kwalcock/processors2b/library/src/main/scala-3/org/clulab/scala are meant to avoid. This update uses the newer ArraySeq from Scala 2.13 and Scala 3 to handle the issue and backports to Scala 2 using scala-collection-compat. A more thorough update would make the other implicit conversions unnecessary, but that's too big a step just now.

kwalcock · 2025-05-27T17:27:50Z

library/src/main/scala/org/clulab/processors/clu/BalaurProcessor.scala

-        tree = None,
-        deps = EMPTY_GRAPH,
-        relations = None
+        words, // TODO: Why isn't this raw?


Here is a worrisome TODO.

Yes, I don't know where that's from... Is any code using this variable? It may be a weird leftover... If Reach or Habitus do not require it, I'm Ok removing it.

Below is the complete code. A Sentence is temporarily created in order to call lexiconNer.find(sentence). That generally only uses the lemmas or words through LexiconNER.getTokens which never accesses the raw field anyway. However, the normal constructor for a Sentence is raw, starts, ends, words.

processors/library/src/main/scala/org/clulab/processors/clu/BalaurProcessor.scala

Lines 207 to 223 in e986420

private def mkNerLabelsOpt(

words: Seq[String], startOffsets: Seq[Int], endOffsets: Seq[Int],

tags: Seq[String], lemmas: Seq[String]

): Option[Seq[String]] = {

lexiconNerOpt.map { lexiconNer =>

val sentence = Sentence(

words, // TODO: Why isn't this raw?

startOffsets,

endOffsets,

words,

Some(tags),

Some(lemmas)

)

lexiconNer.find(sentence)

}

}

I vaguely remember this... I think some lexicon NER dictionaries operate over lemmas rather than words, which required such weird work arounds...

kwalcock · 2025-05-27T17:29:16Z

library/src/main/scala/org/clulab/processors/clu/BalaurProcessor.scala


+        partlyAnnotatedSentence
+      }
+      // TODO: Improve error handling.


I'd like to do something better here.

MihaiSurdeanu · 2025-05-28T08:53:57Z

@MihaiSurdeanu, it is partly to enforce it (immutability), but almost just as much to document it. Scala 2.13 and 3 compel one to do this by not allowing without significant complaint and making copies of the arrays in cases like
val seq: Seq[Int] = Array(1, 2)
which is what all the implicit conversions in e.g. https://github.com/clulab/processors/tree/kwalcock/processors2b/library/src/main/scala-3/org/clulab/scala are meant to avoid. This update uses the newer ArraySeq from Scala 2.13 and Scala 3 to handle the issue and backports to Scala 2 using scala-collection-compat. A more thorough update would make the other implicit conversions unnecessary, but that's too big a step just now.

Thank you!

Clean it up more Get rid of debug files

MihaiSurdeanu

Thank you @kwalcock !

kwalcock · 2025-05-30T15:39:48Z

The ColumnsToDocument code included at least two places where it is implied that a function manipulates a sentence or document:

setLabels(s, labels.toArray)

val d = new Document(sentences.toArray)
annotate(d)

That doesn't work anymore, so the function signatures were changed. ColumnsToDocument is in the apps subproject even though it isn't an application. It is not used anywhere in the project that I can find. Is it then library code that some client is expected to make use of? Perhaps it should be moved to the library subproject.

MihaiSurdeanu · 2025-05-30T19:06:14Z

The ColumnsToDocument code included at least two places where it is implied that a function manipulates a sentence or document:
setLabels(s, labels.toArray)

val d = new Document(sentences.toArray)
annotate(d)
That doesn't work anymore, so the function signatures were changed. ColumnsToDocument is in the apps subproject even though it isn't an application. It is not used anywhere in the project that I can find. Is it then library code that some client is expected to make use of? Perhaps it should be moved to the library subproject.

I used it I think when I created the training data for the parser. That can be easily adjusted.

kwalcock added 30 commits May 9, 2025 23:26

Update scala and sbt

5225cc6

Clean up BalaurProcessor

1d18c8a

Stop assigning to a val in Document

b540f25

Pass the tests

7543b9c

Compile for Scala 3

9ccca36

Pass Scala3 tests

98c8115

NumericUtils

211cd2a

GraphMapType

8ea1301

Scala2

1996cf3

Check in Balaur as well

39a9dca

Start with very basic compatibility

cec4087

Down to last 13

57d1fa5

Finish compiling library

ed80611

Compile for other Scala versions

741307c

Compile other projects for other Scalas

38369e3

Compile tests

deb244b

Pass tests

e9876cf

Clean, get webapp to work

bec8f18

Remove dead code

737e538

Maintenance

4cfd518

Compile with no warnings, some renamed processor variables, view changes

Document, Sentence

dbfe52b

Balaur

2c19b03

Remove Scala-specific GraphMap

55eb202

More GraphMap

3c3f3db

SeqView again

61d871d

Remove spaces

e9979ea

Update sbt again

9c80f42

Fix a toSeq

70b031b

Account for immutable doc in some tests

de0041f

Move evaluation resources to app

db9b5e5

Fix test

7239f2b

MihaiSurdeanu approved these changes May 27, 2025

View reviewed changes

kwalcock added 2 commits May 27, 2025 09:50

Make DocumentAttachments immutable

8b1c2f3

Fix test compilation warning

e986420

kwalcock commented May 27, 2025

View reviewed changes

kwalcock added 5 commits May 29, 2025 08:39

Use Option.when

7d4fec1

Extract the DocumentPrinter

4f06301

Clean up DocumentMaker

0b33f20

Clean it up more Get rid of debug files

Fix ColumnsToDocument

276e894

Remove unused and duplicate code in NumericUtils

0b93379

MihaiSurdeanu approved these changes May 30, 2025

View reviewed changes

Fix typos

e826275

kwalcock added 3 commits June 2, 2025 10:21

Combine named entity without exposing array

b355af7

Update sbt again

000f0ed

Fix test

1432985

	private def mkNerLabelsOpt(
	words: Seq[String], startOffsets: Seq[Int], endOffsets: Seq[Int],
	tags: Seq[String], lemmas: Seq[String]
	): Option[Seq[String]] = {
	lexiconNerOpt.map { lexiconNer =>
	val sentence = Sentence(
	words, // TODO: Why isn't this raw?
	startOffsets,
	endOffsets,
	words,
	Some(tags),
	Some(lemmas)
	)

	lexiconNer.find(sentence)
	}
	}

Modernize with immutable Document and Sentence #842

Are you sure you want to change the base?

Modernize with immutable Document and Sentence #842

Uh oh!

Conversation

kwalcock commented May 27, 2025

Uh oh!

kwalcock commented May 27, 2025

Uh oh!

kwalcock commented May 27, 2025

Uh oh!

MihaiSurdeanu left a comment

Choose a reason for hiding this comment

Uh oh!

kwalcock commented May 27, 2025

Uh oh!

kwalcock May 27, 2025

Choose a reason for hiding this comment

Uh oh!

MihaiSurdeanu May 28, 2025

Choose a reason for hiding this comment

Uh oh!

kwalcock May 28, 2025

Choose a reason for hiding this comment

Uh oh!

MihaiSurdeanu May 28, 2025

Choose a reason for hiding this comment

Uh oh!

kwalcock May 27, 2025

Choose a reason for hiding this comment

Uh oh!

MihaiSurdeanu commented May 28, 2025

Uh oh!

MihaiSurdeanu left a comment

Choose a reason for hiding this comment

Uh oh!

kwalcock commented May 30, 2025

Uh oh!

MihaiSurdeanu commented May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants