Skip to content

Move Text class to libs/xcontent #128780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jordan-powers
Copy link
Contributor

This PR is a precursor to #126492.

It does three things:

  1. Move org.elasticsearch.common.text.Text from :server to org.elasticsearch.xcontent.Text in :libs:x-content.
  2. Refactor the Text class to use a new EncodedBytes record instead of the elasticsearch BytesReference.
  3. Add the XContentString interface, with the Text class implement that interface.

These changes were originally implemented in #127666 and #128316, however they were reverted in #128484 due to problems caused by the mutable nature of java ByteBuffers. This is resolved by instead using a new immutable EncodedBytes record.

@jordan-powers jordan-powers requested review from prdoyle and ldematte June 2, 2025 18:31
@jordan-powers jordan-powers requested a review from a team as a code owner June 2, 2025 18:31
@jordan-powers jordan-powers added :Core/Infra/Core Core issues without another label auto-backport Automatically create backport pull requests when merged v8.19.0 v9.1.0 labels Jun 2, 2025
@jordan-powers jordan-powers self-assigned this Jun 2, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Jun 2, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)


import java.nio.charset.StandardCharsets;

public class TextTests extends ESTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty good test suite!

Optional: Perhaps consider some tests that use randomization to do a sequence of operations (like string(), stringLength(), bytes()) and assert that the result is right?

This is probably not necessary, since you've already added regression tests for the sequences that bit us before.

readBytes(bytes, 0, length);
}
var encoded = new XContentString.EncodedBytes(bytes);
return new Text(encoded);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like repeated code that could benefit from being pulled into a helper method, like we did with readBytesReference.

@prdoyle
Copy link
Contributor

prdoyle commented Jun 3, 2025

For the record: why add a new EncodedBytes record instead of moving BytesReference into xcontent?

@jordan-powers
Copy link
Contributor Author

jordan-powers commented Jun 3, 2025

We can't move BytesReference into libs/xcontent because it has several references to lucene's BytesRef, and lucene is not included in the xcontent project.

Copy link
Contributor

@prdoyle prdoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my comments are either optional, or pertain to javadocs, or don't affect production. I'd like to see the javadocs fixed up, but otherwise this is mergeable.

private int stringLength = -1;

/**
* Construct a Text from a UTF-8 encoded ByteBuffer. Since no string length is specified, {@link #stringLength()}
Copy link
Contributor

@prdoyle prdoyle Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment correct? It's not using a ByteBuffer now.

Also I wonder if the fact that it's UTF-8 should be reflected somewhere in the naming? Perhaps EncodedBytes could be Utf8Bytes? I'm not sure whether that's appropriate but it's a thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF8Bytes makes sense to me! I'll rename it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the implementation of EncodedBytes, that class does not seem to be aware of UTF-8 in any way, so renaming it is probably not appropriate after all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't really have an opinion anymore either way TBH.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still leaning towards calling it UTF8Bytes because its main purpose is as a return type from XContentString#bytes(), which specifies in the javadoc that it's always UTF-8 encoded.

public Text(BytesReference bytes) {
/**
* Construct a Text from a UTF-8 encoded ByteBuffer and an explicit string length. Used to avoid string conversion
* in {@link #stringLength()}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the javadocs here should answer this question: Is there a requirement that the stringLength matches the value that would have been produced by the other constructor? Or is the caller free to specify a different length?

return text == null ? bytes.utf8ToString() : text;
if (text == null) {
var byteBuff = ByteBuffer.wrap(bytes.bytes(), bytes.offset(), bytes.length());
text = StandardCharsets.UTF_8.decode(byteBuff).toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re my comment above: I think we could follow this with:

assert (stringLength < 0) || (stringLength == text.length());

This would catch cases where the wrong string length is specified at construction time.

@@ -36,31 +31,45 @@ public static Text[] convertFromStringArray(String[] strings) {
return texts;
}

private BytesReference bytes;
private EncodedBytes bytes;
private String text;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I wonder why this is called text when we refer to it as "string" everywhere else?)


@Override
public int compareTo(EncodedBytes o) {
return ByteBuffer.wrap(bytes, offset, length).compareTo(ByteBuffer.wrap(o.bytes, o.offset, o.length));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random thought: Is there any value in peeling off a fast-path case for when all the fields are identical by ==? Not sure whether this ever actually happens, but it would avoid two object allocations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me, especially since equal() delegates to compareTo(), so this could happen fairly frequently.

Copy link
Contributor

@prdoyle prdoyle Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, essentially you're making sure you're no slower than the built-in Record.equals() for the case that it returns true. Could be a little slower for the false case.

String string();

/**
* Returns a UTF8-encoded {@link ByteBuffer} view of the data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a ByteBuffer.


import java.nio.charset.StandardCharsets;

public class TextTests extends ESTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty good test suite!

Optional: Perhaps consider some tests that use randomization to do a sequence of operations (like string(), stringLength(), bytes()) and assert that the result is right?

This is probably not necessary, since you've already added regression tests for the sequences that bit us before.

assertEquals(value, text.string());
assertEquals(encoded, text.bytes());

assertSame(text.bytes(), text.bytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky!

assertEquals(value, text.string());
assertEquals(encoded, text.bytes());

assertSame(encoded, text.bytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we could just change all prior assertEquals on encoded to be assertSame instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to keep the assertEquals to verify the equals() method. I could see a situation where equals() is wrong or messes up the internal state somehow.

@prdoyle
Copy link
Contributor

prdoyle commented Jun 4, 2025

Oh one last question:

Did you ever try your new unit test on the old version of Text from before your changes? (The goal being to ensure the behaviour us unchanged, as opposed to merely reasonable.)

@jordan-powers
Copy link
Contributor Author

@prdoyle Thanks for the review! I haven't yet, but I agree it makes sense to run the new test against the old implementation, so I'll be sure to do that.

We actually can't use the native java ByteBuffer class here because it's
a mutable class (since it keeps track of how many bytes have been
consumed), which creates concurrency issues.
@jordan-powers
Copy link
Contributor Author

Ok, I rebased the PR so that the TextTests were added before the updates to the Text class. I then checked out f72a133 and ran the tests against the old implementation and everything passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :Core/Infra/Core Core issues without another label >non-issue Team:Core/Infra Meta label for core/infra team v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants