Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- Full support for the `HIGHCHARUNICODE` compiler directive.

## [1.17.2] - 2025-07-03

### Fixed
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@

import au.com.integradev.delphi.antlr.ast.visitors.DelphiParserVisitor;
import au.com.integradev.delphi.preprocessor.TextBlockLineEndingMode;
import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import java.util.ArrayDeque;
import java.util.Deque;
import java.util.stream.Collectors;
Expand All @@ -28,6 +30,7 @@
import org.apache.commons.lang3.Strings;
import org.sonar.plugins.communitydelphi.api.ast.DelphiNode;
import org.sonar.plugins.communitydelphi.api.ast.TextLiteralNode;
import org.sonar.plugins.communitydelphi.api.directive.SwitchDirective.SwitchKind;
import org.sonar.plugins.communitydelphi.api.token.DelphiTokenType;
import org.sonar.plugins.communitydelphi.api.type.IntrinsicType;
import org.sonar.plugins.communitydelphi.api.type.Type;
Expand Down Expand Up @@ -167,26 +170,38 @@ private String createSingleLineValue() {
return imageBuilder.toString();
}

private static char characterEscapeToChar(String image) {
private boolean isHighCharUnicode() {
return getAst()
.getDelphiFile()
.getCompilerSwitchRegistry()
.isActiveSwitch(SwitchKind.HIGHCHARUNICODE, getTokenIndex());
}

public Charset getAnsiCharset() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably have a CharsetUtils or something in frontend.
I'd also call these nativeCharset.

return Charset.forName(System.getProperty("native.encoding"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that the compiler is actually using the configured codepage to interpret these escapes, we should:

  • read DCC_Codepage from dproj files
  • emit a warning if conflicting DCC_Codepage values are found
  • expose an analyzer property to override DCC_Codepage and ignore it altogether
  • fall back to the system's native encoding if none of these are provided

}

private char characterEscapeToChar(String image) {
image = image.substring(1);
int radix = 10;

switch (image.charAt(0)) {
case '$':
radix = 16;
image = image.substring(1);
break;
case '%':
radix = 2;
image = image.substring(1);
break;
default:
// do nothing
if (image.charAt(0) == '$') {
radix = 16;
image = image.substring(1);
}
Comment on lines +184 to 191
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the handling for binary character escapes even though they're currently not syntactically valid.
Our grammar allows it, and it's arguably a bug that the compiler doesn't currently recognize them. I think it's likely a future Delphi version will fix that.


image = StringUtils.remove(image, '_');
char character = (char) Integer.parseInt(image, radix);

return (char) Integer.parseInt(image, radix);
if (isHighCharUnicode() || character > 255) {
// With HIGHCHARUNICODE ON, all escapes are interpreted as UTF-16.
// Escapes above 255 are always interpreted as UTF-16.
return character;
} else {
// With HIGHCHARUNICODE OFF, escapes between 0-255 are interpreted in the system code page.
Comment on lines +196 to +201
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the HIGHCHARUNICODE documentation, only the #128-#255 range is affected.

This wouldn't seem to matter, if you're only thinking about single-byte ANSI codepages that are supersets of ASCII.
However, there's multi-byte ANSI codepages that aren't supersets of ASCII (I think?), and it seems like this aspect of the behavior would matter for interpreting those.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that being said, I've looked at the Shift_JIS character table and found that the first 127 characters are still ASCII.
Maybe it really doesn't matter and I'm just getting muddled up with the fact that there are codepages that aren't binary supersets of ASCII.

Even so, we should probably follow what the documentation says.

var buffer = ByteBuffer.allocate(1).put((byte) character).flip();
return getAnsiCharset().decode(buffer).get();
}
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,24 @@

import static org.assertj.core.api.Assertions.assertThat;
import static org.mockito.ArgumentMatchers.anyInt;
import static org.mockito.ArgumentMatchers.eq;
import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.spy;
import static org.mockito.Mockito.when;

import au.com.integradev.delphi.antlr.DelphiLexer;
import au.com.integradev.delphi.antlr.ast.DelphiAstImpl;
import au.com.integradev.delphi.file.DelphiFile;
import au.com.integradev.delphi.preprocessor.CompilerSwitchRegistry;
import au.com.integradev.delphi.preprocessor.TextBlockLineEndingMode;
import au.com.integradev.delphi.preprocessor.TextBlockLineEndingModeRegistry;
import java.nio.charset.Charset;
import org.antlr.runtime.CommonToken;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.ValueSource;
import org.sonar.plugins.communitydelphi.api.ast.DelphiNode;
import org.sonar.plugins.communitydelphi.api.directive.SwitchDirective.SwitchKind;

class TextLiteralNodeImplTest {
@Test
Expand Down Expand Up @@ -59,22 +66,45 @@ void testMultilineImage() {
assertThat(node.isMultiline()).isTrue();
}

@Test
void testGetImageWithCharacterEscapes() {
TextLiteralNodeImpl node = new TextLiteralNodeImpl(DelphiLexer.TkTextLiteral);
@ParameterizedTest(name = "HIGHCHARUNICODE = {0}")
@ValueSource(booleans = {true, false})
void testGetImageWithCharacterEscapes(boolean highCharUnicode) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch also affects the type of the expression - not just the way the string is interpreted.

var registry = mock(CompilerSwitchRegistry.class);
when(registry.isActiveSwitch(eq(SwitchKind.HIGHCHARUNICODE), anyInt()))
.thenReturn(highCharUnicode);
var file = mock(DelphiFile.class);
when(file.getCompilerSwitchRegistry()).thenReturn(registry);
var ast = mock(DelphiAstImpl.class);
when(ast.getDelphiFile()).thenReturn(file);

TextLiteralNodeImpl node = spy(new TextLiteralNodeImpl(DelphiLexer.TkTextLiteral));
when(node.getAnsiCharset()).thenReturn(Charset.forName("windows-1252"));
node.setParent(ast);

node.addChild(createNode(DelphiLexer.TkQuotedString, "'F'"));
node.addChild(createNode(DelphiLexer.TkCharacterEscapeCode, "#111"));
node.addChild(createNode(DelphiLexer.TkCharacterEscapeCode, "#111"));
node.addChild(createNode(DelphiLexer.TkQuotedString, "'B'"));
node.addChild(createNode(DelphiLexer.TkCharacterEscapeCode, "#$61"));
node.addChild(createNode(DelphiLexer.TkCharacterEscapeCode, "#$72"));
node.addChild(createNode(DelphiLexer.TkQuotedString, "'B'"));
node.addChild(createNode(DelphiLexer.TkCharacterEscapeCode, "#%01100001"));
node.addChild(createNode(DelphiLexer.TkCharacterEscapeCode, "#%01111010"));
node.addChild(createNode(DelphiLexer.TkCharacterEscapeCode, "#$80"));
node.addChild(createNode(DelphiLexer.TkCharacterEscapeCode, "#$98"));
node.addChild(createNode(DelphiLexer.TkCharacterEscapeCode, "#$A3"));
node.addChild(createNode(DelphiLexer.TkCharacterEscapeCode, "#$20AC"));
node.addChild(createNode(DelphiLexer.TkQuotedString, "'az'"));

assertThat(node.getImage()).isEqualTo("'F'#111#111'B'#$61#$72'B'#%01100001#%01111010");
assertThat(node.getValue()).isEqualTo(node.getImageWithoutQuotes()).isEqualTo("FooBarBaz");
assertThat(node.isMultiline()).isFalse();
assertThat(node.getImage()).isEqualTo("'F'#111#111'B'#$61#$72'B'#$80#$98#$A3#$20AC'az'");
if (highCharUnicode) {
assertThat(node.getValue())
.isEqualTo(node.getImageWithoutQuotes())
.isEqualTo("FooBarB\u0080\u0098£€az");
} else {
assertThat(node.getValue())
.isEqualTo(node.getImageWithoutQuotes())
.isEqualTo("FooBarB€˜£€az");
}
}

@Test
Expand Down