Search results for "UTF-8 with BOM" files shifted on first line by a character #66189

ghost · 2019-01-08T00:53:02Z

This is my first pull request in this project. Please let me know if my solution is not good enough, I am willing to improve it.

msftclas · 2019-01-08T00:53:13Z

All CLA requirements met.

roblourens · 2019-01-08T17:49:57Z

src/vs/workbench/parts/search/common/searchModel.ts

@@ -40,7 +40,7 @@ export class Match {
 	private _fullPreviewRange: ISearchRange;

 	constructor(private _parent: FileMatch, private _fullPreviewLines: string[], _fullPreviewRange: ISearchRange, _documentRange: ISearchRange) {
-		this._oneLinePreviewText = _fullPreviewLines[_fullPreviewRange.startLineNumber];
+		this._oneLinePreviewText = stripUTF8BOM(_fullPreviewLines[_fullPreviewRange.startLineNumber]);


If it's a UTF-8 with BOM file and _fullPreviewRange.startLineNumber is 0, _fullPreviewLines[0] will starts with BOM, then _fullPreviewLines does not match _fullPreviewRange.

But the BOM should be stripped already by ripgrepTextSearchEngine right?

I can move it to ripgrepTextSearchEngine.

But it's already there, right? That file should be stripping the BOM correctly in all cases with your change.

roblourens · 2019-01-08T17:52:00Z

src/vs/workbench/services/search/node/ripgrepTextSearchEngine.ts

 				matchText = stripUTF8BOM(matchText);
-				startCol -= 3;
-				endCol -= 3;
+				startCol -= 1;


Why change this? The BOM is 3 bytes long.

https://github.com/Microsoft/vscode/blob/7c8361ef698d9ed491612cd786952aba2ab47c87/src/vs/workbench/services/search/node/ripgrepTextSearchEngine.ts#L258

Because there is toString() here, Buffer.from([0xEF, 0xBB, 0xBF]).toString().length is 1.

Ok, that's correct thanks.

ghost · 2019-01-10T00:47:35Z

I found another solution: Remove && options.encoding !== 'utf8'

https://github.com/Microsoft/vscode/blob/f0f3b922bcb081c6488b7299dbef4076a1cfde82/src/vs/workbench/services/search/node/ripgrepTextSearchEngine.ts#L333

then ripgrep will not output BOM character in JSON.

Or I can manually strip the BOM character in ripgrepTextSearchEngine.

Which do you prefer?

roblourens · 2019-01-10T01:38:40Z

I want to keep that because I found that search was faster in some cases without it. Also, you could probably search with a different encoding but still end up finding results in UTF8 files with BOMs. Also I don't trust ripgrep to keep that behavior forever. So, let's keep the check to strip the BOM on our end.

ghost · 2019-01-10T06:40:27Z

I stripped fullText at
https://github.com/Microsoft/vscode/blob/c22caf616a9d6baa90ab904292bddc14a7b09ffd/src/vs/workbench/services/search/node/ripgrepTextSearchEngine.ts#L285
So it's not need to edit searchModel.ts.

roblourens · 2019-01-10T17:04:41Z

Looks great, thanks!

RMacfarlane assigned roblourens Jan 8, 2019

roblourens reviewed Jan 8, 2019

View reviewed changes

Fix #66188

c22caf6

Victorique Ko added 2 commits January 10, 2019 16:19

Strip utf-8 bom from matchText

4f151da

Index can't be negative

0be5df1

roblourens mentioned this pull request Jan 10, 2019

Links overflow the description comment box microsoft/vscode-pull-request-github#806

Closed

roblourens merged commit 8c2cef9 into microsoft:master Jan 10, 2019

github-actions bot locked and limited conversation to collaborators Jul 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search results for "UTF-8 with BOM" files shifted on first line by a character #66189

Search results for "UTF-8 with BOM" files shifted on first line by a character #66189

ghost commented Jan 8, 2019

msftclas commented Jan 8, 2019 •

edited

Loading

roblourens Jan 8, 2019

ghost Jan 9, 2019

roblourens Jan 9, 2019

ghost Jan 10, 2019

roblourens Jan 10, 2019

roblourens Jan 8, 2019

ghost Jan 9, 2019

roblourens Jan 9, 2019

ghost commented Jan 10, 2019

roblourens commented Jan 10, 2019

ghost commented Jan 10, 2019 •

edited by ghost

Loading

roblourens commented Jan 10, 2019

Search results for "UTF-8 with BOM" files shifted on first line by a character #66189

Search results for "UTF-8 with BOM" files shifted on first line by a character #66189

Conversation

ghost commented Jan 8, 2019

msftclas commented Jan 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Jan 10, 2019

roblourens commented Jan 10, 2019

ghost commented Jan 10, 2019 • edited by ghost Loading

roblourens commented Jan 10, 2019

msftclas commented Jan 8, 2019 •

edited

Loading

ghost commented Jan 10, 2019 •

edited by ghost

Loading