Support assets with null characters in strings. #567

winder · 2021-07-12T18:22:28Z

Summary

It's possible for users to put null characters in the asset name, asset unit, and asset URL. The indexer should handle these valid transactions in some sensible way.

Test Plan

New unit tests.

codecov-commenter · 2021-07-12T18:28:31Z

Codecov Report

Merging #567 (69aba11) into develop (8980d9e) will increase coverage by 2.74%.
The diff coverage is 75.00%.

@@             Coverage Diff             @@
##           develop     #567      +/-   ##
===========================================
+ Coverage    46.39%   49.14%   +2.74%     
===========================================
  Files           24       24              
  Lines         3899     3913      +14     
===========================================
+ Hits          1809     1923     +114     
+ Misses        1818     1702     -116     
- Partials       272      288      +16

Impacted Files	Coverage Δ
idb/postgres/internal/encoding/encode.go	`85.36% <66.66%> (-7.74%)`	⬇️
idb/postgres/postgres.go	`45.26% <100.00%> (+5.60%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8980d9e...69aba11. Read the comment docs.

brianolson

I think we should probably always do like the '...ForQuery' variant and escape backslashes and then zeros; then the desanitize() function will actually do the right thing.

brianolson · 2021-07-12T18:25:04Z

idb/postgres/internal/encoding/encode.go

I think you need to replace existing backslashes before adding backslashes (this needs to happen beforre line 67)

This is specifically to search for \u0000, if you don't escape the slash it doesn't find anything.

I'm simplifying this to be just return strings.ReplaceAll(str, "\x00", "\\\\u0000").

I think that all other unicode characters would be converted by the json encoder with a single backslash, postgres must convert them into the unicode character.

idb/postgres/internal/encoding/encoding_test.go

idb/postgres/postgres_integration_test.go

idb/postgres/internal/encoding/encode.go

tsachiherman · 2021-07-14T23:23:19Z

idb/postgres/internal/encoding/encode.go

+		case '\\':
+			newlen += 2
+		default:
+			newlen+= csize


add space before +=

tsachiherman · 2021-07-14T23:32:58Z

idb/postgres/internal/encoding/encode.go

+}
+
+// UnescapeNulls is the inverse function of EscapeNulls.
+// UnescapeNulls converts \\ and \uXXXX back into their unescaped form but may not be fully general for input not generated by EscapeNulls().


I believe the method is correct - but relying on the fact that it was correctly generated by EscapeNulls() is an issue. It need to be refactored to return an error in case it's not possible to unescape.

If this is all only used internally to store a few fields and read them back, then there is no problem?

brianolson · 2021-07-12T21:01:38Z

idb/postgres/internal/encoding/encoding_test.go

+			input:    "has >\000< nu\\ll",
+			expected: `has >\u0000< nu\ll`,
+			query:    `has >\\u0000< nu\ll`,
+		},


Suggested change

},

},

{

name: "already escaped null",

input: "has >\\u0000< nu\\ll",

expected: `has >\u0000< nu\ll`,

query: `has >\u0000< nu\ll`,

},

but then that's the case where desanitize(expected) != input

brianolson · 2021-07-13T14:10:19Z

README.md


 As of April 2020, storing all the raw blocks is about 100 GB and the PostgreSQL database of transactions and accounts is about 1 GB. Much of that size difference is the Indexer ignoring cryptographic signature data; relying on `algod` to validate blocks. Dropping that, the Indexer can focus on the 'what happened' details of transactions and accounts.

+Postgres should be configured to use UTF-8 encoding.


add (this is the default).

brianolson · 2021-07-16T14:29:24Z

idb/postgres/internal/encoding/encode.go

+	xb := []byte(x)
+
+	escapenull := []byte("\\u0000")
+	var out strings.Builder


This is worse. The prior implementation will produce less memory garbage by doing exactly one allocation of exactly the right size. Why replace the better implementation with a worse implementation? (I know, readability, but the better implementation already existed so moving backwards is frustrating.)

brianolson · 2021-07-16T14:29:53Z

idb/postgres/internal/encoding/encode.go

+}
+
+// UnescapeNulls is the inverse function of EscapeNulls.
+// UnescapeNulls converts \\ and \uXXXX back into their unescaped form but may not be fully general for input not generated by EscapeNulls().


If this is all only used internally to store a few fields and read them back, then there is no problem?

winder · 2021-07-20T14:24:06Z

Replaced by #577

winder added Unplanned Team Carbon-11 labels Jul 12, 2021

winder requested review from algobolson and tolikzinovyev July 12, 2021 18:22

winder self-assigned this Jul 12, 2021

winder changed the title ~~Support assets with null characts in strings.~~ Support assets with null characters in strings. Jul 12, 2021

brianolson reviewed Jul 12, 2021

View reviewed changes

Support assets with null characts in strings.

ac7f352

winder force-pushed the will/null-char-support branch from a779179 to ac7f352 Compare July 12, 2021 18:32

Some cleanup.

d7e6b8f

winder requested a review from brianolson July 12, 2021 18:55

brianolson reviewed Jul 12, 2021

View reviewed changes

idb/postgres/internal/encoding/encoding_test.go Show resolved Hide resolved

Add desanitize test.

69aba11

tolikzinovyev reviewed Jul 12, 2021

View reviewed changes

idb/postgres/postgres_integration_test.go Outdated Show resolved Hide resolved

idb/postgres/internal/encoding/encode.go Outdated Show resolved Hide resolved

idb/postgres/internal/encoding/encode.go Outdated Show resolved Hide resolved

PR Feedback + uni-directional encoding policy

953291f

brianolson mentioned this pull request Jul 13, 2021

backslash escape bad utf8 and nulls #568

Closed

winder added 5 commits July 14, 2021 09:20

Convert to base64 for non-printable or invalid utf8 text.

92e23ca

Remove unused var

be53186

make fmt

6795c2c

Include EscapeNulls function.

28bbcdc

Fix escape function

18cdd57

tsachiherman reviewed Jul 14, 2021

View reviewed changes

idb/postgres/internal/encoding/encode.go Outdated

case '\\':

newlen += 2

default:

newlen+= csize

Copy link

Contributor

tsachiherman Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add space before +=

tsachiherman reviewed Jul 14, 2021

View reviewed changes

Use a string builder instead of precomputing the length.

8094899

brianolson reviewed Jul 16, 2021

View reviewed changes

winder closed this Jul 20, 2021

winder deleted the will/null-char-support branch June 22, 2022 20:42

-		},
+		},
+		{
+			name:     "already escaped null",
+			input:    "has >\\u0000< nu\\ll",
+			expected: `has >\u0000< nu\ll`,
+			query:    `has >\u0000< nu\ll`,
+		},


		As of April 2020, storing all the raw blocks is about 100 GB and the PostgreSQL database of transactions and accounts is about 1 GB. Much of that size difference is the Indexer ignoring cryptographic signature data; relying on `algod` to validate blocks. Dropping that, the Indexer can focus on the 'what happened' details of transactions and accounts.

		Postgres should be configured to use UTF-8 encoding.

Support assets with null characters in strings. #567

Support assets with null characters in strings. #567

Uh oh!

Conversation

winder commented Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

codecov-commenter commented Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

brianolson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

winder Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

winder commented Jul 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

winder commented Jul 12, 2021 •

edited

Loading

codecov-commenter commented Jul 12, 2021 •

edited

Loading

winder Jul 12, 2021 •

edited

Loading