There is no underscore in the character class in the regular expression capture for charset detection in URL previews #10307

srividyut · 2021-07-03T13:08:01Z

There is no underscore in the character class in the regular expression capture for charset detection >> There is no underscore in the character class in the regular expression capture for charset detection in URL previews

synapse/synapse/rest/media/v1/preview_url_resource.py

Line 61 in 4b965c8

    
           _charset_match = re.compile(br'<\s*meta[^>]*charset\s*=\s*"?([a-z0-9-]+)"?', flags=re.I)

line63

synapse/synapse/rest/media/v1/preview_url_resource.py

Line 63 in 4b965c8

br'\s*<\s*\?\s*xml[^>]*encoding="([a-z0-9-]+)"', flags=re.I

When used in countries other than Europe and the United States, garbled characters are awkward on web pages that use certain character codes. As you know the names of these character codes, we also use underscores.
The fix is 2 lines but only 2 character correction.

~~([a-z0-9-]+) -> ([a-z0-9-_]+)~~
(21/07/16 02:00) ([a-z0-9-]+) -> ([a-z0-9_-]+)

(21/07/16 02:00) Hyphens need to be escaped unless they are at the beginning or end. The "Source Editor Screenshot" is also incorrect, so I deleted it.

When I added an underscore and sent a message including a URL from the client, the content containing the underscore in the name of the character code such as Shift_JIS was displayed without garbled characters.
Pull request with the same content / Sorry for being a beginner in python.
Ignore this as it seems to be excluded in the test
Or I read deeply that there may be a deep reason why there is no underscore.

The text was updated successfully, but these errors were encountered:

richvdh · 2021-07-05T10:18:21Z

please could you explain what the user-visible symptoms of this issue are?

srividyut · 2021-07-11T23:07:17Z

babolivier, thank you for fixing the wrong grammar.

clokep · 2021-07-12T14:46:48Z

I'm assuming this is to match additional charsets, e.g. both Shift-JIS and Shift_JIS? This should probably be fine. Do you get mojibake without this change?

srividyut · 2021-07-15T09:25:30Z

Sorry for the late reply.
Mr. clokep, in the case of Shift_JIS, mojibake occurred without the changes described above.

Steps to reproduce

I referred to the following article to find the "sample URL for verification".

https://w3techs.com/technologies/overview/character_encoding
( The usage rate of Shift_JIS is 0.1%, but it also includes ITMedia, a media conglomerate. Kakaku.com is also often used. )
Below is a screenshot of the WEB client version of "Element" connecting to "Synapse", creating a room and sending some URLs. All the pasted URLs are WEB pages using Shift_JIS.
(To avoid past cache hits, I chose a different path, the content site is the same)

When I changed the regex character class, restarted the server and did the same, it looked like the screenshot below.

I checked the response header using firefox's web development tool

The "upper two pasted URLs" in the validation image did not contain the character set definition in the content-type line of the HTTP response header.

For the third www.jalan.net, even if the HTTP response header contains "charset = Windows-31J", it is not output properly as a result.

I haven't tracked how the "the retrieved character set in variable" are processed, but they are clearly defined in the hash variables in webencodings.labels.
( ... /matrix-synapse/lib/python3.8/site-packages/webencodings/labels.py *Equivalent to here )

There is a recognition that in the past situation in Japan, in order to avoid the occurrence of problems, the WEB server side tended not to clearly set a specific character set for sending response headers. At least in the era when "individual blog operation including servers" became popular, it was seen in many "server setting articles" introduced by individuals. My personal server has the same settings as those.

I tried my best with machine translation and wrote it desperately, I'm sorry if there is something rude

clokep · 2021-07-15T11:39:03Z

@srividyut Thanks for including the screenshots! That makes it clear what's happening. I think the original PR you had put up (#10306) was correct. You'll just need to:

Add a newsfile
Sign-off on your commit
Run the linting scripts

https://github.com/matrix-org/synapse/blob/master/CONTRIBUTING.md#9-submit-your-patch has a bit more info about this.

srividyut · 2021-07-15T17:42:51Z

I made a mistake not only in the grammar but also in the fixed code.
It should have been inserted before the hyphen at the end.
Of course I will fix it.
Thank you for your kind guidance to the pull request !!
I'll try to do my best,

Signed-off-by: sri-vidyut <srividyut@hotmail.com>

babolivier changed the title ~~There is no underscore in the character class in the regular expression capture for charset detection~~ There is no underscore in the character class in the regular expression capture for charset detection in URL previews Jul 5, 2021

richvdh added the X-Needs-Info This issue is blocked awaiting information from the reporter label Jul 7, 2021

clokep added good first issue Good for newcomers S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. and removed X-Needs-Info This issue is blocked awaiting information from the reporter labels Jul 15, 2021

srividyut added a commit to srividyut/synapse that referenced this issue Jul 16, 2021

issue:matrix-org#10307, #prev-pull-request:matrix-org#10306

20030bc

Signed-off-by: sri-vidyut <srividyut@hotmail.com>

srividyut mentioned this issue Jul 16, 2021

Support underscores (in addition to hyphens) for charset detection. #10410

Merged

4 tasks

clokep linked a pull request Jul 16, 2021 that will close this issue

Support underscores (in addition to hyphens) for charset detection. #10410

Merged

4 tasks

clokep closed this as completed in #10410 Jul 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There is no underscore in the character class in the regular expression capture for charset detection in URL previews #10307

There is no underscore in the character class in the regular expression capture for charset detection in URL previews #10307

srividyut commented Jul 3, 2021 •

edited

Loading

richvdh commented Jul 5, 2021

srividyut commented Jul 11, 2021

clokep commented Jul 12, 2021

srividyut commented Jul 15, 2021 •

edited

Loading

clokep commented Jul 15, 2021

srividyut commented Jul 15, 2021 •

edited

Loading

There is no underscore in the character class in the regular expression capture for charset detection in URL previews #10307

There is no underscore in the character class in the regular expression capture for charset detection in URL previews #10307

Comments

srividyut commented Jul 3, 2021 • edited Loading

There is no underscore in the character class in the regular expression capture for charset detection >> There is no underscore in the character class in the regular expression capture for charset detection in URL previews

richvdh commented Jul 5, 2021

srividyut commented Jul 11, 2021

clokep commented Jul 12, 2021

srividyut commented Jul 15, 2021 • edited Loading

Steps to reproduce

I checked the response header using firefox's web development tool

clokep commented Jul 15, 2021

srividyut commented Jul 15, 2021 • edited Loading

srividyut commented Jul 3, 2021 •

edited

Loading

srividyut commented Jul 15, 2021 •

edited

Loading

srividyut commented Jul 15, 2021 •

edited

Loading