Feature/multipart match #231

Mizuho32 · 2024-10-19T12:01:04Z

matchPart() for multipart mails.
Main diff is this 21ed331 and I changed decode functions 9768cb3.

It worked but I found some existing codes doing something with multipart. Should I use them?
And many of this PR overlaps with #230 . sorry...

…h, QUoted-Printable

….19.0

mjl-

Hi Mizuho32, thanks for the PR, awesome to see!

I've added several comments. The main points:

Decoding the subject (and more headers) are a good idea. Perhaps we can get that committed separately from the body-matching, as it seems independent. Mox already has code patterns that do rfc2047-decoding (for many charsets), would be good to reuse that.
Doing regular expression matches could hopefully be done relatively easily with Regexp.MatchReader, applied to the decoded leaf parts (also of multiparts), leveraging existing parsing code for message.Part. It is probably best to have a separate config option for matching bodies, I can make the needed UI changes (probably a refactor).

I realize the suggestions need quite some more changes. Hope it fits in your schedule. Let me know if I can help.

mjl- · 2024-11-02T12:01:51Z

store/account.go

 				if !t[0].MatchString(k) {
 					continue
 				}
 				for _, v := range vl {
+					if isSubjectMatch {


Decoding RFC2047-encoded words is a good idea.
We should probably attempt decoding it for all headers.
https://www.xmox.nl/xr/dev/rfc/2047.html#L343 specifies quite elaborate rules for where in a header the encoded words are allowed. I think it's too much to follow those requirements explicitly, at least for the purpose of matching text against a header. Hopefully, it works well enough to do a quick scan if the magic "=?" and "?=" occur in the header value, and try to parse it if that's the case.

Decoding should probably be done with mime.WordDecoder, as is done at https://www.xmox.nl/xr/v0.0.12/message/part.go.html#L480. The code at https://www.xmox.nl/xr/v0.0.12/message/part.go.html#L448 also handles the various character encodings (though perhaps more need to explicitly added: I think "ianaindex" misses a few characters sets, not sure about the japanese ones).

I think rfc2047-decoding headers could be a separate PR, it isn't tied to matching words in the body.

Thank you. I'll check them.

mjl- · 2024-11-02T12:10:49Z

store/account.go

 			for k, vl := range header {
 				k = strings.ToLower(k)
+				if t[0].MatchString("body") { // message body match
+					ws := PrepareWordSearch([]string{t[1].String()}, []string{})


I'm not so sure anymore that PrepareWordSearch is the best way to do the matching. It is used by IMAP search and webmail search, and it can require presence/absence of certain words, but that's not needed for these matches, and we want to match on regular expressions (at least for now, in the future, perhaps we could add more elaborate matching mechanisms, including "not"-matches).

I think we can use https://pkg.go.dev/regexp#Regexp.MatchReader. The RuneReader interface is implemented by bufio.Reader: https://pkg.go.dev/bufio#Reader.ReadRune. So I think we can wrap the io.Reader returned by https://pkg.go.dev/github.com/mjl-/mox/message#Part.Reader in a bufio.Reader, and call MatchReader (or a similar method) on it. We would also do that for each Part.Parts (multipart messages) recursively (see https://pkg.go.dev/github.com/mjl-/mox/message#Part), until we have a match.

Yes, I referred the codes used in webmail search. I will check MatchReader.

mjl- · 2024-11-02T12:17:58Z

store/account.go

 			for k, vl := range header {
 				k = strings.ToLower(k)
+				if t[0].MatchString("body") { // message body match


You mentioned elsewhere that it may be good to separate the body-matching from header matching. And indeed that seems better, at the minimum to avoid confusion between potential headers called "body" and the actual body. I was thinking we could maybe use an empty header key to indicate matching the body, but HeadersRegexp is a map, and it will probably look weird in the config file, if it even works at all.
A new config option in Ruleset indeed would require changing the web interface, and with the current approach (one big table) make it so big we need to refactor it. I can tackel that UI change. I think we would need a new BodyRegexps []string field in the config.Ruleset?

Btw, for this code, shouldn't the "if" statement be before its for-loop ("range header")? It's not executed for each header key/value in the message.

OK. Separating header match and body match is the correct way I think too.
Changing web interface and config data structure seems complicated. Can you try them?

shouldn't the "if" statement be before its for-loop ("range header")?

Yes, after all, I noticed I wrote naive code...

mjl- · 2024-11-02T12:19:15Z

go.mod

@@ -13,11 +13,12 @@ require (
 	github.com/mjl-/sherpats v0.0.6
 	github.com/prometheus/client_golang v1.18.0
 	github.com/russross/blackfriday/v2 v2.1.0
+	github.com/saintfish/chardet v0.0.0-20230101081208-5e3ef4b5456d


I don't see this is used? Was it for testing?

Can be mistake. I will check.

mjl- · 2024-11-02T12:21:57Z

store/search.go

 		if p.MediaType != "TEXT" {
-			// todo: for other types we could try to find a library for parsing and search in there too.
-			return false, nil
+			if p.MediaType == "MULTIPART" {


This looks suspicious: The "if" above, for "len(p.Parts) == 0" should cause this if-branch to only be taken if this is not a multipart (i.e. it is a leaf part). The multipart-matching should be handled by "for _, pp := range p.Parts {" below (called recursively).
If p.Parts is empty for multiparts, perhaps the Part wasn't fully initialized/parsed ("walked") yet.

I also thought same when I see these codes. When I use my multipart mail sample, len(p.Parts) == 0 becomes true but can be something misunderstand. I'll check.

mjl- · 2024-11-02T12:25:15Z

store/search.go

+			if p.MediaType == "MULTIPART" {
+				// Decode and make io.Reader
+				// todo: avoid to load all content
+				content, err := io.ReadAll(p.RawReader())


This would have to use p.Reader() (https://pkg.go.dev/github.com/mjl-/mox/message#Part.Reader), which should already decode the character set. If decoding doesn't yet work for the japanese encoding, it may require changing the "wordDecoder" as mentioned earlier.

OK. Thanks.

Mizuho32 added 10 commits October 15, 2024 00:33

message content match ruleset

ee46c52

webaccount/account.js for message content match

076e1c4

Added github.com/saintfish/chardet, golang.org/x/text

e1eb8d4

Subject mime decode functions and test

e795893

decodeRFC2047(): return encoded str if error, ignore case regexp matc…

6862559

…h, QUoted-Printable

store/search_test.go lower 'b' for base64

a98f08f

Subject decoded matching

f864423

vendor/modules.txt github.com/saintfish/chardet, golang.org/x/text v0…

45cca10

….19.0

decode TransferEncoding, Charset, Multipart functions

9768cb3

Match Multipart mail

21ed331

mjl- requested changes Nov 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/multipart match #231

Feature/multipart match #231

Mizuho32 commented Oct 19, 2024 •

edited

Loading

mjl- left a comment

mjl- Nov 2, 2024

Mizuho32 Nov 2, 2024

mjl- Nov 2, 2024

Mizuho32 Nov 2, 2024

mjl- Nov 2, 2024

Mizuho32 Nov 2, 2024

mjl- Nov 2, 2024

Mizuho32 Nov 2, 2024

mjl- Nov 2, 2024

Mizuho32 Nov 2, 2024

mjl- Nov 2, 2024

Mizuho32 Nov 2, 2024

Feature/multipart match #231

Are you sure you want to change the base?

Feature/multipart match #231

Conversation

Mizuho32 commented Oct 19, 2024 • edited Loading

mjl- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mizuho32 commented Oct 19, 2024 •

edited

Loading