-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/multipart match #231
base: main
Are you sure you want to change the base?
Conversation
…h, QUoted-Printable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Mizuho32, thanks for the PR, awesome to see!
I've added several comments. The main points:
- Decoding the subject (and more headers) are a good idea. Perhaps we can get that committed separately from the body-matching, as it seems independent. Mox already has code patterns that do rfc2047-decoding (for many charsets), would be good to reuse that.
- Doing regular expression matches could hopefully be done relatively easily with Regexp.MatchReader, applied to the decoded leaf parts (also of multiparts), leveraging existing parsing code for message.Part. It is probably best to have a separate config option for matching bodies, I can make the needed UI changes (probably a refactor).
I realize the suggestions need quite some more changes. Hope it fits in your schedule. Let me know if I can help.
if !t[0].MatchString(k) { | ||
continue | ||
} | ||
for _, v := range vl { | ||
if isSubjectMatch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decoding RFC2047-encoded words is a good idea.
We should probably attempt decoding it for all headers.
https://www.xmox.nl/xr/dev/rfc/2047.html#L343 specifies quite elaborate rules for where in a header the encoded words are allowed. I think it's too much to follow those requirements explicitly, at least for the purpose of matching text against a header. Hopefully, it works well enough to do a quick scan if the magic "=?" and "?=" occur in the header value, and try to parse it if that's the case.
Decoding should probably be done with mime.WordDecoder, as is done at https://www.xmox.nl/xr/v0.0.12/message/part.go.html#L480. The code at https://www.xmox.nl/xr/v0.0.12/message/part.go.html#L448 also handles the various character encodings (though perhaps more need to explicitly added: I think "ianaindex" misses a few characters sets, not sure about the japanese ones).
I think rfc2047-decoding headers could be a separate PR, it isn't tied to matching words in the body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I'll check them.
for k, vl := range header { | ||
k = strings.ToLower(k) | ||
if t[0].MatchString("body") { // message body match | ||
ws := PrepareWordSearch([]string{t[1].String()}, []string{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure anymore that PrepareWordSearch is the best way to do the matching. It is used by IMAP search and webmail search, and it can require presence/absence of certain words, but that's not needed for these matches, and we want to match on regular expressions (at least for now, in the future, perhaps we could add more elaborate matching mechanisms, including "not"-matches).
I think we can use https://pkg.go.dev/regexp#Regexp.MatchReader. The RuneReader interface is implemented by bufio.Reader: https://pkg.go.dev/bufio#Reader.ReadRune. So I think we can wrap the io.Reader returned by https://pkg.go.dev/github.com/mjl-/mox/message#Part.Reader in a bufio.Reader, and call MatchReader (or a similar method) on it. We would also do that for each Part.Parts (multipart messages) recursively (see https://pkg.go.dev/github.com/mjl-/mox/message#Part), until we have a match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I referred the codes used in webmail search. I will check MatchReader.
for k, vl := range header { | ||
k = strings.ToLower(k) | ||
if t[0].MatchString("body") { // message body match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned elsewhere that it may be good to separate the body-matching from header matching. And indeed that seems better, at the minimum to avoid confusion between potential headers called "body" and the actual body. I was thinking we could maybe use an empty header key to indicate matching the body, but HeadersRegexp is a map, and it will probably look weird in the config file, if it even works at all.
A new config option in Ruleset indeed would require changing the web interface, and with the current approach (one big table) make it so big we need to refactor it. I can tackel that UI change. I think we would need a new BodyRegexps []string field in the config.Ruleset?
Btw, for this code, shouldn't the "if" statement be before its for-loop ("range header")? It's not executed for each header key/value in the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Separating header match and body match is the correct way I think too.
Changing web interface and config data structure seems complicated. Can you try them?
shouldn't the "if" statement be before its for-loop ("range header")?
Yes, after all, I noticed I wrote naive code...
@@ -13,11 +13,12 @@ require ( | |||
github.com/mjl-/sherpats v0.0.6 | |||
github.com/prometheus/client_golang v1.18.0 | |||
github.com/russross/blackfriday/v2 v2.1.0 | |||
github.com/saintfish/chardet v0.0.0-20230101081208-5e3ef4b5456d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this is used? Was it for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be mistake. I will check.
if p.MediaType != "TEXT" { | ||
// todo: for other types we could try to find a library for parsing and search in there too. | ||
return false, nil | ||
if p.MediaType == "MULTIPART" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks suspicious: The "if" above, for "len(p.Parts) == 0" should cause this if-branch to only be taken if this is not a multipart (i.e. it is a leaf part). The multipart-matching should be handled by "for _, pp := range p.Parts {" below (called recursively).
If p.Parts is empty for multiparts, perhaps the Part wasn't fully initialized/parsed ("walked") yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thought same when I see these codes. When I use my multipart mail sample, len(p.Parts) == 0 becomes true but can be something misunderstand. I'll check.
if p.MediaType == "MULTIPART" { | ||
// Decode and make io.Reader | ||
// todo: avoid to load all content | ||
content, err := io.ReadAll(p.RawReader()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would have to use p.Reader() (https://pkg.go.dev/github.com/mjl-/mox/message#Part.Reader), which should already decode the character set. If decoding doesn't yet work for the japanese encoding, it may require changing the "wordDecoder" as mentioned earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Thanks.
matchPart()
for multipart mails.Main diff is this 21ed331 and I changed decode functions 9768cb3.
It worked but I found some existing codes doing something with multipart. Should I use them?
And many of this PR overlaps with #230 . sorry...