Skip to content
This repository has been archived by the owner on Mar 1, 2024. It is now read-only.

mbox: allow custom, stable document id #393

Merged
merged 3 commits into from
Sep 25, 2023

Conversation

rc9000
Copy link
Contributor

@rc9000 rc9000 commented Jul 16, 2023

  • via function passed in id_fn, eg. MboxReader(id_fn=lambda msg: md5(msg.encode()).hexdigest())
  • overrides UUID-based default from Document

 * via function passed in `id_fn`, eg. `MboxReader(id_fn=lambda msg: md5(msg.encode()).hexdigest())`
 * overrides UUID-based default from Document
Copy link
Collaborator

@jerryjliu jerryjliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity is msg a useful argument to pass into id_fn? It seems to me that for a lot of use cases you'd need to do a bunch of hacky string parsing to extract out a usable id.

another option is to modify parse_file, extract a more structured dict (not just the formatted message string, but also a dict containing _date, _from, _to_ keys etc.)

docs.append(Document(text=msg, extra_info=extra_info or {}))
d = Document(text=msg, extra_info=extra_info or {})
if self.id_fn:
d.id_ = self.id_fn(msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

canonical way to set doc_id is d.doc_id = ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks fixed! I think I had this from here

@rc9000
Copy link
Contributor Author

rc9000 commented Jul 17, 2023

out of curiosity is msg a useful argument to pass into id_fn?

I got good and fast results with id_fn=lambda msg: md5(msg[:200].encode()).hexdigest(), since all the identifying information is in the Date:/From:/To: bits which are at the beginning in the default message_format. I put this example in the README now.

But yes I was also considering passing a dict to id_fn, if you like it a lot more I could also do that.

@jerryjliu jerryjliu merged commit 56d29d9 into run-llama:main Sep 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants