Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize parser struct for all file types #321

Merged
merged 1 commit into from
Nov 1, 2023

Conversation

timothycarambat
Copy link
Member

No description provided.

@timothycarambat timothycarambat merged commit 5441717 into master Nov 1, 2023
@timothycarambat timothycarambat deleted the normalize-parser branch November 1, 2023 23:44
Copy link

Here are some suggestions to improve the code:

  1. In as_docx.py and as_text.py, you have hardcoded the docAuthor and description fields as 'Unknown'. It would be better to extract these details from the document metadata if possible. If the metadata is not available, then default to 'Unknown'.
'docAuthor': get_author(fullpath) if get_author(fullpath) else 'Unknown',
'description': get_description(fullpath) if get_description(fullpath) else 'Unknown',
  1. In as_mbox.py, you have removed the sender, recipient, subject, and date_sent fields. If these fields are not necessary for your use case, that's fine. However, if you need this information later, consider keeping these fields.
"sender": message["From"],
"recipient": message["To"],
"subject": subject,
"date_sent": date_sent,
  1. In as_text.py, you have added a docSource field. This is a good addition, but it would be better to make the source more descriptive, such as including the file path.
'docSource': f"a text file uploaded by the user from {fullpath}",
  1. In all files, you are calculating the word count using len(content). This will give you the character count, not the word count. To get the word count, you should split the content by spaces.
'wordCount': len(content.split()),
  1. In all files, you are using the file_creation_time(fullpath) function to get the published date. If the file creation time is not the same as the published date, consider using a different method to get the correct date.

franzbischoff pushed a commit to franzbischoff/anything-llm that referenced this pull request Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant