-
Notifications
You must be signed in to change notification settings - Fork 891
feat: transforms for Knowledge Graphs #1345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
|
||
# This regex pattern matches URLs, including those starting with "http://", "https://", or "www." | ||
links_extractor_pattern = r"(?i)\b(?:https?://|www\.)\S+\b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we are initing these extractors - should we change that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please. Such inits can't be scaled cleanly.
|
||
|
||
@dataclass | ||
class HeadlineSplitter(Splitter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I see handles single-level headline splitting, ideally, we might need an iterative one but let's leave that for now. Also, we would need normal hierarchical splitters like ones in llama-index or langchain. This is because headline extraction from documents such as PDS is error prone, so we can't use headline splitters with large pdf files prly. [This can be added later]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noted these down these will have to address them is separate PRs
headlines: t.Dict[str, t.List[str]] | ||
|
||
|
||
class HeadlinesExtractorPrompt(PydanticPrompt[StringIO, Headlines]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you replace the example here with a more generalizable and shorter one? this one targets Arxiv documents specifically.
Transforms
There are basically 3 type of transforms you can do
Extrators
Splitters
RelationshipBuilder