Skip to content

feat: transforms for Knowledge Graphs #1345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 23, 2024

Conversation

jjmachan
Copy link
Member

@jjmachan jjmachan commented Sep 23, 2024

Transforms

There are basically 3 type of transforms you can do

  1. extract new properties for nodes and relationships - Extrators
  2. split existing nodes into new nodes and relationships - Splitters
  3. find and build new relationships in the knowledge graph - RelationshipBuilder

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Sep 23, 2024


# This regex pattern matches URLs, including those starting with "http://", "https://", or "www."
links_extractor_pattern = r"(?i)\b(?:https?://|www\.)\S+\b"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we are initing these extractors - should we change that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please. Such inits can't be scaled cleanly.



@dataclass
class HeadlineSplitter(Splitter):
Copy link
Member

@shahules786 shahules786 Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I see handles single-level headline splitting, ideally, we might need an iterative one but let's leave that for now. Also, we would need normal hierarchical splitters like ones in llama-index or langchain. This is because headline extraction from documents such as PDS is error prone, so we can't use headline splitters with large pdf files prly. [This can be added later]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noted these down these will have to address them is separate PRs

headlines: t.Dict[str, t.List[str]]


class HeadlinesExtractorPrompt(PydanticPrompt[StringIO, Headlines]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you replace the example here with a more generalizable and shorter one? this one targets Arxiv documents specifically.

@jjmachan jjmachan merged commit c17029e into explodinggradients:main Sep 23, 2024
16 checks passed
@jjmachan jjmachan deleted the feat/transforms branch September 23, 2024 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants