feat: transforms for Knowledge Graphs #1345

jjmachan · 2024-09-23T11:43:24Z

Transforms

There are basically 3 type of transforms you can do

extract new properties for nodes and relationships - Extrators
split existing nodes into new nodes and relationships - Splitters
find and build new relationships in the knowledge graph - RelationshipBuilder

jjmachan · 2024-09-23T11:44:27Z

src/ragas/experimental/testset/transforms/extractors/regex_based.py

+
+
+# This regex pattern matches URLs, including those starting with "http://", "https://", or "www."
+links_extractor_pattern = r"(?i)\b(?:https?://|www\.)\S+\b"


here we are initing these extractors - should we change that?

Yes, please. Such inits can't be scaled cleanly.

shahules786 · 2024-09-23T12:29:02Z

src/ragas/experimental/testset/transforms/splitters/headline.py

+
+
+@dataclass
+class HeadlineSplitter(Splitter):


This I see handles single-level headline splitting, ideally, we might need an iterative one but let's leave that for now. Also, we would need normal hierarchical splitters like ones in llama-index or langchain. This is because headline extraction from documents such as PDS is error prone, so we can't use headline splitters with large pdf files prly. [This can be added later]

I've noted these down these will have to address them is separate PRs

shahules786 · 2024-09-23T12:31:30Z

src/ragas/experimental/testset/transforms/extractors/llm_base.py

+    headlines: t.Dict[str, t.List[str]]
+
+
+class HeadlinesExtractorPrompt(PydanticPrompt[StringIO, Headlines]):


Can you replace the example here with a more generalizable and shorter one? this one targets Arxiv documents specifically.

jjmachan added 4 commits September 23, 2024 16:48

base objects

5991bcb

Merge branch 'main' into feat/transforms

a396e91

added extractors

b6c225b

added splitters and relationship builder

c014e3f

dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Sep 23, 2024

jjmachan commented Sep 23, 2024

View reviewed changes

jjmachan requested a review from shahules786 September 23, 2024 11:44

shahules786 reviewed Sep 23, 2024

View reviewed changes

shahules786 approved these changes Sep 23, 2024

View reviewed changes

jjmachan merged commit c17029e into explodinggradients:main Sep 23, 2024
16 checks passed

jjmachan deleted the feat/transforms branch September 23, 2024 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: transforms for Knowledge Graphs #1345

feat: transforms for Knowledge Graphs #1345

jjmachan commented Sep 23, 2024 •

edited

Loading

jjmachan Sep 23, 2024

shahules786 Sep 23, 2024

shahules786 Sep 23, 2024 •

edited

Loading

jjmachan Sep 23, 2024

shahules786 Sep 23, 2024



		# This regex pattern matches URLs, including those starting with "http://", "https://", or "www."
		links_extractor_pattern = r"(?i)\b(?:https?://\|www\.)\S+\b"

		headlines: t.Dict[str, t.List[str]]


		class HeadlinesExtractorPrompt(PydanticPrompt[StringIO, Headlines]):

feat: transforms for Knowledge Graphs #1345

feat: transforms for Knowledge Graphs #1345

Conversation

jjmachan commented Sep 23, 2024 • edited Loading

Transforms

jjmachan Sep 23, 2024

Choose a reason for hiding this comment

shahules786 Sep 23, 2024

Choose a reason for hiding this comment

shahules786 Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

jjmachan Sep 23, 2024

Choose a reason for hiding this comment

shahules786 Sep 23, 2024

Choose a reason for hiding this comment

jjmachan commented Sep 23, 2024 •

edited

Loading

shahules786 Sep 23, 2024 •

edited

Loading