Skip to content

Conversation

stellasia
Copy link
Contributor

@stellasia stellasia commented Jul 19, 2024

Description

Add support for Component/Pipeline. See the shaping document for details.

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Documentation update
  • Project configuration change

Complexity

Note

Please provide an estimated complexity of this PR of either Low, Medium or High

Complexity: High

How Has This Been Tested?

  • Unit tests (a few, need to be refined once the architecture is fixed)
  • E2E tests
  • Manual tests

Checklist

The following requirements should have been met (depending on the changes in the branch):

  • Documentation has been updated
  • Unit tests have been updated
  • E2E tests have been updated
  • Examples have been updated
  • New files have copyright header
  • CLA (https://neo4j.com/developer/cla/) has been signed
  • CHANGELOG.md updated if appropriate

@stellasia stellasia requested a review from alexthomas93 July 19, 2024 08:47
@stellasia stellasia marked this pull request as ready for review July 22, 2024 12:23
return type.__new__(meta, name, bases, attrs)


class Component(abc.ABC, metaclass=ComponentMeta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see a benefit of adding synchronous version of these Components in future or is that redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. For now I haven't found a way to do it without duplicating a lot of code, but it should be doable, just need a bit more research.

Copy link
Contributor

@willtai willtai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work 🐉

logger.warning(
f"Missing dependency {d.start} for {task.name} (status: {d_status})"
)
raise PipelineMissingDependencyError()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if I'm misunderstanding something here but is it possible to run a chain of components in parallel and then 'gather' the results in some final component, or can parallel processing only be done within a component? e.g. after chunking would it be run a pipeline of components on each chunk, and only gather them together at the end for entity resolution? Or must the 'gathering' be done at the end of every component

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, parallelism is possible only within a component. The only "parallelism" the pipeline allows is between branches: for instance, the schema extractor and document chunker are run in parallel because they do not depend on each other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Do you think it'll be possible to add that in the future? For large datasets I imagine there'll be a lot of efficiency gained by not having to gather and re-parallelise everything at every step

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree it makes sense to execute full pipelines in parallel for different chunks, but I'm not sure this package is the best place for it. I'd rather build the pipeline for a single chunk, and then rely on (potentially) distributed task managers (such as luigi?) to manage the workload.
But if we decide to implement it ourselves, we'll have to make some changes to the way the pipelines currently work, but that should be doable.

Comment on lines +465 to +470
async def run(self, data: dict[str, Any]) -> dict[str, Any]:
self.validate_inputs_config(data)
self.reinitialize()
orchestrator = Orchestrator(self)
await orchestrator.run(data)
return self._final_results.all()
Copy link
Member

@oskarhane oskarhane Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a use case where someone might want to have a human in the loop to verify each step before proceeding to the next one.
To have that we'd need to support dumping config + state in a serialized format, and then hydrating a new Pipeline using the same.
And in addition to a .run() that runs all, have a .step() that takes the next step in the pipeline.

I don't see anything in this design preventing us from adding that, but I'm also looking at this code for the first time today.
So have a think if there's anything we might want to change in this PR to prevent future breaking changes when we want to add the step and dump/load functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. For now we are tracking the state of each component to find the next one, but it is not persisted. This PR introduces a show_as_dict and from_template methods on the pipeline, that could be extended to dump/load the pipeline (and it's state). It has not been done yet because I'm unsure about how to deal with the serialization of component instances. But I don't think this would introduce breaking changes if we want to add this feature in the future.

@stellasia stellasia merged commit 59a37a0 into neo4j:feature/kg_builder Jul 31, 2024
willtai added a commit to willtai/neo4j-genai-python that referenced this pull request Jul 31, 2024
* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>
willtai added a commit that referenced this pull request Jul 31, 2024
* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>
stellasia added a commit that referenced this pull request Aug 8, 2024
* Pipeline  (#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Entity / Relation extraction component

* Adds a Text Splitter (#82)

* Added text splitter adapter class

* Added copyright header to new files

* Added __future__ import to text_splitters.py for backwards compatibility of type hints

* Moved text splitter file and tests

* Split text splitter adapter into 2 adapters

* Added optional metadata to text chunks

* Fixed typos

* Moved text splitters inside of the components folder

* Fixed Component import

* Add tests

* Keep it simple: remove deps to jinja for now

* Update example with existing components

* log config in example

* Fix tests

* Rm unused import

* Add copyright headers

* Rm debug code

* Try and fix tests

* Unused import

* get_type_hints is failing for python 3.8/3.9, even when using __future__ annotations => back to the typing.Dict annotation which is compatible with all python versions

* Return model is also conditioned to the existence of the run method
=> should raise an error if run is not implemented?

* Log when we do not raise exception to keep track of the failure

* Update prompt to match new KGwriter expected type

* Fix test

* Fix type for `examples`

* Use SchemaConfig as input for the ER Extractor component

* The "base" EntityRelationExtractor is an ABC that must be subclassed

* Make node IDs unique across several runs of the pipeline by prefixing them with a timestamp

* Option to build lexical graph in the ERExtractor component

* Fix one test

* Fix some more tests

* Fix some more tests

* Remove "type: ignore" comments

---------

Co-authored-by: willtai <wtaisen@gmail.com>
Co-authored-by: Alex Thomas <alexthomas93@users.noreply.github.com>
@stellasia stellasia deleted the kgb/pipeline branch August 13, 2024 08:58
alexthomas93 added a commit that referenced this pull request Aug 13, 2024
* Pipeline  (#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Pipeline  (#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Adds a Text Splitter (#82)

* Added text splitter adapter class

* Added copyright header to new files

* Added __future__ import to text_splitters.py for backwards compatibility of type hints

* Moved text splitter file and tests

* Split text splitter adapter into 2 adapters

* Added optional metadata to text chunks

* Fixed typos

* Moved text splitters inside of the components folder

* Fixed Component import

* Added a TextChunkEmbedder (#87)

* Added a TextChunkEmbedder

* Added the copyright header to test_embedder.py

* Updated test_text_chunk_embedder_run

* Adds a knowledge graph writer (#83)

* Added copyright header to new files

* Added copyright header to kg_writer.py

* Added __future__ import to kg_writer.py for backwards compatibility of type hints

* Added E2E test for Neo4jWriter

* Added a copyright header to test_kg_builder_e2e.py

* Added upsert_vector test for relationship embeddings

* Moved KG writer and its tests

* Moved Neo4jGraph and associated objects to a new file

* Renamed KG builder fixture

* Added unit tests for KG writer

* Split upsert_vector into 2 functions

* Fixed broken cypher query strings

* Removed embedding creation from Neo4jWriter

* Fixed setup_neo4j_for_kg_construction fixture

* Added KGWriterModel class

* Fixed minor mistake in test_weaviate_e2e.py

* Renamed kg_construction folder to components

* Updated unit tests with new folder structure

* Fixed broken import

* Fixed copyright headers

* Added missing docstrings

* Fixed typo

* Add documentation for pipeline exceptions (#90)

* Fixes and refactors the KG writer component (#92)

* Fixes and refactors the KG writer component

* Fixed mypy error

* Made start_node_id and end_node_id parameters in UPSERT_RELATIONSHIP_QUERY

* Add schema for kg builder (#88)

* Add schema for kg builder and tests

* Fixed mypy checks

* Reverted kg builder example with schema

* Revert to List and Dict due to Python3.8 issue with using get_type_hints

* Added properties to Entity and Relation

* Add test for missing properties

* Fix type annotations in test

* Add property types

* Refactored entity, relation, and property types

* Unused import

* Moved schema to components/ (#96)

* Add entity / Relation extraction component (#85)

* Pipeline  (#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Entity / Relation extraction component

* Adds a Text Splitter (#82)

* Added text splitter adapter class

* Added copyright header to new files

* Added __future__ import to text_splitters.py for backwards compatibility of type hints

* Moved text splitter file and tests

* Split text splitter adapter into 2 adapters

* Added optional metadata to text chunks

* Fixed typos

* Moved text splitters inside of the components folder

* Fixed Component import

* Add tests

* Keep it simple: remove deps to jinja for now

* Update example with existing components

* log config in example

* Fix tests

* Rm unused import

* Add copyright headers

* Rm debug code

* Try and fix tests

* Unused import

* get_type_hints is failing for python 3.8/3.9, even when using __future__ annotations => back to the typing.Dict annotation which is compatible with all python versions

* Return model is also conditioned to the existence of the run method
=> should raise an error if run is not implemented?

* Log when we do not raise exception to keep track of the failure

* Update prompt to match new KGwriter expected type

* Fix test

* Fix type for `examples`

* Use SchemaConfig as input for the ER Extractor component

* The "base" EntityRelationExtractor is an ABC that must be subclassed

* Make node IDs unique across several runs of the pipeline by prefixing them with a timestamp

* Option to build lexical graph in the ERExtractor component

* Fix one test

* Fix some more tests

* Fix some more tests

* Remove "type: ignore" comments

---------

Co-authored-by: willtai <wtaisen@gmail.com>
Co-authored-by: Alex Thomas <alexthomas93@users.noreply.github.com>

* Update lock file after merge

* Remove pipeline/components folder (again)

* Updated component docs (#99)

* Updated component docs

* Removed weaviate test update

* Updated pipeline user guide with link to components in the API section

* Feature/kg builder e2e tests (#98)

* End to end tests for KG builder pipeline

* Adding chunk embedder to the pipeline and e2e tests

* Fix how the chunk embedding is saved

* Fix e2e tests

* Fix mypy

* mypy stuff :'(

* WIP: update e2e tests

* Check counts also here

* Enable e2e tests on this PR only

* Fix e2e tests (was not mocking the correct method for Embedder)

* Revert CI to normal

* Updated CHANGLOG and set max-parallel: 1 for E2E tests in pr-e2e-tests.yaml

---------

Co-authored-by: willtai <wtaisen@gmail.com>
Co-authored-by: Alex Thomas <alexthomas93@users.noreply.github.com>
Co-authored-by: willtai <william.tai@neo4j.com>
willtai added a commit to willtai/neo4j-genai-python that referenced this pull request Aug 13, 2024
* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>
willtai added a commit to willtai/neo4j-genai-python that referenced this pull request Aug 13, 2024
* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>
willtai added a commit that referenced this pull request Aug 13, 2024
* Pipeline  (#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Add documentation for pipeline exceptions (#90)

* Add schema for kg builder (#88)

* Add schema for kg builder and tests

* Fixed mypy checks

* Reverted kg builder example with schema

* Revert to List and Dict due to Python3.8 issue with using get_type_hints

* Added properties to Entity and Relation

* Add test for missing properties

* Fix type annotations in test

* Add property types

* Refactored entity, relation, and property types

* Unused import

* Moved schema to components/ (#96)

* Add PDFLoader Component

* Added tests

* Remove pypdf check

* Refactored examples

* Moved to experimental folder

* Exposed fs to run()

---------

Co-authored-by: Estelle Scifo <stellasia@users.noreply.github.com>
stellasia added a commit that referenced this pull request Aug 13, 2024
* Pipeline  (#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Pipeline  (#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Adds a Text Splitter (#82)

* Added text splitter adapter class

* Added copyright header to new files

* Added __future__ import to text_splitters.py for backwards compatibility of type hints

* Moved text splitter file and tests

* Split text splitter adapter into 2 adapters

* Added optional metadata to text chunks

* Fixed typos

* Moved text splitters inside of the components folder

* Fixed Component import

* Added a TextChunkEmbedder (#87)

* Added a TextChunkEmbedder

* Added the copyright header to test_embedder.py

* Updated test_text_chunk_embedder_run

* Adds a knowledge graph writer (#83)

* Added copyright header to new files

* Added copyright header to kg_writer.py

* Added __future__ import to kg_writer.py for backwards compatibility of type hints

* Added E2E test for Neo4jWriter

* Added a copyright header to test_kg_builder_e2e.py

* Added upsert_vector test for relationship embeddings

* Moved KG writer and its tests

* Moved Neo4jGraph and associated objects to a new file

* Renamed KG builder fixture

* Added unit tests for KG writer

* Split upsert_vector into 2 functions

* Fixed broken cypher query strings

* Removed embedding creation from Neo4jWriter

* Fixed setup_neo4j_for_kg_construction fixture

* Added KGWriterModel class

* Fixed minor mistake in test_weaviate_e2e.py

* Renamed kg_construction folder to components

* Updated unit tests with new folder structure

* Fixed broken import

* Fixed copyright headers

* Added missing docstrings

* Fixed typo

* Add documentation for pipeline exceptions (#90)

* Start documentation for KG construction pipeline

* Fixes and refactors the KG writer component (#92)

* Fixes and refactors the KG writer component

* Fixed mypy error

* Made start_node_id and end_node_id parameters in UPSERT_RELATIONSHIP_QUERY

* Add schema for kg builder (#88)

* Add schema for kg builder and tests

* Fixed mypy checks

* Reverted kg builder example with schema

* Revert to List and Dict due to Python3.8 issue with using get_type_hints

* Added properties to Entity and Relation

* Add test for missing properties

* Fix type annotations in test

* Add property types

* Refactored entity, relation, and property types

* Unused import

* Moved schema to components/ (#96)

* Add entity / Relation extraction component (#85)

* Pipeline  (#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Entity / Relation extraction component

* Adds a Text Splitter (#82)

* Added text splitter adapter class

* Added copyright header to new files

* Added __future__ import to text_splitters.py for backwards compatibility of type hints

* Moved text splitter file and tests

* Split text splitter adapter into 2 adapters

* Added optional metadata to text chunks

* Fixed typos

* Moved text splitters inside of the components folder

* Fixed Component import

* Add tests

* Keep it simple: remove deps to jinja for now

* Update example with existing components

* log config in example

* Fix tests

* Rm unused import

* Add copyright headers

* Rm debug code

* Try and fix tests

* Unused import

* get_type_hints is failing for python 3.8/3.9, even when using __future__ annotations => back to the typing.Dict annotation which is compatible with all python versions

* Return model is also conditioned to the existence of the run method
=> should raise an error if run is not implemented?

* Log when we do not raise exception to keep track of the failure

* Update prompt to match new KGwriter expected type

* Fix test

* Fix type for `examples`

* Use SchemaConfig as input for the ER Extractor component

* The "base" EntityRelationExtractor is an ABC that must be subclassed

* Make node IDs unique across several runs of the pipeline by prefixing them with a timestamp

* Option to build lexical graph in the ERExtractor component

* Fix one test

* Fix some more tests

* Fix some more tests

* Remove "type: ignore" comments

---------

Co-authored-by: willtai <wtaisen@gmail.com>
Co-authored-by: Alex Thomas <alexthomas93@users.noreply.github.com>

* Update lock file after merge

* Remove pipeline/components folder (again)

* Updated component docs (#99)

* Updated component docs

* Removed weaviate test update

* Updated pipeline user guide with link to components in the API section

* Feature/kg builder e2e tests (#98)

* End to end tests for KG builder pipeline

* Adding chunk embedder to the pipeline and e2e tests

* Fix how the chunk embedding is saved

* Fix e2e tests

* Fix mypy

* mypy stuff :'(

* WIP: update e2e tests

* Check counts also here

* Enable e2e tests on this PR only

* Fix e2e tests (was not mocking the correct method for Embedder)

* Revert CI to normal

* User guide for KG builder pipeline

* Update line length

* Review comments 1

* Address review comments - add missing file (image)

* Nicer lists

---------

Co-authored-by: willtai <wtaisen@gmail.com>
Co-authored-by: Alex Thomas <alexthomas93@users.noreply.github.com>
Co-authored-by: willtai <william.tai@neo4j.com>
a-s-g93 pushed a commit to a-s-g93/neo4j-genai-python-dev that referenced this pull request Sep 13, 2024
* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Adds a Text Splitter (neo4j#82)

* Added text splitter adapter class

* Added copyright header to new files

* Added __future__ import to text_splitters.py for backwards compatibility of type hints

* Moved text splitter file and tests

* Split text splitter adapter into 2 adapters

* Added optional metadata to text chunks

* Fixed typos

* Moved text splitters inside of the components folder

* Fixed Component import

* Added a TextChunkEmbedder (neo4j#87)

* Added a TextChunkEmbedder

* Added the copyright header to test_embedder.py

* Updated test_text_chunk_embedder_run

* Adds a knowledge graph writer (neo4j#83)

* Added copyright header to new files

* Added copyright header to kg_writer.py

* Added __future__ import to kg_writer.py for backwards compatibility of type hints

* Added E2E test for Neo4jWriter

* Added a copyright header to test_kg_builder_e2e.py

* Added upsert_vector test for relationship embeddings

* Moved KG writer and its tests

* Moved Neo4jGraph and associated objects to a new file

* Renamed KG builder fixture

* Added unit tests for KG writer

* Split upsert_vector into 2 functions

* Fixed broken cypher query strings

* Removed embedding creation from Neo4jWriter

* Fixed setup_neo4j_for_kg_construction fixture

* Added KGWriterModel class

* Fixed minor mistake in test_weaviate_e2e.py

* Renamed kg_construction folder to components

* Updated unit tests with new folder structure

* Fixed broken import

* Fixed copyright headers

* Added missing docstrings

* Fixed typo

* Add documentation for pipeline exceptions (neo4j#90)

* Fixes and refactors the KG writer component (neo4j#92)

* Fixes and refactors the KG writer component

* Fixed mypy error

* Made start_node_id and end_node_id parameters in UPSERT_RELATIONSHIP_QUERY

* Add schema for kg builder (neo4j#88)

* Add schema for kg builder and tests

* Fixed mypy checks

* Reverted kg builder example with schema

* Revert to List and Dict due to Python3.8 issue with using get_type_hints

* Added properties to Entity and Relation

* Add test for missing properties

* Fix type annotations in test

* Add property types

* Refactored entity, relation, and property types

* Unused import

* Moved schema to components/ (neo4j#96)

* Add entity / Relation extraction component (neo4j#85)

* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Entity / Relation extraction component

* Adds a Text Splitter (neo4j#82)

* Added text splitter adapter class

* Added copyright header to new files

* Added __future__ import to text_splitters.py for backwards compatibility of type hints

* Moved text splitter file and tests

* Split text splitter adapter into 2 adapters

* Added optional metadata to text chunks

* Fixed typos

* Moved text splitters inside of the components folder

* Fixed Component import

* Add tests

* Keep it simple: remove deps to jinja for now

* Update example with existing components

* log config in example

* Fix tests

* Rm unused import

* Add copyright headers

* Rm debug code

* Try and fix tests

* Unused import

* get_type_hints is failing for python 3.8/3.9, even when using __future__ annotations => back to the typing.Dict annotation which is compatible with all python versions

* Return model is also conditioned to the existence of the run method
=> should raise an error if run is not implemented?

* Log when we do not raise exception to keep track of the failure

* Update prompt to match new KGwriter expected type

* Fix test

* Fix type for `examples`

* Use SchemaConfig as input for the ER Extractor component

* The "base" EntityRelationExtractor is an ABC that must be subclassed

* Make node IDs unique across several runs of the pipeline by prefixing them with a timestamp

* Option to build lexical graph in the ERExtractor component

* Fix one test

* Fix some more tests

* Fix some more tests

* Remove "type: ignore" comments

---------

Co-authored-by: willtai <wtaisen@gmail.com>
Co-authored-by: Alex Thomas <alexthomas93@users.noreply.github.com>

* Update lock file after merge

* Remove pipeline/components folder (again)

* Updated component docs (neo4j#99)

* Updated component docs

* Removed weaviate test update

* Updated pipeline user guide with link to components in the API section

* Feature/kg builder e2e tests (neo4j#98)

* End to end tests for KG builder pipeline

* Adding chunk embedder to the pipeline and e2e tests

* Fix how the chunk embedding is saved

* Fix e2e tests

* Fix mypy

* mypy stuff :'(

* WIP: update e2e tests

* Check counts also here

* Enable e2e tests on this PR only

* Fix e2e tests (was not mocking the correct method for Embedder)

* Revert CI to normal

* Updated CHANGLOG and set max-parallel: 1 for E2E tests in pr-e2e-tests.yaml

---------

Co-authored-by: willtai <wtaisen@gmail.com>
Co-authored-by: Alex Thomas <alexthomas93@users.noreply.github.com>
Co-authored-by: willtai <william.tai@neo4j.com>
a-s-g93 pushed a commit to a-s-g93/neo4j-genai-python-dev that referenced this pull request Sep 13, 2024
* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Add documentation for pipeline exceptions (neo4j#90)

* Add schema for kg builder (neo4j#88)

* Add schema for kg builder and tests

* Fixed mypy checks

* Reverted kg builder example with schema

* Revert to List and Dict due to Python3.8 issue with using get_type_hints

* Added properties to Entity and Relation

* Add test for missing properties

* Fix type annotations in test

* Add property types

* Refactored entity, relation, and property types

* Unused import

* Moved schema to components/ (neo4j#96)

* Add PDFLoader Component

* Added tests

* Remove pypdf check

* Refactored examples

* Moved to experimental folder

* Exposed fs to run()

---------

Co-authored-by: Estelle Scifo <stellasia@users.noreply.github.com>
a-s-g93 pushed a commit to a-s-g93/neo4j-genai-python-dev that referenced this pull request Sep 13, 2024
* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Adds a Text Splitter (neo4j#82)

* Added text splitter adapter class

* Added copyright header to new files

* Added __future__ import to text_splitters.py for backwards compatibility of type hints

* Moved text splitter file and tests

* Split text splitter adapter into 2 adapters

* Added optional metadata to text chunks

* Fixed typos

* Moved text splitters inside of the components folder

* Fixed Component import

* Added a TextChunkEmbedder (neo4j#87)

* Added a TextChunkEmbedder

* Added the copyright header to test_embedder.py

* Updated test_text_chunk_embedder_run

* Adds a knowledge graph writer (neo4j#83)

* Added copyright header to new files

* Added copyright header to kg_writer.py

* Added __future__ import to kg_writer.py for backwards compatibility of type hints

* Added E2E test for Neo4jWriter

* Added a copyright header to test_kg_builder_e2e.py

* Added upsert_vector test for relationship embeddings

* Moved KG writer and its tests

* Moved Neo4jGraph and associated objects to a new file

* Renamed KG builder fixture

* Added unit tests for KG writer

* Split upsert_vector into 2 functions

* Fixed broken cypher query strings

* Removed embedding creation from Neo4jWriter

* Fixed setup_neo4j_for_kg_construction fixture

* Added KGWriterModel class

* Fixed minor mistake in test_weaviate_e2e.py

* Renamed kg_construction folder to components

* Updated unit tests with new folder structure

* Fixed broken import

* Fixed copyright headers

* Added missing docstrings

* Fixed typo

* Add documentation for pipeline exceptions (neo4j#90)

* Start documentation for KG construction pipeline

* Fixes and refactors the KG writer component (neo4j#92)

* Fixes and refactors the KG writer component

* Fixed mypy error

* Made start_node_id and end_node_id parameters in UPSERT_RELATIONSHIP_QUERY

* Add schema for kg builder (neo4j#88)

* Add schema for kg builder and tests

* Fixed mypy checks

* Reverted kg builder example with schema

* Revert to List and Dict due to Python3.8 issue with using get_type_hints

* Added properties to Entity and Relation

* Add test for missing properties

* Fix type annotations in test

* Add property types

* Refactored entity, relation, and property types

* Unused import

* Moved schema to components/ (neo4j#96)

* Add entity / Relation extraction component (neo4j#85)

* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Entity / Relation extraction component

* Adds a Text Splitter (neo4j#82)

* Added text splitter adapter class

* Added copyright header to new files

* Added __future__ import to text_splitters.py for backwards compatibility of type hints

* Moved text splitter file and tests

* Split text splitter adapter into 2 adapters

* Added optional metadata to text chunks

* Fixed typos

* Moved text splitters inside of the components folder

* Fixed Component import

* Add tests

* Keep it simple: remove deps to jinja for now

* Update example with existing components

* log config in example

* Fix tests

* Rm unused import

* Add copyright headers

* Rm debug code

* Try and fix tests

* Unused import

* get_type_hints is failing for python 3.8/3.9, even when using __future__ annotations => back to the typing.Dict annotation which is compatible with all python versions

* Return model is also conditioned to the existence of the run method
=> should raise an error if run is not implemented?

* Log when we do not raise exception to keep track of the failure

* Update prompt to match new KGwriter expected type

* Fix test

* Fix type for `examples`

* Use SchemaConfig as input for the ER Extractor component

* The "base" EntityRelationExtractor is an ABC that must be subclassed

* Make node IDs unique across several runs of the pipeline by prefixing them with a timestamp

* Option to build lexical graph in the ERExtractor component

* Fix one test

* Fix some more tests

* Fix some more tests

* Remove "type: ignore" comments

---------

Co-authored-by: willtai <wtaisen@gmail.com>
Co-authored-by: Alex Thomas <alexthomas93@users.noreply.github.com>

* Update lock file after merge

* Remove pipeline/components folder (again)

* Updated component docs (neo4j#99)

* Updated component docs

* Removed weaviate test update

* Updated pipeline user guide with link to components in the API section

* Feature/kg builder e2e tests (neo4j#98)

* End to end tests for KG builder pipeline

* Adding chunk embedder to the pipeline and e2e tests

* Fix how the chunk embedding is saved

* Fix e2e tests

* Fix mypy

* mypy stuff :'(

* WIP: update e2e tests

* Check counts also here

* Enable e2e tests on this PR only

* Fix e2e tests (was not mocking the correct method for Embedder)

* Revert CI to normal

* User guide for KG builder pipeline

* Update line length

* Review comments 1

* Address review comments - add missing file (image)

* Nicer lists

---------

Co-authored-by: willtai <wtaisen@gmail.com>
Co-authored-by: Alex Thomas <alexthomas93@users.noreply.github.com>
Co-authored-by: willtai <william.tai@neo4j.com>
a-s-g93 pushed a commit to a-s-g93/neo4j-genai-python-dev that referenced this pull request Sep 13, 2024
* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Adds a Text Splitter (neo4j#82)

* Added text splitter adapter class

* Added copyright header to new files

* Added __future__ import to text_splitters.py for backwards compatibility of type hints

* Moved text splitter file and tests

* Split text splitter adapter into 2 adapters

* Added optional metadata to text chunks

* Fixed typos

* Moved text splitters inside of the components folder

* Fixed Component import

* Added a TextChunkEmbedder (neo4j#87)

* Added a TextChunkEmbedder

* Added the copyright header to test_embedder.py

* Updated test_text_chunk_embedder_run

* Adds a knowledge graph writer (neo4j#83)

* Added copyright header to new files

* Added copyright header to kg_writer.py

* Added __future__ import to kg_writer.py for backwards compatibility of type hints

* Added E2E test for Neo4jWriter

* Added a copyright header to test_kg_builder_e2e.py

* Added upsert_vector test for relationship embeddings

* Moved KG writer and its tests

* Moved Neo4jGraph and associated objects to a new file

* Renamed KG builder fixture

* Added unit tests for KG writer

* Split upsert_vector into 2 functions

* Fixed broken cypher query strings

* Removed embedding creation from Neo4jWriter

* Fixed setup_neo4j_for_kg_construction fixture

* Added KGWriterModel class

* Fixed minor mistake in test_weaviate_e2e.py

* Renamed kg_construction folder to components

* Updated unit tests with new folder structure

* Fixed broken import

* Fixed copyright headers

* Added missing docstrings

* Fixed typo

* Add documentation for pipeline exceptions (neo4j#90)

* Fixes and refactors the KG writer component (neo4j#92)

* Fixes and refactors the KG writer component

* Fixed mypy error

* Made start_node_id and end_node_id parameters in UPSERT_RELATIONSHIP_QUERY

* Add schema for kg builder (neo4j#88)

* Add schema for kg builder and tests

* Fixed mypy checks

* Reverted kg builder example with schema

* Revert to List and Dict due to Python3.8 issue with using get_type_hints

* Added properties to Entity and Relation

* Add test for missing properties

* Fix type annotations in test

* Add property types

* Refactored entity, relation, and property types

* Unused import

* Moved schema to components/ (neo4j#96)

* Add entity / Relation extraction component (neo4j#85)

* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Entity / Relation extraction component

* Adds a Text Splitter (neo4j#82)

* Added text splitter adapter class

* Added copyright header to new files

* Added __future__ import to text_splitters.py for backwards compatibility of type hints

* Moved text splitter file and tests

* Split text splitter adapter into 2 adapters

* Added optional metadata to text chunks

* Fixed typos

* Moved text splitters inside of the components folder

* Fixed Component import

* Add tests

* Keep it simple: remove deps to jinja for now

* Update example with existing components

* log config in example

* Fix tests

* Rm unused import

* Add copyright headers

* Rm debug code

* Try and fix tests

* Unused import

* get_type_hints is failing for python 3.8/3.9, even when using __future__ annotations => back to the typing.Dict annotation which is compatible with all python versions

* Return model is also conditioned to the existence of the run method
=> should raise an error if run is not implemented?

* Log when we do not raise exception to keep track of the failure

* Update prompt to match new KGwriter expected type

* Fix test

* Fix type for `examples`

* Use SchemaConfig as input for the ER Extractor component

* The "base" EntityRelationExtractor is an ABC that must be subclassed

* Make node IDs unique across several runs of the pipeline by prefixing them with a timestamp

* Option to build lexical graph in the ERExtractor component

* Fix one test

* Fix some more tests

* Fix some more tests

* Remove "type: ignore" comments

---------

Co-authored-by: willtai <wtaisen@gmail.com>
Co-authored-by: Alex Thomas <alexthomas93@users.noreply.github.com>

* Update lock file after merge

* Remove pipeline/components folder (again)

* Updated component docs (neo4j#99)

* Updated component docs

* Removed weaviate test update

* Updated pipeline user guide with link to components in the API section

* Feature/kg builder e2e tests (neo4j#98)

* End to end tests for KG builder pipeline

* Adding chunk embedder to the pipeline and e2e tests

* Fix how the chunk embedding is saved

* Fix e2e tests

* Fix mypy

* mypy stuff :'(

* WIP: update e2e tests

* Check counts also here

* Enable e2e tests on this PR only

* Fix e2e tests (was not mocking the correct method for Embedder)

* Revert CI to normal

* Updated CHANGLOG and set max-parallel: 1 for E2E tests in pr-e2e-tests.yaml

---------

Co-authored-by: willtai <wtaisen@gmail.com>
Co-authored-by: Alex Thomas <alexthomas93@users.noreply.github.com>
Co-authored-by: willtai <william.tai@neo4j.com>
a-s-g93 pushed a commit to a-s-g93/neo4j-genai-python-dev that referenced this pull request Sep 13, 2024
* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Add documentation for pipeline exceptions (neo4j#90)

* Add schema for kg builder (neo4j#88)

* Add schema for kg builder and tests

* Fixed mypy checks

* Reverted kg builder example with schema

* Revert to List and Dict due to Python3.8 issue with using get_type_hints

* Added properties to Entity and Relation

* Add test for missing properties

* Fix type annotations in test

* Add property types

* Refactored entity, relation, and property types

* Unused import

* Moved schema to components/ (neo4j#96)

* Add PDFLoader Component

* Added tests

* Remove pypdf check

* Refactored examples

* Moved to experimental folder

* Exposed fs to run()

---------

Co-authored-by: Estelle Scifo <stellasia@users.noreply.github.com>
a-s-g93 pushed a commit to a-s-g93/neo4j-genai-python-dev that referenced this pull request Sep 13, 2024
* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Adds a Text Splitter (neo4j#82)

* Added text splitter adapter class

* Added copyright header to new files

* Added __future__ import to text_splitters.py for backwards compatibility of type hints

* Moved text splitter file and tests

* Split text splitter adapter into 2 adapters

* Added optional metadata to text chunks

* Fixed typos

* Moved text splitters inside of the components folder

* Fixed Component import

* Added a TextChunkEmbedder (neo4j#87)

* Added a TextChunkEmbedder

* Added the copyright header to test_embedder.py

* Updated test_text_chunk_embedder_run

* Adds a knowledge graph writer (neo4j#83)

* Added copyright header to new files

* Added copyright header to kg_writer.py

* Added __future__ import to kg_writer.py for backwards compatibility of type hints

* Added E2E test for Neo4jWriter

* Added a copyright header to test_kg_builder_e2e.py

* Added upsert_vector test for relationship embeddings

* Moved KG writer and its tests

* Moved Neo4jGraph and associated objects to a new file

* Renamed KG builder fixture

* Added unit tests for KG writer

* Split upsert_vector into 2 functions

* Fixed broken cypher query strings

* Removed embedding creation from Neo4jWriter

* Fixed setup_neo4j_for_kg_construction fixture

* Added KGWriterModel class

* Fixed minor mistake in test_weaviate_e2e.py

* Renamed kg_construction folder to components

* Updated unit tests with new folder structure

* Fixed broken import

* Fixed copyright headers

* Added missing docstrings

* Fixed typo

* Add documentation for pipeline exceptions (neo4j#90)

* Start documentation for KG construction pipeline

* Fixes and refactors the KG writer component (neo4j#92)

* Fixes and refactors the KG writer component

* Fixed mypy error

* Made start_node_id and end_node_id parameters in UPSERT_RELATIONSHIP_QUERY

* Add schema for kg builder (neo4j#88)

* Add schema for kg builder and tests

* Fixed mypy checks

* Reverted kg builder example with schema

* Revert to List and Dict due to Python3.8 issue with using get_type_hints

* Added properties to Entity and Relation

* Add test for missing properties

* Fix type annotations in test

* Add property types

* Refactored entity, relation, and property types

* Unused import

* Moved schema to components/ (neo4j#96)

* Add entity / Relation extraction component (neo4j#85)

* Pipeline  (neo4j#81)

* First draft of pipeline/component architecture. Example using the RAG pipeline.

* More complex implementation of pipeline to deal with branching and aggregations - no async yet

* Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default

* Test RAG with new Pipeline implementation

* File refactoring

* Pipeline orchestration with async support

* Import sorting

* Pipeline rerun + exception on cyclic graph (for now)

* Mypy

* Python version compat

* Rename process->run for Components for consistency with Pipeline

* Move components test in the example folder - add some tests

* Race condition fix - documentation - ruff

* Fix import sorting

* mypy on tests

* Mark test as async

* Tests were not testing...

* Ability to create Pipeline templates

* Ruff

* Future + header

* Renaming + update import structure to make it more compatible with rest of the repo

* Check input parameters before starting the pipeline

* Introduce output model for component - Validate pipeline before running - More unit tests

* Import..

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks...

* Finally installed pre-commit hooks... and struggling with pydantic..

* Mypy on examples

* Add missing header

* Update doc

* Fix import in doc

* Update changelog

* Update docs/source/user_guide_pipeline.rst

Co-authored-by: willtai <wtaisen@gmail.com>

* Refactor tests folder to match src structure

* Move exceptions to separate file and rename them to make it clearer they are related to pipeline

* Mypy

* Rename def => config

* Introduce generic type to remove most of the "type: ignore" comments

* Remove unnecessary comment

* Ruff

* Document and test is_cyclic method

* Remove find_all method from store (simplify data retrieval)

* value is not a list anymore (or, if it is, it's on purpose)

* Remove comments, fix example in doc

* Remove core directory - move files to /pipeline

* Expose stores from pipeline subpackage

* Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input

* Component subclasses can return DataModel

* Add note on async + schema to illustrate parameter propagation

---------

Co-authored-by: willtai <wtaisen@gmail.com>

* Entity / Relation extraction component

* Adds a Text Splitter (neo4j#82)

* Added text splitter adapter class

* Added copyright header to new files

* Added __future__ import to text_splitters.py for backwards compatibility of type hints

* Moved text splitter file and tests

* Split text splitter adapter into 2 adapters

* Added optional metadata to text chunks

* Fixed typos

* Moved text splitters inside of the components folder

* Fixed Component import

* Add tests

* Keep it simple: remove deps to jinja for now

* Update example with existing components

* log config in example

* Fix tests

* Rm unused import

* Add copyright headers

* Rm debug code

* Try and fix tests

* Unused import

* get_type_hints is failing for python 3.8/3.9, even when using __future__ annotations => back to the typing.Dict annotation which is compatible with all python versions

* Return model is also conditioned to the existence of the run method
=> should raise an error if run is not implemented?

* Log when we do not raise exception to keep track of the failure

* Update prompt to match new KGwriter expected type

* Fix test

* Fix type for `examples`

* Use SchemaConfig as input for the ER Extractor component

* The "base" EntityRelationExtractor is an ABC that must be subclassed

* Make node IDs unique across several runs of the pipeline by prefixing them with a timestamp

* Option to build lexical graph in the ERExtractor component

* Fix one test

* Fix some more tests

* Fix some more tests

* Remove "type: ignore" comments

---------

Co-authored-by: willtai <wtaisen@gmail.com>
Co-authored-by: Alex Thomas <alexthomas93@users.noreply.github.com>

* Update lock file after merge

* Remove pipeline/components folder (again)

* Updated component docs (neo4j#99)

* Updated component docs

* Removed weaviate test update

* Updated pipeline user guide with link to components in the API section

* Feature/kg builder e2e tests (neo4j#98)

* End to end tests for KG builder pipeline

* Adding chunk embedder to the pipeline and e2e tests

* Fix how the chunk embedding is saved

* Fix e2e tests

* Fix mypy

* mypy stuff :'(

* WIP: update e2e tests

* Check counts also here

* Enable e2e tests on this PR only

* Fix e2e tests (was not mocking the correct method for Embedder)

* Revert CI to normal

* User guide for KG builder pipeline

* Update line length

* Review comments 1

* Address review comments - add missing file (image)

* Nicer lists

---------

Co-authored-by: willtai <wtaisen@gmail.com>
Co-authored-by: Alex Thomas <alexthomas93@users.noreply.github.com>
Co-authored-by: willtai <william.tai@neo4j.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants