Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataClass + Agent] Improve DataClass to support nested data classes + consolidate shared code for schema #53

Merged
merged 7 commits into from
Jun 29, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
dataclass pass all tests of nested data classes
  • Loading branch information
liyin2015 committed Jun 29, 2024
commit 538bc103d8a8381124a89045a647aa5e9d0c457d
41 changes: 36 additions & 5 deletions docs/source/developer_notes/base_data_class.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,25 @@ DataClass

`Li Yin <https://github.com/liyin2015>`_

In PyTorch, ``Tensor`` is the data type used in ``Module`` and ``Optimizer`` across the library.
The data in particular is a multi-dimensional matrix such as such as weights, biases, and even inputs and predictions.
In LLM applications, you can think of the data as a freeform data class with various fields and types of data.
For instance:
In `PyTorch`, ``Tensor`` is the data type used in ``Module`` and ``Optimizer`` across the library.
Tensor wraps a multi-dimensional matrix to better support its operations and computations.
In LLM applications, data constantly needs to interact with LLMs in the form of strings via prompt and be parsed back to structured data from LLMs' text prediction.
:class:`core.base_data_class.DataClass` is designed to ease the data interaction with LLMs via prompt(input) and text prediction(output).

.. figure:: /_static/images/dataclass.png
:align: center
:alt: DataClass
:width: 680px

DataClass is to ease the data interaction with LLMs via prompt(input) and text prediction(output).


Design
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In Python, data is typically represented as a class with attributes.
Python's native ``dataclasses`` module is lightweight and flexible to further help users define a data class.

in Python is a decorator that can be used to automatically generate special methods such as `__init__`, `__repr__`, `__str__` etc. for a class.

.. code-block:: python

Expand All @@ -24,7 +39,7 @@ For instance:
It is exactly a single input data item in a typical PyTorch ``Dataset`` or a `HuggingFace` ``Dataset``.
The unique thing is all data or tools interact with LLMs via prompt and text prediction, which is a single ``str``.

Most existing libraries use `Pydantic` to handle the serialization(convert to string) and deserialization(convert from string) of the data.
Most existing libraries use `Pydantic` to handle the serialization(convert to string) and deserialization(convert back from string) of the data.
But, in LightRAG, we in particular designed :class:`core.base_data_class.DataClass` using native `dataclasses` module.
The reasons are:

Expand Down Expand Up @@ -265,6 +280,22 @@ Here is a real-world example:



.. admonition:: References
:class: highlight

1. Full-text search on PostgreSQL: https://www.postgresql.org/docs/current/textsearch.html



.. admonition:: API References
:class: highlight

- :class:`core.base_data_class.DataClass`
- :class:`core.base_data_class.DataClassFormatType`
- :func:`core.functional.custom_asdict`
- :ref:`core.base_data_class<core-base_data_class>`


.. Document
.. ------------
.. We defined `Document` to function as a `string` container, and it can be used for any kind of text data along its `metadata` and relations
Expand Down
Loading
Loading