You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+7Lines changed: 7 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,6 +19,10 @@ cd DataProcessingFramework
19
19
pip install .
20
20
```
21
21
22
+
Extra requirements: `filters`, `dev`, `llava`, `video_llava`
23
+
24
+
To install extra requirements run: `pip install .[filters]`
25
+
22
26
## Overview
23
27
24
28
Framework supports following features:
@@ -31,6 +35,9 @@ Framework supports following features:
31
35
32
36
DPF allows you to easily filter datasets and add new metadata.
33
37
For example, the code below generates synthetic captions for images in shards on remote s3 storage and updates dataset metadata without downloading shards:
38
+
39
+
Before running the example below, install extra requirements: `pip install DPF[filters,llava]`
Copy file name to clipboardExpand all lines: docs/filters.md
+132-1Lines changed: 132 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -86,5 +86,136 @@ You can find usage examples [there](../examples).
86
86
87
87
### Creating new filter
88
88
89
-
TODO
89
+
To add your filter, you should create new filter class.
90
+
If your filter uses only data from columns (e.g. _text_ modality), you should inherit your class from [ColumnFilter class](../DPF/filters/column_filter.py)
91
+
If your filter uses data from files, you should inherit your class from [DataFilter class](../DPF/filters/data_filter.py)
90
92
93
+
#### Creating DataFilter
94
+
95
+
To create a new datafilter, add new file in a folder with the modality used by your filter.
96
+
For example, if your filter uses _images_ modality, create file in [DPF/filters/images/](../DPF/filters/images) folder.
97
+
If your filter uses _texts_ and _images_ modality, create file in [DPF/filters/text2image/](../DPF/filters/text2image) and so on.
98
+
99
+
Inherit you filter from corresponding `DataFilter` class in modality folder:
100
+
-[DPF/filters/images/img_filter.py](../DPF/filters/images/img_filter.py) for _images_
101
+
-[DPF/filters/text2image/t2i_filter.py](../DPF/filters/text2image/t2i_filter.py) for _texts_ and _images_
102
+
-[DPF/filters/videos/video_filter.py](../DPF/filters/videos/video_filter.py) for _videos_
103
+
104
+
Then you should implement `result_columns`, `dataloader_kwargs` properties and `preprocess_data`, `process_batch` methods.
105
+
-`result_columns` - list of result columns that filter adds to a DataFrame
106
+
-`dataloader_kwargs` - parameters for a pytorch dataloader
107
+
-`preprocess_data` - method where data preprocessing is implemented. This method is passed to dataloader and preprocessing runs in multiple processes. Do not use cuda operations in this method.
108
+
-`process_batch` - method where batch is processed with model
0 commit comments