Skip to content

feat: Add upload parquet and manifest files #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
924ca50
add method export to parquet
SantanaTiago Apr 21, 2025
00f744c
typo
SantanaTiago Apr 21, 2025
474df33
add missing dependencies for parquet export
SantanaTiago Apr 21, 2025
e85400c
add upload of manifest file. add download file to temporary file to u…
SantanaTiago Apr 22, 2025
05ca2e4
typo
SantanaTiago Apr 22, 2025
97e101e
changed usage of temporaryfile to temporary directory at top
SantanaTiago Apr 22, 2025
07bf861
improved manifest file
SantanaTiago Apr 22, 2025
52c8de2
removed pages info from manifest files
SantanaTiago Apr 22, 2025
9466f17
add check file size limit before download. changed save s3 path from …
SantanaTiago Apr 23, 2025
f3631d0
fix s3 path name
SantanaTiago Apr 23, 2025
4da5ef4
typo
SantanaTiago Apr 23, 2025
0958312
improve key_prefix path in get_source_files
SantanaTiago Apr 23, 2025
80d7b7c
removed unused key from parquet
SantanaTiago Apr 23, 2025
f681710
merge with main
SantanaTiago May 8, 2025
c97b7d1
changed parquet to a single file in conversion.
SantanaTiago May 12, 2025
22e7f76
merge with main
SantanaTiago May 12, 2025
a9bc859
redone uv lock
SantanaTiago May 12, 2025
e9d4aa0
merge with main
SantanaTiago May 12, 2025
cef5357
set parquet engine
SantanaTiago May 12, 2025
57d40a9
add check file is not empty before appending data
SantanaTiago May 12, 2025
8471edb
changed engine in the create of parquet file
SantanaTiago May 13, 2025
c378cb0
disable parquet export for testing
SantanaTiago May 13, 2025
236f55c
set export parquet default to True. add exception handler in writing …
SantanaTiago May 13, 2025
59704c5
commented pdf file write in parquet file
SantanaTiago May 13, 2025
da0555e
changed parquet write file from a single dataframe
SantanaTiago May 14, 2025
2da3ae6
typo
SantanaTiago May 14, 2025
b080ce4
add debug logs
SantanaTiago May 14, 2025
3f64da4
add return to document_to_dataframe
SantanaTiago May 14, 2025
f2c785d
add write parquet file with max size file. add filename to parquet info
SantanaTiago May 14, 2025
71327d2
improved naming of parquet and manifest files. clear dev logs
SantanaTiago May 15, 2025
c6e37da
removed usage of Optional and Union
SantanaTiago May 15, 2025
39a0edc
merge with main
SantanaTiago May 22, 2025
659422b
run uv lock
SantanaTiago May 22, 2025
b7afa33
add missing types. set max parquet size to 500MB
SantanaTiago May 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading