Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: separate celery workers #2948

Merged
merged 87 commits into from
Aug 6, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
6f205a4
add: auth dropbox
chloedia Jul 15, 2024
483db0a
fix: dropbox sync creds
chloedia Jul 15, 2024
c587456
add: list dropbox items
chloedia Jul 15, 2024
1d9070f
add: weblink
chloedia Jul 16, 2024
d55f411
feat: poetry lock
StanGirard Jul 16, 2024
9d98792
feat: Add Dropbox sync functionality
StanGirard Jul 16, 2024
78d161f
fix: DropBox integration link & front dropbox
chloedia Jul 16, 2024
ac0d211
add: notification to DropBox
chloedia Jul 16, 2024
36a4ecd
fix: delete Session middleware
chloedia Jul 17, 2024
98c216a
fix: refacto syncs
chloedia Jul 17, 2024
6631975
fix: remove useless debug & fix dropbox refresh
chloedia Jul 17, 2024
3b00e66
fix: merge problems
chloedia Jul 17, 2024
7625182
feat: Update sync datetime formats for Google Drive and Azure Drive &…
chloedia Jul 18, 2024
806af03
add: retry fetch_files Azure
chloedia Jul 18, 2024
f002a09
add : notion auth
chloedia Jul 18, 2024
e1c2742
feat: Add Notion sync functionality and update dependencies
chloedia Jul 18, 2024
1a7a961
add: download file for notion
chloedia Jul 19, 2024
e02d273
add: notion front
chloedia Jul 19, 2024
c90f08b
fix: add notion icon url
chloedia Jul 19, 2024
6e76108
add: is_subfolder, fix: notif failures,
chloedia Jul 22, 2024
36ae406
add: icon
chloedia Jul 22, 2024
8f07c79
add: front show emoji
chloedia Jul 22, 2024
8d18ba9
add: add async to notion integration
chloedia Jul 23, 2024
df73259
feat: Add Notion sync functionality and update dependencies
chloedia Jul 23, 2024
86a4721
add: redis
chloedia Jul 23, 2024
693d9c3
add: store in db + get_files from db
chloedia Jul 26, 2024
4e8b8f0
add: migration notion sync
chloedia Jul 30, 2024
4fa4f6e
tests sync notion
AmineDiro Jul 30, 2024
3443586
fix test notion session sync
AmineDiro Jul 30, 2024
9971d2b
fix test sync
AmineDiro Jul 30, 2024
b6bb55d
feat(frontend): handle no brain selection (#2932)
Zewed Jul 31, 2024
24672c0
moved parsed to worker and update packages
AmineDiro Jul 31, 2024
8819379
add: get_file, upload, add notion page to db
chloedia Jul 31, 2024
8ca9e53
add: get_file, upload, add notion page to db
chloedia Jul 31, 2024
5b50f80
add: get_file, upload, add notion page to db
chloedia Jul 31, 2024
794e3ad
fix: merge with main
chloedia Jul 31, 2024
95bf927
fix: merge with main
chloedia Jul 31, 2024
c0a7fec
fix: merge with main
chloedia Jul 31, 2024
03fa348
feat: add sync logic
chloedia Jul 31, 2024
2d64962
fix: processor quivr version (#2934)
AmineDiro Jul 31, 2024
13709b6
init quivr_worker
AmineDiro Aug 1, 2024
0d3b698
Merge remote-tracking branch 'origin/main' into feat/separate-celery-…
AmineDiro Aug 1, 2024
d9c1f3a
fix: quivr core fix tests (#2935)
AmineDiro Aug 1, 2024
837f33d
moved parsers
AmineDiro Aug 1, 2024
bd8763f
merged main
AmineDiro Aug 1, 2024
a4150f1
refacto file
AmineDiro Aug 1, 2024
de8ae64
add: sync in sync cron job
chloedia Aug 1, 2024
0584a79
moved pdf parsing to megaparse
Aug 2, 2024
5a9d27a
moved md5 in core to sha1
Aug 2, 2024
cd01abc
moved check_premium + changed qfile to sha1
Aug 2, 2024
b3debee
chore(main): release core 0.0.13 (#2930)
StanGirard Aug 2, 2024
e775644
finish: regular notion sync
chloedia Aug 2, 2024
9a93dbb
finish: regular notion sync
chloedia Aug 2, 2024
1397d49
quivr_core qfile changes
AmineDiro Aug 2, 2024
67bf3a0
refacto processing + tasks
AmineDiro Aug 2, 2024
b1d5f45
parse simple tests
Aug 2, 2024
207e4f5
added llama-parse dep
Aug 2, 2024
071670b
added processing audio to parsers + tests
Aug 3, 2024
e8b7423
crawler celery workers
Aug 3, 2024
5a38a98
dockerfile.dev workers
Aug 3, 2024
7a463f5
dockerfile.dev workers
Aug 4, 2024
e918c73
Merge branch 'main' into feat/notion-oauth-sync
chloedia Aug 4, 2024
3c2cae2
fix: double mime type in file name
chloedia Aug 4, 2024
94f7948
working docker-compose
AmineDiro Aug 4, 2024
dbd2191
patch json dump tests
AmineDiro Aug 4, 2024
99491e7
logger + placeholder test
AmineDiro Aug 4, 2024
499b907
Delete backend/supabase/migrations/20240107152745_ollama.sql
AmineDiro Aug 4, 2024
6c29231
add: normalize async for BaseSync & SyncUtils
chloedia Aug 4, 2024
1a27e23
remove prints
chloedia Aug 4, 2024
e8ac203
changed dockerfile
AmineDiro Aug 5, 2024
64207c4
default venv
AmineDiro Aug 5, 2024
52b5037
fix original file_name
AmineDiro Aug 5, 2024
5144275
skip all known processors checks
AmineDiro Aug 5, 2024
12c2d92
rolled back crawler
AmineDiro Aug 5, 2024
61d9b47
production docker images
AmineDiro Aug 5, 2024
c129cd0
ignore tox env
AmineDiro Aug 5, 2024
9934a7a
feat: Add GitHub sync functionality to sync router (#2871)
chloedia Aug 5, 2024
1c63608
refactor: Remove syncGitHub function from useSync.ts (#2942)
StanGirard Aug 5, 2024
71733e4
rolled back crawler
AmineDiro Aug 5, 2024
803f425
fix: sync update only modified files
chloedia Aug 5, 2024
704b47d
merge with main
chloedia Aug 5, 2024
3f4f323
sync utils rewrite
AmineDiro Aug 6, 2024
0463c8a
push new lock
chloedia Aug 6, 2024
4016da9
push new lock
chloedia Aug 6, 2024
adc71b7
push new lock
chloedia Aug 6, 2024
46df016
Merged feat/notion-oauth
AmineDiro Aug 6, 2024
dafda78
lock files ok
AmineDiro Aug 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add: download file for notion
  • Loading branch information
chloedia committed Jul 19, 2024
commit 1a7a9616775367b76171a60b9a292c16092abdec
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ def oauth2callback_notion(request: Request):
# Get account information
user_info = notion.users.me()

owner_info = user_info["bot"]["owner"]["user"]
owner_info = user_info["bot"]["owner"]["user"] # type: ignore
user_email = owner_info["person"]["email"]
account_id = owner_info["id"]

Expand Down
257 changes: 256 additions & 1 deletion backend/api/quivr_api/modules/sync/utils/sync.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,17 @@
import time
from abc import ABC, abstractmethod
from io import BytesIO
from typing import Any, Dict, List
from typing import Any, Dict, List, Optional

import dropbox
import markdownify
import msal
import requests
from fastapi import HTTPException
from google.auth.transport.requests import Request as GoogleRequest
from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build
from notion_client import Client
from quivr_api.logger import get_logger
from quivr_api.modules.sync.entity.sync import SyncFile
from quivr_api.modules.sync.utils.normalize import remove_special_characters
Expand Down Expand Up @@ -620,3 +622,256 @@ def download_file(self, credentials: Dict, file: SyncFile) -> BytesIO:

metadata, file_data = self.dbx.files_download(file_id) # type: ignore
return BytesIO(file_data.content)


class NotionSync(BaseSync):
name = "Notion"
lower_name = "notion"
notion: Optional[Client] = None
datetime_format: str = "%Y-%m-%dT%H:%M:%S.%fZ"

def link_notion(self, credentials) -> Client:
return Client(auth=credentials["access_token"])

def check_and_refresh_access_token(self, credentials: Dict) -> Dict:
if not self.notion:
self.notion = self.link_notion(credentials)
# no need to refresh token for notion
return credentials

def fetch_root_pages(self) -> List[SyncFile]:
pages = []
if not self.notion:
raise Exception("Notion client is not initialized")

search_result = self.notion.search(
query="", filter={"property": "object", "value": "page"}
)["results"]
print(len(search_result))
for page in search_result:
# print(page)
if "parent" in page and page["parent"]["type"] == "workspace":
page_info = SyncFile(
name=page["properties"]["title"]["title"][0]["text"]["content"],
id=page["id"],
is_folder=True,
last_modified=page["last_edited_time"],
mime_type="md",
web_view_link=page["url"],
)
pages.append(page_info)
return pages

def fetch_pages(self, page_id, recursive):
pages = []
blocks = self.notion.blocks.children.list(page_id)["results"] # type: ignore

for block in blocks:
block_type = block["type"]
# if block_type is child database mark it as unclickable
page_info = SyncFile(
name=block[block_type]["title"],
id=block["id"],
is_folder=True,
last_modified=block["last_edited_time"],
mime_type="md" if block_type == "page" else "db",
web_view_link=f"https://www.notion.so/{block['id'].replace('-', '')}",
)
pages.append(page_info)

if recursive:
sub_pages = self.fetch_pages(block["id"], recursive)
pages.extend(sub_pages)

return pages

def get_files(
self, credentials: Dict, page_id: str = "", recursive: bool = False
) -> List[SyncFile]:
"""
Retrieve files (pages) from Notion.

Args:
credentials (dict): The credentials for accessing Notion.
page_id (str, optional): The root page ID to start fetching files. Defaults to "".
recursive (bool, optional): If True, fetch files from all subpages. Defaults to False.

Returns:
list: A list of SyncFile objects containing the page metadata.
"""

logger.info("Retrieving Notion files (pages) from page_id: %s", page_id)

try:
if not self.notion:
self.notion = self.link_notion(credentials)

if not page_id:
files = self.fetch_root_pages()
else:
files = self.fetch_pages(page_id, recursive)

logger.info("Notion files (pages) retrieved successfully: %d", len(files))
return files

except Exception as e:
logger.error("Unexpected error: %s", e)
raise Exception("Failed to retrieve files")

def get_files_by_id(self, credentials: Dict, file_ids: List[str]) -> List[SyncFile]:
"""
Retrieve files from Notion by their IDs.

Args:
credentials (dict): The credentials for accessing Notion.
file_ids (list): The list of file IDs to retrieve.

Returns:
list: A list of SyncFile objects containing the metadata of each file.
"""
logger.info("Retrieving Notion files by file_ids: %s", file_ids)

if not self.notion:
self.notion = self.link_notion(credentials)

files = []

for file_id in file_ids:
try:
page = self.notion.pages.retrieve(file_id)
page_info = SyncFile(
name=page["properties"]["title"]["title"][0]["text"]["content"],
id=page["id"],
is_folder=True, # Notion pages are generally considered as folders
last_modified=page["last_edited_time"],
mime_type="md",
web_view_link=page["url"],
)
files.append(page_info)

except Exception as e:
logger.error("Error retrieving Notion file with ID %s: %s", file_id, e)
continue # Skip this file and proceed with the next one

logger.info("Notion files retrieved successfully by IDs: %d", len(files))
return files

def get_block_content(self, block):
block_type = block["type"]
result = ""

if block_type == "paragraph":
result = markdownify.markdownify(
block["paragraph"]["rich_text"][0]["plain_text"]
)

elif block_type == "heading_1":
result = "# " + markdownify.markdownify(
block["heading_1"]["text"][0]["plain_text"]
)

elif block_type == "heading_2":
result = "## " + markdownify.markdownify(
block["heading_2"]["text"][0]["plain_text"]
)
elif block_type == "heading_3":
result = "### " + markdownify.markdownify(
block["heading_3"]["text"][0]["plain_text"]
)
elif block_type == "bulleted_list_item":
if len(block["bulleted_list_item"]["rich_text"]) != 0:
result = "* " + markdownify.markdownify(
block["bulleted_list_item"]["rich_text"][0]["plain_text"]
)
else:
result = "* "

elif block_type == "numbered_list_item":
result = "1. " + markdownify.markdownify(
block["numbered_list_item"]["rich_text"][0]["plain_text"]
)
elif block_type == "to_do":
checked = "x" if block["to_do"]["checked"] else " "
result = f"- [{checked}] " + markdownify.markdownify(
block["to_do"]["rich_text"][0]["plain_text"]
)
elif block_type == "toggle":
result = "> " + markdownify.markdownify(
block["toggle"]["rich_text"][0]["plain_text"]
)

elif block_type == "quote":
result = "> " + markdownify.markdownify(
block["quote"]["rich_text"][0]["plain_text"]
)
elif block_type == "code":
result = (
"```"
+ block["code"]["language"]
+ "\n"
+ markdownify.markdownify(block["code"]["rich_text"][0]["plain_text"])
+ "\n```"
)

elif block_type == "callout":
result = "> " + markdownify.markdownify(
block["callout"]["rich_text"][0]["plain_text"]
)
else:
result = markdownify.markdownify(
block[block_type]["rich_text"][0]["plain_text"]
)

return result

def download_file(self, credentials: Dict, file: SyncFile) -> BytesIO:
"""
Download a Notion page as a Markdown file.

Args:
credentials (Dict): The credentials for accessing Notion.
file (SyncFile): The file (page) to be downloaded.

Returns:
BytesIO: The downloaded file content in Markdown format.
"""

if not self.notion:
self.notion = self.link_notion(credentials)

try:

def retrieve_page_content(page_id) -> List[str]:
# Retrieve the page content
blocks = self.notion.blocks.children.list(page_id)["results"]
if not blocks:
raise Exception("Page does not exists")

# Convert blocks to Markdown
markdown_content = []
for block in blocks:
markdown_content.append(self.get_block_content(block))
if block["has_children"]:
sub_elements = [
f"\t{content}"
for content in retrieve_page_content(block["id"])
]
markdown_content.extend(sub_elements)
return markdown_content

markdown_content = retrieve_page_content(file.id)
# Join all markdown content
markdown_text = "\n\n".join(markdown_content)

print(markdown_text)

# Convert to BytesIO
markdown_bytes = BytesIO(markdown_text.encode("utf-8"))

return markdown_bytes

except Exception as e:
logger.error(
"Error downloading Notion file (page) with ID %s: %s", file.id, e
)
raise Exception("Failed to download file")