-
Notifications
You must be signed in to change notification settings - Fork 10
[FE] Converts audio to markdown #147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… in openAI does not support temperature value between 0-1
- Formatting and var type fixes - Handle image inputs - Split on "```json" for robustness - Add example to example_notebook.ipynb
create separate function to parse images using gemini
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for converting audio files to markdown using the Gemini API. The implementation routes audio files through the LLM parser and creates a new audio-specific parsing function that uploads audio files to Gemini and transcribes them into well-structured markdown.
Key Changes
- Added audio file type support to the file type checking and routing logic
- Implemented
parse_audio_with_gemini()function with audio-specific prompt template - Added example usage in the Colab notebook demonstrating audio transcription
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| lexoid/core/utils.py | Added audio file type to supported formats and routing logic to direct audio files to LLM parser |
| lexoid/core/prompt_templates.py | Added AUDIO_TO_MARKDOWN_PROMPT template with instructions for transcription and markdown formatting |
| lexoid/core/parse_type/llm_parser.py | Implemented audio parsing with Gemini, including validation to restrict audio to Gemini API and new parse_audio_with_gemini() function |
| examples/example_notebook_colab.ipynb | Added example demonstrating audio file parsing with output |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
lexoid/core/parse_type/llm_parser.py
Outdated
| audio_file = client.files.upload(file=path) | ||
| system_prompt = kwargs.get("system_prompt", None) | ||
| if system_prompt == "" or system_prompt is None: | ||
| system_prompt = AUDIO_TO_MARKDOWN_PROMPT + "Audo file name is: {path}\n" |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string formatting placeholder {path} is not being replaced with the actual path value. This line should use f-string formatting: system_prompt = AUDIO_TO_MARKDOWN_PROMPT + f"Audio file name is: {path}\n" (note the f prefix and corrected spelling of "Audio").
| system_prompt = AUDIO_TO_MARKDOWN_PROMPT + "Audo file name is: {path}\n" | |
| system_prompt = AUDIO_TO_MARKDOWN_PROMPT + f"Audio file name is: {path}\n" |
| import requests | ||
| import torch | ||
| from anthropic import Anthropic | ||
| from google import genai |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code imports from google import genai and uses genai.Client(), which appears to be from the google-genai package. However, the pyproject.toml file specifies google-generativeai (a different package) as the dependency. According to the PR description TODO, google-genai needs to be added via poetry. Either add the correct package to pyproject.toml, or update the code to use the existing google-generativeai package API.
| from google import genai | |
| import google.generativeai as genai |
| "source": [ | ||
| "from lexoid.api import parse\n", | ||
| "\n", | ||
| "document_path =\"inputs\\harvard.wav\"\n", |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path string uses a single backslash which Python interprets as an escape sequence, causing a SyntaxWarning (visible in the output at lines 1988-1991). Use either a raw string (r"inputs\harvard.wav") or forward slashes ("inputs/harvard.wav") to avoid this warning.
| "from lexoid.api import parse\n", | ||
| "\n", | ||
| "document_path =\"inputs\\harvard.wav\"\n", | ||
| "parsed_md = parse(document_path, \"AUTO\",api=\"gemini\")[\"raw\"]\n", |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after comma in the function call. Should be: parse(document_path, "AUTO", api="gemini")
| def parse_audio_with_gemini(path: str, **kwargs) -> Dict: | ||
| client = genai.Client() | ||
| audio_file = client.files.upload(file=path) | ||
| system_prompt = kwargs.get("system_prompt", None) | ||
| if system_prompt == "" or system_prompt is None: | ||
| system_prompt = AUDIO_TO_MARKDOWN_PROMPT + "Audo file name is: {path}\n" | ||
|
|
||
| response = client.models.generate_content( | ||
| model=kwargs["model"], contents=[system_prompt, audio_file] | ||
| ) | ||
|
|
||
| return { | ||
| "raw": response.text, | ||
| "segments": [ | ||
| { | ||
| "metadata": {"page": 0}, | ||
| "content": response.text, | ||
| } | ||
| ], | ||
| "title": kwargs.get("title", ""), | ||
| "url": kwargs.get("url", ""), | ||
| "parent_title": kwargs.get("parent_title", ""), | ||
| "recursive_docs": [], | ||
| "token_usage": { | ||
| "input": response.usage_metadata.prompt_token_count, | ||
| "output": response.usage_metadata.candidates_token_count, | ||
| "total": ( | ||
| response.usage_metadata.prompt_token_count | ||
| + response.usage_metadata.candidates_token_count | ||
| ), | ||
| }, | ||
| } |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new audio parsing functionality lacks test coverage. Consider adding a test case similar to the existing test_llm_parse and test_jpg_parse functions to verify audio file parsing works correctly with the Gemini API. This would help ensure the feature works as expected and prevent regressions.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
closes #148
Example Usage:
TODOs:
google-genaivia poetry