GitHub - johnshearing/scrape_yt_mk_transcripts: Scrape YouTube. Make transcripts. Collect metadata. Prepare LLM Training Data

Scrape a YouTube channel for audio.
Create a transcript with punctuation, diarization, timestamps, and metadata.
The transcripts are ingested by the LightRAG server which is found a the following repository:
https://github.com/johnshearing/deep_avatar
The repostitory linked above is used to create question and answer pairs which are used to train LLMs to emulate a human model.
See _Notes.txt for usage.

.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
archive		archive
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_Notes.txt		_Notes.txt
_blank.jpg		_blank.jpg
_merged08.py		_merged08.py
_meta_only.py		_meta_only.py
_process_channel_videos02.py		_process_channel_videos02.py
_requirements.txt		_requirements.txt
_wav_to_mp4_03.py		_wav_to_mp4_03.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

johnshearing/scrape_yt_mk_transcripts

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages