-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transcript for Episode 1 #15
Comments
Wow, this is super cool! |
Current progress: I wrote a script to parse the JSON and convert it into readable Markdown. See the result in this Gist. The numbers above each paragraph are timecodes (measured in seconds from the start of recording). That was the easy part. The hard part is to make the data editable (for manual fixes) while keeping the timecodes etc. intact. |
Maybe you could add a case to the script that would remove entries like this:
|
This is amazing @ole! Do you think its good enough to open a PR adding it to the repo? |
Also, is there a way to automate this? |
@calebkleveter Good call. That wouldn't be difficult to include in my code. I wanted to preserve the structure as the automatic transcription created it for this first attempt.
@garricn We could of course publish the unedited transcript as-is. It's arguably better than nothing, even with all the transcription errors. And we can always push manual edits later (and possibly in a piecemeal fashion; it takes a lot of time to edit an entire episode). I considered editing the portion of the transcript that I found most interesting to preserve for posterity (mostly @lattner's comments about the origins of Swift) and possibly post it to my own blog in addition to the entire transcript (which should definitely be hosted on the podcast's website). Would you be okay with that?
@garricn The steps up to this current state are automatable, yes. Like all AWS services, Amazon Transcribe has an API, and so do other potential services, I believe. (Google has one, and there are probably others. Apple ships a speech recognition API in the iOS and macOS SDKs, but if I remember correctly it had some limitations regarding the length of the content you could transcribe in one go. I might be wrong about that.) In any case, if you have an AWS account, setting up a transcription job on Amazon Transcribe takes just a few clicks. It would be worth automating if we have a complete process in place, but it's not the most important step right now IMO. I wrote Swift code to parse the JSON file produced by Amazon Transcribe and a simple function to output it as Markdown. I haven't published the code yet, but I can certainly do that. This is where we currently stand. The next step is where it gets complicated: Ideally, I'd like to be able to edit the transcript manually while preserving as much of the timecode information as possible. This means we can't just edit the Markdown source because it would be pretty much impossible to reintegrate the edits with the timecodes (at least on a per-word basis — arguably, per-sentence or per-paragraph timecodes would be good enough). I plan to do some research if there is an existing transcription software (a desktop app or a web-based solution) that we could use for this. I don't know any off the top of my head, but I have a hunch others must have solved the same problem. If anyone has suggestions, I'd be all ears. |
Yes, absolutely. It occurs to me that we don't have a license for the repo, but IMO it makes sense to use the creative commons attribution license which allows very permissive use and re-use of the contents. In terms of transcript, I think it would be really great to post something relatively raw and then ask for contributors to help with the editing. Github is pretty good for collaboration :-) Thanks for driving this @ole! |
Thank you for the shoutout in episode 2. I haven't forgotten this. I'll send a PR with the (unedited) transcript for episode 1 soon. |
Hello 👋 @ole ! This is really cool! I'm a Junior, but how can I help? |
Hi there, same here! I would love to get involved in this (side-)project! 😊 |
@jonesandcode @JulianKahnert Great! I pushed my code for parsing the Amazon Transcribe JSON format to this repository: ole/transcribe. Feel free to have a look. The only functionality so far is parsing Amazon Transcribe files and outputting them in a (hardcoded) Markdown format. I still want to research a good existing file format for transcripts that would allow us to edit the transcript text while keeping speaker and timecode information (at least on a per-paragraph basis; the per-word timecodes that Amazon Transcribe produces are probably overkill). |
@ole @jonesandcode what do you think about WebVTT. I have no experience with it, but it seems to be supported by auphonic and the beta version of the Podlove Web Player. We can even see this in action (sry for the german reference): |
@JulianKahnert I love it! I opened a separate issue in the other repo: ole/transcribe#2 Maybe it's better to discuss concrete next steps over there. |
Does anyone has some experience with Auphonic Transcript Editor. It seems like a perfect fit for creating the transcript (e.g. via AWS) and editing it afterwards with an inline HTML editor (example). Since I have never used Auphonic, I don't know if we can use it collaboratively. Another option might be the "open source transcript editor":
https://auphonic.com/help/algorithms/speech_recognition.html#transcript-editor It would be awesome if we could use the transcript editor, but I can not find any documentation except for the two Transcript Editor Examples. Am I missing something? 🤔 |
* Add unedited WebVTT transcript for episode 1 The transcript has been generated with the Amazon Transcribe service (as discussed in #15) and converted to the WebVTT format with a Swift tool written by @ole (https://github.com/ole) and @JulianKahnert (https://github.com/JulianKahnert). This autogenerated transcript contains many transcription errors, but it gives us a good baseline for manual editing. * Add some metadata and a "request for editing" to episode 1 transcript * Edit episode 1 transcript from 00:00:00.000 to 00:05:38.497 * Edit episode 1 transcript from 00:05:38.887 to 00:09:46.529 * Edit episode 1 transcript from 00:59:15:213 to 01:07:26.240 * Edit episode 1 transcript from 00:50:18.774 to 00:59:15.093 * Small transcript edits * Edit episode 1 transcript from 00:15:14.437 to 00:23:31.535 * Episode 1 transcript from 00:39:55.214 -> 00:50:18.494 * Edit episode 1 transcript from 00:09:51.039 to 00:15:13.397 * More transcript edits * Edit episode 1 transcript from 00:23:32.785 to 00:30:32.178 * Edit episode 1 transcript from 00:30:32.658 to 00:39:54.794 * Edit episode 1 transcript header comment
Closing as this is done! 🎉 Sent with GitHawk |
This is an exciting project. Great job on the first episode!
I ran the first episode through Amazon's Transcribe service. The result is a massive JSON file that includes not only the transcribed text but also timecodes and speaker identification (i.e. you tell the AWS Transcribe API how many speakers there were and it will try to distinguish them as "Speaker 1", "Speaker 2" and so on). Here's a screenshot of the AWS Transcribe console:
The transcription is obviously not perfect, but I think it's a good start and manually editing the file is probably way faster than typing everything out manually. I'm a big fan of transcripts to make it possible to find things again later, but I also think a podcast transcript need not (and perhaps should not) mirror the spoken word precisely. Transcribed text is generally not very readable if it includes every "uh" etc.
You can download the complete JSON file (5.6 MB). I ran it through a formatter and removed my AWS account ID, other than that it's unchanged.
The bulk of the file is a huge array of recognized words with a per-word timecode and sometimes with word alternatives if the system wasn't certain. For example, this is how the first two words ("Welcome to") look like:
I'm not sure how much time I can spend on editing the transcript and/or writing a script to process the whole thing into something that can be published on the web. If anyone would like to help, feel free to chime in.
Lastly, I'd like to mention the Podlove web player, a great (I think) open-source HTML5 audio player that can, among many other features, display transcripts and sync them to the audio, i.e. you can search the transcript for something, click on a search result, and the player will jump to that timecode in the audio. I think this or something like this would be a great addition to the web site — at least if transcripts become a regular thing that we create for each episode (making a good transcript is a lot of work, and somebody has to do it).
The text was updated successfully, but these errors were encountered: