-
Notifications
You must be signed in to change notification settings - Fork 27
Further improvements to Introduction to Elixir WebRTC tutorial
#141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,27 +1,136 @@ | ||
| # Consuming media data | ||
|
|
||
| Other than just forwarding, we probably would like to be able to use the media right in the Elixir app to | ||
| e..g feed it to a machine learning model or create a recording of a meeting. | ||
| Other than just forwarding, we would like to be able to use the media right in the Elixir app to e.g. | ||
| use it as a machine learning model input, or create a recording of a meeting. | ||
|
|
||
| In this tutorial, we are going to build on top of the simple app from the previous tutorial by, instead of just sending the packets back, depayloading and decoding | ||
| the media, using a machine learning model to somehow augment the video, encode and payload it back into RTP packets and only then send it to the web browser. | ||
| In this tutorial, we are going to learn how to use received media as input for ML inference. | ||
|
|
||
| ## Deplayloading RTP | ||
| ## From raw media to RTP | ||
|
|
||
| We refer to the process of taking the media payload out of RTP packets as _depayloading_. | ||
| When the browser sends audio or video, it does the following things: | ||
|
|
||
| 1. Capturing the media from your peripheral devices, like a webcam or microphone. | ||
| 2. Encoding the media, so it takes less space and uses less network bandwidth. | ||
| 3. Packing it into a single or multiple RTP packets, depending on the media chunk (e.g., video frame) size. | ||
| 4. Sending it to the other peer using WebRTC. | ||
|
|
||
| We have to reverse these steps in order to be able to use the media: | ||
|
|
||
| 1. We receive the media from WebRTC. | ||
| 2. We unpack the encoded media from RTP packets. | ||
| 3. We decode the media to a raw format. | ||
| 4. We use the media however we like. | ||
|
|
||
| We already know how to do step 1 from previous tutorials, and step 4 is completely up to the user, so let's go through steps 2 and 3 in the next sections. | ||
|
|
||
| > #### Codecs {: .info} | ||
| > A media codec is a program used to encode/decode digital video and audio streams. Codecs also compress the media data, | ||
| > A media codec is a program/technique used to encode/decode digital video and audio streams. Codecs also compress the media data, | ||
| > otherwise, it would be too big to send over the network (bitrate of raw 24-bit color depth, FullHD, 60 fps video is about 3 Gbit/s!). | ||
| > | ||
| > In WebRTC, most likely you will encounter VP8, H264 or AV1 video codecs and Opus audio codec. Codecs that will be used during the session are negotiated in | ||
| > the SDP offer/answer exchange. You can tell what codec is carried in an RTP packet by inspecting its payload type (`packet.payload_type`, | ||
| > a non-negative integer field) and match it with one of the codecs listed in this track's transceiver's `codecs` field (you have to find | ||
| > the `transceiver` by iterating over `PeerConnection.get_transceivers` as shown previously in this tutorial series). | ||
| > In WebRTC, most likely you will encounter VP8, H264 or AV1 video codecs and Opus audio codec. Codecs used during the session are negotiated in | ||
| > the SDP offer/answer exchange. You can tell what codec is carried in an RTP packet by inspecting its payload type (`payload_type` field in the case of Elixir WebRTC). | ||
| > This value should correspond to one of the codecs included in the SDP offer/answer. | ||
|
|
||
| ## Depayloading RTP | ||
|
|
||
| We refer to the process of getting the media payload out of RTP packets as _depayloading_. Usually a single video frame is split into | ||
| multiple RTP packets, and in case of audio, each packet carries, more or less, 20 milliseconds of sound. Fortunately, you don't have to worry about this, | ||
| just use one of the depayloaders provided by Elixir WebRTC (see the `ExWebRTC.RTP.<codec>` submodules). For instance, when receiving VP8 RTP packets, we could depayload | ||
| the video by doing: | ||
|
|
||
| ```elixir | ||
| def init(_) do | ||
| # ... | ||
| state = %{depayloader: ExWebRTC.Media.VP8.Depayloader.new()} | ||
| {:ok, state} | ||
| end | ||
|
|
||
| def handle_info({:ex_webrtc, _from, {:rtp, _track_id, nil, packet}}, state) do | ||
| depayloader = | ||
| case ExWebRTC.RTP.VP8.Depayloader.write(state.depayloader, packet) do | ||
| {:ok, depayloader} -> depayloader | ||
| {:ok, frame, depayloader} -> | ||
| # we collected a whole frame (it is just a binary)! | ||
| # we will learn what to do with it in a moment | ||
| depayloader | ||
| end | ||
|
|
||
| {:noreply, %{state | depayloader: depayloader}} | ||
| end | ||
| ``` | ||
|
|
||
| Every time we collect a whole video frame consisting of a bunch of RTP packets, the `VP8.Depayloader.write` returns it for further processing. | ||
|
|
||
| _TBD_ | ||
| > #### Codec configuration {: .warning} | ||
| > By default, `ExWebRTC.PeerConnection` will use a set of default codecs when negotiating the connection. In such case, you have to either: | ||
| > | ||
| > * support depayloading/decoding for all of the negotiated codecs | ||
| > * force some specific set of codecs (or even a single codec) in the `PeerConnection` configuration. | ||
| > | ||
| > Of course, the second option is much simpler, but it increases the risk of failing the negotiation, as the other peer might not support your codec of choice. | ||
| > If you still want to do it the simple way, set the codecs in `PeerConnection.start_link` | ||
| > ```elixir | ||
| > codec = %ExWebRTC.RTPCodecParameters{ | ||
| > payload_type: 96, | ||
| > mime_type: "video/VP8", | ||
| > clock_rate: 90_000 | ||
| > } | ||
| > {:ok, pc} = ExWebRTC.PeerConnection.start_link(video_codecs: [codec]) | ||
| > ``` | ||
| > This way, you either will always have to send/receive VP8 video codec, or you won't be able to negotiate a video stream at all. At least you won't encounter | ||
| > unpleasant bugs in video decoding! | ||
|
|
||
| ## Decoding the media to raw format | ||
|
|
||
| _TBD_ | ||
| Before we use the video as an input to the machine learning model, we need to decode it into raw format. Video decoding or encoding is a very | ||
| complex and resource-heavy process, so we don't provide anything for that in Elixir WebRTC, but you can use the `xav` library, a simple wrapper over `ffmpeg`, | ||
| to decode the VP8 video. Let's modify the snippet from the previous section to do so. | ||
|
|
||
| ```elixir | ||
| def init(_) do | ||
| # ... | ||
| serving = # setup your machine learning model (i.e. using Bumblebee) | ||
| state = %{ | ||
| depayloader: ExWebRTC.Media.VP8.Depayloader.new(), | ||
| decoder: Xav.Decoder.new(:vp8), | ||
| serving: serving | ||
| } | ||
| {:ok, state} | ||
| end | ||
|
|
||
| def handle_info({:ex_webrtc, _from, {:rtp, _track_id, nil, packet}}, state) do | ||
| depayloader = | ||
| with {:ok, frame, depayloader} <- ExWebRTC.RTP.VP8.Depayloader.write(state.depayloader, packet), | ||
| {:ok, raw_frame} <- Xav.Decoder.decode(state.decoder, frame) do | ||
| # raw frame is just a 3D matrix with the shape of resolution x colors (e.g 1920 x 1080 x 3 for FullHD, RGB frame) | ||
| # we can cast it to Elixir Nx tensor and use it as the machine learning model input | ||
| # machine learning stuff is out of scope of this tutorial, but you probably want to check out Elixir Nx and friends | ||
| tensor = Xav.Frame.to_nx(raw_frame) | ||
| prediction = Nx.Serving.run(state.serving, tensor) | ||
| # do something with the prediction | ||
|
|
||
| depayloader | ||
| else | ||
| {:ok, depayloader} -> depayloader | ||
| {:error, _err} -> # handle the error | ||
| end | ||
|
|
||
| {:noreply, %{state | depayloader: depayloader}} | ||
| end | ||
| ``` | ||
|
|
||
| We decoded the video and used it as an input of the machine learning model and got some kind of prediction - do whatever you want with it. | ||
|
|
||
| > #### Jitter buffer {: .warning} | ||
| > Do you recall that WebRTC uses UDP under the hood, and UDP does not ensure packet ordering? We could ignore this fact when forwarding the packets (as | ||
| > it was not our job to decode/play/save the media), but now packets out of order can seriously mess up the process of decoding. | ||
| > To remedy this issue, something called _jitter buffer_ can be used. Its basic function | ||
| > is to delay/buffer incoming packets by some time, let's say 100 milliseconds, waiting for the packets that might be late. Only if the packets do not arrive after the | ||
| > additional 100 milliseconds, we count them as lost. To learn more about jitter buffer, read [this](https://bloggeek.me/webrtcglossary/jitter-buffer/). | ||
| > | ||
| > As of now, Elixir WebRTC does not provide a jitter buffer, so you either have to build something yourself or wish that such issues won't occur, but if anything | ||
| > is wrong with the decoded video, this might be the problem. | ||
|
|
||
| This tutorial shows, more or less, what the [Recognizer](https://github.com/elixir-webrtc/apps/tree/master/recognizer) app does. Check it out, along with other | ||
| example apps in the [apps](https://github.com/elixir-webrtc/apps) repository, it's a great reference on how to implement fully-fledged apps based on Elixir WebRTC. | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.