How to transform back VP8 RTP packets into an image/tensor #119

ndrean · 2024-06-14T13:03:48Z

ndrean
Jun 14, 2024

https://github.com/elixir-webrtc/apps/blob/b1e182a7cc5ab30dba6c19993ed8085339753fc8/reco/lib/reco/room.ex#L113

I assume every (video) RTP packet that ExWEBRTC receives is a video frame encoded/encapsulated in VP8, which means it needs some preceding frames to rebuild an image from it.

In the Reco example, you do: {:ok, frame, d} = VP8Depayloader.write, then you Xav.decode(frame) it to get an Nx tensor, but you use Xav.

Since one big interest of using a WebRTC server is being able to transform back the images, I have a question probably as old as WebRTC: how do people deal with this? (in the Elixir/Erlang world).

Note: I tried to FFMPEG a "depayloaded" frame, which I believe reassembles payload chunks, but this fails as FFMPEG expects a stream, not a single frame.

LVala · 2024-06-14T14:27:23Z

LVala
Jun 14, 2024
Collaborator

Hi,

WebRTC can handle more codecs than just VP8 (H264, VP9, AV1), although VP8 is the default choice (highest in the default SDP = highest priority).
A video frame is usually split into multiple RTP packets (so 1 packet != 1 frame). You can tell by looking at the timestamp in the RTP header - all of the packets containing a single frame should have the exact same timestamp. I would consider "frame" and "image" synonyms in this case, not sure what you mean by "it needs some preceding frames to rebuild an image from it".
To be able to use a frame in an ML model you have to first take it out of RTP packets (this is what VP8Depayloader is doing - when it gets all of the packets from a single frame, it spits the VP8 frame out), then you have to decode the VP8 to raw video (a matrix with the size of the image resolution with RGB/YUV/other color model values, like a matrix of size 1280x720x3, this what Xav.decode does, Xav is just a wrapper over ffmpeg, @mickel8 wrote it so he might be able to tell you more).
I assume that by "transform back the images" you mean to be able to send the video back over WebRTC?
I assume you use a model that somehow transforms the video (let's say, it adds some kind of visual effect or whatever). Then it probably takes a raw frame as an input and produces a new raw frame. If so, you have to reverse the process - encode the frame to VP8 (I assume that Xav allows for that, but I'm not sure) and payload into RTP packets (VP8Payloder will be useful).

In the case of "WebRTC server" (although it's a very broad term, so let's assume an SFU), we usually only forward media from one peer to another, so there's no need to decode and reencode video, as it's a quite computationally intensive task.

Correct me if I didn't understand you, I'll try to answer any further questions :)

0 replies

ndrean · 2024-06-26T16:38:03Z

ndrean
Jun 26, 2024
Author

"it needs some preceding frames to rebuild an image from it".

It is "it needs the preceding packets to rebuild a frame (== image).

Sorry for this late response and awful vocabulary and late response.
I looked a bit into the RFC 3550) for more confidence.

The header has a marker that represent an event in the stream, such as the end of a frame.
If I understand correctly what you do, your udp listener receives packets. Given that you distinguish audio packets from video packets with the Payload Type, given that you accumulate the "valid" packets in a buffer = %{sequence_number_1 => packet_payload_1,...} ), then once you have this marker at 1, you start a new buffer, and process re-assemble the packets from the buffer by ordering the buffer on the sequence, map and join the payload. You get a frame of the video.

I want to play with the frame (Evision or Vix) so I need to decode it (probably only 1 out of n). Now, you de-VP8 (or at least it's the name of the function), but what if it is not vp8?
The codec is in the SDP, right?
A

Regex.scan(~r/a=rtpmap:([a-zA-Z0-9\s]*)/, PeerConnection.get_remote_description(pc).sdp)
["a=rtpmap:96 H264", "96 H264"],
["a=rtpmap:97 rtx", "97 rtx"],
["a=rtpmap:98 H264", "98 H264"],
["a=rtpmap:99 rtx", "99 rtx"],
["a=rtpmap:100 H264", "100 H264"],
["a=rtpmap:101 rtx", "101 rtx"],
["a=rtpmap:102 H264", "102 H264"],
["a=rtpmap:103 rtx", "103 rtx"],
["a=rtpmap:104 VP8", "104 VP8"],
["a=rtpmap:105 rtx", "105 rtx"],
["a=rtpmap:106 VP9", "106 VP9"]
  ....
]

So H.264 is the first, thus preferred (?) codec, not VP8?
Or am I on a false route? In any case, FFmpeg doesn't seem to like the frame I give him (the code below is a copy from yours).

def handle_info(
      {:ex_webrtc, pc, {:rtp, client_track_id, packet}},
      %{client_video_track: %{id: client_track_id, kind: :video}} = state
    ) do
  PeerConnection.send_rtp(pc, state.serv_video_track.id, packet)

  state = handle_paquet(packet, state)
  {:noreply, state}
end

defp handle_paquet(packet, state) do
  case VP8Depayloader.write(state.video_depayloader, packet) do
    {:ok, d} ->
      %{state | video_depayloader: d}

    {:ok, _frame, d} ->
      # do something with the frame, or every n frame....
        ^^^

      %{state | video_depayloader: d}

     _ ->
       state
  end
end

For the posterity, the RTP packet has the following format:

<<version::2, padding::1, extension::1, cc::4, marker::1, payload_type::7, sequence_number::16,
      timestamp::32, ssrc::32, payload::binary>> = packet

so when marker == 1, you received the last fragment of the current frame, thus you get a frame by joining the payloads of the current buffer.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |V=2|P|X|  CC   |M|     PT      |       sequence number         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                           timestamp                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           synchronization source (SSRC) identifier            |
   +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
   |            contributing source (CSRC) identifiers             |
   |                             ....                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

You have this bit marker:

marker (M): 1 bit
      The interpretation of the marker is defined by a profile.  It is
      intended to allow significant events such as frame boundaries to
      be marked in the packet stream.  A profile MAY define additional
      marker bits or specify that there is no marker bit by changing the
      number of bits in the payload type field (see [Section 5.3](https://datatracker.ietf.org/doc/html/rfc3550#section-5.3)).

4 replies

LVala Jun 27, 2024
Collaborator

Well, if your codec is H264, you have to find and use a tool for depayloading and decoding H264 (and it's the same for any other codec). I'm not sure if we provide something for this. On the other hand, if you don't exactly care about the type of codec, you can force Elixir PeerConnection to negotiate only a subset of codecs that you support (like the VP8):

{:ok, pc} = PeerConnection.start_link(
  video_codecs: [%RTPCodecParameters{
    mime_type: "video/VP8",
    payload_type: 96,
    clock_rate: 90_000
  }]
)

Then, when negotiating, the Elixir PeerConnection will reject all of the codecs different than VP8.

Also, keep in mind that depayloading (so converting RTP packets into codec frames) might not be as simple as just concatenating payloads from a few packets up until a marker. If you take a look at the VP8.Depayloader it does quite a bit more work. This is all codec-specific, so H264 might be completely different.

Generally, Chrome and other browsers offer VP8 as the codec with the highest priority (higher in SDP == higher priority, you're right there), that's why Reco works without forcing the codec in PeerConnection.start_link, but generally I would consider this a bad practice.

Generally, the process of receiving RTP packets that you described is more or less correct (without some details, feel free to take a look at the code).

ndrean Jun 29, 2024
Author

Yes, easy talk, harder to do.

For the moment, I created a Livebook to illustrate the Echo server.

I wanted to illustrate the FFmpeg parsing of frames but for some reason I can't run FFmpeg in a Livebook.
So I just save some frames in a file "frame.vp8" as it is supposed to be in VP8 format. This is the reason for saving a frame. But when I run ffmpeg -i frame.vp8 -vframes 1 -q:v 1 -f image2 frame.jpg, I should get a nice jpeg image, but it says "invalid data found when processing input". And yes, FFmpeg works on this computer. I also tried some WebM readers without success.

LVala Jul 1, 2024
Collaborator

Thanks for the concise example!

Unrelated to your questions, but I noticed this when reading your livebook: instead of scanning the remote SDP offer for rtpmap with Regex.scan, you can do (assuming you only have two transceivers).

tr = PeerConnection.get_transceivers(pc) |> Enum.find(&(&1.kind == :video))
IO.inspect(tr.receiver.codec)

Every media section in SDP (all of the lines after m=audio... or m=video... will contain its set of codecs. Depending on your luck and how the browser arranged the SDP (if it put the audio or video media section first). Your log could give you the audio or video codec.

LVala Jul 1, 2024
Collaborator

Getting back to your question, afaik (unfortunately, we are getting further and further away from my area of expertise) you cannot just dump VP8 frames into a file. It has to be stored in a container format like IFV (take a look at our save to file example). Then you hopefully should be able to play the video with something like ffplay my_video.ifv (ffplay is a tool bundled with ffmpeg that will open a GUI window and play the media file to you) or the command you provided. I'm not very familiar with livebook so I cannot tell you why ffmpeg does not work there.

Like I said, you should also add the video_codecs: option to PeerConnection.start_link to force it to use VP8.

Lastly, when trying to store video from RTP as a file (or just play the video back), you should be conscious of the fact that RTP does not ensure packet ordering (you can receive packet with sequence number N before packet with N-1). Because of that in a real-world scenario, you might need to use a jitter buffer. As of now, we don't provide anything like this and simply hope for the best in our examples. You probably can do without it for now (especially when testing of localhost) but just be aware that this might be the cause of some weird artifacts in the video.

Elixir WebRTC

How to transform back VP8 RTP packets into an image/tensor #119

Uh oh!

Uh oh!

ndrean Jun 14, 2024

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

LVala Jun 14, 2024 Collaborator

Uh oh!

Uh oh!

ndrean Jun 26, 2024 Author

Uh oh!

Uh oh!

LVala Jun 27, 2024 Collaborator

Uh oh!

Uh oh!

ndrean Jun 29, 2024 Author

Uh oh!

LVala Jul 1, 2024 Collaborator

Uh oh!

LVala Jul 1, 2024 Collaborator

ndrean
Jun 14, 2024

Replies: 2 comments 4 replies

LVala
Jun 14, 2024
Collaborator

ndrean
Jun 26, 2024
Author

LVala Jun 27, 2024
Collaborator

ndrean Jun 29, 2024
Author

LVala Jul 1, 2024
Collaborator

LVala Jul 1, 2024
Collaborator