MediaStreamTrack Transform

Unofficial Proposal Draft,

This version:
https://jan-ivar.github.io/mediacapture-transform/
Feedback:
public-webrtc@w3.org with subject line “[mediacapture-transform] … message topic …” (archives)
Issue Tracking:
GitHub
Editor:
(Mozilla)

Abstract

Proposal #59 by Jan-Ivar Bruaroey with help from Youenn Fablet, written as a future PR off a fork of mediacapture-transform, and thus builds on text from other authors as well. This API defines an API surface for manipulating the bits on MediaStreamTracks carrying raw data.

Status of this document

1. Introduction

The [WEBRTC-NV-USE-CASES] document describes several functions that can only be achieved by access to media (requirements N20-N22), including, but not limited to:

These use cases further require that processing can be done in worker threads (requirement N23-N24).

This specification gives an interface based on [WEBCODECS] and [STREAMS] to provide access to such functionality.

This specification provides access to raw video, which is the output of a media source such as a camera, screen capture, or the decoder part of a codec and the input to the decoder part of a codec. The processed media can be consumed by any destination that can take a MediaStreamTrack, including HTML <video> and <audio> tags, RTCPeerConnection, canvas or MediaRecorder.

This specification explicitly aims to support the following use cases:

2. Specification

This specification shows the IDL extensions for [MEDIACAPTURE-STREAMS].

The API consists of two elements. One is a ReadableStream exposed on a track that is capable of exposing the unencoded video frames from the track. The other one is the inverse of that: it provides a track source that takes video frames as input.

2.1. MediaStreamTrack readable attribute

This specification adds a readable ReadableStream attribute to MediaStreamTrack in the dedicated worker environment. If the MediaStreamTrack is a video track, consumers may read VideoFrames from it. The MediaStreamTrack is the underlying source of its readable member.

Once the ReadableStream becomes locked, the MediaStreamTrack must, in its role as the underlying source, maintain a circular queue to buffer video frames coming from the track’s source. This buffering allows the MediaStreamTrack to temporarily hold frames waiting to be read from its associated ReadableStream. The depth of this underlying queue is decided by the UA and can change dynamically.

When a new frame arrives, if the queue is full, the oldest frame will be removed from the queue, and the new frame will be added to the queue. This means that for the particular case of a queue with a maximum depth of 1, if there is a queued frame, it will aways be the most recent one.

The UA is also free to remove any frames from the queue at any time. The UA may remove frames in order to save resources or to improve performance in specific situations. In all cases, frames that are not dropped must be made available to the ReadableStream in the order in which they arrive to the MediaStreamTrack.

The readable attribute is only exposed in the DedicatedWorker context.

2.1.1. Interface definition

partial interface MediaStreamTrack {
  [Exposed=DedicatedWorker] readonly attribute ReadableStream readable;
};

2.1.2. Internal slots

[[queue]]
A queue used to buffer video frames not yet read by the application.
[[numPendingReads]]
An integer whose value represents the number of read requests issued by the application that have not yet been handled.
[[isClosed]]
An boolean whose value indicates if the MediaStreamTrack is closed in its role as underlying source of its readable member.

2.1.3. Track creation

At construction of MediaStreamTrack, run the following steps:
  1. Set this.[[queue]] to an empty Queue.

  2. Set this.[[numPendingReads]] to 0.

  3. Set this.[[isClosed]] to false.

2.1.4. Attributes

readable
Allows reading the frames delivered to the MediaStreamTrack. This attribute is created the first time it is invoked according to the following steps:
  1. Initialize this.readable to be a new ReadableStream.

  2. Set up this.readable with its pullAlgorithm set to trackPull with this as parameter, cancelAlgorithm set to trackCancel with this as parameter, and highWatermark set to 0.

The trackPull algorithm is given a track as input. It is defined by the following steps:

  1. Increment the value of the track.[[numPendingReads]] by 1.

  2. Queue a task to run the maybeReadFrame algorithm with track as parameter.

  3. Return a promise resolved with undefined.

The maybeReadFrame algorithm is given a track as input. It is defined by the following steps:

  1. If track.[[queue]] is empty, abort these steps.

  2. If track.[[numPendingReads]] equals zero, abort these steps.

  3. dequeue a frame from track.[[queue]] and Enqueue it in track.readable.

  4. Decrement track.[[numPendingReads]] by 1.

  5. Go to step 1.

The trackCancel algorithm is given a track as input. It is defined by running the following steps:

  1. Run the trackClose algorithm with track as parameter.

  2. Return a promise resolved with undefined.

The trackClose algorithm is given a track as input. It is defined by running the following steps:

  1. If track.[[isClosed]] is true, abort these steps.

  2. Close track.readable.[[controller]].

  3. Empty track.[[queue]].

  4. Set track.[[isClosed]] to true.

2.1.5. Handling interaction with the track

When a MediaStreamTrack track receives a video frame, the UA MUST execute the handleNewFrame algorithm with track as parameter.

The handleNewFrame algorithm is given a track as input. It is defined by running the following steps:

  1. If track.[[queue]] has is full, as determined b the UA, dequeue an item from track.[[queue]].

  2. enqueue the new frame in track.[[queue]].

  3. Queue a task to run the maybeReadFrame algorithm with track as parameter.

At any time, the UA MAY remove any frame from track.[[queue]]. The UA may decide to remove frames from track.[[queue]], for example, to prevent resource exhaustion or to improve performance in certain situations.

The application may detect that frames have been dropped by noticing that there is a gap in the timestamps of the frames.

When a MediaStreamTrack track ends, the trackClose algorithm must be executed with track as parameter.

2.2. VideoTrackSource

A VideoTrackSource allows the creation of a video source for a MediaStreamTrack in the MediaStream model in the dedicated worker environment. It has a two readonly attributes: a writable WritableStream and a track MediaStreamTrack.

The VideoTrackSource is the underlying sink of its writable attribute. The track attribute is the output. Further tracks connected to the same VideoTrackSource can be created using the clone method on the track attribute.

The WritableStream accepts VideoFrame objects. When a VideoFrame is written to writable, the frame’s close() method is automatically invoked, so that its internal resources are no longer accessible from JavaScript.

A VideoTrackSource object only exists in the DedicatedWorker context. However, the MediaStreamtrack(s) it sources may be transferred to other contexts.

2.2.1. Interface definition

[Exposed=DedicatedWorker]
interface VideoTrackSource {
  constructor();
  readonly attribute WritableStream writable;
  attribute boolean muted;
  readonly attribute MediaStreamTrack track;
}

2.2.2. Internal slots

[[track]]
The MediaStreamTrack output of this source
[[isMuted]]
A boolean whose value indicates whether this source and all the MediaStreamTracks it sources, are currently muted or not.

2.2.3. Constructor

VideoTrackSource()
  1. Let source be a new VideoTrackSource object.

  2. Let track be a new MediaStreamTrack object, whose kind is "video", and whose id is a new unique id generated by the user agent.

  3. Initialize source.track to track.

  4. Return source.

2.2.4. Attributes

writable, of type WritableStream, readonly
Allows writing video frames to the VideoTrackSource. When this attribute is accessed for the first time, it MUST be initialized with the following steps:
  1. Initialize this.writable to be a new WritableStream.

  2. Set up this.writable, with its writeAlgorithm set to writeFrame with this as parameter, with closeAlgorithm set to closeWritable with this as parameter and abortAlgorithm set to closeWritable with this as parameter.

The writeFrame algorithm is given a source and a frame as input. It is defined by running the following steps:

  1. If frame is not a VideoFrame object, return a promise rejected with a TypeError.

  2. If source.[[isMuted]] is false, send the media data backing frame to all live tracks sourced from source.

  3. Invoke the close method of frame.

  4. Return a promise resolved with undefined.

When the media data is sent to a track, the UA may apply processing (e.g., cropping and downscaling) to ensure that the media data sent to the track satisfies the track’s constraints. Each track may receive a different version of the media data depending on its constraints.

The closeWritable algorithm is given a source as input. It is defined by running the following steps.

  1. For each track t sourced from source, end t.

  2. Return a promise resolved with undefined.

muted, of type boolean
Mutes the VideoTrackSource. The getter steps are to return this.[[isMuted]]. The setter steps, given a value newValue, are as follows:
  1. If newValue is equal to this.[[isMuted]], abort these steps.

  2. Set this.[[isMuted]] to newValue.

  3. Unless one has been queued already this run of the event loop, queue a task to run the following steps:

    1. Let settledValue be this.[[isMuted]].

    2. For each live track sourced by this, queue a task to set a track’s muted state to settledValue.

track, of type MediaStreamTrack, readonly
The MediaStreamTrack output. The getter steps are to return this.[[track]].

2.2.5. Specialization of MediaStreamTrack behavior

A VideoTrackSource acts as the source for one or more MediaStreamTracks. This section adds clarifications on how a MediaStreamTrack sourced from a VideoTrackSource behaves.
2.2.5.1. stop
The stop method stops the track. When the last track sourced from a VideoTrackSource ends, that VideoTrackSource's writable is closed.
2.2.5.2. Constrainable properties

The following constrainable properties are defined for MediaStreamTracks sourced from a VideoTrackSource:

Property Name Values Notes
width ConstrainULong As a setting, this is the width, in pixels, of the latest frame received by the track. As a capability, max MUST reflect the largest width a VideoFrame may have, and min MUST reflect the smallest width a VideoFrame may have.
height ConstrainULong As a setting, this is the height, in pixels, of the latest frame received by the track. As a capability, max MUST reflect the largest height a VideoFrame may have, and min MUST reflect the smallest height a VideoFrame may have.
frameRate ConstrainDouble As a setting, this is an estimate of the frame rate based on frames recently received by the track. As a capability min MUST be zero and max MUST be the maximum frame rate supported by the system.
aspectRatio ConstrainDouble As a setting, this is the aspect ratio of the latest frame delivered by the track; this is the width in pixels divided by height in pixels as a double rounded to the tenth decimal place. As a capability, min MUST be the smallest aspect ratio supported by a VideoFrame, and max MUST be the largest aspect ratio supported by a VideoFrame.
resizeMode ConstrainDOMString As a setting, this string should be one of the members of VideoResizeModeEnum. The value "none" means that the frames output by the MediaStreamTrack are unmodified versions of the frames written to the writable backing the track, regardless of any constraints. The value "crop-and-scale" means that the frames output by the MediaStreamTrack may be cropped and/or downscaled versions of the source frames, based on the values of the width, height and aspectRatio constraints of the track. As a capability, the values "none" and "crop-and-scale" both MUST be present.

The applyConstraints method applied to a video MediaStreamTrack sourced from a VideoTrackSource supports the properties defined above. It can be used, for example, to resize frames or adjust the frame rate of the track. Note that these constraints have no effect on the VideoFrame objects written to the writable of a VideoTrackSource, just on the output of the track on which the constraints have been applied. Note also that, since a VideoTrackSource can in principle produce media data with any setting for the supported constrainable properties, an applyConstraints call on a track backed by a VideoTrackSource will generally not fail with OverconstrainedError unless the given constraints are outside the system-supported range, as reported by getCapabilities.

2.2.5.3. Events and attributes
Events and attributes work the same as for any MediaStreamTrack. It is relevant to note that if the writable stream of a VideoTrackSource is closed, all the live tracks connected to it are ended and the ended event is fired on them.

3. Examples

3.1. Video Processing

Consider a face recognition function detectFace(videoFrame) that returns a face position (in some format), and a manipulation function blurBackground(videoFrame, facePosition) that returns a new VideoFrame similar to the given videoFrame, but with the non-face parts blurred. The example also shows the video before and after effects on video elements.
// main.js

const stream = await navigator.mediaDevices.getUserMedia({video:true});
const videoBefore = document.getElementById('video-before');
const videoAfter = document.getElementById('video-after');
videoBefore.srcObject = stream.clone();

const [track] = stream.getVideoTracks();
const worker = new Worker('worker.js');
worker.postMessage({track}, [track]);

const {data} = await new Promise(r => worker.onmessage);
videoAfter.srcObject = new MediaStream([data.track]);

// worker.js

self.onmessage = async ({data: {track}}) => {
  const source = new VideoTrackSource();
  parent.postMessage({track: source.track}, [source.track]);

  const transformer = new TransformStream({
    async transform(frame, controller) {
      const facePosition = await detectFace(frame);
      const newFrame = blurBackground(frame, facePosition);
      frame.close();
      controller.enqueue(newFrame);
    }
  });
  await track.readable.pipeThrough(transformer).pipeTo(source.writable);
};

3.2. Multi-consumer post-processing with constraints

A common use case is to remove the background from live camera video fed into a video conference, with a live self-view showing the result. It’s desirable for the self-view to be smooth even if the frame rate used for actual sending may dip lower due to bandwidth constraints. This can be solved using clone() and applyConstraints(), without having to process twice.
// main.js

const stream = await navigator.mediaDevices.getUserMedia({video:true});
const [track] = stream.getVideoTracks();
const worker = new Worker('worker.js');
worker.postMessage({track}, [track]);

const {data} = await new Promise(r => worker.onmessage);
const selfView = document.getElementById('video-self');
selfView.srcObject = new MediaStream([data.track.clone()]); // 60 fps

await data.track.applyConstraints({width: 320, height: 200, frameRate: 30});
const pc = new RTCPeerConnection(config);
pc.addTrack(data.track); // 30 fps

// worker.js

self.onmessage = async ({data: {track}}) => {
  const source = new VideoTrackSource();
  parent.postMessage({track: source.track}, [source.track]);

  const transformer = new TransformStream({transform: myRemoveBackgroundFromVideo});
  await track.readable.pipeThrough(transformer).pipeTo(source.writable);
};

3.3. Multi-consumer post-processing with constraints in a worker

Being able to show a higher frame-rate self-view is also relevant when sending video frames over WebTransport. The same technique above may be used here, except clone() and applyConstraints() happen in the worker.
// main.js

const stream = await navigator.mediaDevices.getUserMedia({video:true});
const [track] = stream.getVideoTracks();
const worker = new Worker('worker.js');
worker.postMessage({track}, [track]);

const {data} = await new Promise(r => worker.onmessage);
const selfView = document.getElementById('video-self');
selfView.srcObject = new MediaStream([data.track]); // 60 fps

// worker.js

self.onmessage = async ({data: {track}}) => {
  const source = new VideoTrackSource();
  const sendTrack = source.track.clone();
  parent.postMessage({track: source.track}, [source.track]);

  await sendTrack.applyConstraints({width: 320, height: 200, frameRate: 30});

  const wt = new WebTransport("https://webtransport.org:8080/up");

  const transformer = new TransformStream({transform: myRemoveBackgroundFromVideo});
  await Promise.all([
    track.readable.pipeThrough(transformer).pipeTo(source.writable),
    sendTrack.readable.pipeTo(wt.createUnidirectionalStream()) // 30 fps
  ]);
};

The above example avoids using the tee() function to serve multiple consumers, due to its issues with real-time streams.

For brevity, the example also over-simplifies sending video frames over WebTransport (incurring head-of-line blocking).

4. Implementation advice

This section is informative.

4.1. Use with multiple consumers

There are use cases where the programmer may desire that a single stream of frames is consumed by multiple consumers.

Examples include the case where the result of a background blurring function should be both displayed in a self-view and encoded using a VideoEncoder.

For cases where both consumers are consuming unprocessed frames, and synchronization is not desired, instantianting multiple MediaStreamTrack clones is a robust solution.

For cases where both consumers intend to convert the result of a processing step into a MediaStreamTrack using a VideoTrackSource, for example when feeding a processed stream to both a <video>& tag and an RTCPeerConnection, attaching the resulting MediaStreamTrack to multiple sinks may be the most appropriate mechanism.

For cases where the downstream processing takes frames, not streams, the frames can be cloned as needed and sent off to the downstream processing; "clone" is a cheap operation.

When the stream is the output of some processing, and both branches need a Stream object to do further processing, one needs a function that produces two streams from one stream.

However, the standard tee() operation is problematic in this context:

Therefore, the use of tee() with Streams containing media should only be done when fully understanding the implications. Instead, custom elements for splitting streams more appropriate to the use case should be used.

5. Security and Privacy considerations

This API exposes data access to and from MediaStreamTrack.

The security and privacy of VideoTrackSource relies on the same-origin policy. That is, the data VideoTrackSource can make available in the form of a MediaStreamTrack is visible to the document before a VideoFrame can be constructed and written to the VideoTrackSource. Any attempt to create VideoFrame objects using cross-origin data will fail. Therefore, VideoTrackSource does not introduce any new fingerprinting surface.

The MediaStreamTrack's readable attribute introduced by this API exposes the same data that is exposed by other MediaStreamTrack sinks such as RTCPeerConnection and media elements. The security and privacy of MediaStreamTrack's readable relies on the security and privacy of MediaStreamTrack. For example, camera, microphone and screen-capture tracks rely on explicit use authorization via permission dialogs (see [MEDIACAPTURE-STREAMS] and [MEDIACAPTURE-SCREEN-SHARE]), while element capture and VideoTrackSource rely on the same-origin policy. A potential issue with MediaStreamTrack's readable is resource exhaustion. For example, a site might hold on to too many open VideoFrame objects and deplete a system-wide pool of GPU-memory-backed frames. UAs can mitigate this risk by limiting the number of pool-backed frames a site can hold. This can be achieved by reducing the maximum number of buffered frames and by refusing to deliver more frames to the readable once the budget limit is reached. Accidental exhaustion is also mitigated by automatic closing of VideoFrame objects once they are written to a VideoTrackSource.

Conformance

Document conventions

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Conformant Algorithms

Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps can be implemented in any manner, so long as the end result is equivalent. In particular, the algorithms defined in this specification are intended to be easy to understand and are not intended to be performant. Implementers are encouraged to optimize.

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[HTML]
Anne van Kesteren; et al. HTML Standard. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[INFRA]
Infra. URL: https://https://infra.spec.whatwg.org
[MEDIACAPTURE-STREAMS]
Media Capture and Streams. URL: https://www.w3.org/TR/mediacapture-streams/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119
[STREAMS]
Streams. URL: https://streams.spec.whatwg.org
[WEBCODECS]
WebCodecs. URL: https://wicg.github.io/web-codecs/
[WebIDL]
Boris Zbarsky. Web IDL. 15 December 2016. ED. URL: https://heycam.github.io/webidl/
[WEBRTC-1]
WebRTC 1.0: Real-time Communication Between Browsers URL: https://www.w3.org/TR/webrtc/

Informative References

[MEDIACAPTURE-SCREEN-SHARE]
Screen Capture. URL: https://w3c.github.io/mediacapture-screen-share/
[WEBRTC-NV-USE-CASES]
Bernard Aboba. WebRTC Next Version Use Cases. 16 March 2021. WD. URL: https://www.w3.org/TR/webrtc-nv-use-cases/
[WEBTRANSPORT]
WebTransport. URL: https://www.w3.org/TR/webtransport/

IDL Index

partial interface MediaStreamTrack {
  [Exposed=DedicatedWorker] readonly attribute ReadableStream readable;
};

[Exposed=DedicatedWorker]
interface VideoTrackSource {
  constructor();
  readonly attribute WritableStream writable;
  attribute boolean muted;
  readonly attribute MediaStreamTrack track;
}