Streaming Media and RTOS: August 2014

x265 creates a pool of worker threads and shares this thread pool with all encoders within the same process (it is process global, aka a singleton). The number of threads within the thread pool is determined by the encoder which first allocates the pool, which by definition is the first encoder created within each process.

--threads specifies the number of threads the encoder will try to allocate for its thread pool. If the thread pool was already allocated this parameter is ignored. By default x265 allocated one thread per (hyperthreaded) CPU core in your system.

Work distribution is job based. Idle worker threads ask their parent pool object for jobs to perform. When no jobs are available, idle worker threads block and consume no CPU cycles.

Objects which desire to distribute work to worker threads are known as job providers (and they derive from the JobProvider class). When job providers have work they enqueue themselves into the pool’s provider list (and dequeue themselves when they no longer have work). The thread pool has a method to poke awake a blocked idle thread, and job providers are recommended to call this method when they make new jobs available.

x265_cleanup() frees the process-global thread pool, allowing it to be reallocated if necessary, but only if no encoders are allocated at the time it is called.

Frame Threading

Frame threading is the act of encoding multiple frames at the same time. It is a challenge because each frame will generally use one or more of the previously encoded frames as motion references and those frames may still be in the process of being encoded themselves.

Previous encoders such as x264 worked around this problem by limiting the motion search region within these reference frames to just one macroblock row below the coincident row being encoded. Thus a frame could be encoded at the same time as its reference frames so long as it stayed one row behind the encode progress of its references (glossing over a few details).

x265 has the same frame threading mechanism, but we generally have much less frame parallelism to exploit than x264 because of the size of our CTU rows. For instance, with 1080p video x264 has 68 16x16 macroblock rows available each frame while x265 only has 17 64x64 CTU rows.

The second extenuating circumstance is the loop filters. The pixels used for motion reference must be processed by the loop filters and the loop filters cannot run until a full row has been encoded, and it must run a full row behind the encode process so that the pixels below the row being filtered are available. When you add up all the row lags each frame ends up being 3 CTU rows behind its reference frames (the equivalent of 12 macroblock rows for x264)

The third extenuating circumstance is that when a frame being encoded becomes blocked by a reference frame row being available, that frame’s wave-front becomes completely stalled and when the row becomes available again it can take quite some time for the wave to be restarted, if it ever does. This makes WPP many times less effective when frame parallelism is in use.

--merange can have a negative impact on frame parallelism. If the range is too large, more rows of CTU lag must be added to ensure those pixels are available in the reference frames. Similarly --sao-lcu-opt 0 will cause SAO to be performed over the entire picture at once (rather than being CTU based), which prevents any motion reference pixels from being available until the entire frame has been encoded, which prevents any real frame parallelism at all.

NoteEven though the merange is used to determine the amount of reference pixels that must be available in the reference frames, the actual motion search is not necessarily centered around the coincident block. The motion search is actually centered around the motion predictor, but the available pixel area (mvmin, mvmax) is determined by merange and the interpolation filter half-heights.

When frame threading is disabled, the entirety of all reference frames are always fully available (by definition) and thus the available pixel area is not restricted at all, and this can sometimes improve compression efficiency. Because of this, the output of encodes with frame parallelism disabled will not match the output of encodes with frame parallelism enabled; but when enabled the number of frame threads should have no effect on the output bitstream except when using ABR or VBV rate control or noise reduction.

When --nr is enabled, the outputs of each number of frame threads will be deterministic but none of them will match becaue each frame encoder maintains a cumulative noise reduction state.

VBV introduces non-determinism in the encoder, at this point in time, regardless of the amount of frame parallelism.

By default frame parallelism and WPP are enabled together. The number of frame threads used is auto-detected from the (hyperthreaded) CPU core count, but may be manually specified via --frame-threads

Each frame encoder runs in its own thread (allocated separately from the worker pool). This frame thread has some pre-processing responsibilities and some post-processing responsibilities for each frame, but it spends the bulk of its time managing the wave-front processing by making CTU rows available to the worker threads when their dependencies are resolved. The frame encoder threads spend nearly all of their time blocked in one of 4 possible locations:
blocked, waiting for a frame to process
blocked on a reference frame, waiting for a CTU row of reconstructed and loop-filtered reference pixels to become available
blocked waiting for wave-front completion
blocked waiting for the main thread to consume an encoded frame

Lookahead

The lookahead module of x265 (the lowres pre-encode which determines scene cuts and slice types) uses the thread pool to distribute the lowres cost analysis to worker threads. It follows the same wave-front pattern as the main encoder except it works in reverse-scan order.

The function slicetypeDecide() itself may also be performed by a worker thread if your system has enough CPU cores to make this a beneficial trade-off, else it runs within the context of the thread which calls the x265_encoder_encode().

Reference from: http://x265.readthedocs.org/en/latest/threading.html

One year after the Comcast Reference Design Kit (RDK) made its public debut at The Cable Show, the platform has more than lived up to its “service velocity” goal. In short, the RDK went from being a PowerPoint presentation at last year’s Cable Show, to a viable platform that is cutting down on development cycle for not only set-top boxes and gateways, but also applications.

The Comcast RDK was developed internally using open-source components and by working with various vendors. The RDK is a community-based ecosystem that allows developers, vendors and cable operators to use a defined stack of software on one layer in order to provision set-top boxes and gateways.

The RDK allows all of the interested parties to develop once and then scale across multiple environments – in the CableCard/ QAM/MPEG-2 environment of today, as well as in the IP environment of tomorrow.

The RDK includes CableLabs’ “Reference Implementation” for AP and tru2way, as well as the Java Virtual Machine (JVM.) Opensource components of the RDK include GStreamer, QT and WebKit, which are execution environments that can be tailored to each MSO. There are also optional plug-ins, such as Adobe Flash and Smooth HD.

The RDK is all about service velocity, which was demonstrated during demos at Imagine Park on the last day of The Cable Show in June. Demonstrations by vendors at Imagine Park showed that the RDK enabled them to develop products and applications in a matter of weeks instead of months.

The Reference Design Kit (RDK) is a pre-integrated software bundle that provides a common framework for powering customer-premises equipment (CPE) from TV service providers, including set-top boxes, gateways, and converged devices. The RDK was created to accelerate the deployment of next-gen video products and services. It enables TV service providers to standardize certain elements of these devices, but also to easily customize the applications and user experiences that ride on top.

The RDK is supported by more than 140 licensees including: CE manufacturers, SOCs vendors, software developers, system integrators, and TV service providers. It is administered by the RDK Management LLC, a joint venture between Comcast Cable, Time Warner Cable, and Liberty Global. The RDK software is available at no cost to RDK licensees in a shared source manner, and RDK community member companies can contribute software changes and enhancements to the RDK stack.

There are many benefits to using the RDK technology, including:
1. Speed to Market
2. Expanded Services
3. Standarization
4. collabration

Streaming Media and RTOS

Friday, August 29, 2014

Comparison of encoder compression efficiency libdec265

Thursday, August 14, 2014

Threading/Thread Pool X265

Tuesday, August 12, 2014

Why RDK?

Software Systems Architect

Blog Archive