We’ve seen the meteoric rise of hardware-accelerated video transcoding. That is, we've seen a growth in the availability of graphics cards and processors that can decode and subsequently re-encode compressed video, whether using GPU shaders or dedicated transcoding logic. These days, every new GeForce, Radeon, A-series APU, and Core processor has some sort of hardware transcoding. Vendors video conversion software across the world are working hard to support them all.
Both CPUs and GPUs come with their own dedicated transcoding logic, some systems offer multiple paths to hardware acceleration. Traditional GPUs rely on the existence of fast shared local memory, which the programmer needs to program explicitly. While some traditional GPUs are based on HW scheduling of many tiny threads, these and other differences suggest that applications usually benefit from tuning to the HW they’re intended to run on. The folks behind the popular x264 software encoder have been quietly plugging away at an OpenCL-accelerated version of their look ahead pipeline. Look ahead only accounts for 10-25% of the total encoding time, according to x264 lead developer Jason Garrett-Glaser, but the process allows for nearly unlimited parallelism and is relatively easy to implement in OpenCL. Re-writing all of the x264 encoder in OpenCL, by contrast, would be "very hard." Garrett-Glaser says the accelerated look ahead can increase performance by up to 40% on AMD's new Trinity APUs and by a factor of two on the latest Radeon graphics cards.
Targeted Applications for HW accelerations using GPU cards:
1. UltraHD (3840x2160) & Cinema 4K (4096x2160)
2. Internet and mobile video streaming (VOD)
3. HD Video Communication
4. Broadcast Distribution(OTT)
Differences between H.264 and H.265 are:
TECHNOLOGY OVERVIEW: In order to get maximum CPU and GPU utilization from available resources, GPU should operate/execute following module.
CPU Formatting Pre-analysis and Decision Making Configuration Rate Control CABAC Overall operation GPU Motion Estimation Intra and Inter prediction DCT and DST Transform and Inverse Transform Quantization and DE quantization SAO and De-blocking filters The current picture/frame pixels is predicted from the reference frame’s pixels:
The reference picture can be from past or future The prediction happens on a block-by-block basis and there can be multiple reference frames for each block Motion Compensation:
The most compute intensive part of Motion compensation is sub-pixel interpolation Luma – 8 or 7 tap filter Chroma – 4 tap filter Sub pixel interpolation is data parallel, i.e., interpolation of each block within a frame can happen in parallel and hence suited for GPU computing
Challenges in CPU+GPU Implementation 1. Efficient Partitioning of work between CPU and GPU
TECHNOLOGY OVERVIEW: In order to get maximum CPU and GPU utilization from available resources, GPU should operate/execute following module.
CPU Formatting Pre-analysis and Decision Making Configuration Rate Control CABAC Overall operation GPU Motion Estimation Intra and Inter prediction DCT and DST Transform and Inverse Transform Quantization and DE quantization SAO and De-blocking filters The current picture/frame pixels is predicted from the reference frame’s pixels:
The reference picture can be from past or future The prediction happens on a block-by-block basis and there can be multiple reference frames for each block Motion Compensation:
The most compute intensive part of Motion compensation is sub-pixel interpolation Luma – 8 or 7 tap filter Chroma – 4 tap filter Sub pixel interpolation is data parallel, i.e., interpolation of each block within a frame can happen in parallel and hence suited for GPU computing
Challenges in CPU+GPU Implementation 1. Efficient Partitioning of work between CPU and GPU
- The effective FPS of decoder will be the minimum of the FPS achieved by the CPU and GPU for their respective work So the partitioning needs to be efficient so that both of them perform their respective work at almost the same speed(FPS) 2. Efficient pipelining data between CPU and GPU The algorithms running on CPU will depend on the output of algorithms from GPU and/or vice versa
- A good design should make sure neither the CPU nor the GPU spend any time waiting for the output of the other 3. Cache coherency
- Cache coherency between CPU and GPU data need to ensure. Why GPUs will stay ahead 1. Technical reasons:
- SIMD cores (instead of MIMD cores) means larger proportion of chip devoted to floating point computation
- tightly-coupled fast graphics memory means much higher bandwidth 2. Economic reasons:
- CPUs driven by price-sensitive office/home computing; not clear these need vastly more speed
- CPU direction may be towards low cost, low power chips for mobile and embedded applications GPUs driven by high-end applications– prepared to pay a premium for high performance Gives at least 10×improvement in energy efficiency and price / performance compared to 2× quad-core Intel Xeons. Effectively a personal cluster in a PC under your desk
1. http://ieeexplore.ieee.org/xpl/login.jsp?reload=true&tp=&arnumber=6618412&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel7%2F6599032%2F6618213%2F06618412.pdf%3Farnumber%3D6618412
2. http://work.martin.spb.ru/
3. http://www.arm.com/support/university/researchers/gpu/graphics-processing-software-tools-for-researchers.php
4. https://software.intel.com/en-us/articles/opencl-design-and-programming-guide-for-the-intel-xeon-phi-coprocessor
5. http://x265.org/
6. http://www.mainconcept.com/products/sdks/gpu-acceleration.html
7. http://homepages.laas.fr/elbaz/PCO11.pdf