Video acceleration engine technology development based on Xtensa configurable processor technology

The growth of handheld multimedia devices has dramatically changed the positioning requirements of terminal multimedia chip vendors for products. The IC design goals of these chip providers are no longer limited to just one or two multimedia codecs. Consumers want their mobile devices to be able to play media using different devices, be able to encode with different standards, and be able to download or receive media data from different devices. Video decoders and encoder engines must meet a variety of needs with area and power advantages.

1. Traditional RTL method for designing video acceleration engine The design of the previous generation video ASIC mainly encodes and decodes MPEG-2, because this is the DVD standard. Some video ASICs also support MPEG-1 for VCD (Video CD) playback. In most cases, both the MPEG-2 encoder and the decoder use the RTL design method. A typical MPEG-2 video ASIC architecture is shown in Figure 1, which includes a video subsystem consisting of individual RTL modules, a main controller, and on-chip memory.

Figure 1. The MPEG-2 video ASIC architecture supports multiple video standards using a hard-wired RTL architecture. However, this also means that each video standard requires a dedicated RTL module. Implementing a multi-standard video acceleration engine with a hard-wired RTL module has certain limitations. Whether it is to implement a new video standard, update an existing standard, or eliminate a fault in it, you need to re-process the chip.

2, the advantage of using the processor as a video acceleration engine Programmable processor can meet the flexibility requirements of a variety of video standards. Compared with the RTL module design method, the programmable processor has several advantages: one is to easily interface the codec with the processor; the other is to meet the requirements of the new video standard, update the existing codec or adopt the software method. The chip can also modify the fault after the film is cast; the third is that the performance of the video codec can be easily improved by the software update method.
However, traditional 32-bit processors have performance bottlenecks because they are designed for general purpose code rather than for video acceleration engines. Embedded DSPs are not specifically tailored for video, but include hardware features, instructions, and interfaces that are specifically used in general-purpose DSP applications. Therefore, in order to implement video codecs on traditional RISC and DSP processors, these processors must be run at very high speed (Mhz), which requires a large amount of memory space, and therefore requires a large amount of power consumption, which is not suitable. Portable application.
By studying the amount of computation required for a video kernel program, you can see at a glance. For example, an absolute difference accumulation operation SAD, which is a commonly used method in motion estimation in most video coding algorithms. The SAD algorithm will find the motion of the macroblock in two consecutive consecutive video frames. To this end, it is necessary to calculate the sum of the absolute differences between the pixel values ​​corresponding to each of the two macroblocks.
The following C code gives a simple implementation of the SAD core algorithm:
For (row = 0; row < numrows; row++) {
For (col = 0; col < numcols; col++) {
Accum += abs(macroblk1[row][col] - macroblk2[row][col]);
} /* column loop */
} /* row loop */
The basic calculation method of the SAD core algorithm is shown in Figure 2. As shown in the figure, the SAD core algorithm first performs the subtraction operation, then takes the absolute value, and finally accumulates the previous results.


Figure 2 Difference absolute value accumulation (SAD) main calculation method Calculating a SAD operation composed of two 16x16 macroblocks on a RISC processor requires 256 subtraction operations, 256 absolute value operations, and 256 accumulation operations. 768 arithmetic operations, which does not include the fetch and store operations required for data transfer. Since this requires operation of all macroblocks of each frame, the increase in resolution causes video frames to increase, making the computational cost extremely expensive.
In fact, for a general-purpose general-purpose RISC processor (including some DSP instructions, such as multiply instructions and multiply-accumulate instructions), performing an H.264 reference decoding algorithm requires 250 MHz performance (CIF resolution) while performing An H.264 reference encoding algorithm requires performance in excess of 1 GHz (CIF resolution). To accomplish this, only the processor core needs 500mW of power, not to mention the power consumed by memory access and other components of the video SOC.

3. Configurable Processor Method A more efficient way to implement the SAD core algorithm on a processor is to create a "subtractive-absolute-addition" dedicated instruction. This will greatly reduce the overhead of arithmetic operations. For a 16x16 macroblock, the number of operations will be reduced from 768 to 256. Moreover, since a simple operation of a plurality of simple arithmetic operations can be realized by using one functional component, the above operation can be completed in one instruction cycle, which is equivalent to the original 256 cycles. Users cannot add instructions to a standard 32-bit RISC processor, but it is entirely possible to add dedicated instructions to a configurable processor. Configurable processors allow designers to extend processor functionality by selecting relevant configuration commands from the configurable options menu, including adding dedicated instructions, register files, and interfaces.
Below are the configuration and expansion options offered by modern configurable processors such as Tensilica's Xtensa processor, which is not possible with traditional fixed-mode processors.
(i) Configuration options: The options menu includes the following items:
a. Instructions that the designer needs or does not need. For example, 16x16 multiply or multiply accumulate, shift, float instructions, and so on.
b. Zero overhead loop, five- or seven-stage pipeline, local data load, or number of storage components.
c. Whether memory protection, memory address translation, or memory management unit (MMU) is required
d. Contains or does not include the system bus interface e. System bus width and local memory interface width f. Local (tightly coupled) memory size and number.
g. Number of interrupts and interrupt types and interrupt priorities.
(ii) Extended options: Add features that the designer has defined themselves, including:
a. Register and register file.
b. Multi-cycle, arbitration complex instruction features.
c. Single instruction stream multiple data stream SIMD function.
d. Turn the single-issue processor into a multi-transmit processor.
e. User-customized interface for direct reading and writing of data paths, such as processor core ports or pins like GPIO (General Purpose Input/Output) pins, for extending the queue interface of the FIFO queue (can Interface with other logic or processor cores).
The benefit of configuration options is that designers can build a moderately sized processor and meet their specific application by simply selecting the options associated with their application. The benefit of the extended option is that the designer customizes the processor based on the application, including the creation of dedicated instructions, register files, features, and associated interfaces to speed up the execution of the system application algorithms.

4. The key to automating software development tool suite support for configurability and scalability is not only the ability to automatically generate pre-verified RTL code for designers to customize the processor (including all system extensions), but also to automatically generate complete software. Tools include a processor-optimized development tool suite, a clock cycle-based instruction set simulator, and a system model.
This automation means that the compiler knows the new instructions, associated registers, and register files that the designer has added. Therefore, the compiler can schedule user-defined instructions and perform register allocation operations. Similarly, software developers can understand the designer-defined registers and register files in addition to the processor's native registers while debugging; at the same time, software developers can use the instruction set simulator to simulate new instructions defined by the designer. The real-time operating system RTOS ports and system models associated with the processor are also automatically generated. Tensilica's software tools automatically generate these software tools in less than an hour, a core commitment to users of configurable processors that can perform implementations such as SAD without having to use RTL.

5. Using a configurable processor to build a video acceleration engine to create multi-operational features Adding a fused operation such as SAD to a configurable processor can be a hassle. A new instruction called "sub.abs.ac" can perform the "subtraction-absolute-accumulate" operation. This new instruction can turn the operation in Figure 2 into the complex operation in Figure 3.

Figure 3: Using the new instruction to calculate the "subtract-absolute-accumulate" operation to add the instruction to the processor, the C compiler recognizes the new "sub.abs.ac" instruction and dispatches the relevant instructions; the scheduler The internal signals used by the "sub.abs.ac" feature are displayed; the assembler can process this new instruction; the instruction set simulator ISS can simulate in clock cycle mode.
A simplified diagram of the data path after the new dedicated video feature is plugged into the processor is shown in Figure 4. It is noted that in addition to generating functional logic, the hardware generation tool can automatically insert feedforward paths, control logic, and bypass logic to interconnect new functional components with other logic in the data path.

Figure 4 shows a simplified data path diagram after inserting the sub.abs.ac video-specific feature. The SAD algorithm containing the C code description of the new instruction is as follows:
For (row = 0; row < numrows; row++) { for (col = 0; col < numcols; col++) {
Sub.abs.ac( accum, macroblk1[row][col], macroblk2[row][col]);
} /* column loop */
} /* row loop */
As mentioned earlier, for a 16x16 macroblock, the number of operands in the main loop of the program is reduced to 256 after adding a new instruction (ie numrows = numcols = 16).

6. Establishing a Single Instruction Stream Multiple Data Stream The SAD program in front of the SIMD function can be further optimized. The inner loop in the program does the same operation for 16 columns in the macroblock. This is ideal for SIMD (Single Instruction Multiple Data) features, with the corresponding instruction "sub.abs.ac16" performing sub.abs.ac operations simultaneously for 16 pixels, as shown in Figure 5.

Figure 5 The single instruction stream multi-stream calculation operation for the 16-pixel simultaneous sub.abs.ac instruction The corresponding C-language procedure is named sub.abs.ac16. The SAD kernel C program code rewritten by this procedure name is as follows:
For (row = 0; row < numrows; row++) {
Sub.abs.ac16( accum, macroblk1[row], macroblk2[row]);
} /* row loop */
The rewritten SAD kernel program was reduced from 768 arithmetic operations to only 16 arithmetic operations.
However, only the above C program code is not enough. Because the instruction sub.abs.ac16 needs to read 128 bits of data from two macroblocks, this requires two aspects of support: a 128-bit register file and a wide data bit fetch/store interface, configurable The processor supports these features.

7, create a user-customized register file in the Xtensa configurable processor, indicating that a custom register file of any width is as simple as writing a line of programs. For example, a procedure statement called "myRegFile128" creates a 128-bit register file with a length of 4 and creates a corresponding new C data type. "myRegFile128" can be used for C/C++ program code description variables. Software tools also create "MOVE" operations for converting various C data types into new custom data types. Therefore, the SAD kernel C program code after the sub.abs.ac16 process and the new register file is as follows:
For (row = 0; row < numrows; row++) {
myRegFile128 mblk1, mblk2;
Mblk1 = macroblk1[row];
Mblk2 = macroblk2[row];
Sub.abs.ac16( accum, mblk1, mblk2);
} /* row loop */
The C/C++ compiler will now generate a MOVE instruction that moves the data from the normal C data type to the custom C data type "myRegFile128" and allocates registers for the new register file.

8. Create a high data bandwidth load/store interface In order to access data for high bandwidth custom register files (and corresponding single instruction stream multiple data stream SIMD functions), the processor should have high bandwidth data load/store operation capabilities. For configurable processors, designers can specify custom load and store operation instructions to directly perform high-bandwidth load/store data operations on custom register files. The compiler then automatically generates load/store instructions corresponding to the high bandwidth load/store interface.
The updated processor data path is shown in Figure 6. The hardware generation tool produces high bandwidth custom register files, load/store interfaces associated with the data memory, and corresponding feedforward logic, control logic, and bypass logic. The hardware tools also generate corresponding hardware logic for moving data from the reference register file to a user-defined register file.

Figure 6. Inserting the register file and the high-bandwidth load/store interface data path. 9. Loading or storing operations while updating the address. The Xtensa configurable processor allows the user to create another very useful extension of the function, which is to create an instruction that can be completed simultaneously. Address update operations and data load/store operations. The new load/store operation instructions created can perform the following functions concurrently: Load A1 ← Memory(Addr1); Addr1 = Addr1 + IndexUpdate
This instruction enables a "back-to-back" load/store operation without the need for special instructions to update the address.

10, the establishment of a first-in-first-out (FIFO) interface and general-purpose input / output port video and audio are streaming media, requiring fast data access to the processor. Conventional processors are limited by the system bus interface and the loading and storage access of data before the execution of data operations.
To support streaming data/output operations, the Xtensa configurable processor allows designers to define a first-in, first-out (FIFO) interface and a general-purpose input/output (GPIO) port for direct read and write access to the data path. The FIFO and GPIO ports can be any data width (up to 1024 bits) in an unlimited number (each can contain 1024 FIFOs and GPIO ports). These high-bandwidth interfaces can be directly connected to the data path, providing high data throughput, reading, processing, and writing data through the processor core, which is very important for multimedia and network applications.
The data path with FIFO interface and GPIO port is shown in Figure 7. The processor can do the following: first take the data from the two FIFOs (while ensuring that both FIFO queues are not empty), then calculate a complex operation (such as a multiply-accumulate rounding operation), and finally calculate The result is pushed into the output FIFO (while ensuring that the FIFO queue is not full). The hardware generation tool then generates the corresponding interface signals, control logic, and bypass logic, etc.; generates the complete RTL code for the configured processor. The software generation tool produces a complete set of compiler tools, as well as a clock cycle-accurate instruction set simulator ISS for simulating new instructions. Note that this ability by the designer to define the FIFO interface and GPIO port is unique to the Xtensa configurable processor.

Figure 7. High-speed communication using a custom first-in, first-out (FIFO) interface and a general-purpose input-output (IO) port. 11. Accelerate the execution of complex control-intensive code. The number and complexity of control code in multimedia applications is significantly increased, making the data in the program Intensive operations are approximately equivalent to computational time. For example, a key part of the H.264 main program decoder is the CABAC (Context-Dependent Binary Arithmetic Coding) algorithm. The algorithm is almost entirely a control flow decision tree with data calculation and data comparison.
Due to the high computational complexity, most traditional processors use a dedicated RTL accelerator to implement the CABAC algorithm. However, the CABAC algorithm can be implemented more efficiently on a configurable processor by adding a specific set of instructions. The benefit of this implementation is that data is constantly being exchanged between the processor and the RTL accelerator. Another benefit of using a configurable processor is the use of instruction-extension technology, which allows for better hardware and software interface partitioning because dedicated hardware is internal to the processor.

12. Summary Modern configurable and scalable processors are ideal for building custom video and audio engines. Tensilica offers related video and audio IP as SOC modules, including the HiFi 2 audio engine, the Diamond Series standard 38xVDO (video) multi-standard and multi-resolution video methods. The matching software codec is very important. The HiFi 2 audio engine works with related software to complete most popular audio codecs such as MP3, AAC, WMA and more. Similarly, the Diamond 38xVDO Video Acceleration Engine and the corresponding encoder and decoder software enable H.264 (including Baseline, Main and profiles), MPEG-4 (SP and ASP), MPEG-2, VC-1/WM9 And other standards. These video technologies cover a wide range of resolutions from QCIF to CIF and SD, with low power consumption and small area.

HDMI Adapters

Usb Hdmi Hub,Usb a Adapter,Usb Aux Adapter,Usb Port Extender

Pogo Technology International Ltd , https://www.pogomedical.com