• Based on ADSP-BF533 processor’s H.264 decoder

    Abstract: The H.264 standard has compared to other video frequency code standard better compression performance, but the computation order of complexity is high, has limited the H.264 standard application. The Blackfin processor is the low power loss which, the high performance fixed-point DSP chip ADI Corporation promotes, has the extremely high performance-to-price ratio, is ideal platform which H.264 standard DSP realizes. In the article discusses on the Blackfin processor realizes the H.264 real-time decoder’s method through many kinds of optimization techniques. And gives the experimental result.

    Key word: H.264 Blackfin ADSP real-time decoder BF533

     

    Introduction  < ?XML:NAMESPACE PREFIX = O />

        H.264 is ITU T VCEG and ISO/IEC MPEG unites the new video frequency code standard which tenable union video frequency group JVT (Joint Video Tearn) formulates together, locates in the cover entire video frequency application domain. The H.264 standard used based on the invariable size great block motion compensation, many references, the integer transformation, based on 1/4 picture element precision movement had estimated that went to the blocking effect filter and so on new technology, thus obtained the better compression performance, simultaneously has also caused operand large scale increase.

        The Blackfin processor has used the micro signal structure which ADI Corporation and Intel Corporation develops together, adds the person special video processing instruction in the structure, the operating frequency reaches as high as 756 MHz, can complete 12OOM /s while to add the operation. With uses the exceeding the allowed figure quantity structure or ultra long set of instructions DSP (for example the TI C6000 series) compares, the Blackfin processor in the power loss, the cost aspect has the very big superiority, very suitable embedded video frequency application.

     

    1 H.264 video frequency code standard

    H.264 video frequency codec encoder-decoder’s basic structure and early code standard (H.263, MPEG4 and so on) similar, is by the motion compensation, the transformation, the quantification, the entropy code, the ring circuit goes to function units and so on blocking effect filter to be composed. The H.264 standard’s improvement mainly manifests in various functional module. The H_264 major improvement displays in the following several aspects:

    ①High accuracy based on 1/4 picture element precision movement forecast.

    ②Many kinds of great block division pattern. Each great block (16×16 picture element) the luminance component has 7 district methods: 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, 4×4.

    ③Many forecasts. Codes when the frame, may choose 5 different reference frames.

    ④Integer transformation. Uses replaces the DCT transformation based on 4×4 the picture element block integer transformation.

        ⑤H_264/AVC supports two entropy encoding method, namely CAVLC (context-based auto-adapted may lengthen code) and CABAC (context-based auto-adapted arithmetic code). The CAVLC anti-mistake ability is quite high, but codes efficiency ratio CABAC to be low; The CABAC coding efficiency is high, but needs the computation load and the storage capacity are bigger.

        ⑥In frame predictive coding. H.264 used many kinds of rational designs in the frame to forecast the pattern, cut I encoding rate greatly.

        ⑦Network adaptive level NAL (Network Abstraction Layer) is the video frequency code level provides one unified connection which has nothing to do with the network, enables the video frequency coded data to adapt to the different network application circumstances.

        H.264 divides into 7 kind of different frame (profile)–Baselineprofile, Main profiIe, Extended profile, High profik, High10 profik, High4:2:2 profile and High 4; 4:4, represents the different technical limit and the algorithm set separately. And the baseline prome use does not collect the copyright royaltys.

     

    2 realize the platform based on the ADSP-BF533 software and hardware

    The hardware platform uses ADI Corporation’s ADSP-BF533 EZ-kit the Lite appraisal board. This appraisal board including l ADSP-BF533 processor, 32MB SDRAM,2 MB the Flash, ADVl836 audio frequency codec encoder-decoder external connection 4 inputs /6 output audio frequency connection, the ADV7183 video frequency decoder and the ADV7171 video frequency encoder external connection 3 inputs /3 output video frequency connection, 1 UART connection, 1 USB debugging connection, 1 JTAG debugging connection. Appraisal board system structure diagram as shown in Figure 1.

     

     

        Appraises the ADSP-BF533 processor which on the board uses, the operating frequency reaches as high as 756 MHz. This processor has the following characteristic: Double 16 multiplication accumulator; Double 40 arithmetic logic unit (ALU); 4 8 video frequency ALU; 1 40

    Position shifter; Special-purpose video signal processing instruction; 148 KB internal memories (16 KB may be possible to take data Cache as instruction Cache,32 KB); Dynamic power source management function and so on. The Blackfin processor also includes the rich peripheral device and the connection: The EBIU connections (4 128 MB the SDRAM connection, 4 l MB asynchronous memory interface), 3 fixed time/counters, 1 UART,1 the SPI connection, 2 synchronized serial interfaces, 1 group parallel peripheral device connection (supports ITU 1656 data formats) and so on. The Blackfin processor has manifested in the structure to the media application (is specially fully video frequency application) the algorithm support.

        The software confirmation selects the following method: First, through the DSP simulator H.264 code document copy to appraisal board memory. Then, the software reads the code document from the memory the data, carries on the decoding operation. Finally, will decode the data outputs the ADV7171 chip through the PPI connection, the video data code which the ADV7171 chip will input is the PAL form outputs on the monitor two to carry on the demonstration.

        The Blackfin processor’s software development platform is VisualDSP 4.0.

     

    3 H 264 real-time decoder software design

    3.1 software system design

    In order to realize the real-time decoding request, needs to optimize the procedure the design. The optimized flow is as follows:

    ①Carries on the algorithm on PC machine the confirmation and the appraisal, the optimized procedure flow design and the construction of data design.

    ②Transplants the procedure code to the Blackfin processor. Carries on the translation in the Visual-DSP integrated development environment, deletes the PC platform related code, increases the DSP platform related code.

    ③Carries on based on the DSP platform optimized operation. The establishment speed optimizes the translation parameter, carries on the C language level the optimization, rewrites the most time-consuming function with the assembly directive, reduces the function through the use special-purpose vector instruction and the parallel instruction the execution time.

    3.2 realize and optimizes the decoder procedure on PC machine

    The decoder procedure has referred to JM9.6, and has made the optimization in the below aspect:

    ①Because only supports Baseline profile, deletes related B, the SI piece, the SP piece and the data division and so on does not support the characteristic redundant procedure code;

    ②Revises JM9.6, each time processes when Slice must assign the memory, the read information, releases the memory again, possible arrangement memory space assignment and release;

    ③I, P distinction decodes independently, the great block decoding also presses the forecast pattern and the forecast direction divides into the different decoding module, omits the middle redundant judgment, raises the decoding speed;

    ④Optimizes the CAVLC stopwatch the inquiry method.

    3.3 procedure transplants

        VisualDSP is one section supports the Blackfin processor’s integrated development, the debugging environment, including VisuaIDSP essence (VDK), C/C compiler, high-quality graph plan tool, debugging aids, component simulator and so on many kinds of functions; Can support well on the Blackfin processor carries on the development work with the C/C language.

     
        The transplant first step is the function which does not support except all translation environment (e.g. certain time correlation function), revises the file operation to read the file data buffer the operation, deletes code which SNR DSP platforms and so on collection of information and information printout realize do not need. The second step is the increase and the hardware related code. These codes including system initialization code, output module code, interrupt service and procedure codes and so on decoding speed control procedure.

        After the transplant finished, has realized based on ADSP-BF533 processor’s H_264 decoder; But the speed cannot achieve the real-time decoding the request, but also needs to carry on the optimization.

    3.4 based on DSP platform optimization

        Divides into the system-level based on the DSP platform’s optimization to optimize, the C procedure level optimization and to collect the arrange in classes optimization.

    (1) system-level optimization

        Turns on compiler’s optimized switch, establishes for the speed optimization; Turns on the automatic internal integration switch; Opens “Interprocedural optimization” (process optimization) the switch; Uses VisualDSP compiler’s PGO (Profile-Guided Optimization) to optimize the compiling technique.

    (2)C procedure level optimization

        The C procedure level’s optimization is mainly aims at the BIackfin processor’s concrete characteristic to carry on the optimization:

        ①The compilation link description document, will use frequently data storage in internal memory, for example CAVLC entropy decoding stopwatch; Begins using instruction Cache and data Cache, establishes begins using the Cache mechanism the instruction address and the data address.

        ②Transforms the division operation for the multiply operation or uses the table look-up law computation.

        ③Reduction to piece external memory’s access. Regarding piece external memory region which visits frequently, establishes Cache to enable, and may establish the Cache locking, prevents to replace by the buffer data, reduces the Cache miss the probability.

        ④Regarding can use the short data type expression the data to change to the short data type to express, for example the original definition is the int type 4×4 the counter integer transformation loses the person data, in fact may define is the short type.

    (3) collects the arrange in classes optimization

    Collects the arrange in classes to optimize usually follows following principle:

    ①     Uses the register to replace the local variable. If the local variable uses for to preserve the computation the intermediate result, then uses the register

    Replaces the local variable to be possible to save many visit memories time asked.

    ②     The use hardware circulation replaces the software circulation. The Blackfin processor has the special-purpose hardware support two level of nesting zero expenses

    Hardware circulation. Replaces the software circulation with the hardware circulation to be possible to avoid stopping up the assembly line, raises the speed.

    ③Uses the parallel instruction and the vector instruction. Uses the parallel instruction and the vector instruction, may use Blackfin fully processor’s SIMD system structure merit and the internal hardware source parallel processing merit, reduces the instruction execute number of times and raises the instruction execute efficiency. Uses 1 parallel instruction simultaneously to carry out 2 or 3 non-parallel instructions. The vector instruction may simultaneously carry on the same processing operation to many data streams.

    ④Use video processing instruction. The video processing application may use the Blackfin processor special-purpose video processing instruction, enhances carries out the efficiency.

        The most time-consuming some functions will rewrite with the assembly language, uses Blackfin fully on processor’s S1MD structure merit and hardware’s parallelism, carries out many operations in an instruction cycle, reduces instruction cycle which the function execution needs. The most time-consuming function has great block decoding function decode_one_macroblock, counter integer transition function itrans, to go to blocking effect filter function EdgeLoop, functions and so on filter threshold computation function Get_Strength.

        Below by 4×4 matrix counter integer transition function itrans and 1/4 picture element interpolation filter get_block(), explained that brings the performance enhancement with the assembly directive optimization. what 4×4 matrix counter integer transition function itrans uses is 2 levels of butterfly-shaped operations, makes the good inverse transformation separately first to 4×4 matrix each line, then makes a row inverse transformation to each row. The univariate transformation uses the butterfly-shaped algorithm which as shown in Figure 2.

    Blackfin processor’s SIMD structure support vector operation, most may complete 4 16 add operations in 1 cycle. Its parallel instruction can simultaneously carry on the arithmetic operation and two data loading/store operation. For example the above butterfly-shaped operation might use the following instruction to realize (supposes in register IO to preserve has lost person data y[4][4] the address, in I2 preserves the coefficient array cof[2]={0×7fff,0×4000} address, in Il has preserved the temporary variable tmp[4][4] address, what R2 and R1 preservation was asks result):

    R7=[IO ];

    Al=R6.I*R7.1, AO=R6.1*R7.1(IS)┃│I R5=

    [10 ]┃┃[││ ]=R2;

    R4.h = (A1yiyiR5.1*R6.1), R4.1=(AO =R5.1*R6.1)(IS)││W[I1 ]=R1 .h;

    R7.1=R6.1*R5.h(IS)1 W[11 ]=R1.1;

    R5=R7>>>1(v);

    A1=R6.1*R5.h, AO-R6.1*R5.1(IS);

    R3.h one (A1 R6.1*R7.1),    R3.1yi (AO =R6.1*R7.h) (IS);

    R2=R4 l R3, R1=R4 │ R3:

        Completes a time univariate inverse transformation only to need 8 instructions, is counted function call the expenses and other house-keeping instructions, completes a 4×4 matrix when the counter integer transformation altogether needs 82 instruction cycles. Table 1 optimizes the before and after comparison.

        the get_block function carries on 1/4 picture element interpolation operation to the picture element matrix. Uses six step filter to carry on 1/2 picture element interpolation first, then carries on the l/4 picture element interpolation with the linear interpolation.

        l/2 the picture element b computational method is: b=round ((E 5F 20G 20H 5I j) /32). Schematic drawing as shown in Figure 3. E, F, G, H, I, J are the integer picture element, b are 1/2 picture elements which G and H asked.

     

    The picture element luminance value is unsigned the char type, uses the parallel instruction to be possible first to read in 1 instruction cycle 8 picture element luminance values the register, then the use video frequency special instruction 4 byte bale breakings to 1 register to (R1:O or R3:2), the use vector instruction carries on 1 cycle 2 times while adds the operation. Through the video frequency special instruction, the vector instruction and the parallel instruction’s use, reduced the function instruction instruction periodicity.

     

    4 experimental results

    Has tested the decoder algorithm on the EZKit533 development board, to CIF the form (352×288) foreman test sequence, may achieve 45~50 /s the decoding speeds; To CIF the form mobile test sequence, can achieve 40 ~44 decoding speeds. If increases the decoding speed control module, may realize stably by 30 /s speeds broadcasts the CIF test sequence. The experimental result proved that realizes the H.264 real-time decoder on the Blackiln processor is feasible. ADI Corporation even declared that may realize D1(720×576) form video frequency real-time decoder on 600 Mtz BF533 processors.

        The BIackfin processor has the low power loss, the low cost and the high performance characteristic. The H.264 video frequency decoder which realizes on the Blackfin processor is very suitable to use in the IP set-top box, the videophone, PMP (portable media player) and so on inlaying in the person type video frequency application.

    Share/Save/Bookmark

    Friday, September 19th, 2008 at 10:03
No comments yet.

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
TOP
Copyright © 2008-2009 51 Research and Design, Electronic Engineers website - Embedded Systems, MCU, DSP, EDA, Test and Measurement, Components, Communications, Power, Microelectronics, Semiconductors
Powered by WordPress | Theme by mg12 | Valid XHTML 1.1 and CSS 3