• Based on DSP video frequency algorithmic system’s optimized strategy

    In recent years, was rapid to the digital video product’s demand growth. The digital video product’s mainstream application including the video communication, the video frequency monitoring and the industrial automation, most popular is the entertainment application, like DVD, HDTV, satellite TV, sign clear (SD) or high clear (HD) set-top box, digital camera and HD camera, high-end monitor (LCD, plasma monitor, DLP) as well as individual camera and so on. These applications arrange the decoding algorithm and the standard to the high grade video frequency set a higher request. The present mainstream compression standard mainly has MPEG2, MPEG4 and H.264/AVC, but arranges the decoding standard in view of these to have variously realizes the plan. This article main discussion based on TI Corporation C64 the series DSP video frequency decoding algorithm certain factors which needs to consider in the system optimization process.

    TI C64 series DSP widely uses in by its formidable handling ability the video processing domain, but because the user to C64 the series DSP structure, the instruction understanding degree is dissimilar, causes the algorithm to realize the effect to have many differences. On the CPU resources which manifests specifically when realizes algorithm takes, for example realizes H.264 when the MP@D1 decoding will take CPU the resources to have a difference; Or manifests, in contains in the algorithm tool subset, for example realizes H.264 when the MP@D1 decoding uses CAVLC, but is not CABAC.

    Causes these differences the primary cause to include: Algorithm essential module optimization; When algorithmic system integration memory management; When algorithmic system integration EDMA resource distribution management. This article will discuss the factor which from these three aspects in the algorithm optimization integration process needs to consider.

    Algorithm essential module optimization

    Generally speaking, has in view of the present mainstream video frequency solution compression standard consumes DSP the CPU resources the module, like code and so on in H.264/AVC, MPEG4, AVS movement vector searches take the resources, moreover these modules, in the overall system realizes in the process also frequently to transfer, therefore we should first discover these modules. TI CCS has provided project analysis tool (Profile), might find in the entire project to take DSP very quickly CPU resources most modules, then carried on the optimization to these modules.

    May divide three steps to these essential algoritic module’s optimization to carry on. As shown in Figure 1, analyzes this part of codes earnestly first, and makes the corresponding adjustment, for example reduces as far as possible has the judgment skipping code, specially in for circulation, because judgment skipping can break the software running water. May use Intrinsics and so on table look-up or with _cmpgtu4, _cmpeq4 replaces the comparison judgment instruction, thus substitutes the judgment skipping sentence ingeniously. Simultaneously may also use #pragma which in TI CCS provides, supplies as far as possible many information for the compiler. These information including for circulation’s number of times information, the data aline the information and so on. Also if after passing through this part optimizes, is unable to satisfy the system request, realizes to this part of module use linearity assembly.

    The linear assembly is situated between C and the assembly one language realizes the form, may the control command use, but does not need to care about the register and the function unit specially (S, D, M, L) assignment and use. The use linearity assembly compared to will use the C language to have high generally carries out the efficiency. If the linear assembly is unable to satisfy the request, then the use assembly realizes. In order to compile high parallel, the deep software running water assembly, needs to pass through founds the related chart, the succession table (Scheduling table) and so on steps, because the length limits, here no longer discusses.

    When in the movement search needs to calculate 16×16 the great block SAD value, consumes the DSP CPU periodicity under the different way: Uses C Intrinsics to need for 83 cycles, the use linearity assembly needs for 74 cycles, but uses the assembly only to need for 57 cycles. Thus it can be seen, the assembly realizes the CPU periodicity which consumes to be least, but the premise is needs to understand fully DSP CPU the structure, the instruction as well as algoritic module’s structure, compiles high parallel, the deep software running water assembly, otherwise writes the assembly has the possibility not to have the linear assembly or the C efficiency is high. An effective method is, uses in the algorithm storehouse function which fully TI provides, because in the algorithm storehouse’s function is fully had already optimized the algoritic module, moreover provides corresponding mostly C, the linear assembly and the assembly source code, and has the documents to carry on the API introduction.

    Click enlargement
    Figure 1: Based on TI DSP video frequency algorithm essential module optimized step (left); Figure 2: Based on TI DSP video frequency algorithm optimized integration process (right).

    When algorithmic system integration memory management

    Because in based on the DSP embedded system development, the memory resources is specially the internal high speed memory resources is limited. When algorithmic system integration the memory management is important regarding overall system’s optimization, it will affect the data at the same time the read, the removal speed; On the other hand will also affect buffer the hit probability. Below carries on the analysis from the procedure and the data two aspects.

    Procedure area: The biggest principle is dispatches the use the algoritic module to put frequently internal. In order to achieve this goal, in TI CCS has provided #pragma CODE_SECTION, may need independent control depositing the function section to be independent from the .text section, thus carries on the independent physical address mapping in the .cmd document to these function section. May also use the dynamic control the way, will need to move the code section dispatches in the first internal memory. For example, in H.264/AVC CAVLC and the CABAC two algoritic modules have the alternative, may place these two algoritic module outside the piece, and corresponds to internal with together the operating range, before movement some algoritic module, calls in first it internal, thus uses the internal limited high speed memory block fully.

    Regarding the procedure area’s management, considered first-level procedure buffer (L1 P) the hit probability, should better have successively the execution sequence function according to the address successively order disposition in the procedure space, simultaneously analyzes the code quantity quite big processing function the small function.

    Data area: In the video frequency standard arranges in the decoding, because the block data is very big, for instance a D1 4:2: 0 images have 622KB, moreover in arranges in the decoding to need 3~5 even more cushion frames, therefore the data basically is unable in internal to deposit. Therefore, in system’s memory optimization management, needs to use C64 series DSP the second-level buffer (to TMS320DM642, uses in video frequency arranging decoding second-level buffer to use 64KB the situation to be quite many). In addition, should better place outside the piece, the video frequency buffer data which is mapped by the buffer to aline by 128 bytes, this is because C64 series DSP second-level buffer’s each line of sizes are 128 bytes, alines by 128 bytes is advantageous to buffer refurbishing and the uniform maintenance.

    The system uses EDMA the situation as well as must take the EDMA physics main line’s time.

    When algorithmic system integration EDMA resource distribution management

    Because in the video processing, has the data removal frequently, moreover C64 series DSP provided EDMA, in logic had 64 channels, therefore to optimized the system to the EDMA disposition use is very important. May use the following step to dispose system’s EDMA resources fully.

    1. Counts in the system to need to use EDMA each kind of situation and needs to take the EDMA physics main line’s general time, if the table shows. This table gives the data suits the following condition: The video frequency through the video frequency port (720×480,4:2:0,30 frame/second), the audio frequency (sampling rate is 44k) enters DSP through McBSP, after the compression data number rate about 2Mbps, the data outputs 128 byte packages through PCI each 488us (PCI mouth operating frequency is 33MHz), outside sets at the SDRAM clock rate is 133MHz.

    2. After counting these information, needs according to the system to each kind of symbol stream timeliness and the transmission block data size, to the EDMA channel which uses is carried on one by one priority assignment. , The audio frequency class transmission block is generally speaking small, takes the EDMA main line’s time to be short, but the video frequency transmission block is quite big, in takes the EDMA main line’s time to be long, will therefore input the EDMA channel’s priority hypothesis which the audio frequency corresponds will be Q0 (urgent), the video frequency corresponds the priority hypothesis is Q2 (medium), the output symbol stream corresponds the priority hypothesis is Q1 (high), in sound video frequency algorithm processing dispatches the QDMA priority hypothesis is Q3 (low). Certainly, in the genuine system application, possibly also needs to adjust these establishments.

    In fact, based on the TI DSP video frequency algorithm’s optimized integration process, the step which according to shown in Figure 2 will carry on. First disposes the memory initially, and chooses the corresponding translation optimization option, if translates the result already might meet the timely requirements to finish the following optimization, otherwise starts to optimize the memory and the EDMA disposition, thus enhances to the buffer and the internal main line’s use factor. If is also unable to meet the requirements, through the analysis entire project determined that consumes the CPU resources highest code section or the function, carries on the optimization to these essential modules, and uses the linear assembly, even to assemble until the overall system satisfies the request.

    Share/Save/Bookmark

    Thursday, July 2nd, 2009 at 23:08
No comments yet.

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

TOP
Copyright © 51 Research and Design, Electronic Engineers website - Embedded Systems, MCU, DSP, EDA, Test and Measurement, Components, Communications, Power, Microelectronics, Semiconductors
Powered by WordPress | Theme by mg12 | Valid XHTML 1.1 and CSS 3