1 introduction
Develops unceasingly along with the DSP technology and consummates, digital signal processing’s application scope is getting more and more widespread. The labor controls, the computer, the correspondence and in the expense electronic products, could find its shadow. In recent years, along with the multimedia correspondence’s vigorous developments, DSP also more and more applications in the multimedia correspondences, but DSP used in aspects and so on pronunciation compression and imagery processing in the multimedia correspondences, but these needed the huge computation load, some low speed DSP to satisfy with difficulty in the real-time communication request, but will use high speed DSP to enhance the cost greatly, therefore carries on the optimization to the code is in the present DSP development the commonly used one method. As a result of the DSP special structure, compiler’s translation efficiency is quite low, displays completely with difficulty the DSP computing power, must therefore carries on the manual assembly optimization according to the DSP special structure and the set of instructions code.
This article unifies the author on TI Corporation’s TMS320VC5402 DSP to the G.729 algorithm optimized experience, proposes some optimized method and the suggestion, but these methods are also suitable other 54 series DSP.
2 chip introductions
TMS320C54X is TI Corporation the new generation fixed point digit processor which promoted in 1996, it had the power loss slightly, highly parallel and so on merits, might satisfy the telecommunication and so on numerous domains the real-time processing request. 54 series have the different model chip, their structure is the same, is only is somewhat different in the connection and the memory space. TMS320VC5402 uses the most widespread one kind of chip in 54 series numerous DSP chips, then as the example introduces 54 series DSP take TMS320VC5402 the performance characteristics:
* the operating speed is highest reaches 100MIPS
* has the advanced multi-bus structure, three 16 bit data memory main line and a program memory main line
* 40 arithmetic logic unit (ALU), including a 40 doliform shifter and two 40 accumulators
* a 17bit×17bit multiplier and 40 special-purpose accumulators, allow 16 belts/not to have the sign multiplication
* 8 auxiliary registers and a software stack
* the interior uses the improvement the Harvard structure, the procedure space and the data space separates, the permission simultaneously takes the instruction and takes the operand, and permits in the procedure and the data space transmits the data mutually
* biggest 64K×16bit exterior data space, biggest 1M×16bit exterior procedure space, 4K×16bit internal ROM,16K×16bit internal RAM
* built-in programmable waiting status generator, phase-locked loop (PLL) clock generator, two multichannel cushion serial port, a 8 bit parallel and exterior processor correspondence HPI mouth, two 16 timers as well as 6 channel DMA controller
* supports the single instruction circulation and the block circulation, uses six levels of assembly lines, needs an instruction execute to take refers to, the decoding, to take the operand and to carry out and so on several steps also to complete, is the instruction cycle falls to slightly suits the algorithm the optimization
3 code optimizations
Carries on the manual assembly optimization to the C code to have three methods: 1. compares the C code to write the assembly code, this method optimizes the efficiency is very high, but the development difficulty is very greatly specially works as the code quantity is very big, the structure is very complex when optimizes very easily to make a mistake; 2. uses the compiler to have the assembly code first, then the rewriting assembly code, this method optimizes the efficiency is low, because the frame has been defined, but the development difficulty reduced, is not easy to make a mistake.
Because the present commonly used some audio frequencies, the imagery processing algorithm are the structure very complex procedures, therefore the suggestion uses the second optimized method.
3.1 have the assembly code
TI Corporation is the DSP exploiter provides set of translations to develop the platform to call CCS (Code Composer Studio), this tool provided the compiler to be possible the C language program compiling to be the DSP assembly language procedure, then the link production might COFF which carried out on DSP the form out document.

But CCS oneself also provides the optimizer to be possible to carry on the optimization to the C code, and has the assembly language procedure, concrete process as shown in Figure 1.
CCS has provided 4 level of document optimization plan, respectively is O0, O1, O2, O3, the following concrete explanation
.
(1) O0 register rank
* execution control flow simplification
* uses the register allocated variable
* the execution circulates alternately
* the elimination has not used code
* reduction formula and indication
* expanded internal company function transfer
(2) O1 partial rank
Carries out all O0 rank optimization, and:
* carries out the partial constant dissemination
* the elimination has not used evaluation
* the elimination partial uses in common the expression
(3) O2 function rank
Carries out all O1 rank optimization, and:
* execution optimization of loops
* removes the overall situation to use in common the child expression
* removes the evaluation which the overall situation does not use
* the execution opens the circulation
(4) O3 document rank
Carries out all O1 rank optimization, and:
* the elimination not the function which transfers
* the simplified returns value not the function which uses
* lets the small function turn the internal integration transfer
* the preserved function showed that so that the main function can be optimized time knew that is transferred the function the attribute
* recognition document rank variable characteristic
When uses the O3 rank the optimization, but may also use other option to carry out a more careful optimization
* OLN obtains the standard storehouse function document
* ONN creation optimization information file
* the PM executive routine rank optimizes, translates many source documents
But we when makes the optimization, what elects is the O2 rank optimization, because the assembling file which after using the O2 rank optimizes, produces to have the quite many annotation information, compared with easy to understand the procedure, suggested that is not too ripe to the procedure and to the assembly language not too skilled person use.
3.2 manual assembly optimizations
Because the assembly language readability is very bad, and the code quantity is very big, therefore optimizes the work load to be very big manually, and easy to make a mistake. In order to guarantee that the optimization does not make a mistake, we first manufacture a section of test sequence are the procedure inputs, then the operating procedure carries on processing to it, produces a section of correct result sequence, the examination optimizes manually whether correct is with has optimized the procedure to the same test sequence carries on processing, compared with production result sequence and correct result sequence to be whether same, the same words represent the optimization to be unmistakable. However the test sequence must compare is long, because some start not to be able wrongly to appear comes out, only will be accumulates slowly, moves period of time only then to appear.
Then, starts the work which optimizes manually. Below is I to manual optimized some experiences.
(1) as far as possible little carries on the function call. Because carries on the function call, must press PC the stack, but must press some registers the stack, after function call, but must send out of the warehouse, this is some nonessential operation, therefore some small functions, do not transfer, but is reads in directly the main function, like this may be possible to reduce the operation which these pressure stack sends out of the warehouse, raises the speed.
when (2) optimization circulates, puts as far as possible some operations to the circulation outside, reduces the operation the number of times. For example some evaluations and the initialization operation, may mention that outside the circulation does, raises the speed.
(3) removes some redundancies the evaluation. The compiler produces the code has many evaluations, bestows on frequently a value for the register, bestows on again for the variable, has like this had the redundancy.
(4) uses RPT and RPTB as far as possible carries out the cycle operation. Produces in the compiler in the code many cycle operations are realize through the condition distinction, like this many useless distinction codes, but the 54x DSP chip provides the special circulation instruction: RPT and RPTB. The RPT function is the looping execution next instruction, the cycle-index decided by the RC register’s value that the cycle-index is the RC register’s value adds 1, before therefore the execution circulation, must reduce the cycle-index 1 bestows on for the RC register; RPTB is the block circulation instruction, its function is looping execution section of instructions, its cycle-index decided by the BRC register, the cycle-index is the BRC register’s value adds 1, before therefore the use, must the cycle-index reduce 1 bestows on for the BRC register.
(5) use quite quick addressing system. Inside the digital signal processing, will carry on the massive operations to the massive data, if will use the quite quick addressing system to reduce the instruction cycle greatly. Because the data mostly is smooth depositing, therefore we use the register to go to the addressing, after operating, from adds under 1 direction a data, such addressing will reduce many instruction cycles.
(6) use circulation buffer. Because FFT, FIR and so on the commonly used operation needs to carry on the shifting function to the data, if the data quantity is big, the procedure flower’s has been very big in data shifting expenses, if the use circulation buffer may not carry on these operations, thus raises the speed.
(7) uses some special instructions. In 54 command systems, has some special instructions to carry out some special operations, for example square, FIR and so on, if replaces with other instructions needs many instruction cycles, but uses the special instruction value to need an instruction cycle.
(8) use parallel instruction. Because of the DSP running water line structure, may let some instructions also move, will have the parallel instruction, the use parallel instruction will reduce the instruction cycle greatly.
(9) some commonly used procedures and the data, places the internal RAM movement. On the DSP chip has RAM generally, but internal RAM addressing speed compared to piece outside RAM quick one to two times, therefore places the commonly used procedure and the data internal, will raise the running rate greatly.
in 3.3 optimizations often meets question
When manual optimal process will meet many questions, the following several spots will be quite common.
(1) pair of some register’s establishment. Because is the manual optimization, therefore wants to some registers to evaluate, for example ST0, ST1 and PMST and so on, the different establishment can cause the operation result not to be dissimilar. And some use quite many positions have SXM, OVM and FRCT. SXM is the mark expands the location, if SXM=0 does not carry on the mark to expand, if SXM=1 carries on the mark to expand (sees chart 2a). OVM overflows the pattern position, when has the overflow, if the OVM=0 overflow’s result is escorted to the goal register, if OVM=1 delivers the biggest positive number toward the goal register (007FFFFFFFh) or the smallest negative number (FF80000000h). FRCT is the decimal pattern position, when FRCT=1 the multiplication result will shift to the left one (to see chart 2b). The above 3 flag bit’s setting and the replacement are complete by SSBX and the RSBX instruction.
(2) attention assembly line conflict. 5402 chips have a 6 level of depth instruction assembly line, these 6 level of assembly line each other is independent, in any machine cycle, may have 1 to 6 different instructions in the work. These 6 level of assembly line’s function respectively is prefetching refers to, takes refers to, the decoding, the addressing, the reading and the execution. The C5402 multistage stream line operations may let at the same time many instructions the instruction visit the CPU resources, if many assembly lines simultaneously visit the same resources, possibly has the assembly line conflict, some conflicts may alleviate automatically by CPU the transit delay addressing method, but some conflicts cannot prevent, needs to arrange the instruction by the procedure or to insert the spatial operation to solve. When carries on with the CCS compiler to the C procedure translates, the compiler automatically will join the NOP instruction to solve the assembly line conflict, but carries on optimizes manually, wants special attention this question, the majority of assembly line conflict is because simultaneously visits certain registers, so long as joins the corresponding NOP instruction according to the waiting periodic table to be possible to solve.
(3) pair of some parameter preservation. In the manual optimization’s process, we will use certain registers to transmit the data, but during this process, if transferred other function, these register’s value will have the possibility to change, therefore will transfer these functions time, must first these parameters press the stack preservation, after transferring, will send out of the warehouse again it the restoration. Also has is certain flag bit preservation, because will change these condition flag bit in the transfer function’s process, therefore after transferring must its restoration.
(4) circulation buffer allocation question. The circulation buffer’s allocation must aline, the length is the R buffer must start (i.e. circulation buffer base address N least significant bit from N bit address boundary to be 0), N is satisfies 2N>R the smallest integer. For example, the length R=31 circulation buffer must from address XXX0 00002 (N=5,25>>31, this address lowest 5 be 0).
(5) memory divulging question. What because DSP use is the Harvard structure, the data space and the procedure space is separates, generally the data operation cannot affect the procedure. But on the DSP chip has RAM, but these spatial data and the procedure are sharing, therefore carries on the operation to this part of data, if will have divulging to rewrite the procedure, will cause the procedure to run flies. Therefore the procedure runs the words which flies, must consider whether to have memory divulging.
4 conclusions
The above experience and the skill are author summarize in the actual DSP project obtain, the practice proved that has the help to the actual development. Take the author to the G.729 algorithm optimization as the example, before optimization, the G.729 operand is 1000MIPS, after the optimized operand is 30MIPS, enhanced 30 multiple, obviously optimizes the effect is very obvious. Above these experiences mainly aim at the TI Corporation’s 54 series, but also has regarding other model’s DSP profits from the function.
Reference
1 Peng Qicong .TMS320C54X practical course. University of Electronic Science and Technology of China publishing house, 2000
2 Dai Mingzhen the .TMS320C54X digital signal processor structure, the principle and apply .TI DSPS UNIVERSITY,2000