The pronunciation activation examines VAD (Voice Activity Detection) is one kind the stop which and the silent gap appears through the specific decision criterion judgment pronunciation, examines the effective pronunciation part the technology. Using this kind of technology may in guarantee that the pronunciation quality under the premise, uses the different bit number to the different category’s pronunciation section to carry on the code, thus reduces the pronunciation the code rate. Because in the duplex mobile communication system, one only then only then 35% time is at state of activation [1], how to reduce the static sound time the code rate regarding the reduced transmission band width, the power as well as the capacity has the positive function, therefore the VAD technology has the important use value in the voice communication domain. Along with suits in changing bit rate multiple access technique and so on speech coding CDMA and PRMA appearances, applies also enhances [2] in the honeycomb pronunciation activation examination importance along with it.
As a result of voice communication’s particularity, the request examination process can meet the timely requirements. But present mainstream DSP chip’s degree of parallelism is not high, therefore under real-time processing’s request, guaranteed that the pronunciation quality and reduces the pronunciation code rate both to give dual attention with difficulty. But scene programmable gate array (FPGA), because its hardware has the programmable flexibility, may realize the high degree of parallelism, thus may in satisfies the timely request under the premise, guaranteed well the pronunciation quality and reduces the pronunciation the code rate.
1 algorithm and examination flow
1.1 algorithm summaries
The pronunciation activation examination algorithm may based on the time domain or the frequency range. This article uses the algorithm is the time domain analysis method. The algorithm may divide into the short-time energy examination and the short-time zero crossing rate examination two parts regarding input signal’s examination process. Algorithm take short-time energy examination primarily, short-time zero crossing rate examination as auxiliary. According to the pronunciation statistical property, may divide into the pronunciation section the voiceless sound, the voiced sound as well as the static sound (including background noise) three kinds. In this algorithm, the short-time energy examination may the good area branch out the voiced sound and the static sound. Regarding voiceless sound, because its energy is small, because of will be lower than the energy threshold in the short-time energy examination by the miscarriage of justice will be the static sound; The short-time zero crossing rate may branch out the static sound and the voiceless sound from the pronunciation central area. Unifies two kind of examinations, may examine the pronunciation section (voiceless sound and voiced sound) and the static sound section.
1.2 examination flows
Examination flow: Carries on first to the input signal passes the filter high, weakens by the noise signal energy primarily. Then carries on the window length is 80 data Canadian window processing, then calculates this frame the moderate energy, uses the short-time energy to carry on VAD again initially to sentence. If the moderate energy is bigger than the threshold to sentence is the pronunciation frame, if the moderate energy being smaller than threshold sentences for the static sound frame. Regarding initially sentences for the static sound frame frame carries on VAD to be smooth again, namely refers to the first three situations: If in first three contains a non-smooth pronunciation frame at least, this smooth is the pronunciation frame, under simultaneously records this frame for the smooth obtained pronunciation frame; Otherwise, then judges for the static sound frame. If smooth result still for static sound frame, and the current frame’s zero crossing rate is situated between 30~70 time, then changes the original judgment to the pronunciation frame; Otherwise still sentenced for static sound frame [3]. VAD algorithm’s examination flow chart as shown in Figure 1.

In addition, because the person ear’s sense of hearing has the masking effect, therefore has the necessity to carry on to the short-time energy threshold renews [3]. This algorithm uses the threshold update mode is: If examines continuously to three pronunciations, to examine the static sound well, enhances 3dB the short-time energy threshold, but if enhances the threshold surpasses the current frame the moderate energy to reduce 12dB, then does not enhance the threshold; If examines continuously to three static sounds, to examine the pronunciation well, reduces 3dB the short-time energy threshold, but if reduces the threshold is smaller than the current frame moderate energy adds 12dB, then does not reduce the threshold. In addition, to prevent the threshold becomes too Gao Huojiang too lowly, but should also limit the threshold in GATE_MIN, in the GATE_MAX scope.
2 systems realize and optimize
This design uses QuartusII as well as ModelSim carries on the development (ModelSim is Mentor Graphics Corporation’s simulation software). QuartusII is a Altera Corporation’s set develops FPGA/CPLD the EDA software, may complete after the design input, the function simulation, the synthesis optimizes, the simulation, the pin disposition, the layout wiring to the disposition chip a series of FPGA/CPLD development flow, and provides transfers other EDA tool, like ModelSim, Synplify/Synplify Pro, FPGA Complier connection.
This design’s input is 16 PCM code digit voice signal, the output is every 80 data is a voice signal test result, the high level expression pronunciation, the low level expresses the static sound. According to uses the algorithm the characteristic, divides five modules this design: The FIFO module, passes the filter module, the moderate energy module, the decision module as well as the control module high. System structure diagram as shown in Figure 2.

2.1 FIFO modules
The input voice signal’s sampling rate is 8kHz, if takes 8kHz system’s clock rate, weakened the FPGA chip speed advantage enormously. Therefore the system needs two clocks, one is the frequency is the 8kHz sampling clock, another for system master clock.
In the FPGA design, the multi-clock design can bring the unstable hidden danger. In order to enhance system’s stability, this design uses a pair of mouth FIFO to make the clock isolation. The FIFO module has 16 data input ports and 16 data outlets, the 8kHz clock input port as well as the system master clock input port. In addition, because the FIFO reading rate is bigger than writes the speed, when therefore FIFO is a free time, needs to output a empty signal.
In passes the filter, the moderate energy computation, the decision, to control in high these four modules to be possible to use the single clock design, moreover uses the clock is the system master clock.
2.2 filter modules
The filter carry on to the input signal pass filter’s pretreatment high. Passes filter’s transmission function to use transmission function [4] which high the CS-ACELP algorithm uses:

In FPGA the IIR filter’s design uses the running water line structure which generally as shown in Figure 3 (in chart take the second-order IIR filter as example). This kind of structure’s filter may complete filter in a clock cycle to calculate [5~6], the degree of parallelism are high, but on the hardware needs 5 multipliers, 4 accumulators and certain registers, has taken many resources. Used the non-running water line structure the filter (still take the second-order IIR filter as example) structure as shown in Figure 4. And: fifo_out is the FIFO module output data, empty is FIFO whether for spatial symbol signal, ready_out signal to complete the symbol signal which filter calculate. This structure’s filter every 5 clock cycle completes a filter computation, the degree of parallelism is low, but only needs 1 multiplier, 1 accumulator, 1 counter as well as certain registers on the hardware.


As a result of this article algorithm its signal’s sampling rate only then 8kHz, the non-running water line structure filter’s processing speed has been possible to satisfy the request. Therefore, to use the resources reasonably, this design has used based on the non-running water line structure filter. Simultaneously considered system’s synchronism and the stability, design the filter every 8 clock cycle completes a filter operation and saves the result lock. Table 1 is two kind of different structure filter (coefficient uses 18 quantifications, namely 2 integers add 16 decimals) the result comparison which realizes on the identical component, uses the component is Altera Corporation CycloneII series EP2C5T144C7, the comprehensive tool is QuartusII 5.0, the optimized option is balanced. Comparison result may see by Table 1, although non-running water line structure filter speed compared to assembly line slow, but uses the resources actually greatly reduce, and can complete a filter computation in 101.61ns, may satisfy the timely request. This module’s processing detention is 8 clock cycles.
2.3 add the window, the moderate energy computation module
(1) correlation formula
Through will pass filter’s signal to carry on adds Hamming window processing, the window length will be 80 data high. Processing uses the formula is as follows:

And, x(i) is after passes filter’s signal, after y(i) is high the process adds window processing the signal.
Adds the window has processed the signal to the process, calculates its moderate energy the formula to be as follows:

And, y(0), y(1), ……, y(79) adds window processing for the process the signal, E_average is this frame moderate energy.

In adds the window in the computation to involve to the cosine operation, to save the resources and the enhancement processing speed, uses the table look-up law to obtain 0.54-0.46×cos (e×i/79) the part value.
(2) square law function generator realization
Because a square operation and the ordinary multiplier compare have certain particularity, therefore in the square law function generator hardware realizes on uses the following algorithm to reduce the hardware source and to raise the operating speed:
Supposes X binary system expression is In-1 ……I1I0, Iij is ith and the jth product. Because in a square operation Iij=Iji, therefore Iij Iji=2Iij. Therefore take a 4bit number’s square operation as the example, the reference diagram 5, may after the merge same item, shifts to the left one (to be equal in rides 2 operations) to reduce the partial product figure [7]. To the partial product which obtains uses the Wallace compression tree the partial product compression to two groups, then uses again carries the accumulator to obtain the final output in advance.

Because this module computation obtains the moderate energy only uses in the same pronunciation decision module the threshold comparison, stemming from saves the hardware source the consideration, this moderate energy and in the pronunciation decision module’s threshold does not turn into the dB unit. Through the Matlab simulation confirmation, this procedure will not affect the final ruling. Adds the window, the moderate energy computation module processing detention is 5 clock cycles, Figure 6 is this module structure diagram. The data which high in after the chart ready_out signal to pass the filter module to complete a filter computation, outputs to be possible to read the signal, the acc_clken signal for accumulator’s clock enables the signal.

2.4 pronunciation decision module
Before current frame moderate energy which calculating as well as the frame situation, judges this whether for pronunciation frame. Through uses four flag bits: frame_attribute[2:0] and smooth decided whether needs to carry on smoothly, frame_attribute[2:0] records the first three attributes, smooth records in first three whether to have the non-smooth pronunciation. This module’s processing detention is 1 clock cycle.
2.5 control modules
The control module controls passes the filter, the Canadian window, the moderate energy computation as well as the pronunciation decision module movement high, and acts according to the actual situation to carry on the renewal to the threshold.
2.6 systems synthesis results
Table 2 for this design on two section of FPGA chip comprehensive result.

The comprehensive result showed that this design the resources which takes on the hardware are few, and may (consider cost in low cost FPGA, selects CycloneII series EP2C5T144C7) on to realize. Therefore this design may also constitute the complete pronunciation processing chip together with other digital pronunciation processing module.
2.7 simulation results and analysis
Figure 7 is the ModelSim simulation result. In the chart the last line of signals are the test results, the high level expression pronunciation, the low level expresses the static sound. By the simulation result may see, designs FPGA may satisfy the accuracy and the timely request.

By the front each module’s analysis result may calculate that this design in gathers a data, after 14 clock cycles may the ruling output.
This article introduced realizes based on the short-time energy and short-time zero crossing’s rate VAD algorithm FPGA. The overall system uses VHDL to carry on the description, and has carried on the simulation, has confirmed the design accuracy. System’s clock rate may reach 46.22MHz, may after gathering a data in 302.90ns to output the test result, meets the timely requirement. Because this design uses VHDL to carry on the description, therefore has the probability, because simultaneously designs the hardware source which uses not to be many, therefore may also take a module to apply in other systems.
Reference
1 BRADY P T.A technique for investigating on-off patterns of speech[J]. Bell Syst Tech J, 1965; (44):1~22
2 GERSHO A, PAKSOY E. An overview of variable rate speech coding for cellular networks[A]. IEEE Conf Selected on Topics Wireless Commun[C]. Vancouver, 1992; 172~175
in 3 Wu Zhi brave .VoIP the pronunciation compresses Codec the research with to realize. Master paper. Nankai University, 2003
4 ITU-T Rec.G.729, Coding of speech at 8 Kbit/s using conjugate-structure algebraic-code-excited linearprediction (CS ACELP) [S]. 1996
5 Parhi K K.VLSI digital signal processing systems: Design and Implementation. Beijing: Mechanical industry publishing house, 2003
6 Kuo S M, Lee B H, Lu Boying is translating. Real-time digital signal processing. Beijing: Chinese Railroad Publishing house, 2004
7 Han wild geese, Yao Qingdong. In the digit specific IC the square operation’s hardware realizes. Journal of Electronics, 1996; 18(6)