This article take 16-QAM the RF firing data pump’s design as an example, introduced that uses the FPGA design digit filter’s skill and the component choice method, explained carries out when the distributed computing FPGA compared to DSP superior place.
- All digital logic basic structure
- 16-QAM modulator
- Code and element mapping
- The square root rises the cosine filter
- Design tactics
- 5 MHz carriers
- Distributed computing (DA) technology
- Filter’s realization
Designs the software radio and the modem with scene programmable gate array (FPGA) may with the DSP chip setoff. Although FPGA may realize the convolution encoder easily and so on complex logic function, but is realizing the massive complex computation aspect actually to have the very big flaw. Namely uses quickest FPGA to realize the matrix multiplier, its cost and the performance also cannot be as good as only value 5 US dollar DSP chips. When used the CAD tool designed DSP was still the first choice chip, but along with the distributed computing (DA) technology’s application, FPGA received designer’s affection once more.
One of FPGA characteristics is the structure is flexible. In fact is wireless and the modulation demodulation data channel’s functional module is very easy to map in the independence and the parallel hardware node. When uses one only to be able the time sharing movement digital signal processor, the dispatcher much time request urgent duty needs the very complex programming, but used FPGA to avoid this question.
We while will design 16-QAM radio frequency firing data pump’s to introduce the FPGA characteristic, and how describes the data channel functional module transformation is the Xilinx 4000 series FPGA logic circuits conveniently in detail, thus estimated accurately needs logic circuit’s quantity. Although satisfied the same system requirements and the use identical type FPGA 16-QAM data pump’s design once had published in the public literature, but reported that the logic circuit quantity as if were much more than the actual need. To put in the market eagerly, the product is very possible not to need the CAD tool to carry on the design. The total dependence CAD tool also not necessarily can always obtain the most superior plan, but must pay the massive industrious sweat, the experience and the creative work.
All digital logic basic structure
So long as has the enough NAND gate and the rejection gate and so on general logical gate then constructs any digital logic. FPGA has the sufficient logical gate. The Xilinx 4000 series logical gates use the truth table the form, or uses more general 16 character x 1 bit search table (LUT) the form, it may realize four input variables (search table address wire) willfully Boolean function function. Because produces function function usually quite Yu Duoge NAND gate combination, therefore LUT regards as the basic logical unit. The Xilinx 4000 series may dispose logic module (CLB) including two 16 character LUT, may combine has five input variables willfully the Boolean function. In addition LUT may also establish two 16 x 1 RAM or 32 x 1 RAM.
CLB becomes the two-dimensional square formation arrangement, CLB and between them the interconnection may dispose separately. Smallest XC4002 contains a 8 x 8 CLB matrix, biggest XC4085XL contains a 48 x 48 CLB matrix. Each LUT connects one to reach as high as 100 MHz the triggers.
16-QAM modulator
the 16-QAM modulator (see Figure including the RF firing data pump’s essential functional module 1). after the 20-Mbps serial data divides into 4 bit-interleaved code tuple (symbol), parallel delivers to a difference encoder and the element mapping by each second 5,000,000,000,000 element’s speed (symbol mapper). This mapping has 3 bit quadrature components to be right. Then these components to rise the cosine filter by a pair of square root to carry on the pulse shaping, achieves each second 20,000,000,000,000 elements after the interpolation, carries on the modulation again by the 5MHz carrier, after each output adding together, carries on the digital-analog conversion. The design key uses pair of interpolation pulse shaping filter. 
To realize this design method effectively, it is necessary when the determination logical gate total, will code and the mapping functional module as well as a 5MHz modulator also considers.
Code and element mapping
In determined when encoder and signal mapping logical number, we may profit from the standard modem’s design. If in the V.32 encoder provides 180 degree double-phase protections including one difference encoder and can join the redundancy to reduce receiver’s position error coefficient (BER) convolution encoder. The encoder and the mapping is the limited state machine realizes, all conditions (2.5 CLB) realize by five registers, connection logic two inputs the different or gate by eight (4 CLB) and three two input AND gates (1.5 CLB) constitutes. In this 16-QAM transmitter, a string and transforms the register (2 CLB) catches after four 20-Mbps serial bits forms 4 bit elements, such encoder may process reduces to each second 5,000,000,000,000 element data stream, but this speed CLB is very easy to process. The data channel control needs to carry on the clock controlling along data channel’s register, needs CLB the quantity to be short in 15. , After the code 5 bit output element correspondence mapping’s address wire, is very then simple, this mapping is pair of 3 bit output LUT.
These outputs (I and Q) map a two-dimensional surface as the quadrature component (constellation) in element position. 64 intersections (star) only some 16 represent the effective element position. The mapping size is 32 character x 3 bit x 2 namely 6 CLB. These functional module’s CLB total is 31.
The square root rises the cosine filter
The square root rises the cosine filter is one feasible method which disturbs mutually in transmission path’s limited band width internal inhibition element. The frequency spectrum modulated separately by launcher and the receiver unit, forms the square root to rise the cosine filter. The filter shape and the coefficient develop auxiliary with the QEDesign 1000 softwares. Figure 2 is 12 bit fixed point calculation 32 tap limited pulse response (FIR) filter’s response chart. We will use a 12 bit filter model and determined that its logical gate number (uses 12 bits quantifications ways, the QEDesign procedure only needs 28 indeces of symmetry, but this kind of design proposal will use complete 32 tap symmetrical FIR filter). 
Design tactics
The square root rises the cosine filter to use on I, Q two channel’s frequency spectrum formations. When has I, Q sampling point by each second 5,000,000,000,000 sampling speed, the filter have each second 20,000,000,000,000 sampled data for the modulator. Thus, the filter have acted as a 1:4 interpolation. The corresponding computation load (uses index of symmetry) is 2 channel x 16 step symmetrical tap x each second 20,000,000,000,000 sampling point = each second 640,000,000,000,000 multiplication - accumulation operation. This speed surpasses the majority fixed-point DSP chip greatly the running rate. Now FPGA has become one kind of very attractive choice, but, but must choose one kind of filter form to enable it to map effectively based on the CLB design.
Now has many kinds of logic circuit’s disposition or the form may realize the FIR filter. Most main has the direct form (i.e. one kind of commonly used software model), the belt variable transposition form (to realize as well as the heterogeneity filter by special-purpose filter chip) (is suitable for multi-speed application). But these forms cannot use the index of symmetry the method to reduce the multiplication computation load. Designs a multi-speed filter’s skill is sections out the signal class path in the sampling point - coefficient plane.
The ordinate axis expression sampling point, the horizontal axis expression coefficient, draws after data track display 90 degrees turn over, filter’s response chart. Because the coefficient is symmetrical, only need list half filter coefficient. The insertion coefficient is K, namely in inputs the sampling point to fill in the K-1 zero spot, thus obtains 32 tap FIR V shape path. Although the data-in sampling point gap is 200 ns, but the new path spot must every other a 50ns spot.
May obtain two kind of computation models by this chart. The first kind is the transposition form distortion, the non-vanishing input sampling value and all 32 coefficient’s product adds together in the part and the register. after 32 product adding together, and after filter’s complete response output, the multiplication - accumulator electric circuit may use in calculating the new path. Here, every other 200ns carries on 32 MAC operations. The second kind of model retards adding together, namely FIR filter’s direct form. Just like sees in the filter path, needs eight memories the sampling values to calculate a filter response. Through calculates five continual filter to respond us to be possible to observe the model which Table 1 gives.
May calculate four continual 20MHz responses by the similar eight sampling point input group. Has only used two group of filter coefficients. The filter coefficient and each sampling data set third and the fourth response (y d and y e) the order is opposite. These response equation can map in the effective FPGA electric circuit? Certainly energy! The key applies the distributed computing technology, all present design tool does not have this algorithm. Before realizing the response equation set, may make to simplify first.
5 MHz carriers
The carrier modulation’s simple equation is: Y(k) = yI(k)cos(wC*t) yQ(k)sin(wC*t), wC is carrier frequency = 2p (5 MHz), I and Q expression synchronism and orthogonal element component.
This equation every 50 ns carries out one time. In an element cycle (200 ns) only has four carrier values. These values may the definition be conveniently: cos(wC*t) = 1, 0, -1, 0 and sin(wC*t)= 0, 1, 0, -1, 1.
The modulation output already does not need any multiplication or the addition, also does not need every other 50ns to calculate a time I, Q filter response. 50 ns calculates a I response then to calculate Q in next 50 ns to respond, then calculates I response, Q to respond again, starts once again.
Distributed computing (DA) technology
DA is aims at the product and the equation one kind of computation technology, in an equation product factor is a constant specially. The DA design may realize the gate level high efficiency, the serial bit algorithm and the high performance bit parallel operation, it is the classical string/and the master-plan. The DA technology may apply in the important linearity, when invariable digital signal processing algorithm, like filter (FIR and IIR), transformation (fast Fournier transformation [FFT]) and matrix vector product, like 8 x 8 discrete cosine transformation (DCT).
More than 20 years ago had the DA technology, already confirmed that it is not suitable for programmable DSP the fixed-point set of instructions structure. However, DA is suitable for FPGA to realize, particularly like Xilinx CLB LUT logic module. Designed DA with Xilinx XC3000 series FPGA the FIR filter as early as to propose in 1992.
In DA electric circuit not independent multiplier. The multiplication is completes by LUT. DA saves in advance an equation all partial product item and, and acts according to all input variable position table look-up (here is DALUT) operates. The serial DA electric circuit has independent DALUT, it starts the table look-up from the lowest significant digit. Partial product’s output and the memory in the accumulator, this method let us remember in the early computer’s shifting adding together subroutine, the continual DALUT output accumulated to under the partial product binary system moves to the accumulation and. This may obtain a true double precision result.
Filter’s realization
The square root rises the cosine filter’s data channel by to be possible to transform is the CLB standard functional module definition. Every other 200ns the mapping output’s 3 bit I, Q signal will pass on to and string transformation shift register (PSR). RAM in the shift register (SR) chain saved seven formerly elements. The first three filter respond Y b, Y c, Y d and in shift register’s for data operate together. PSR also needs a feedback path, but RAM SR in read-only time circulates receives the module addressing the influence. Here module has six, first three shifting use in Y b, is following closely three times uses in Y c, finally three times uses in Y d. When calculates Y e, the data moves along the SR chain under. This kind of module addressing mode (writes) before the level transmission the data is unceasingly redundant. All 12 shifting and corresponding PSR load, the RAMSR addressing and write the control to originate from the 60MHz system clock.
Because the same department array must use in two sampling periods, uses in I channel data computation, another uses in the Q channel data computation, guides with group of DALUT and 2/1 multiplexer the serial data stream the corresponding address port. These ports may express the DALUT structure. the h 3 port’s logical high level choice partial products and contain the h 3 all memory addresses. With this similar, the h 7 port’s logical high level choices all contain the h 7 all addresses, h 3 and the h 7 port’s logical high level choices all contain h 3 and the h 7 addresses. The surplus six coefficients still used this kind of pattern. In fact, eight coefficients will need 2 8 or 256 character memories. Regarding 12 bit coefficient’s situations, will need (each CLB will be 256/32 character) x 12 = 96 CLB. Another knack uses two DALUT, each needs four coefficients and increases their output. Such CLB number reduces to (2 x 24) /32 x 12 13/2 (parallel accumulator) = 18.5 CLB.
The similar simplification also available to the second set of filter coefficient which 1 starts by h. Uses 2/1 multiplexer separable when shares the parallel accumulator. After this accumulator expansion is 13 bits, inputs to fore-mentioned execution shifting and the additive operation scalar accumulator. When input variable sign bit transmission for DALUT, carries on reduces the operation. This process may through increase the EXOR gate and in the DALUT output completes to the accumulator first level of carrying standard method. Regarding negative responds Y d and Y e, the data sampling may, no matter sign bit, but takes to all DALUT output data instead comes the supplement.
Regarding the score two’s complement form I, the Q data, the filter coefficient must make the adjustment by against to overflow in the final output. Ten highest significant digits may load to the D/A transformation actuation register.
The filter data channel’s CLB total is 71.5, the FPGA output port has the trigger, may actuation register which transforms as D/A. Is counted the encoder (31 CLB) and fixed time (estimate must be short with the control function in 50 CLB), the total probably is 159 CLB, happen to may put in Xilinx in the XC4000 series is small (is bigger than slightly slightly) in the chip, namely XC4005 (196 CLB). If uses Xilinx Virtex and so on higher paraffin the FPGA component, then may reduce CLB the quantity and enhances the performance.
The entire design may guarantee under the 60MHz system clock condition the performance. The data stream adopts the unification form, and unidirectional transmission. May insert the pipeline register (not to increase CLB) to reduce the combinatorial path. Through the scalar accumulator’s 14 level of carrying chain is the longest combinatorial path. However, may guarantee the enough speed remainder through the built-in pre-carry circuit.