Texas Instruments Security Camera TMS320C64X User Manual

TMS320C64x+ DSP

Little-Endian DSP Library

Programmer’s Reference

Literature Number: SPRUEB8

February 2006

Download from Www.Somanuals.com. All Manuals Search And Download.

Preface

Read This First

About This Manual

This document describes the C64x+ digital signal processor little-endian

(DSP) Library, or DSPLIB for short.

Notational Conventions

This document uses the following conventions:

- Hexadecimal numbers are shown with the suffix h. For example, the

following number is 40 hexadecimal (decimal 64): 40h.

- Registers in this document are shown in figures and described in tables.

- Macro names are written in uppercase text; function names are written in

lowercase.

Each register figure shows a rectangle divded into fields that repre-

sent the fields of the register. Each field is labeled with its bit name, its

beginning and ending bit numbers above, and its read/write properties

below. A legend explains the notation used for the properties.

Reserved bits in a register figure designate a bit that is used for future

device expansion.

Related Documentation From Texas Instruments

The following books describe the C6000™ devices and related support tools.

Copies of these documents are available on the Internet at www.ti.com. Tip:

Enter the literature number in the search box provided at www.ti.com.

SPRU732 — TMS320C64x/C64x+ DSP CPU and Instruction Set

Reference Guide. Describes the CPU architecture, pipeline, instruction

set, and interrupts for the TMS320C64x and TMS320C64x+ digital

signal processors (DSPs) of the TMS320C6000 DSP family. The

C64x/C64x+ DSP generation comprises fixed-point devices in the

C6000 DSP platform. The C64x+ DSP is an enhancement of the C64x

DSP with added functionality and an expanded instruction set.

Read This First

Download from Www.Somanuals.com. All Manuals Search And Download.

Trademarks

SPRAA84 — TMS320C64x to TMS320C64+ CPU Migration Guide.

Describes migrating from the Texas Instruments TMS320C64x digital

signal processor (DSP) to the TMS320C64x+ DSP. The objective of this

document is to indicate differences between the two cores. Functionality

in the devices that is identical is not included.

Trademarks

C6000, TMS320C64x+, TMS320C64x, C64x are trademarks of Texas

Instruments.

Download from Www.Somanuals.com. All Manuals Search And Download.

Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1

Provides a brief introduction to the TI C64x+ DSPLIBs, shows the organization of the routines

contained in the libraries, and lists the features and benefits of the DSPLIBs.

1.1

1.2

Introduction to the TI C64x+ DSPLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2

Features and Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4

Installing and Using DSPLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1

Provides information on how to install and rebuild the TI C64x+ DSPLIB.

2.1

2.2

How to Install DSPLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2

Using DSPLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3

2.2.1 DSPLIB Arguments and Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3

2.2.2 Calling a DSPLIB Function From C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4

2.2.3 Calling a DSP Function From Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4

2.2.4 DSPLIB Testing − Allowable Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4

2.2.5 DSPLIB Overflow and Scaling Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4

2.2.6 Interrupt Behavior of DSPLIB Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5

How to Rebuild DSPLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5

2.3

DSPLIB Function Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1

Provides tables containing all DSPLIB functions, a brief description of each, and a page refer-

ence for more detailed information.

3.1

3.2

3.3

3.4

Arguments and Conventions Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2

DSPLIB Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3

DSPLIB Function Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4

Differences Between the C64x and C64x+ DSPLIBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8

DSPLIB Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1

Provides a list of the functions within the DSPLIB organized into functional categories.

4.1

4.2

4.3

4.4

4.5

4.6

4.7

4.8

Adaptive Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2

Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4

FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8

Filtering and Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-38

Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-58

Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-73

Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-76

Obsolete Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-90

4.8.1 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-90

iii

Download from Www.Somanuals.com. All Manuals Search And Download.

Contents

Performance/Fractional Q Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1

Describes performance considerations related to the C64x+ DSPLIB and provides information

about the Q format used by DSPLIB functions.

A.1 Performance Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2

A.2 Fractional Q Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3

A.2.1 Q3.12 Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3

A.2.2 Q.15 Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3

A.2.3 Q.31 Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4

Software Updates and Customer Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1

Provides information about warranty issues, software updates, and customer support.

B.1 DSPLIB Software Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2

B.2 DSPLIB Customer Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1

Defines terms and abbreviations used in this book.

Download from Www.Somanuals.com. All Manuals Search And Download.

Tables

2−1

3−1

3−2

3−3

3−4

3−5

3−6

3−7

3−8

3−9

DSPLIB Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3

Argument Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2

Adaptive Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4

Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4

FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4

Filtering and Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5

Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6

Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6

Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7

Obsolete Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7

3−10 Functions Optimized in the C64x+ DSPLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8

A−1

A−2

A−3

A−4

Q3.12 Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3

Q.15 Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3

Q.31 Low Memory Location Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4

Q.31 High Memory Location Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4

Contents

Download from Www.Somanuals.com. All Manuals Search And Download.

Chapter 1

Introduction

This chapter provides a brief introduction to the TI C64x+ DSP Libraries

(DSPLIB), shows the organization of the routines contained in the library, and

lists the features and benefits of the DSPLIB.

Topic

Page

1.1 Introduction to the TI C64x+ DSPLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2

1.2 Features and Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4

1-1

Download from Www.Somanuals.com. All Manuals Search And Download.

Introduction to the TI C64x+ DSPLIB

1.1 Introduction to the TI C64x+ DSPLIB

The TI C64x+ DSPLIB is an optimized DSP Function Library for C

programmers using devices that include the C64x+ megamodule. It includes

many C-callable, assembly-optimized, general-purpose signal-processing

routines. These routines are typically used in computationally intensive

real-time applications where optimal execution speed is critical. By using

these routines, you can achieve execution speeds considerably faster than

equivalent code written in standard ANSI C language. In addition, by providing

ready-to-use DSP functions, TI DSPLIB can significantly shorten your DSP

application development time.

The TI DSPLIB includes commonly used DSP routines. Source code is

provided that allows you to modify functions to match your specific needs.

The routines contained in the library are organized into the following seven

different functional categories:

- Adaptive filtering

DSP_firlms2

- Correlation

DSP_autocor

DSP_autocor_rA8

- FFT

DSP_fft16x16

DSP_fft16x16_imre

DSP_fft16x16r

DSP_fft16x32

DSP_fft32x32

DSP_fft32x32s

DSP_ifft16x16

DSP_ifft16x16_imre

DSP_ifft16x32

DSP_ifft32x32

DSP_fft16x16t (obolete, use DSP_fft16x16)

DSP_bitrev_cplx (obsolete, use DSP_fft16x16)

DSP_radix 2 (obsolete, use DSP_fft16x16)

DSP_r4fft (obsolete, use DSP_fft16x16)

DSP_fft (obsolete, use DSP_fft16x16)

1-2

Download from Www.Somanuals.com. All Manuals Search And Download.

Introduction to the TI C64x+ DSPLIB

- Filtering and convolution

DSP_fir_cplx

DSP_fir_cplx_hM4X4

DSP_fir_gen

DSP_fir_gen_hM17_rA8X8

DSP_fir_r4

DSP_fir_r8

DSP_fir_r8_hM16_rM8A8X8

DSP_fir_sym

DSP_iir

- Math

DSP_dotp_sqr

DSP_dotprod

DSP_maxval

DSP_maxidx

DSP_minval

DSP_mul32

DSP_neg32

DSP_recip16

DSP_vecsumsq

DSP_w_vec

- Matrix

DSP_mat_mul

DSP_mat_trans

- Miscellaneous

DSP_bexp

DSP_blk_eswap16

DSP_blk_eswap32

DSP_blk_eswap64

DSP_blk_move

DSP_fltoq15

DSP_minerror

DSP_q15tofl

Introduction

1-3

Download from Www.Somanuals.com. All Manuals Search And Download.

Features and Benefits

1.2 Features and Benefits

- Hand-coded assembly-optimized routines

- C and linear assembly source code

- C-callable routines, fully compatible with the TI C6x compiler

- Fractional Q.15-format operands supported on some benchmarks

- Benchmarks (time and code)

- Tested against C model

1-4

Download from Www.Somanuals.com. All Manuals Search And Download.

Chapter 2

Installing and Using DSPLIB

This chapter provides information on how to install and rebuild the TI C64x+

DSPLIB.

Topic

Page

2.1 How to Install DSPLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2

2.2 Using DSPLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3

2.3 How to Rebuild DSPLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5

2-1

Download from Www.Somanuals.com. All Manuals Search And Download.

How to Install DSPLIB

2.1 How to Install DSPLIB

Note:

You should read the README.txt file for specific details of the release.

The DSPLIB is provided in the file dsp64plus.zip. The file must be unzipped to

provide the following directory structure:

dsp

+−−README.txt

Top−level README file

+−−docs

library documentation

+−−examples

CCS project examples

|−−include

Required include files

library and source archives

fft twiddle generation functions

|−−lib

|−−support

Please install the contents of the lib directory in the default directory indicated

by your C_DIR environment. If you choose not to install the contents in the

default directory, update the C_DIR environment variable, for example, by

adding the following line in autoexec.bat file:

SET C_DIR=<install_dir>/lib;<install_dir>/include;%C_DIR%

or under Unix/csh:

setenv C_DIR ”<install_dir>/lib;<install_dir>/include;

$C_DIR”

or under Unix/Bourne Shell:

C_DIR=”<install_dir>/lib;<install_dir>/include;$C_DIR”;

export C_DIR

2-2

Download from Www.Somanuals.com. All Manuals Search And Download.

Using DSPLIB

2.2 Using DSPLIB

2.2.1 DSPLIB Arguments and Data Types

2.2.1.1 DSPLIB Types

T a ble 2−1 shows the data types handled by the DSPLIB.

Table 2−1. DSPLIB Data Types

Size

(bits)

Name

Type

Minimum

Maximum

short

integer

−32768

32767

int

integer

−2147483648

2147483647

long

integer

−549755813888

0000:0000h

549755813887

FFFF:FFFFh

pointer

Q.15

address

fraction

−0.9999694824...

−0.99999999953...

1.17549435e−38

2.2250738585072014e−308

0.9999694824...

0.99999999953...

3.40282347e+38

Q.31

fraction

IEEE float

IEEE double

floating point

1.7976931348623157e+308

Unless specifically noted, DSPLIB operates on Q.15-fractional data type

elements. Appendix A presents an overview of Fractional Q formats.

2.2.1.2 DSPLIB Arguments

TI DSPLIB functions typically operate over vector operands for greater

efficiency. Even though these routines can be used to process short arrays, or

even scalars (unless a minimum size requirement is noted), they will be slower

for those cases.

- Vector stride is always equal to 1: Vector operands are composed of vector

elements held in consecutive memory locations (vector stride equal to 1).

- Complex elements are assumed to be stored in consecutive memory

locations with Real data followed by Imaginary data.

- In-place computation is not allowed, unless specifically noted: Source

operand cannot be equal to destination operand.

Installing and Using DSPLIB

2-3

Download from Www.Somanuals.com. All Manuals Search And Download.

Using DSPLIB

2.2.2 Calling a DSPLIB Function From C

In addition to correctly installing the DSPLIB software, follow these steps to

include a DSPLIB function in the code:

- Include the function header file corresponding to the DSPLIB function

- Link the code with dsp64plus.lib

- Use a correct linker command file for the platform used.

The examples in the DSP\Examples folder show how to use the DSPLIB in a

Code Composer Studio C envirionment.

2.2.3 Calling a DSP Function From Assembly

The C64x+ DSPLIB functions were written to be used from C. Calling the

functions from assembly language source code is possible as long as the

calling function conforms to the Texas Instruments C64x+ C compiler calling

conventions. For more information, see Section 8 (Runtime Environment) of

TMS320C6000 Optimizing C Compiler User’s Guide (SPRU187).

2.2.4 DSPLIB Testing − Allowable Error

DSPLIB is tested under the Code Composer Studio environment against a

reference C implementation. You can expect identical results between

Reference C implementation and its Assembly implementation when using

test routines that focus on fixed-point type results. The test routines that deal

with floating points typically allow an error margin of 0.000001 when

comparing the results of reference C code and DSPLIB assembly code.

2.2.5 DSPLIB Overflow and Scaling Issues

The DSPLIB functions implement the same functionality of the reference C

code. You must conform to the range requirements specified in the API

function, and in addition, restrict the input range so that the outputs do not

overflow.

In FFT functions, twiddle factors are generated with a fixed scale factor; i.e.,

15−1

30−1

32767(=2

) for all 16-bit FFT functions, 1073741823(=2

) for

31−1

DSP_fft32x32s, 2147483647(=2

) for all other 32-bit FFT functions.

Twiddle factors cannot be scaled further to not scale input data. Because

DSP_fft16x16r and DSP_fft32x32s perform scaling by 2 at each radix-4 stage,

(log2(nx)−cei[log4(nx)−1])

the input data must be scaled by 2

to completely prevent

(log2(nx))

overflow. In all other FFT functions, the input data must be scaled by 2

because no scaling is done by the functions.

2-4

Download from Www.Somanuals.com. All Manuals Search And Download.

How to Rebuild DSPLIB

2.2.6 Interrupt Behavior of DSPLIB Functions

All of the functions in this library are designed to be used in systems with

interrupts. Thus, it is not necessary to disable interrupts when calling any of

these functions. The functions in the library will disable interrupts as needed to

protect the execution of code in tight loops and so on. Library functions have

three categories:

- Fully-interruptible: These functions do not disable interrupts. Interrupts

are blocked by at most 5 to 10 cycles at a time (not counting stalls) by

branch delay slots.

- Partially-interruptible: These functions disable interrupts for long

periods of time, with small windows of interruptibility. Examples include a

function with a nested loop, where the inner loop is non-interruptible and

the outer loop permits interrupts between executions of the inner loop.

- Non-interruptible: These functions disable interrupts for nearly their

entire duration. Interrupts may happen for a short time during the setup

and exit sequence.

Note that all three function categories tolerate interrupts. That is, an interrupt

can occur at any time without affecting the function correctness. The

interruptibility of the function only determines how long the kernel might delay

the processing of the interrupt.

2.3 How to Rebuild DSPLIB

If you would like to rebuild DSPLIB (for example, because you modified the

source file contained in the archive), you will have to use the mk6x utility as

follows:

mk6x dsp64plus.src −mv64plus −l dsp64plus.lib

Installing and Using DSPLIB

2-5

Download from Www.Somanuals.com. All Manuals Search And Download.

2-6

Download from Www.Somanuals.com. All Manuals Search And Download.

Chapter 3

DSPLIB Function Tables

This chapter provides tables containing all DSPLIB functions, a brief

description of each, and a page reference for more detailed information.

Topic

Page

3.1 Arguments and Conventions Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2

3.2 DSPLIB Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3

3.3 DSPLIB Function Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4

3.4 Differences Between the C64x and C64x+ DSPLIBs

3-8

3-1

Download from Www.Somanuals.com. All Manuals Search And Download.

Arguments and Conventions Used

3.1 Arguments and Conventions Used

The following convention has been used when describing the arguments for

each individual function:

Table 3−1. Argument Conventions

Argument

Description

x,y

Argument reflecting input data vector

Argument reflecting output data vector

nx,ny,nr

Arguments reflecting the size of vectors x,y, and r, respectively. For

functions in the case nx = ny = nr, only nx has been used across.

Argument reflecting filter coefficient vector (filter routines only)

Argument reflecting the size of vector h

Argument reflecting FFT coefficient vector (FFT routines only)

Some C64x+ functions have additional restrictions due to optimization using

new features such as higher multiply throughput. While these new functions

perform better, they can also lead to problems if not carefully used. For

example, DSP_autocor_rA8 is faster than DSP_autocor, but the output buffer

must be aligned to an 8−byte boundary. Therefore, the new functions are

named with any additional restrictions. Three types of restrictions are specified

to a pointer: minimum buffer size (M), buffer alignment (A), and the number of

elements in the buffer to be a multiple of an integer (X).The following

convention has been used when describing the arguments for each individual

function:

A kernel function foo with two parameters, m and n, with the following

restrictions:

m −> Minimum buffer size = 8, buffer alignment = double word, buffer

needs to be a multiple of 8 elements

n −> Minimum buffer size = 32, buffer alignment = word , buffer needs to be

a multiple of 16 elements

This function would be named: foo_mM8A8X8_nM32A4X16.

3-2

Download from Www.Somanuals.com. All Manuals Search And Download.

DSPLIB Functions

3.2 DSPLIB Functions

The routines included in the DSP library are organized into eight functional

categories and listed below in alphabetical order.

- Adaptive filtering

- Correlation

- FFT

- Filtering and convolution

- Math

- Matrix functions

- Miscellaneous

- Obsolete functions

DSPLIB Function Tables

3-3

Download from Www.Somanuals.com. All Manuals Search And Download.

DSPLIB Function Tables

3.3 DSPLIB Function Tables

Table 3−2. Adaptive Filtering

Functions

Description

Page

long DSP_firlms2(short *h, short *x, short b, int nh)

LMS FIR

4-2

Table 3−3. Correlation

Functions

Description

Page

void DSP_autocor(short *r,short *x, int nx, int nr)

Autocorrelation

4-4

void DSP_autocor_rA8(short *r,short *x, int nx, int nr)

Autocorrelation ( r[] must be

double word aligned)

4-4

Table 3−4. FFT

Functions

Description

Page

void DSP_fft16x16(short *w, int nx, short *x, short *y)

Complex out of place, Forward

FFT mixed radix with digit

reversal. Input/Output data in

Re/Im order.

4-8

void DSP_fft16x16_imre(short *w, int nx, short *x, short

*y)

Complex out of place, Forward

FFT mixed radix with digit

reversal. Input/Output data in

Im/Re order.

4-11

4-14

void DSP_fft16x16r(int nx, short *x, short *w, unsigned

char *brev, short *y, int radix, int offset, int n_max)

Cache-optimized mixed radix FFT

with scaling and rounding, digit

reversal, out of place. Input and

output: 16 bits, Twiddle factor: 16

bits.

void DSP_fft16x32(short *w, int nx, int *x, int *y)

void DSP_fft32x32(int *w, int nx, int *x, int *y)

void DSP_fft32x32s(int *w, int nx, int *x, int *y)

Extended precision, mixed radix

FFT, rounding, digit reversal, out

of place. Input and output: 32 bits,

Twiddle factor: 16 bits.

4-24

4-26

4-28

Extended precision, mixed radix

FFT, rounding, digit reversal, out

of place. Input and output: 32 bits,

Twiddle factor: 32 bits.

Extended precision, mixed radix

FFT, digit reversal, out of place.,

with scaling and rounding. Input

and output: 32 bits, Twiddle

factor: 32 bits.

3-4

Download from Www.Somanuals.com. All Manuals Search And Download.

DSPLIB Function Tables

Table 3−4. FFT (Continued)

Functions

Description

Page

void DSP_ifft16x16(short *w, int nx, short *x, short *y)

Complex out of place, Inverse

FFT mixed radix with digit

reversal. Input/Output data in

Re/Im order.

4-28

4-34

4-36

void DSP_ifft16x16_imre(short *w, int nx, short *x, short

*y)

Complex out of place, Inverse

FFT mixed radix with digit

reversal. Input/Output data in

Re/Im order.

void DSP_ifft16x32(short *w, int nx, int *x, int *y)

void DSP_ifft32x32(int *w, int nx, int *x, int *y)

Extended precision, mixed radix

IFFT, rounding, digit reversal, out

of place. Input and output: 32 bits,

Twiddle factor: 16 bits.

Extended precision, mixed radix

IFFT, digit reversal, out of place,

with scaling and rounding. Input

and output: 32 bits, Twiddle

factor: 32 bits.

Table 3−5. Filtering and Convolution

Functions

Description

Page

void DSP_fir_cplx (short *x, short *h, short *r, int nh, int

nx)

Complex FIR Filter (nh is a

multiple of 2)

4-38

void DSP_fir_cplx_hM4X4 (short *x, short *h, short *r, int

nh, int nx)

Complex FIR Filter (nh is a

multiple of 4)

4-38

void DSP_fir_gen (short *x, short *h, short *r, int nh, int nr) FIR Filter (any nh)

4-42

void DSP_fir_gen_hM17_rA8X8 (short *x, short *h, short

*r, int nh, int nr)

FIR Filter (r[] must be double

word aligned, nr must be multiple

of 8)

void DSP_fir_r4 (short *x, short *h, short *r, int nh, int nr)

void DSP_fir_r8 (short *x, short *h, short *r, int nh, int nr)

FIR Filter (nh is a multiple of 4)

FIR Filter (nh is a multiple of 8)

4-46

4-50

void DSP_fir_r8_hM16_rM8A8X8 (short *x, short *h, short FIR Filter (r[] must be double

*r, int nh, int nr) word aligned, nr is a multiple of 8)

void DSP_fir_sym (short *x, short *h, short *r, int nh, int nr, Symmetric FIR Filter (nh is a

int s) multiple of 8)

4-52

DSPLIB Function Tables

3-5

Download from Www.Somanuals.com. All Manuals Search And Download.

DSPLIB Function Tables

Table 3−5. Filtering and Convolution (Continued)

Functions

Description

Page

void DSP_iir(short *r1, short *x, short *r2, short *h2, short IIR with 5 Coefficients

*h1, int nr)

4-54

void DSP_iirlat(short *x, int nx, short *k, int nk, int *b,

short *r)

All−pole IIR Lattice Filter

4-56

Table 3−6. Math

Functions

Description

Page

int DSP_dotp_sqr(int G, short *x, short *y, int *r, int nx)

Vector Dot Product and Square

4-58

int DSP_dotprod(short *x, short *y, int nx)

short DSP_maxval (short *x, int nx)

int DSP_maxidx (short *x, int nx)

Vector Dot Product

4-60

4-62

4-63

Maximum Value of a Vector

Index of the Maximum Element of

a Vector

short DSP_minval (short *x, int nx)

Minimum Value of a Vector

32-bit Vector Multiply

32-bit Vector Negate

void DSP_mul32(int *x, int *y, int *r, short nx)

void DSP_neg32(int *x, int *r, short nx)

void DSP_recip16 (short *x, short *rfrac, short *rexp, short 16-bit Reciprocal

nx)

int DSP_vecsumsq (short *x, int nx)

Sum of Squares

4-71

4-72

void DSP_w_vec(short *x, short *y, short m, short *r, short Weighted Vector Sum

nr)

Table 3−7. Matrix

Functions

Description

Page

void DSP_mat_mul(short *x, int r1, int c1, short *y, int c2, Matrix Multiplication

short *r, int qs)

4-73

void DSP_mat_trans(short *x, short rows, short columns, Matrix Transpose

short *r)

4-75

3-6

Download from Www.Somanuals.com. All Manuals Search And Download.

DSPLIB Function Tables

Table 3−8. Miscellaneous

Functions

Description

Page

short DSP_bexp(int *x, short nx)

Max Exponent of a Vector (for

scaling)

void DSP_blk_eswap16(void *x, void *r, int nx)

void DSP_blk_eswap32(void *x, void *r, int nx)

void DSP_blk_eswap64(void *x, void *r, int nx)

Endian-swap a block of 16-bit

values

Endian-swap a block of 32-bit

values

Endian-swap a block of 64-bit

values

void DSP_blk_move(short *x, short *r, int nx)

void DSP_fltoq15 (float *x,short *r, short nx)

Move a Block of Memory

Float to Q15 Conversion

Minimum Energy Error Search

4-84

4-85

4-87

int DSP_minerror (short *GSP0_TABLE,short *errCoefs,

int *savePtr_ret)

void DSP_q15tofl (short *x, float *r, short nx)

Q15 to Float Conversion

4-89

Table 3−9. Obsolete Functions

Functions

Description

Page

void DSP_bitrev_cplx (int *x, short *index, int nx)

Use DSP_fft16x16() instead

4-88

void DSP_radix2 (int nx, short *x, short *w)

void DSP_r4fft (int nx, short *x, short *w)

Use DSP_fft16x16() instead

4-91

4-93

void DSP_fft(short *w, int nx, short *x, short *y)

void DSP_fft16x16t(short *w, int nx, short *x, short *y)

4-96

4-107

DSPLIB Function Tables

3-7

Download from Www.Somanuals.com. All Manuals Search And Download.

Differences Between the C64x and C64x+ DSPLIBs

3.4 Differences Between the C64x and C64x+ DSPLIBs

The C64x+ DSPLIB was developed by optimizing some of the functions of the

C64x DSPLIB to take advantage of the C64x+ architecture.

T a ble 3−10 shows the optimized functions for the C64x+ DSPLIB.

There are two optimization types:

- SPLOOP conversion: Optimized code uses SPLOOP to provide

interruptibility and decrease power consumption. The new C64x+

instructions do not increase algorithm performance, and thus, are not

used.

- Kernel redesign, SPLOOP: Kernel of algorithm rewritten to take

advantage of the new C64x+ instructions and of the SPLOOP feature.

Table 3−10. Functions Optimized in the C64x+ DSPLIB

Function

C64x+ Optimized

Optimization Type

DSP_firlms2

DSP_autocor

DSP_autocor_rA8

Yes

Kernel re−design, SPLOOP

Optimization resulted in new

requirements. New name is used.

DSP_fft16x16

DSP_fft16x16_imre

DSP_fft16x16r

DSP_fft16x32

DSP_fft32x32

DSP_fft32x32s

DSP_ifft16x16

DSP_ifft16x16_imre

DSP_ifft16x32

DSP_ifft32x32

DSP_fir_cplx

Yes

New Function Optimized C64x+

Kernel re−design, SPLOOP

New Function Optimized C64x+

Kernel re−design, SPLOOP

3-8

Download from Www.Somanuals.com. All Manuals Search And Download.

Differences Between the C64x and C64x+ DSPLIBs

Table 3−10. Functions Optimized in the C64x+ DSPLIB (Continued)

Function

C64x+ Optimized

Optimization Type

DSP_fir_cplx_hM4X4

Yes

Kernel re−design, SPLOOP

Optimization resulted in new

requirements. New name is used.

DSP_fir_gen

DSP_fir_gen_hM17_rA8X8

Yes

Kernel re−design, SPLOOP

Optimization resulted in new

requirements. New name is used.

DSP_fir_r4

DSP_fir_r8

DSP_fir_r8_hM16_rM8A8X8

Yes

Kernel re−design, SPLOOP

Optimization resulted in new

requirements. New name is used.

DSP_fir_sym

DSP_iir

Yes

DSP_iirlat

DSP_dotp_sqr

DSP_dotprod

DSP_maxval

DSP_maxidx

DSP_minval

DSP_mul32

DSP_neg32

DSP_recip16

DSP_vecsumsq

DSP_w_vec

DSP_mat_mu

DSP_mat_trans

DSP_bexp

SPLOOP conversion

DSPLIB Function Tables

3-9

Download from Www.Somanuals.com. All Manuals Search And Download.

Differences Between the C64x and C64x+ DSPLIBs

Table 3−10. Functions Optimized in the C64x+ DSPLIB (Continued)

Function

C64x+ Optimized

Optimization Type

DSP_blk_eswap16

DSP_blk_eswap32

DSP_blk_move

DSP_fltoq15

Yes

SPLOOP conversion

DSP_minerror

DSP_q15tofl

DSP_bitrev_cplx

DSP_radix2

DSP_r4fft

Obsolete

DSP_fft

DSP_fft16x16t

Any functions which were not optimized for the C64x+ have the same

performance as on the C64x.

3-10

Download from Www.Somanuals.com. All Manuals Search And Download.

Chapter 4

DSPLIB Reference

This chapter provides a list of the functions within the DSP library (DSPLIB)

organized into functional categories. The functions within each category are

listed in alphabetical order and include arguments, descriptions, algorithms,

benchmarks, and special requirements.

Topic

Page

4.1 Adaptive Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2

4.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4

4.3 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8

4.4 Filtering and Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-38

4.5 Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-58

4.6 Matirx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-73

4.7 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-76

4.8 Obsolete Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-90

4-1

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_firlms2

4.1 Adaptive Filtering

LMS FIR

DSP_firlms2

Function

long DSP_firlms2(short * restrict h, const short * restrict x, short b, int nh)

Arguments

h[nh]

x[nh+1]

Coefficient Array

Input Array

Error from previous FIR

Number of coefficients. Must be multiple of 4.

Return value

return long

Description

Algorithm

The Least Mean Square Adaptive Filter computes an update of all nh

coefficients by adding the weighted error times the inputs to the original

coefficients. The input array includes the last nh inputs followed by a new

single sample input. The coefficient array includes nh coefficients.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

long DSP_firlms2(short h[ ],short x[ ], short b,

int nh)

{

int

long

r = 0;

for (i = 0; i < nh; i++) {

h[i] += (x[i] * b) >> 15;

r += x[i + 1] * h[i];

}

return r;

}

Special Requirements

- This routine assumes 16-bit input and output.

- The number of coefficients nh must be a multiple of 4.

4-2

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_firlms2

Implementation Notes

Benchmarks

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- The loop is unrolled 4 times.

Cycles

Codesize

3 * nh/4 + 17

148 bytes

C64x+ DSPLIB Reference

4-3

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_autocor

4.2 Correlation

AutoCorrelation

DSP_autocor

Function

void DSP_autocor(short * restrict r, const short * restrict x, int nx, int nr)

Arguments

r[nr]

x[nx+nr]

Output array

Input array. Must be double-word aligned.

Length of autocorrelation. Must be a multiple of 8.

Number of lags. Must be a multiple of 4.

Description

Algorithm

This routine accepts an input array of length nx + nr and performs nr

autocorrelations each of length nx producing nr output results. This is typically

used in VSELP code.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_autocor(short r[ ],short x[ ], int nx, int nr)

{

int i,k,sum;

for (i = 0; i < nr; i++){

sum = 0;

for (k = nr; k < nx+nr; k++)

sum += x[k] * x[k−i];

r[i] = (sum >> 15);

}

Special Requirements

- nx must be a multiple of 8.

- nr must be a multiple of 4.

- x[ ] must be double-word aligned.

4-4

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_autocor

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- The inner loop is unrolled 8 times.

- The outer loop is unrolled 4 times.

- The outer loop is conditionally executed in parallel with the inner loop. This

allows for a zero overhead outer loop.

Benchmarks

Cycles

nx<40:

6*nr*nr/4 + 20

nx>=40: nx*nr/8 + 2*nr + 20

304 bytes

Codesize

C64x+ DSPLIB Reference

4-5

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_autocor_rA8

AutoCorrelation

DSP_autocor_rA8

Function

void DSP_autocor_rA8(short * restrict r, const short * restrict x, int nx, int nr)

Arguments

r[nr]

x[nx+nr]

Output array, Must be double word aligned.

Input array. Must be double-word aligned.

Length of autocorrelation. Must be a multiple of 8.

Number of lags. Must be a multiple of 4.

Description

Algorithm

This routine accepts an input array of length nx + nr and performs nr

autocorrelations each of length nx producing nr output results. This is typically

used in VSELP code.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_autocor(short r[ ],short x[ ], int nx, int nr)

{

int i,k,sum;

for (i = 0; i < nr; i++){

sum = 0;

for (k = nr; k < nx+nr; k++)

sum += x[k] * x[k−i];

r[i] = (sum >> 15);

}

Special Requirements

Implementation Notes

- nx must be a multiple of 8.

- nr must be a multiple of 4.

- x[ ] must be double-word aligned.

- r[ ] must be double-word aligned.

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

- The inner loop is unrolled 8 times.

- The outer loop is unrolled 4 times.

4-6

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_autocor_rA8

Benchmarks

Cycles

nx<40:

6*nr+ 20

nx>=40: nx*nr/8 + 2*nr + 20

304 bytes

Codesize

C64x+ DSPLIB Reference

4-7

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16

4.3 FFT

Complex Forward Mixed Radix 16 x 16-bit FFT

DSP_fft16x16

Function

void DSP_fft16x16(const short * restrict w, int nx, short * restrict x, short *

restrict y)

Arguments

w[2*nx]

Pointer to complex Q.15 FFT coefficients.

Length of FFT in complex samples. Must be power of 2 or 4

, and 16 ≤ nx ≤ 32768.

x[2*nx]

y[2*nx]

Pointer to complex 16-bit data input.

Pointer to complex 16-bit data output.

Description

This routine computes a complex forward mixed radix FFT with rounding and

digit reversal. Input data x[ ], output data y[ ], and coefficients w[ ] are 16-bit.

The output is returned in the separate array y[ ] in normal order. Each complex

value is stored with interleaved real and imaginary parts. The code uses a

special ordering of FFT coefficients (also called twiddle factors) and memory

accesses to improve performance in the presence of cache.

Algorithm

All stages are radix-4 except the last one, which can be radix-2 or radix-4,

depending on the size of the FFT. All stages except the last one scale by two

the stage output data.

Special Requirements

- In-place computation is not allowed.

- The size of the FFT, nx, must be power of 2 or 4, and 16 ≤ nx ≤ 32768.

- The arrays for the complex input data x[ ], complex output data y[ ], and

twiddle factors w[ ] must be double-word aligned.

- The input and output data are complex, with the real/imaginary

components stored in adjacent locations in the array. The real

components are stored at even array indices, and the imaginary

components are stored at odd array indices. All data are in short precision

or Q.15 format.

4-8

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

The routine uses log (nx) − 1 stages of radix-4 transform and performs either

a radix-2 or radix-4 transform on the last stage depending on nx. If nx is a

power of 4, then this last stage is also a radix-4 transform, otherwise it is a

radix-2 transform. The conventional Cooley Tukey FFT is written using three

loops. The outermost loop “k” cycles through the stages. There are log N to

the base 4 stages in all. The loop “j” cycles through the groups of butterflies

with different twiddle factors, and loop “i” reuses the twiddle factors for the

different butterflies within a stage. Note the following:

Butterflies With Common

Twiddle Factors

Stage

Groups

Groups*Butterflies

N/4

N/16

N/4

logN

N/4

The following statements can be made based on above observations:

1) Inner loop “i0” iterates a variable number of times. In particular, the number

of iterations quadruples every time from 1..N/4. Hence, software pipelining

a loop that iterates a variable number of times is not profitable.

2) Outer loop “j” iterates a variable number of times as well. However, the

number of iterations is quartered every time from N/4 ..1. Hence, the

behavior in (a) and (b) are exactly opposite to each other.

3) If the two loops “i” and “j” are coalesced together then they will iterate for

a fixed number of times, namely N/4. This allows us to combine the “i” and

“j” loops into one loop. Optimized implementations will make use of this

fact.

In addition,, the Cooley Tukey FFT accesses three twiddle factors per iteration

of the inner loop, as the butterflies that reuse twiddle factors are lumped

together. This leads to accessing the twiddle factor array at three points, each

separated by “ie”. Note that “ie” is initially 1, and is quadrupled with every

iteration. Therefore, these three twiddle factors are not even contiguous in the

array.

C64x+ DSPLIB Reference

4-9

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16

To vectorize the FFT, it is desirable to access the twiddle factor array using

double word wide loads and fetch the twiddle factors needed. To do this, a

modified twiddle factor array is created, in which the factors WN/4, WN/2,

W3N/4 are arranged to be contiguous. This eliminates the separation between

twiddle factors within a butterfly. However, this implies that we maintain a

redundant version of the twiddle factor array as the loop is traversed from one

stage to another. Hence, the size of the twiddle factor array increases as

compared to the normal Cooley Tukey FFT. The modified twiddle factor array

is of size “2 * N” where the conventional Cooley Tukey FFT is of size “3N/4”

where N is the number of complex points to be transformed. The routine that

generates the modified twiddle factor array was presented earlier. With the

above transformation of the FFT, both the input data and the twiddle factor

array can be accessed using double-word wide loads to enable packed data

processing.

The final stage is optimized to remove the multiplication as w0 = 1. This stage

also performs digit reversal on the data, so the final output is in natural order.

In addition, if the number of points to be transformed is a power of 2, the final

stage applies a radix-2 pass instead of a radix-4. In any case, the outputs are

returned in normal order.

The code performs the bulk of the computation in place. However, because

digit-reversal cannot be performed in-place, the final result is written to a

separate array, y[].

Benchmarks

Cycles

(6 * nx/8 + 19) * ceil[log (nx) − 1] + 8*nx/8 + 30

Codesize 864 bytes

4-10

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16_imre

Complex Forward Mixed Radix 16 x 16-bit FFT, With Im/Re Order

DSP_fft16x16_imre

Function

void DSP_fft16x16_imre(const short * restrict w, int nx, short * restrict x, short

* restrict y)

Arguments

w[2*nx]

Pointer to complex Q.15 FFT coefficients.

Length of FFT in complex samples. Must be power of 2 or 4

, and 16 ≤ nx ≤ 32768.

x[2*nx]

y[2*nx]

Pointer to complex 16-bit data input.

Pointer to complex 16-bit data output.

Description

This routine computes a complex forward mixed radix FFT with truncation and

digit reversal. Input data x[ ], output data y[ ], and coefficients w[ ] are 16-bit.

The output is returned in the separate array y[ ] in normal order. Each complex

value is stored with interleaved imaginary and real parts. The code uses a

special ordering of FFT coefficients (also called twiddle factors) and memory

accesses to improve performance in the presence of cache.

Algorithm

All stages are radix-4 except the last one, which can be radix-2 or radix-4,

depending on the size of the FFT. All stages except the last one scale by two

the stage output data.

Special Requirements

- In-place computation is not allowed.

- The size of the FFT, nx, must be power of 2 or 4, and 16 ≤ nx ≤ 32768.

- The arrays for the complex input data x[ ], complex output data y[ ], and

twiddle factors w[ ] must be double-word aligned.

- The input and output data are complex, with the imaginary/real

components stored in adjacent locations in the array. The imaginary

components are stored at even array indices, and the real components are

stored at odd array indices. All data are in short precision or Q.15 format.

Implementation Notes

- Bank Conflicts: no conflicts occur.

- Interruptibility: The code is interruptible.

C64x+ DSPLIB Reference

4-11

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16_imre

The routine uses log (nx) − 1 stages of radix-4 transform and performs either

a radix-2 or radix-4 transform on the last stage depending on nx. If nx is a

power of 4, then this last stage is also a radix-4 transform, otherwise it is a

radix-2 transform. The conventional Cooley Tukey FFT is written using three

loops. The outermost loop “k” cycles through the stages. There are log N to

the base 4 stages in all. The loop “j” cycles through the groups of butterflies

with different twiddle factors, and loop “i” reuses the twiddle factors for the

different butterflies within a stage. Note the following:

Butterflies With Common

Twiddle Factors

Stage

Groups

Groups*Butterflies

N/4

N/16

N/4

logN

N/4

The following statements can be made based on above observations:

1) Inner loop “i0” iterates a variable number of times. In particular, the number

of iterations quadruples every time from 1..N/4. Hence, software pipelining

a loop that iterates a variable number of times is not profitable.

2) Outer loop “j” iterates a variable number of times as well. However, the

number of iterations is quartered every time from N/4 ..1. Hence, the

behavior in (a) and (b) are exactly opposite to each other.

3) If the two loops “i” and “j” are coalesced together then they will iterate for

a fixed number of times, namely N/4. This allows us to combine the “i” and

“j” loops into one loop. Optimized implementations will make use of this

fact.

In addition, the Cooley Tukey FFT accesses three twiddle factors per iteration

of the inner loop, as the butterflies that reuse twiddle factors are lumped

together. This leads to accessing the twiddle factor array at three points, each

separated by “ie”. Note that “ie” is initially 1, and is quadrupled with every

iteration. Therefore these three twiddle factors are not even contiguous in the

array.

4-12

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16_imre

To vectorize the FFT, it is desirable to access twiddle factor array using double

word wide loads and fetch the twiddle factors needed. To do this, a modified

twiddle factor array is created, in which the factors WN/4, WN/2, W3N/4 are

arranged to be contiguous. This eliminates the separation between twiddle

factors within a butterfly. However, this implies that we maintain a redundant

version of the twiddle factor array as the loop is traversed from one stage to

another. Hence, the size of the twiddle factor array increases as compared to

the normal Cooley Tukey FFT. The modified twiddle factor array is of size

“2 * N”, where the conventional Cooley Tukey FFT is of size “3N/4”, where N

is the number of complex points to be transformed. The routine that generates

the modified twiddle factor array was presented earlier. With the above

transformation of the FFT, both the input data and the twiddle factor array can

be accessed using double-word wide loads to enable packed data processing.

The final stage is optimized to remove the multiplication as w0 = 1. This stage

also performs digit reversal on the data, so the final output is in natural order.

In addition, if the number of points to be transformed is a power of 2, the final

stage applies a DSP_radix2 pass instead of a radix 4. In any case, the outputs

are returned in normal order.

The code performs the bulk of the computation in place. However, because

digit-reversal cannot be performed in-place, the final result is written to a

separate array, y[].

Benchmarks

Cycles

(6 * nx/8 + 19) * ceil[log (nx) − 1] + 8*nx/8 + 30

Codesize 864 bytes

C64x+ DSPLIB Reference

4-13

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16r

Complex Forward Mixed Radix 16 x 16-bit FFT With Rounding

DSP_fft16x16r

Function

void DSP_fft16x16r(int nx, short * restrict x, const short * restrict w, const un-

signed char * restrict brev, short * restrict y, int radix, int offset, int nmax)

Arguments

Length of FFT in complex samples. Must be power of 2 or 4, and

≤16384

x[2*nx]

w[2*nx]

brev[64]

Pointer to complex 16-bit data input

Pointer to complex FFT coefficients

Pointer to bit reverse table containing 64 entries. Only required for

C code. Use NULL for assembly code since BITR instruction

is used instead.

y[2*nx]

radix

Pointer to complex 16-bit data output

Smallest FFT butterfly used in computation used for

decomposing FFT into sub-FFTs. See notes.

offset

nmax

Index in complex samples of sub-FFT from start of main FFT.

Size of main FFT in complex samples.

Description

This routine implements a cache-optimized complex forward mixed radix FFT

with scaling, rounding and digit reversal. Input data x[ ], output data y[ ], and

coefficients w[ ] are 16-bit. The output is returned in the separate array y[ ] in

normal order. Each complex value is stored as interleaved 16-bit real and

imaginary parts. The code uses a special ordering of FFT coefficients (also

called twiddle factors).

This redundant set of twiddle factors is size 2*N short samples. As pointed out

in subsequent sections, dividing these twiddle factors by 2 will give an effective

divide by 4 at each stage to guarantee no overflow. The function is accurate

to about 68dB of signal to noise ratio to the DFT function as follows.

4-14

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16r

void dft(int n, short x[], short y[])

{

int k,i, index;

const double PI = 3.14159654;

short * p_x;

double arg, fx_0, fx_1, fy_0, fy_1, co, si;

for(k = 0; k<n; k++)

{

p_x = x;

fy_0 = 0;

fy_1 = 0;

for(i=0; i<n; i++)

{

fx_0 = (double)p_x[0];

fx_1 = (double)p_x[1];

p_x += 2;

index = (i*k) % n;

arg = 2*PI*index/n;

co = cos(arg);

si = −sin(arg);

fy_0 += ((fx_0 * co) − (fx_1 * si));

fy_1 += ((fx_1 * co) + (fx_0 * si));

}

y[2*k] = (short)2*fy_0/sqrt(n);

y[2*k+1] = (short)2*fy_1/sqrt(n);

}

Scaling by 2 (i.e., >>1) takes place at each radix-4 stage except the last one.

A radix-4 stage could give a maximum bit-growth of 2 bits, which would require

scaling by 4. To completely prevent overflow, the input data must be scaled by

(BT−BS)

, where BT (total number of bit growth) = log (nx) and BS (number of

scales by the functions) = ceil[log (nx)−1]. All shifts are rounded to reduce

truncation noise power by 3dB.

C64x+ DSPLIB Reference

4-15

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16r

The function takes the twiddle factors and input data, and calculates the FFT

producing the frequency domain data in the y[ ] array. As the FFT allows every

input point to affect every output point, which causes cache thrashing in a

cache based system. This is mitigated by allowing the main FFT of size N to

be divided into several steps, allowing as much data reuse as possible. For

example, see the following function:

DSP_fft16x16r(1024,&x[0],

&w[0],

y,brev,4,

0,1024);

is equivalent to:

DSP_fft16x16r(1024,&x[2*0], &w[0]

,y,brev,256, 0,1024);

0,1024);

DSP_fft16x16r(256, &x[2*0], &w[2*768],y,brev,4,

DSP_fft16x16r(256, &x[2*256],&w[2*768],y,brev,4, 256,1024);

DSP_fft16x16r(256, &x[2*512],&w[2*768],y,brev,4, 512,1024);

DSP_fft16x16r(256, &x[2*768],&w[2*768],y,brev,4, 768,1024);

Notice how the first FFT function is called on the entire 1K data set. It covers

the first pass of the FFT until the butterfly size is 256.

The following 4 FFTs do 256-point FFTs 25% of the size. These continue down

to the end when the butterfly is of size 4. They use an index to the main twiddle

factor array of 0.75*2*N. This is because the twiddle factor array is composed

of successively decimated versions of the main array.

N not equal to a power of 4 can be used; i.e. 512. In this case, the following

would be needed to decompose the FFT:

DSP_fft16x16r(512, &x[0],

is equivalent to:

&w[0],

y,brev,2,

0,512);

DSP_fft16x16r(512, &x[0],

y,brev,128, 0,512);

0,512);

DSP_fft16x16r(128, &x[2*0], &w[2*384],y,brev,2,

DSP_fft16x16r(128, &x[2*128],&w[2*384],y,brev,2, 128,512);

DSP_fft16x16r(128, &x[2*256],&w[2*384],y,brev,2, 256,512);

DSP_fft16x16r(128, &x[2*384],&w[2*384],y,brev,2, 384,512);

The twiddle factor array is composed of log (N) sets of twiddle factors, (3/4)*N,

(3/16)*N, (3/64)*N, etc. The index into this array for each stage of the FFT is

calculated by summing these indices up appropriately. For multiple FFTs, they

can share the same table by calling the small FFTs from further down in the

twiddle factor array, in the same way as the decomposition works for more data

reuse.

Thus, the above decomposition can be summarized for a general N, radix “rad”

as follows.

4-16

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16r

DSP_fft16x16r(N, &x[0],

DSP_fft16x16r(N/4,&x[0],

&w[0],

brev,y,N/4,0,

&w[2*3*N/4],brev,y,rad,0,

DSP_fft16x16r(N/4,&x[2*N/4], &w[2*3*N/4],brev,y,rad,N/4, N)

DSP_fft16x16r(N/4,&x[2*N/2], &w[2*3*N/4],brev,y,rad,N/2, N)

DSP_fft16x16r(N/4,&x[2*3*N/4],&w[2*3*N/4],brev,y,rad,3*N/4,N)

As discussed previously, N can be either a power of 4 or 2. If N is a power of

4, then rad = 4, and if N is a power of 2 and not a power of 4, then rad = 2. “rad”

controls how many stages of decomposition are performed. It also determines

whether a radix4 or DSP_radix2 decomposition should be performed at the

last stage. Hence, when “rad” is set to “N/4”, the first stage of the transform

alone is performed and the code exits. To complete the FFT, four other calls

are required to perform N/4 size FFTs. In fact, the ordering of these 4 FFTs

amongst themselves does not matter and, thus, from a cache perspective, it

helps to go through the remaining 4 FFTs in exactly the opposite order to the

first. This is illustrated as follows:

DSP_fft16x16r(N, &x[0],

&w[0],

brev,y,N/4,0,

DSP_fft16x16r(N/4,&x[2*3*N/4],&w[2*3*N/4],brev,y,rad,3*N/4, N)

DSP_fft16x16r(N/4,&x[2*N/2], &w[2*3*N/4],brev,y,rad,N/2, N)

DSP_fft16x16r(N/4,&x[2*N/4], &w[2*3*N/4],brev,y,rad,N/4, N)

DSP_fft16x16r(N/4,&x[0],

&w[2*3*N/4],brev,y,rad,0,

In addition, this function can be used to minimize call overhead by completing

the FFT with one function call invocation as shown below:

DSP_fft16x16r(N, &x[0], &w[0], y, brev, rad, 0, N)

Algorithm

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void fft16x16r

(

int

short

*ptr_x,

*ptr_w,

unsigned char *brev,

short

*y,

int

radix,

offset,

nmax

int

)

C64x+ DSPLIB Reference

4-17

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16r

{

int i, l0, l1, l2, h2, predj;

int l1p1,l2p1,h2p1, tw_offset, stride, fft_jmp;

short xt0, yt0, xt1, yt1, xt2, yt2;

short si1,si2,si3,co1,co2,co3;

short xh0,xh1,xh20,xh21,xl0,xl1,xl20,xl21;

short x_0, x_1, x_l1, x_l1p1, x_h2 , x_h2p1, x_l2, x_l2p1;

short *x,*w;

short *ptr_x0, *ptr_x2, *y0;

unsigned int j, k, j0, j1, k0, k1;

short x0, x1, x2, x3, x4, x5, x6, x7;

short xh0_0, xh1_0, xh0_1, xh1_1;

short xl0_0, xl1_0, xl0_1, xl1_1;

short yt3, yt4, yt5, yt6, yt7;

unsigned a, num;

stride = n;

/* n is the number of complex samples */

tw_offset = 0;

while (stride > radix)

{

j = 0;

fft_jmp = stride + (stride>>1);

h2 = stride>>1;

l1 = stride;

l2 = stride + (stride>>1);

x = ptr_x;

w = ptr_w + tw_offset;

for (i = 0; i < n>>1; i += 2)

{

co1 = w[j+0];

si1 = w[j+1];

co2 = w[j+2];

si2 = w[j+3];

co3 = w[j+4];

si3 = w[j+5];

j += 6;

x_0

= x[0];

4-18

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16r

x_1

= x[1];

x_h2 = x[h2];

x_h2p1 = x[h2+1];

x_l1 = x[l1];

x_l1p1 = x[l1+1];

x_l2 = x[l2];

x_l2p1 = x[l2+1];

xh0 = x_0

xh1 = x_1

xl0 = x_0

xl1 = x_1

+ x_l1;

+ x_l1p1;

− x_l1;

− x_l1p1;

xh20 = x_h2 + x_l2;

xh21 = x_h2p1 + x_l2p1;

xl20 = x_h2 − x_l2;

xl21 = x_h2p1 − x_l2p1;

ptr_x0 = x;

ptr_x0[0] = ((short)(xh0 + xh20))>>1;

ptr_x0[1] = ((short)(xh1 + xh21))>>1;

ptr_x2 = ptr_x0;

x += 2;

predj = (j − fft_jmp);

if (!predj) x += fft_jmp;

if (!predj) j = 0;

xt0 = xh0 − xh20;

yt0 = xh1 − xh21;

xt1 = xl0 + xl21;

yt2 = xl1 + xl20;

xt2 = xl0 − xl21;

yt1 = xl1 − xl20;

l1p1 = l1+1;

h2p1 = h2+1;

l2p1 = l2+1;

ptr_x2[l1 ] = (xt1 * co1 + yt1 * si1 + 0x00008000) >> 16;

ptr_x2[l1p1] = (yt1 * co1 − xt1 * si1 + 0x00008000) >> 16;

ptr_x2[h2 ] = (xt0 * co2 + yt0 * si2 + 0x00008000) >> 16;

C64x+ DSPLIB Reference

4-19

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16r

ptr_x2[h2p1] = (yt0 * co2 − xt0 * si2 + 0x00008000) >> 16;

ptr_x2[l2 ] = (xt2 * co3 + yt2 * si3 + 0x00008000) >> 16;

ptr_x2[l2p1] = (yt2 * co3 − xt2 * si3 + 0x00008000) >> 16;

}

tw_offset += fft_jmp;

stride = stride>>2;

} /* end while */

j = offset>>2;

ptr_x0 = ptr_x;

y0 = y;

/* determine _norm(nmax) − 17 */

l0 = 31;

if (((nmax>>31)&1)==1)

num = ~nmax;

else

num = nmax;

if (!num)

l0 = 32;

else

{

a=num&0xFFFF0000; if (a) { l0−=16; num=a; }

a=num&0xFF00FF00; if (a) { l0−= 8; num=a; }

a=num&0xF0F0F0F0; if (a) { l0−= 4; num=a; }

a=num&0xCCCCCCCC; if (a) { l0−= 2; num=a; }

a=num&0xAAAAAAAA; if (a) { l0−= 1; }

}

l0 −= 1;

l0 −= 17;

if(radix == 2 || radix == 4)

for (i = 0; i < n; i += 4)

{

/* reversal computation */

j0 = (j

) & 0x3F;

j1 = (j >> 6) & 0x3F;

k0 = brev[j0];

k1 = brev[j1];

4-20

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16r

k = (k0 << 6) | k1;

if (l0 < 0)

k = k << −l0;

else

k = k >> l0;

j++;

/* multiple of 4 index */

x0 = ptr_x0[0]; x1 = ptr_x0[1];

x2 = ptr_x0[2]; x3 = ptr_x0[3];

x4 = ptr_x0[4]; x5 = ptr_x0[5];

x6 = ptr_x0[6]; x7 = ptr_x0[7];

ptr_x0 += 8;

xh0_0 = x0 + x4;

xh1_0 = x1 + x5;

xh0_1 = x2 + x6;

xh1_1 = x3 + x7;

if (radix == 2)

{

xh0_0 = x0;

xh1_0 = x1;

xh0_1 = x2;

xh1_1 = x3;

}

yt0 = xh0_0 + xh0_1;

yt1 = xh1_0 + xh1_1;

yt4 = xh0_0 − xh0_1;

yt5 = xh1_0 − xh1_1;

xl0_0 = x0 − x4;

xl1_0 = x1 − x5;

xl0_1 = x2 − x6;

xl1_1 = x3 − x7;

if (radix == 2)

{

xl0_0 = x4;

xl1_0 = x5;

C64x+ DSPLIB Reference

4-21

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16r

xl1_1 = x6;

xl0_1 = x7;

}

yt2 = xl0_0 + xl1_1;

yt3 = xl1_0 − xl0_1;

yt6 = xl0_0 − xl1_1;

yt7 = xl1_0 + xl0_1;

if (radix == 2)

{

yt7 = xl1_0 − xl0_1;

yt3 = xl1_0 + xl0_1;

}

y0[k] = yt0; y0[k+1] = yt1;

k += n>>1;

y0[k] = yt2; y0[k+1] = yt3;

k += n>>1;

y0[k] = yt4; y0[k+1] = yt5;

k += n>>1;

y0[k] = yt6; y0[k+1] = yt7;

}

Special Requirements

- In-place computation is not allowed.

- nx must be a power of 2 or 4.

- Complex input data x[ ], twiddle factors w[ ], and output array y[ ] must be

double-word aligned.

- Real values are stored in even word, imaginary in odd.

- All data are in short precision or Q.15 format. Allowed input dynamic range

is 16 − (log (nx)−ceil[log (nx)−1]).

- Output results are returned in normal order.

- The FFT coefficients (twiddle factors) are generated using the program

tw_fft16x16 provided in the directory ‘support\fft’. The scale factor must be

(log2(nx)−ceil[log4(nx)−1])

32767.5. The input data must be scaled by 2

completely prevent overflow.

4-22

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16r

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

- The routine uses log (nx) − 1 stages of radix-4 transform and performs

either a radix-2 or radix-4 transform on the last stage depending on nx. If

nx is a power of 4, then this last stage is also a radix-4 transform, otherwise

it is a radix-2 transform.

- A special sequence of coefficients used as generated above produces the

FFT. This collapses the inner 2 loops in the traditional Burrus and Parks

implementation.

- The revised FFT uses a redundant sequence of twiddle factors to allow a

linear access through the data. This linear access enables data and

instruction level parallelism.

- The butterfly is bit reversed; i.e. the inner 2 points of the butterfly are

crossed over. This makes the data come out in bit reversed rather than in

radix 4 digit reversed order. This simplifies the last pass of the loop. The

BITR instruction does the bit reversal out of place.

Benchmarks

Cycles

ceil[log (nx) − 1] * (8 * nx/8 + 24) + 5.25 * nx/4 + 31

Codesize 640 bytes

C64x+ DSPLIB Reference

4-23

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x32

Complex Forward Mixed Radix 16 x 32-bit FFT With Rounding

Function

void DSP_fft16x32(const short * restrict w, int nx, int * restrict x, int * restrict y)

Arguments

w[2*nx]

Pointer to complex Q.15 FFT coefficients.

Length of FFT in complex samples. Must be power of 2 or 4,

and 16 ≤ nx ≤ 32768.

x[2*nx]

y[2*nx]

Pointer to complex 32-bit data input.

Pointer to complex 32-bit data output.

Description

This routine computes an extended precision complex forward mixed radix

FFT with rounding and digit reversal. Input data x[ ] and output data y[ ] are

32-bit, coefficients w[ ] are 16-bit. The output is returned in the separate array

y[ ] in normal order. Each complex value is stored with interleaved real and

imaginary parts. The code uses a special ordering of FFT coefficients (also

called twiddle factors) and memory accesses to improve performance in the

presence of cache. The C code to generate the twiddle factors is the same as

the one used for the DSP_fft16x16r routine.

Algorithm

The C equivalent of the assembly code without restrictions is similar to the one

shown for the DSP_fft16x16t routine. For further details, see the source code

of the C version of this function, which is provided with this library. Note that

the assembly code is hand optimized and restrictions may apply.

Special Requirements

- In-place computation is not allowed.

- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal

to 16 and less than 32768.

- The arrays for the complex input data x[ ], complex output data y[ ], and

twiddle factors w[ ] must be double-word aligned.

- The input and output data are complex, with the real/imaginary

components stored in adjacent locations in the array. The real

components are stored at even array indices, and the imaginary

components are stored at odd array indices.

- The FFT coefficients (twiddle factors) are generated using the program

tw_fft16x32 provided in the directory ‘support\fft’. The scale factor must be

32767.5. No scaling is done with the function; thus, the input data must be

log2(nx)

scaled by 2

to completely prevent overflow.

4-24

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x32

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

- The routine uses log (nx) − 1 stages of radix-4 transform and performs

either a radix-2 or radix-4 transform on the last stage depending on nx. If

nx is a power of 4, then this last stage is also a radix-4 transform, otherwise

it is a radix-2 transform.

- See the fft16x16t implementation notes, as similar ideas are used.

Benchmarks

Cycles

(10.25 * nx/8 + 10) * ceil[log (nx) − 1] + 6 * nx/4 + 81

Codesize 1056 bytes

C64x+ DSPLIB Reference

4-25

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft32x32

Complex Forward Mixed Radix 32 x 32-bit FFT With Rounding

Function

void DSP_fft32x32(const int * restrict w, int nx, int * restrict x, int * restrict y)

Arguments

w[2*nx]

Pointer to complex 32-bit FFT coefficients.

Length of FFT in complex samples. Must be power of 2 or 4,

and 16 ≤ nx ≤ 32768.

x[2*nx]

y[2*nx]

Pointer to complex 32-bit data input.

Pointer to complex 32-bit data output.

Description

This routine computes an extended precision complex forward mixed radix

FFT with rounding and digit reversal. Input data x[ ], output data y[ ], and

coefficients w[ ] are 32-bit. The output is returned in the separate array y[ ] in

normal order. Each complex value is stored with interleaved real and

imaginary parts. The code uses a special ordering of FFT coefficients (also

called twiddle factors) and memory accesses to improve performance in the

presence of cache. The C code to generate the twiddle factors is similar to the

one used for the DSP_fft16x16r routine, except that the factors are maintained

at 32-bit precision.

Algorithm

The C equivalent of the assembly code without restrictions is similar to the one

shown for the DSP_fft16x16t routine. For further details, see the source code

of the C version of this function, which is provided with this library. Note that

the assembly code is hand optimized and restrictions may apply.

Special Requirements

- In-place computation is not allowed.

- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal

to 16 and less than 32768.

- The arrays for the complex input data x[ ], complex output data y[ ], and

twiddle factors w[ ] must be double-word aligned.

- The input and output data are complex, with the real/imaginary

components stored in adjacent locations in the array. The real

components are stored at even array indices, and the imaginary

components are stored at odd array indices.

- The FFT coefficients (twiddle factors) are generated using the program

tw_fft32x32 provided in the directory ‘support\fft’. The scale factor must be

2147483647.5. No scaling is done with the function; thus, the input data

log2(nx)

must be scaled by 2

to completely prevent overflow.

4-26

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft32x32

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

- The routine uses log (nx) − 1 stages of radix-4 transform and performs

either a radix-2 or radix-4 transform on the last stage depending on nx. If

nx is a power of 4, then this last stage is also a radix-4 transform, otherwise

it is a radix-2 transform.

- See the fft16x16t implementation notes, as similar ideas are used.

Benchmarks

Cycles

(12 * nx/8 + 12) * ceil[log (nx) − 1] + 6 * nx/4 + 79

Codesize 1056 bytes

C64x+ DSPLIB Reference

4-27

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft32x32s

Complex Forward Mixed Radix 32 x 32-bit FFT With Scaling

Function

void DSP_fft32x32s(const int * restrict w, int nx, int * restrict x, int * restrict y)

Arguments

w[2*nx]

Pointer to complex 32-bit FFT coefficients.

Length of FFT in complex samples. Must be power of 2 or 4,

and 16 ≤ nx ≤ 32768.

x[2*nx]

y[2*nx]

Pointer to complex 32-bit data input.

Pointer to complex 32-bit data output.

Description

This routine computes an extended precision complex forward mixed radix

FFT with scaling, rounding and digit reversal. Input data x[ ], output data y[ ],

and coefficients w[ ] are 32-bit. The output is returned in the separate array y[ ]

in normal order. Each complex value is stored with interleaved real and

imaginary parts. The code uses a special ordering of FFT coefficients (also

called twiddle factors) and memory accesses to improve performance in the

presence of cache. The C code to generate the twiddle factors is the same one

used for the DSP_fft32x32 routine.

Scaling by 2 (i.e., >>1) takes place at each radix-4 stage except for the last

one. A radix-4 stage can add a maximum of 2 bits, which would require scaling

by 4 to completely prevent overflow. Thus, the input data must be scaled by

log2(nx)−ceil[log4(nx)−1])

Algorithm

The C equivalent of the assembly code without restrictions is similar to the one

shown for the fft16x16t routine. For further details, see the source code of the

C version of this function, which is provided with this library. Note that the

assembly code is hand optimized and restrictions may apply.

Special Requirements

- In-place computation is not allowed.

- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal

to 16 and less than 32768.

- The arrays for the complex input data x[ ], complex output data y[ ], and

twiddle factors w[ ] must be double-word aligned.

- The input and output data are complex, with the real/imaginary

components stored in adjacent locations in the array. The real

components are stored at even array indices, and the imaginary

components are stored at odd array indices.

4-28

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft32x32s

- The FFT coefficients (twiddle factors) are generated using the program

tw_fft32x32 provided in the directory ‘support\fft’. The scale factor must be

(log2(nx) − ceil[ log4(nx)−1

1073741823.5. The input data must be scaled by 2

to completely prevent overflow.

])

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

- Scaling is performed at each stage by shifting the results right by 1,

preventing overflow.

- The routine uses log (nx) − 1 stages of radix-4 transform and performs

either a radix-2 or radix-4 transform on the last stage depending on nx. If

nx is a power of 4, then this last stage is also a radix-4 transform, otherwise

it is a radix-2 transform.

- See the fft16x16t implementation notes, as similar ideas are used.

Benchmarks

Cycles

(13 * nx/8 + 36) * ceil[log (nx) − 1] + 6 * nx/4 + 36

Codesize 928 bytes

C64x+ DSPLIB Reference

4-29

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_ifft16x16

Complex Inverse Mixed Radix 16 x 16-bit FFT With Rounding

DSP_ifft16x16

Function

void DSP_ifft16x16(const short * restrict w, int nx, short * restrict x, short *

restrict y)

Arguments

w[2*nx]

Pointer to complex Q.15 FFT coefficients.

Length of FFT in complex samples. Must be power of 2 or 4,

and 16 ≤ nx ≤ 32768.

x[2*nx]

y[2*nx]

Pointer to complex 16-bit data input.

Pointer to complex 16-bit data output.

Description

This routine computes a complex inverse mixed radix IFFT with rounding and

digit reversal. Input data x[ ], output data y[ ], and coefficients w[ ] are 16-bit.

The output is returned in the separate array y[ ] in normal order. Each complex

value is stored with interleaved real and imaginary parts. The code uses a

special ordering of IFFT coefficients (also called twiddle factors) and memory

accesses to improve performance in the presence of cache.

The fft16x16 can be used to perform IFFT, by first conjugating the input,

performing the FFT, and conjugating again. This allows fft16x16 to perform the

IFFT as well. However, if the double conjugation needs to be avoided, then this

routine uses the same twiddle factors as the FFT and performs an IFFT. The

change in the sign of the twiddle factors is adjusted for in the routine. Hence,

this routine uses the same twiddle factors as the fft16x16 routine.

Algorithm

The C equivalent of the assembly code without restrictions is similar to the one

of the fft16x16 routine. For further details, see the source code of the C version

of this function which is provided with this library.

Special Requirements

- In-place computation is not allowed.

- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal

to 16 and less than 32768.

- The arrays for the complex input data x[ ], complex output data y[ ], and

twiddle factors w[ ] must be double-word aligned.

- The input and output data are complex, with the real/imaginary

components stored in adjacent locations in the array. The real

components are stored at even array indices, and the imaginary

components are stored at odd array indices.

- Scaling by two is performed after each radix-4 stage except the last one.

4-30

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_ifft16x16

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

- The routine uses log (nx) − 1 stages of radix-4 transform and performs

either a radix-2 or radix-4 transform on the last stage depending on nx. If

nx is a power of 4, then this last stage is also a radix-4 transform, otherwise

it is a radix-2 transform.

- See the fft16x16 implementation notes, as similar ideas are used.

Benchmarks

Cycles

(6 * nx/8 + 19) * ceil[log (nx) − 1] + 8 * nx/8 + 30

Codesize 864 bytes

C64x+ DSPLIB Reference

4-31

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_ifft16x16_imre

Complex Inverse Mixed Radix 16 x 16-bit FFT With Im/Re Order

DSP_ifft16x16_imre

Function

void DSP_ifft16x16_imre(const short * restrict w, int nx, short * restrict x, short

* restrict y)

Arguments

w[2*nx]

Pointer to complex Q.15 FFT coefficients.

Length of FFT in complex samples. Must be power of 2 or 4,

and 16 ≤ nx ≤ 32768.

x[2*nx]

y[2*nx]

Pointer to complex data input.

Pointer to complex data output.

Description

This routine computes a complex inverse mixed radix IFFT with rounding and

digit reversal. Input data x[ ], output data y[ ], and coefficients w[ ] are 16-bit.

The output is returned in the separate array y[ ] in normal order. Each complex

value is stored with interleaved imaginary and real parts. The code uses a

special ordering of IFFT coefficients (also called twiddle factors) and memory

accesses to improve performance in the presence of cache.

The fft16x16_imre can be used to perform IFFT, by first conjugating the input,

performing the FFT, and conjugating again. This allows fft16x16_imre to

perform the IFFT as well. However, if the double conjugation needs to be

avoided, then this routine uses the same twiddle factors as the FFT and

performs an IFFT. The change in the sign of the twiddle factors is adjusted for

in the routine. Hence, this routine uses the same twiddle factors as the

fft16x16_imre routine.

Algorithm

The C equivalent of the assembly code without restrictions is similar to the one

of the ifft16x16 routine. For further details, see the source code of the C version

of this function which is provided with this library.

Special Requirements

- In-place computation is not allowed.

- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal

to 16 and less than 32768.

- The arrays for the complex input data x[ ], complex output data y[ ], and

twiddle factors w[ ] must be double-word aligned.

- The input and output data are complex, with the imaginary/real

components stored in adjacent locations in the array. The imaginary

components are stored at even array indices, and the real components are

stored at odd array indices.

- Scaling by two is performed after each radix-4 stage except the last one.

4-32

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_ifft16x16_imre

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

- The routine uses log (nx) − 1 stages of radix-4 transform and performs

either a radix-2 or radix-4 transform on the last stage depending on nx. If

nx is a power of 4, then this last stage is also a radix-4 transform, otherwise

it is a radix-2 transform.

- See the fft16x16 implementation notes, as similar ideas are used.

Benchmarks

Cycles

(6 * nx/8 + 19) * ceil[log (nx) − 1] + 8 * nx/8 + 30

Codesize 864 bytes

C64x+ DSPLIB Reference

4-33

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_ifft16x32

Complex Inverse Mixed Radix 16 x 32-bit FFT With Rounding

DSP_ifft16x32

Function

void DSP_ifft16x32(const short * restrict w, int nx, int * restrict x, int * restrict

Arguments

w[2*nx]

Pointer to complex Q.15 FFT coefficients.

Length of FFT in complex samples. Must be power of 2 or 4,

and 16 ≤ nx ≤ 32768.

x[2*nx]

y[2*nx]

Pointer to complex 32-bit data input.

Pointer to complex 32-bit data output.

Description

This routine computes an extended precision complex inverse mixed radix

FFT with rounding and digit reversal. Input data x[ ] and output data y[ ] are

32-bit, coefficients w[ ] are 16-bit. The output is returned in the separate array

y[ ] in normal order. Each complex value is stored with interleaved real and

imaginary parts. The code uses a special ordering of FFT coefficients (also

called twiddle factors) and memory accesses to improve performance in the

presence of cache.

fft16x32 can be reused to perform IFFT, by first conjugating the input,

performing the FFT, and conjugating again. This allows fft16x32 to perform the

IFFT as well. However, if the double conjugation needs to be avoided, then this

routine uses the same twiddle factors as the FFT and performs an IFFT. The

change in the sign of the twiddle factors is adjusted for in the routine. Hence,

this routine uses the same twiddle factors as the fft16x32 routine.

Algorithm

The C equivalent of the assembly code without restrictions is similar to the one

shown for the fft16x16t routine. For further details, see the source code of the

C version of this function which is provided with this library. Note that the

assembly code is hand optimized and restrictions may apply.

Special Requirements

- In-place computation is not allowed.

- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal

to 16 and less than 32768.

- The arrays for the complex input data x[ ], complex output data y[ ], and

twiddle factors w[ ] must be double-word aligned.

- The input and output data are complex, with the real/imaginary

components stored in adjacent locations in the array. The real

components are stored at even array indices, and the imaginary

components are stored at odd array indices.

4-34

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_ifft16x32

- The FFT coefficients (twiddle factors) are generated using the program

tw_fft16x32 provided in the directory ‘support\fft’. The scale factor must be

32767.5. No scaling is done with the function; thus the input data must be

log2(nx)

scaled by 2

to completely prevent overflow.

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

- The routine uses log (nx) − 1 stages of radix-4 transform and performs

either a radix-2 or radix-4 transform on the last stage depending on nx. If

nx is a power of 4, then this last stage is also a radix-4 transform, otherwise

it is a radix-2 transform.

- See the fft16x16t implementation notes, as similar ideas are used.

Benchmarks

Cycles

(12.5 * nx/8 + 30) * ceil[log (nx) − 1] + 6 * nx/4 + 32

Codesize 864 bytes

C64x+ DSPLIB Reference

4-35

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_ifft32x32

Complex Inverse Mixed Radix 32 x 32-bit FFT With Rounding

Function

void DSP_ifft32x32(const int * restrict w, int nx, int * restrict x, int * restrict y)

Arguments

w[2*nx]

Pointer to complex 32-bit FFT coefficients.

Length of FFT in complex samples. Must be power of 2 or 4, and

16 ≤ nx ≤ 32768.

x[2*nx]

y[2*nx]

Pointer to complex 32-bit data input.

Pointer to complex 32-bit data output.

Description

This routine computes an extended precision complex inverse mixed radix

FFT with rounding and digit reversal. Input data x[ ], output data y[ ], and

coefficients w[ ] are 32-bit. The output is returned in the separate array y[ ] in

normal order. Each complex value is stored with interleaved real and

imaginary parts. The code uses a special ordering of FFT coefficients (also

called twiddle factors) and memory accesses to improve performance in the

presence of cache.

fft32x32 can be reused to perform IFFT, by first conjugating the input,

performing the FFT, and conjugating again. This allows fft32x32 to perform the

IFFT as well. However, if the double conjugation needs to be avoided, then this

routine uses the same twiddle factors as the FFT and performs an IFFT. The

change in the sign of the twiddle factors is adjusted for in the routine. Hence,

this routine uses the same twiddle factors as the fft32x32 routine.

Algorithm

The C equivalent of the assembly code without restrictions is similar to the one

shown for the fft16x16t routine. For further details, see the source code of the

C version of this function which is provided with this library. Note that the

assembly code is hand optimized and restrictions may apply.

Special Requirements

- In-place computation is not allowed.

- The size of the IFFT, nx, must be a power of 4 or 2 and greater than or equal

to 16 and less than 32768.

- The arrays for the complex input data x[ ], complex output data y[ ], and

twiddle factors w[ ] must be double-word aligned.

- The input and output data are complex, with the real/imaginary

components stored in adjacent locations in the array. The real

components are stored at even array indices, and the imaginary

components are stored at odd array indices.

4-36

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_ifft32x32

- The FFT coefficients (twiddle factors) are generated using the program

tw_fft32x32 provided in the directory ‘support\fft’. The scale factor must be

2147483647.5. No scaling is done with the function; thus the input data

log2(nx)

must be scaled by 2

to completely prevent overflow.

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

- The routine uses log (nx) − 1 stages of radix-4 transform and performs

either a radix-2 or radix-4 transform on the last stage depending on nx. If

nx is a power of 4, then this last stage is also a radix-4 transform, otherwise

it is a radix-2 transform.

- See the fft16x16t implementation notes, as similar ideas are used.

Benchmarks

Cycles

(13*nx/8 + 28) * ceil(log (nx) − 1) + 6 * nx/4 + 39

Codesize 960 bytes

C64x+ DSPLIB Reference

4-37

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_cplx

4.4 Filtering and Convolution

Complex FIR Filter

DSP_fir_cplx

Function

void DSP_fir_cplx (const short * restrict x, const short * restrict h, short * restrict

r, int nh, int nr)

Arguments

x[2*(nr+nh−1)] Complex input data. x must point to x[2*(nh−1)].

h[2*nh]

r[2*nr]

Complex coefficients (in normal order).

Complex output data.

Number of complex coefficients. Must be a multiple of 2.

Number of complex output samples. Must be a multiple of 4.

Description

Algorithm

This function implements the FIR filter for complex input data. The filter has

nr output samples and nh coefficients. Each array consists of an even and odd

term with even terms representing the real part and the odd terms the

imaginary part of the element. The pointer to input array x must point to the

(nh)th complex sample; i.e., element 2*(nh−1), upon entry to the function. The

coefficients are expected in normal order.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_fir_cplx(short *x, short *h, short *r,short nh, short

nr)

{

short i,j;

int imag, real;

for (i = 0; i < 2*nr; i += 2){

imag = 0;

real = 0;

for (j = 0; j < 2*nh; j += 2){

real += h[j] * x[i−j] − h[j+1] * x[i+1−j];

imag += h[j] * x[i+1−j] + h[j+1] * x[i−j];

}

r[i] = (real >> 15);

r[i+1] = (imag >> 15);

}

4-38

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_cplx

Special Requirements

Implementation Notes

- The number of coefficients nh must be a multiple of 2.

- The number of output samples nr must be a multiple of 4.

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- The outer loop is unrolled 4 times while the inner loop is not unrolled.

- Both inner and outer loops are collapsed in one loop.

- ADDAH and SUBAH are used along with PACKH2 to perform

accumulation, shift, and data packing.

- Collapsed one stage of epilog and prolog each.

Benchmarks

Cycles

nr * nh/2 + 7

Codesize 448 bytes

C64x+ DSPLIB Reference

4-39

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_cplx_hM4X4

Complex FIR Filter

DSP_fir_cplx_hM4X4

Function

void DSP_fir_cplx _hM4X4(const short * restrict x, const short * restrict h, short

* restrict r, int nh, int nr)

Arguments

x[2*(nr+nh−1)] Complex input data. x must point to x[2*(nh−1)].

h[2*nh]

Complex coefficients (in normal order).

r[2*nr]

Complex output data.

Number of complex coefficients. Must be a multiple of 4.

Number of complex output samples. Must be a multiple of 4.

Description

Algorithm

This function implements the FIR filter for complex input data. The filter has

nr output samples and nh coefficients. Each array consists of an even and odd

term with even terms representing the real part and the odd terms the

imaginary part of the element. The pointer to input array x must point to the

(nh)th complex sample; i.e., element 2*(nh−1), upon entry to the function. The

coefficients are expected in normal order.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_fir_cplx(short *x, short *h, short *r,short nh, short

nr)

{

short i,j;

int imag, real;

for (i = 0; i < 2*nr; i += 2){

imag = 0;

real = 0;

for (j = 0; j < 2*nh; j += 2){

real += h[j] * x[i−j] − h[j+1] * x[i+1−j];

imag += h[j] * x[i+1−j] + h[j+1] * x[i−j];

}

r[i] = (real >> 15);

r[i+1] = (imag >> 15);

}

4-40

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_cplx_hM4X4

Special Requirements

Implementation Notes

- The number of coefficients nh must be larger or equal to 4 and a multiple

of 4.

- The number of output samples nr must be a multiple of 4.

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is fully interruptible.

- The outer loop is unrolled 4 times while the inner loop is not unrolled.

- Both inner and outer loops are collapsed in one loop.

- ADDAH and SUBAH are used along with PACKH2 to perform

accumulation, shift and data packing.

- Collapsed one stage of epilog and prolog each.

Benchmarks

Cycles

nr * nh*9/16 + 40

Codesize 384 bytes

C64x+ DSPLIB Reference

4-41

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_gen

FIR Filter

DSP_fir_gen

Function

void DSP_fir_gen (const short * restrict x, const short * restrict h, short * restrict

r, int nh, int nr)

Arguments

x[nr+nh−1]

h[nh]

Pointer to input array of size nr + nh − 1.

Pointer to coefficient array of size nh (coefficients must be in

reverse order).

r[nr]

Pointer to output array of size nr. Must be word aligned.

Number of coefficients. Must be ≥5.

Number of samples to calculate. Must be a multiple of 4.

Description

Algorithm

Computes a real FIR filter (direct-form) using coefficients stored in vector h[ ].

The real data input is stored in vector x[ ]. The filter output result is stored in

vector r[ ]. It operates on 16-bit data with a 32-bit accumulate. The filter

calculates nr output samples using nh coefficients.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_fir_gen(short *x, short *h, short *r, int nh, int nr)

{

int i, j, sum;

for (j = 0; j < nr; j++) {

sum = 0;

for (i = 0; i < nh; i++)

sum += x[i + j] * h[i];

r[j] = sum >> 15;

}

4-42

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_gen

Special Requirements

- The number of coefficients, nh, must be greater than or equal to 5.

Coefficients must be in reverse order.

- The number of outputs computed, nr, must be a multiple of 4 and greater

than or equal to 4.

- Array r[ ] must be word aligned.

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- Load double-word instruction is used to simultaneously load four values

in a single clock cycle.

- The inner loop is unrolled four times and will always compute a multiple

of 4 of nh and nr. If nh is not a multiple of 4, the code will fill in zeros to make

nh a multiple of 4.

- This code yields best performance when the ratio of outer loop to inner

loop is less than or equal to 4.

Benchmarks

Cycles: Not available

Codesize: Not available

C64x+ DSPLIB Reference

4-43

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_gen_hM17_rA8X8

FIR Filter

Function

void DSP_fir_gen_hM17_rA8X8 (const short * restrict x, const short * restrict

h, short * restrict r, int nh, int nr)

Arguments

x[nr+nh−1]

h[nh]

Pointer to input array of size nr + nh − 1.

Pointer to coefficient array of size nh (coefficients must be in

reverse order).

r[nr]

Pointer to output array of size nr. Must be double word

aligned.

Number of coefficients. Must be ≥17.

Number of samples to calculate. Must be a multiple of 8.

Description

Algorithm

Computes a real FIR filter (direct-form) using coefficients stored in vector h[ ].

The real data input is stored in vector x[ ]. The filter output result is stored in

vector r[ ]. It operates on 16-bit data with a 32-bit accumulate. The filter

calculates nr output samples using nh coefficients.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_fir_gen(short *x, short *h, short *r, int nh, int nr)

{

int i, j, sum;

for (j = 0; j < nr; j++) {

sum = 0;

for (i = 0; i < nh; i++)

sum += x[i + j] * h[i];

r[j] = sum >> 15;

}

4-44

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_gen_hM17_rA8X8

Special Requirements

- The number of coefficients, nh, must be greater than or equal to 17.

Coefficients must be in reverse order.

- The number of outputs computed, nr, must be a multiple of 8 and greater

than or equal to 8.

- Array r[ ] must be word aligned.

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is fully interruptible.

- Load double-word instruction is used to simultaneously load four values

in a single clock cycle.

- The inner loop is unrolled four times and will always compute a multiple

of 4 of nh and nr. If nh is not a multiple of 4, the code will fill in zeros to make

nh a multiple of 4.

- This code yields best performance when the ratio of outer loop to inner

loop is less than or equal to 4.

Benchmarks

Cycles: 3*ceil(nh/4)*nr/4+39

Codesize: 416 bytes

C64x+ DSPLIB Reference

4-45

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_r4

FIR Filter (when the number of coefficients is a multiple of 4)

DSP_fir_r4

Function

void DSP_fir_r4 (const short * restrict x, const short * restrict h, short * restrict

r, int nh, int nr)

Arguments

x[nr+nh−1]

h[nh]

Pointer to input array of size nr + nh – 1.

Pointer to coefficient array of size nh (coefficients must be in

reverse order).

r[nr]

Pointer to output array of size nr.

Number of coefficients. Must be multiple of 4 and ≥8.

Number of samples to calculate. Must be multiple of 4.

Description

Algorithm

Computes a real FIR filter (direct-form) using coefficients stored in vector h[ ].

The real data input is stored in vector x[ ]. The filter output result is stored in

vector r[ ]. This FIR operates on 16-bit data with a 32-bit accumulate. The filter

calculates nr output samples using nh coefficients.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_fir_r4(short *x, short *h, short *r, int nh, int nr)

{

int i, j, sum;

for (j = 0; j < nr; j++) {

sum = 0;

for (i = 0; i < nh; i++)

sum += x[i + j] * h[i];

r[j] = sum >> 15;

}

4-46

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_r4

Special Requirements

Implementation Notes

- The number of coefficients, nh, must be a multiple of 4 and greater than

or equal to 8. Coefficients must be in reverse order.

- The number of outputs computed, nr, must be a multiple of 4 and greater

than or equal to 4.

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- The load double-word instruction is used to simultaneously load four

values in a single clock cycle.

- The inner loop is unrolled four times and will always compute a multiple

of 4 output samples.

Benchmarks

Cycles

(8 + nh) * nr/4 + 9

Codesize 308 bytes

C64x+ DSPLIB Reference

4-47

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_r8

FIR Filter (when the number of coefficients is a multiple of 8)

Function

void DSP_fir_r8_hM16_rM8A8X8 (short *x, short *h, short *r, int nh, int nr)

Arguments

x[nr+nh−1]

h[nh]

Pointer to input array of size nr + nh – 1.

Pointer to coefficient array of size nh (coefficients must be in

reverse order).

r[nr]

Pointer to output array of size nr. Must be word aligned.

Number of coefficients. Must be multiple of 8, ≥ 8.

Number of samples to calculate. Must be multiple of 4.

Description

Algorithm

Computes a real FIR filter (direct-form) using coefficients stored in vector h[ ].

The real data input is stored in vector x[ ]. The filter output result is stored in

vector r[ ]. This FIR operates on 16-bit data with a 32-bit accumulate. The filter

calculates nr output samples using nh coefficients.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_fir_r8 (short *x, short *h, short *r, int nh, int nr)

{

int i, j, sum;

for (j = 0; j < nr; j++) {

sum = 0;

for (i = 0; i < nh; i++)

sum += x[i + j] * h[i];

r[j] = sum >> 15;

}

Special Requirements

- The number of coefficients, nh, must be a multiple of 8 and greater than

or equal to 8. Coefficients must be in reverse order.

- The number of outputs computed, nr, must be a multiple of 4 and greater

than or equal to 4.

- Array r[ ] must be word aligned.

4-48

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_r8

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

- The load double-word instruction is used to simultaneously load four

values in a single clock cycle.

- The inner loop is unrolled 4 times and will always compute a multiple of

4 output samples.

- The outer loop is conditionally executed in parallel with the inner loop. This

allows for a zero overhead outer loop.

Benchmarks

Cycles

nh*nr/4 + 17

Codesize 336 bytes

C64x+ DSPLIB Reference

4-49

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_r8_hM16_rM8A8X8

FIR Filter (the number of coefficients is a multiple of 8)

Function

void DSP_fir_r8_hM16_rM8A8X8 (short *x, short *h, short *r, int nh, int nr)

Arguments

x[nr+nh−1]

h[nh]

Pointer to input array of size nr + nh – 1.

Pointer to coefficient array of size nh (coefficients must be in

reverse order).

r[nr]

Pointer to output array of size nr. Must be double word

aligned.

Number of coefficients. Must be multiple of 8, ≥ 16.

Number of samples to calculate. Must be multiple of 8, ≥.8.

Description

Algorithm

Computes a real FIR filter (direct-form) using coefficients stored in vector h[ ].

The real data input is stored in vector x[ ]. The filter output result is stored in

vector r[ ]. This FIR operates on 16-bit data with a 32-bit accumulate. The filter

calculates nr output samples using nh coefficients.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_fir_r8 (short *x, short *h, short *r, int nh, int nr)

{

int i, j, sum;

for (j = 0; j < nr; j++) {

sum = 0;

for (i = 0; i < nh; i++)

sum += x[i + j] * h[i];

r[j] = sum >> 15;

}

4-50

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_r8_hM16_rM8A8X8

Special Requirements

- The number of coefficients, nh, must be a multiple of 8 and greater than

or equal to 16. Coefficients must be in reverse order.

- The number of outputs computed, nr, must be a multiple of 8 and greater

than or equal to 8.

- Array r[ ] must be double word aligned.

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

- The load double-word instruction is used to simultaneously load four

values in a single clock cycle.

- The inner loop is unrolled 4 times and will always compute a multiple of

4 output samples.

- The outer loop is conditionally executed in parallel with the inner loop. This

allows for a zero overhead outer loop.

Benchmarks

Cycles

When nh>32, nh*nr/8+22

Otherwise, 32*nr/8+22

Codesize 640 bytes

C64x+ DSPLIB Reference

4-51

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_sym

Symmetric FIR Filter

DSP_fir_sym

Function

void DSP_fir_sym (const short * restrict x, const short * restrict h, short * re-

strict r, int nh, int nr, int s)

Arguments

x[nr+2*nh]

Pointer to input array of size nr + 2*nh. Must be double-word

aligned.

h[nh+1]

Pointer to coefficient array of size nh + 1. Coefficients are in

normal order and only half (nh+1 out of 2*nh+1) are required.

Must be double-word aligned.

r[nr]

Pointer to output array of size nr. Must be word aligned.

Number of coefficients. Must be multiple of 8. The number of

original symmetric coefficients is 2*nh+1.

Number of samples to calculate. Must be multiple of 4.

Number of insignificant digits to truncate; e.g., 15 for Q.15

input data and coefficients.

Description

This function applies a symmetric filter to the input samples. The filter tap array

h[] provides ‘nh+1’ total filter taps. The filter tap at h[nh] forms the center point

of the filter. The taps at h[nh − 1] through h[0] form a symmetric filter about this

central tap. The effective filter length is thus 2*nh+1 taps.

The filter is performed on 16-bit data with 16-bit coefficients, accumulating

intermediate results to 40-bit precision. The accumulator is rounded and

truncated according to the value provided in ‘s’. This allows a variety of

Q-points to be used.

Algorithm

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_fir_sym(short *x, short *h, short *r, int nh, int nr,

int s)

{

int

i, j;

long

y0;

round = (long) 1 << (s − 1);

for (j = 0; j < nr; j++) {

y0 = round;

for (i = 0; i < nh; i++)

4-52

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fir_sym

y0 += (short) (x[j + i] + x[j + 2 * nh − i]) * h[i];

y0 += x[j + nh] * h[nh];

r[j] = (int) (y0 >> s);

}

Special Requirements

Implementation Notes

Benchmarks

- nh must be a multiple of 8. The number of original symmetric coefficients

is 2*nh+1. Only half (nh+1) are required.

- nr must be a multiple of 4.

- x[ ] and h[ ] must be double-word aligned.

- r[ ] must be word aligned.

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

- The load double-word instruction is used to simultaneously load four

values in a single clock cycle.

- The inner loop is unrolled eight times.

Cycles

(10 * nh/8 + 15) * nr/4 + 26

Codesize 664 bytes

C64x+ DSPLIB Reference

4-53

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_iir

IIR With 5 Coefficients

DSP_iir

Function

void DSP_iir (short * restrict r1, const short * restrict x, short * restrict r2, const

short * restrict h2, const short * restrict h1, int nr)

Arguments

r1[nr+4]

must

Output array (used in actual computation. First four elements

have the previous outputs.)

x[nr+4]

r2[nr]

h2[5]

h1[5]

Input array

Output array (stored)

Moving-average filter coefficients

Auto-regressive filter coefficients. h1[0] is not used.

Number of output samples. Must be ≥ 8.

Description

Algorithm

The IIR performs an auto-regressive moving-average (ARMA) filter with 4

auto-regressive filter coefficients and 5 moving-average filter coefficients for

nr output samples.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_iir(short *r1, short *x, short *r2, short *h2,

short *h1, int nr)

{

int j,i;

int sum;

for (i=0; i<nr; i++){

sum = h2[0] * x[4+i];

for (j = 1; j <= 4; j++)

sum += h2[j]*x[4+i−j]−h1[j]*r1[4+i−j];

r1[4+i] = (sum >> 15);

r2[i] = r1[4+i];

}

4-54

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_iir

Special Requirements

Implementation Notes

- nr is greater than or equal to 8.

- Input data array x[ ] contains nr + 4 input samples to produce nr output

samples.

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- Output array r1[ ] contains nr + 4 locations, r2[ ] contains nr locations for

storing nr output samples. The output samples are stored with an offset

of 4 into the r1[ ] array.

- The inner loop that iterated through the filter coefficients is completely

unrolled.

Benchmarks

Cycles

4 * nr + 21

Codesize 276 bytes

C64x+ DSPLIB Reference

4-55

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_iirlat

All-Pole IIR Lattice Filter

DSP_iirlat

Function

void DSP_iirlat(const short * restrict x, int nx, const short * restrict k, int nk, int

* restrict b, short * restrict r)

Arguments

x[nx]

Input vector (16-bit).

Length of input vector.

k[nk]

Reflection coefficients in Q.15 format.

Number of reflection coefficients/lattice stages. Must be >=4.

Make multiple of 2 to avoid bank conflicts.

b[nk+1]

r[nx]

Delay line elements from previous call. Should be initialized to

all zeros prior to the first call.

Output vector (16-bit).

Description

This routine implements a real all-pole IIR filter in lattice structure (AR lattice).

The filter consists of nk lattice stages. Each stage requires one reflection

coefficient k and one delay element b. The routine takes an input vector x[] and

returns the filter output in r[]. Prior to the first call of the routine, the delay

elements in b[] should be set to zero. The input data may have to be pre-scaled

to avoid overflow or achieve better SNR. The reflections coefficients lie in the

range −1.0 < k < 1.0. The order of the coefficients is such that k[nk−1]

corresponds to the first lattice stage after the input and k[0] corresponds to the

last stage.

Algorithm

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void iirlat(short *x, int nx, short *k, int nk, int *b,

short *r)

{

int rt;

/* output

int i, j;

for (j=0; j<nx; j++)

{

rt = x[j] << 15;

for (i = nk − 1; i >= 0; i−−)

{

4-56

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_iirlat

= rt − (short)(b[i] >> 15) * k[i];

b[i + 1] = b[i] + (short)(rt >> 15) * k[i];

}

b[0] = rt;

r[j] = rt >> 15;

}

Special Requirements

Implementation Notes

- nk must be >= 4.

- No special alignment requirements

- See Bank Conflicts for avoiding bank conflicts

- Bank Conflicts: nk should be a multiple of 2, otherwise bank conflicts

occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- Prolog and epilog of the inner loop are partially collapsed and overlapped

to reduce outer loop overhead.

Benchmarks

Cycles

(2 * nk + 7) * nx + 9

(without bank conflicts)

Codesize 352 bytes

C64x+ DSPLIB Reference

4-57

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_dotp_sqr

4.5 Math

Vector Dot Product and Square

DSP_dotp_sqr

Function

int DSP_dotp_sqr(int G, const short * restrict x, const short * restrict y, int *

restrict r, int nx)

Arguments

Calculated value of G (used in the VSELP coder).

First vector array

x[nx]

y[nx]

Second vector array

Result of vector dot product of x and y.

Number of elements. Must be multiple of 4, and ≥12.

return int New value of G.

Description

Algorithm

This routine performs an nx element dot product of x[ ] and y[ ] and stores it

in r. It also squares each element of y[ ] and accumulates it in G. G is passed

back to the calling function in register A4. This computation of G is used in the

VSELP coder.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

int DSP_dotp_sqr (int G,short *x,short *y,int *r,

int nx)

{

short *y2;

short *endPtr2;

y2 = x;

for (endPtr2 = y2 + nx; y2 < endPtr2; y2++){

*r += *y * *y2;

G += *y * *y;

y++;

}

return(G);

}

4-58

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_dotp_sqr

Special Requirements nx must be a multiple of 4 and greater than or equal to 12.

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

Benchmarks

Cycles

nx/2 + 21

Codesize 128

C64x+ DSPLIB Reference

4-59

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_dotprod

Vector Dot Product

Function

int DSP_dotprod(const short * restrict x, const short * restrict y, int nx)

Arguments

x[nx]

y[nx]

First vector array. Must be double-word aligned.

Second vector array. Must be double word-aligned.

Number of elements of vector. Must be multiple of 4.

return int Dot product of x and y.

Description

Algorithm

This routine takes two vectors and calculates their dot product. The inputs are

16-bit short data and the output is a 32-bit number.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

int DSP_dotprod(short x[ ],short y[ ], int nx)

{

int sum;

int i;

sum = 0;

for(i=0; i<nx; i++){

sum += (x[i] * y[i]);

}

return (sum);

}

Special Requirements

- The input length must be a multiple of 4.

- The input data x[ ] and y[ ] are stored on double-word aligned boundaries.

- To avoid bank conflicts, the input arrays x[ ] and y[ ] must be offset by 4

half-words (8 bytes).

4-60

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_dotprod

Implementation Notes

- Bank Conflicts: No bank conflicts occur if the input arrays x[ ] and y[ ] are

offset by 4 half-words (8 bytes).

- Interruptibility: The code is fully interruptible.

- The code is unrolled 4 times to enable full memory and multiplier

bandwidth to be utilized.

- Interrupts are masked by branch delay slots only.

- Prolog collapsing has been performed to reduce codesize.

Benchmarks

Cycles

nx / 4 + 14

Codesize 64 bytes

C64x+ DSPLIB Reference

4-61

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_maxval

Maximum Value of Vector

Function

short DSP_maxval (const short *x, int nx)

Arguments

x[nx]

Pointer to input vector of size nx.

Length of input data vector. Must be multiple of 8 and ≥32.

Maximum value of a vector.

return short

Description

Algorithm

This routine finds the element with maximum value in the input vector and

returns that value.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

short DSP_maxval(short x[ ], int nx)

{

int i, max;

max = −32768;

for (i = 0; i < nx; i++)

if (x[i] > max)

max = x[i];

return max;

}

Special Requirements nx is a multiple of 8 and greater than or equal to 32.

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

Benchmarks

Cycles nx / 4 + 10

Codesize 116 bytes

4-62

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_maxidx

Index of Maximum Element of Vector

DSP_maxidx

Function

int DSP_maxidx (const short *x, int nx)

Arguments

x[nx]

Pointer to input vector of size nx. Must be double-word aligned.

Length of input data vector. Must be multiple of 16 and ≥ 48.

return int Index for vector element with maximum value.

Description

This routine finds the max value of a vector and returns the index of that value.

The input array is treated as 16 separate columns that are interleaved

throughout the array. If values in different columns are equal to the maximum

value, then the element in the leftmost column is returned. If two values within

a column are equal to the maximum, then the one with the lower index is

returned. Column takes precedence over index.

Algorithm

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

int DSP_maxidx(short x[ ], int nx)

{

int max, index, i;

max = −32768;

for (i = 0; i < nx; i++)

if (x[i] > max) {

max = x[i];

index = i;

}

return index;

}

Special Requirements

- nx must be a multiple of 16 and greater than or equal to 48.

- The input vector x[ ] must be double-word aligned.

C64x+ DSPLIB Reference

4-63

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_maxidx

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- The code is unrolled 16 times to enable the full bandwidth of LDDW and

MAX2 instructions to be utilized. This splits the search into 16 sub-ranges.

The global maximum is then found from the list of maximums of the

sub-ranges. Then, using this offset from the sub-ranges, the global

maximum and the index of it are found using a simple match. For common

maximums in multiple ranges, the index will be different to the above C

code.

- This code requires 40 bytes of stack space for a temporary buffer.

Benchmarks

Cycles

5 * nx / 16 + 42

Codesize 388 bytes

4-64

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_minval

Minimum Value of Vector

DSP_minval

Function

short DSP_minval (const short *x, int nx)

Arguments

x [nx]

Pointer to input vector of size nx.

Length of input data vector. Must be multiple of 4 and ≥20.

return short

Maximum value of a vector.

Description

Algorithm

This routine finds the minimum value of a vector and returns the value.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

short DSP_minval(short x[ ], int nx)

{

int i, min;

min = 32767;

for (i = 0; i < nx; i++)

if (x[i] < min)

min = x[i];

return min;

}

Special Requirements nx is a multiple of 4 and greater than or equal to 20.

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- The input data is loaded using double word wide loads, and the MIN2

instruction is used to get to the minimum.

Benchmarks

Cycles

nx / 4 +10

Codesize 116 bytes

C64x+ DSPLIB Reference

4-65

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_mul32

32-Bit Vector Multiply

DSP_mul32

Function

void DSP_mul32(const int * restrict x, const int * restrict y, int * restrict r, short

nx)

Arguments

x[nx]

y[nx]

r[nx]

Pointer to input data vector 1 of size nx. Must be double-word

aligned.

Pointer to input data vector 2 of size nx. Must be double-word

aligned.

Pointer to output data vector of size nx. Must be double-word

aligned.

Number of elements in input and output vectors. Must be multiple

of 8 and ≥16.

Description

Algorithm

The function performs a Q.31 x Q.31 multiply and returns the upper 32 bits of

the result. The result of the intermediate multiplies are accumulated into a

40-bit long register pair, as there could be potential overflow. The contribution

of the multiplication of the two lower 16-bit halves are not considered. The

output is in Q.30 format. Results are accurate to least significant bit.

In the comments below, X and Y are the two input values. Xhigh and Xlow

represent the upper and lower 16 bits of X. This is the C equivalent of the

assembly code without restrictions. Note that the assembly code is hand

optimized and restrictions may apply.

void DSP_mul32(const int *x, const int *y, int *r,

short nx)

{

short

int

a,b,c,d,e;

for(i=nx;i>0;i−−)

{

a=*(x++);

b=*(y++);

c=_mpyluhs(a,b); /* Xlow*Yhigh */

d=_mpyhslu(a,b); /* Xhigh*Ylow */

e=_mpyh(a,b); /* Xhigh*Yhigh */

d+=c;

/* Xhigh*Ylow+Xlow*Yhigh */

/* (Xhigh*Ylow+Xlow*Yhigh)>>16 */

d=d>>16;

4-66

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_mul32

e+=d;

/* Xhigh*Yhigh + */

/* (Xhigh*Ylow+Xlow*Yhigh)>>16 */

*(r++)=e;

}

Special Requirements

Implementation Notes

- nx must be a multiple of 8 and greater than or equal to 16.

- Input and output vectors must be double-word aligned.

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- The MPYHI instruction is used to perform 16 x 32 multiplies to form 48-bit

intermediate results.

Benchmarks

Cycles

9 * nx/8 + 18

Codesize 512 bytes

C64x+ DSPLIB Reference

4-67

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_neg32

32-Bit Vector Negate

Function

void DSP_neg32(int *x, int *r, short nx)

Arguments

x[nx]

r[nx]

Pointer to input data vector 1 of size nx with 32-bit elements.

Must be double-word aligned.

Pointer to output data vector of size nx with 32-bit elements.

Must be double-word aligned.

Number of elements of input and output vectors. Must be a

multiple of 4 and ≥8.

Description

Algorithm

This function negates the elements of a vector (32-bit elements). The input and

output arrays must not be overlapped except for where the input and output

pointers are exactly equal.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_neg32(int *x, int *r, short nx)

{

short i;

for(i=nx; i>0; i−−)

*(r++)=−*(x++);

}

Special Requirements

Implementation Notes

- nx must be a multiple of 4 and greater than or equal to 8.

- The arrays x[ ] and r[ ] must be double-word aligned.

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- The loop is unrolled twice and pipelined.

Benchmarks

Cycles

nx/2 + 19

Codesize 124 bytes

4-68

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_recip16

16-Bit Reciprocal

DSP_recip16

Function

void DSP_recip16 (short *x, short *rfrac, short *rexp, short nx)

Arguments

x[nx]

Pointer to Q.15 input data vector of size nx.

rfrac[nx]

rexp[nx]

Pointer to Q.15 output data vector for fractional values.

Pointer to output data vector for exponent values.

Number of elements of input and output vectors.

Description

This routine returns the fractional and exponential portion of the reciprocal of

an array x[ ] of Q.15 numbers. The fractional portion rfrac is returned in Q.15

format. Since the reciprocal is always greater than 1, it returns an exponent

such that:

(rfrac[i] * 2rexp[i]) = true reciprocal

The output is accurate up to the least significant bit of rfrac, but note that this

bit could carry over and change rexp. For a reciprocal of 0, the procedure will

return a fractional part of 7FFFh and an exponent of 16.

Algorithm

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_recip16(short *x, short *rfrac, short *rexp, short

nx)

{

int i,j,a,b;

short neg, normal;

for(i=nx; i>0; i−−)

{

a=*(x++);

if(a<0)

{

/* take absolute value */

a=−a;

neg=1;

}

else neg=0;

normal=_norm(a);

a=a<<normal;

/* normalize number */

C64x+ DSPLIB Reference

4-69

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_recip16

*(rexp++)=normal−15;

/* store exponent */

/* dividend = 1 */

b=0x80000000;

for(j=15;j>0;j−−)

b=_subc(b,a);

/* divide */

b=b&0x7FFF;

/* clear remainder

/* (clear upper half) */

if(neg) b=−b;

*(rfrac++)=b;

/* if originally

/* negative, negate */

/* store fraction */

}

Special Requirements None

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interruptible.

- The conditional subtract instruction, SUBC, is used for division. SUBC is

used once for every bit of quotient needed (15).

Benchmarks

Cycles

8 * nx + 14

Codesize 196 bytes

4-70

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_vecsumsq

Sum of Squares

DSP_vecsumsq

Function

int DSP_vecsumsq (const short *x, int nx)

Arguments

x[nx]

Input vector

Number of elements in x. Must be multiple of 4 and ≥8.

return int Sum of the squares

Description

Algorithm

This routine returns the sum of squares of the elements contained in the vector

x[ ].

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

int DSP_vecsumsq(short x[ ], int nx)

{

int i, sum=0;

for(i=0; i<nx; i++)

{

sum += x[i]*x[i];

}

return(sum);

}

Special Requirements nx must be a multiple of 4 and greater than or equal to 32.

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- The code is unrolled 4 times to enable full memory and multiplier

bandwidth to be utilized.

Benchmarks

Cycles

nx/4 + 11

Codesize 188 bytes

C64x+ DSPLIB Reference

4-71

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_w_vec

Weighted Vector Sum

DSP_w_vec

Function

void DSP_w_vec(const short * restrict x, const short * restrict y, short m, short

* restrict r, short nr)

Arguments

x[nr]

y[nr]

Vector being weighted. Must be double-word aligned.

Summation vector. Must be double-word aligned.

Weighting factor

r[nr]

Output vector

Dimensions of the vectors. Must be multiple of 8 and ≥8.

Description

Algorithm

This routine is used to obtain the weighted vector sum. Both the inputs and

output are 16-bit numbers.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_w_vec(short x[ ],short y[ ],short m,

short r[ ],short nr)

{

short i;

for (i=0; i<nr; i++) {

r[i] = ((m * x[i]) >> 15) + y[i];

}

Special Requirements

Implementation Notes

- nr must be a multiple of 8 and ≥ 8.

- Vectors x[ ] and y[ ] must be double-word aligned.

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- Input is loaded in double-words.

- Use of packed data processing to sustain throughput.

Benchmarks

Cycles

3 * nr/8 + 18

Codesize 144 bytes

4-72

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_mat_mul

4.6 Matrix

Matrix Multiplication

DSP_mat_mul

Function

void DSP_mat_mul(const short * restrict x, int r1, int c1, const short * restrict

y, int c2, short * restrict r, int qs)

Arguments

x [r1*c1]

Pointer to input matrix of size r1*c1.

Number of rows in matrix x.

Number of columns in matrix x. Also number of rows in y.

Pointer to input matrix of size c1*c2.

Number of columns in matrix y.

y [c1*c2]

r [r1*c2]

Pointer to output matrix of size r1*c2.

Final right−shift to apply to the result.

Description

This function computes the expression “r = x * y” for the matrices x and y. The

columnar dimension of x must match the row dimension of y. The resulting

matrix has the same number of rows as x and the same number of columns as

The values stored in the matrices are assumed to be fixed-point or integer

values. All intermediate sums are retained to 32-bit precision, and no overflow

checking is performed. The results are right-shifted by a user-specified

amount, and then truncated to 16 bits.

Algorithm

This is the C equivalent of the assembly code without restrictions. Note that the

assembly code is hand optimized and restrictions may apply.

void DSP_mat_mul(short *x, int r1, int c1, short *y, int c2,

short *r, int qs)

{

int i, j, k;

int sum;

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

/* Multiply each row in x by each column in y. The */

/* product of row m in x and column n in y is placed */

/* in position (m,n) in the result.

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

C64x+ DSPLIB Reference

4-73

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_mat_mul

for (i = 0; i < r1; i++)

for (j = 0; j < c2; j++)

{

sum = 0;

for (k = 0; k < c1; k++)

sum += x[k + i*c1] * y[j + k*c2];

r[j + i*c2] = sum >> qs;

}

Special Requirements

Implementation Notes

- The arrays x[], y[], and r[] are stored in distinct arrays. That is, in-place

processing is not allowed.

- The input matrices have minimum dimensions of at least 1 row and 1

column, and maximum dimensions of 32767 rows and 32767 columns.

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: This code blocks interrupts during its innermost loop.

Interrupts are not blocked otherwise. As a result, interrupts can be blocked

for up to 0.25*c1’ + 16 cycles at a time.

- The ‘i’ loop and ‘k’ loops are unrolled 2x. The ’j’ loop is unrolled 4x. For

dimensions that are not multiples of the various loops’ unroll factors, this

code calculates extra results beyond the edges of the matrix. These extra

results are ultimately discarded. This allows the loops to be unrolled for

efficient operation on large matrices while not losing flexibility.

Benchmarks

Cycles

0.25 * ( r1’ * c2’ * c1’ ) + 2.25 * ( r1’ * c2’ ) + 11, where:

r1’ = 2 * ceil(r1/2.0) (r1 rounded up to next even)

c1’ = 2 * ceil(c1/2.0) (c1 rounded up to next even)

c2’ = 4 * ceil(c2/4.0) (c2 rounded up to next mult of 4)

For r1= 1, c1= 1, c2= 1: 33 cycles

For r1= 8, c1=20, c2= 8: 475 cycles

Codesize 416 bytes

4-74

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_mat_trans

Matrix Transpose

DSP_mat_trans

Function

void DSP_mat_trans (const short *x, short rows, short columns, short *r)

Arguments

x[rows*columns]

rows

Pointer to input matrix.

Number of rows in the input matrix. Must be a multiple

of 4.

columns

Number of columns in the input matrix. Must be a multiple

of 4.

r[columns*rows]

Pointer to output data vector of size rows*columns.

Description

Algorithm

This function transposes the input matrix x[ ] and writes the result to matrix r[ ].

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_mat_trans(short *x, short rows, short columns, short

*r)

{

short i,j;

for(i=0; i<columns; i++)

for(j=0; j<rows; j++)

*(r+i*rows+j)=*(x+i+columns*j);

}

Special Requirements

Implementation Notes

- Rows and columns must be a multiple of 4.

- Matrices are assumed to have 16-bit elements.

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- Data from four adjacent rows, spaced “columns” apart are read, and a

local 4x4 transpose is performed in the register file. This leads to four

double words, that are “rows” apart. These loads and stores can cause

bank conflicts; hence, non-aligned loads and stores are used.

Benchmarks

Cycles

(2 * rows + 9) * columns/4 + 3

Codesize 224 bytes

C64x+ DSPLIB Reference

4-75

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_bexp

4.7 Miscellaneous

Block Exponent Implementation

DSP_bexp

Function

short DSP_bexp(const int *x, short nx)

Arguments

x[nx]

Pointer to input vector of size nx. Must be double-word

aligned.

Number of elements in input vector. Must be multiple of 8.

return short

Return value is the maximum exponent that may be used in

scaling.

Description

Algorithm

Computes the exponents (number of extra sign bits) of all values in the input

vector x[ ] and returns the minimum exponent. This will be useful in

determining the maximum shift value that may be used in scaling a block of

data.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

short DSP_bexp(const int *x, short nx)

{

int

min_val =_norm(x[0]);

short

int

for(i=1;i<nx;i++)

{

n =_norm(x[i]); /* _norm(x) = number of */

/* redundant sign bits */

if(n<min_val) min_val=n;

}

return min_val;

}

Special Requirements

- nx must be a multiple of 8.

- The input vector x[ ] must be double-word aligned.

4-76

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_bexp

Implementation Notes

Benchmarks

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

Cycles

nx/2 + 21

Codesize 216 bytes

C64x+ DSPLIB Reference

4-77

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_blk_eswap16

Endian-Swap a Block of 16-Bit Values

Function

void blk_eswap16(void * restrict x, void * restrict r, int nx)

Arguments

x [nx]

r [nx]

Source data. Must be double-word aligned.

Destination array. Must be double-word aligned.

Number of 16-bit values to swap. Must be multiple of 8.

Description

Algorithm

The data in the x[] array is endian swapped, meaning that the byte-order of the

bytes within each half-word of the r[] array is reversed. This facilitates moving

big-endian data to a little-endian system or vice-versa.

When the r pointer is non-NULL, the endian-swap occurs out-of-place, similar

to a block move. When the r pointer is NULL, the endian-swap occurs in-place,

allowing the swap to occur without using any additional storage.

This is the C equivalent of the assembly code without restrictions. Note that the

assembly code is hand optimized and restrictions may apply.

void DSP_blk_eswap16(void *x, void *r, int nx)

{

int i;

char *_x, *_r;

if (r)

{

_x = (char *)x;

_r = (char *)r;

} else

{

_x = (char *)x;

_r = (char *)r;

}

for (i = 0; i < nx; i++)

{

char t0, t1;

t0 = _x[i*2 + 1];

t1 = _x[i*2 + 0];

_r[i*2 + 0] = t0;

_r[i*2 + 1] = t1;

}

4-78

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_blk_eswap16

Special Requirements

- Input and output arrays do not overlap, except when “r == NULL” so that

the operation occurs in-place.

- The input array and output array are expected to be double-word aligned,

and a multiple of 8 half-words must be processed.

Implementation Notes

Benchmarks

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

Cycles

nx/8 + 18

Codesize 104 bytes

C64x+ DSPLIB Reference

4-79

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_blk_eswap32

Endian-Swap a Block of 32-Bit Values

Function

void blk_eswap32(void * restrict x, void * restrict r, int nx)

Arguments

x [nx]

r [nx]

Source data. Must be double-word aligned.

Destination array. Must be double-word aligned.

Number of 32-bit values to swap. Must be multiple of 4.

Description

The data in the x[] array is endian swapped, meaning that the byte-order of the

bytes within each word of the r[] array is reversed. This facilitates moving

big-endian data to a little-endian system or vice-versa.

When the r pointer is non-NULL, the endian-swap occurs out-of-place, similar

to a block move. When the r pointer is NULL, the endian-swap occurs in-place,

allowing the swap to occur without using any additional storage.

Algorithm

This is the C equivalent of the assembly code without restrictions. Note that the

assembly code is hand optimized and restrictions may apply.

void DSP_blk_eswap32(void *x, void *r, int nx)

{

int i;

char *_x, *_r;

if (r)

{

_x = (char *)x;

_r = (char *)r;

} else

{

_x = (char *)x;

_r = (char *)r;

}

for (i = 0; i < nx; i++)

{

char t0, t1, t2, t3;

t0 = _x[i*4 + 3];

t1 = _x[i*4 + 2];

4-80

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_blk_eswap32

t2 = _x[i*4 + 1];

t3 = _x[i*4 + 0];

_r[i*4 + 0] = t0;

_r[i*4 + 1] = t1;

_r[i*4 + 2] = t2;

_r[i*4 + 3] = t3;

}

Special Requirements

- Input and output arrays do not overlap, except where “r == NULL” so that

the operation occurs in-place.

- The input array and output array are expected to be double-word aligned,

and a multiple of 4 words must be processed.

Implementation Notes

Benchmarks

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

Cycles

nx/4 + 20

Codesize 116 bytes

C64x+ DSPLIB Reference

4-81

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_blk_eswap64

Endian-Swap a Block of 64-Bit Values

Function

void blk_eswap64(void * restrict x, void * restrict r, int nx)

Arguments

x[nx]

r[nx]

Source data. Must be double-word aligned.

Destination array. Must be double-word aligned.

Number of 64-bit values to swap. Must be multiple of 2.

Description

The data in the x[] array is endian swapped, meaning that the byte-order of the

bytes within each double-word of the r[] array is reversed. This facilitates

moving big-endian data to a little-endian system or vice-versa.

When the r pointer is non-NULL, the endian-swap occurs out-of-place, similar

to a block move. When the r pointer is NULL, the endian-swap occurs in-place,

allowing the swap to occur without using any additional storage.

Algorithm

This is the C equivalent of the assembly code without restrictions. Note that the

assembly code is hand optimized and restrictions may apply.

void DSP_blk_eswap64(void *x, void *r, int nx)

{

int i;

char *_x, *_r;

if (r)

{

_x = (char *)x;

_r = (char *)r;

} else

{

_x = (char *)x;

_r = (char *)r;

}

for (i = 0; i < nx; i++)

{

char t0, t1, t2, t3, t4, t5, t6, t7;

t0 = _x[i*8 + 7];

t1 = _x[i*8 + 6];

4-82

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_blk_eswap64

t2 = _x[i*8 + 5];

t3 = _x[i*8 + 4];

t4 = _x[i*8 + 3];

t5 = _x[i*8 + 2];

t6 = _x[i*8 + 1];

t7 = _x[i*8 + 0];

_r[i*8 + 0] = t0;

_r[i*8 + 1] = t1;

_r[i*8 + 2] = t2;

_r[i*8 + 3] = t3;

_r[i*8 + 4] = t4;

_r[i*8 + 5] = t5;

_r[i*8 + 6] = t6;

_r[i*8 + 7] = t7;

}

Special Requirements

- Input and output arrays do not overlap, except when “r == NULL” so that

the operation occurs in-place.

- The input array and output array are expected to be double-word aligned,

and a multiple of 2 double-words must be processed.

Implementation Notes

Benchmarks

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

Cycles

nx/2 + 20

Codesize 116 bytes

C64x+ DSPLIB Reference

4-83

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_blk_move

Block Move (Overlapping)

Function

void DSP_blk_move(short * x, short * r, int nx)

Arguments

x [nx]

r [nx]

Block of data to be moved.

Destination of block of data.

Number of elements in block. Must be multiple of 8 and ≥32.

Description

Algorithm

This routine moves nx 16-bit elements from one memory location pointed to

by x to another pointed to by r. The source and destination blocks can be

overlapped.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_blk_move(short *x, short *r, int nx)

{

int i;

if( r < x )

{

for (I = 0; I < nx; i++)

r[i] = x[i];

} else

{

for (I = nx−1; I >= 0; i−−)

r[i] = x[i];

}

Special Requirements nx must be a multiple of 8 and ≥ 32.

Implementation Notes

- Twin input and output pointers are used.

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is fully interruptible.

Benchmarks

Cycles

nx/4+18

Codesize 112 bytes

4-84

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fltoq15

Float to Q15 Conversion

DSP_fltoq15

Function

void DSP_fltoq15 (float *x, short *r, short nx)

Arguments

x[nx]

r[nx]

Pointer to floating-point input vector of size nx. x should contain

the numbers normalized between [−1,1).

Pointer to output data vector of size nx containing the Q.15

equivalent of vector x.

Length of input and output data vectors. Must be multiple of 2.

Description

Algorithm

Convert the IEEE floating point numbers stored in vector x[ ] into Q.15 format

numbers stored in vector r[ ]. Results are truncated toward zero. Values that

exceed the size limit will be saturated to 0x7fff if value is positive and 0x8000

if value is negative. All values too small to be correctly represented will be

truncated to 0.

This is the C equivalent of the assembly code without restrictions. Note that the

assembly code is hand optimized and restrictions may apply.

void fltoq15(float x[], short r[], short nx)

{

int i, a;

for(i = 0; i < nx; i++)

{

a = 32768 * x[i];

// saturate to 16−bit //

if (a>32767) a = 32767;

if (a<−32768) a = −32768;

r[i] = (short) a;

}

Special Requirements nx must be a multiple of 2.

C64x+ DSPLIB Reference

4-85

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fltoq15

Implementation Notes

- Loop is unrolled twice.

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

Benchmarks

Cycles

3 * nx/2 + 14

Codesize 224 bytes

4-86

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_minerror

Minimum Energy Error Search

DSP_minerror

Function

int minerror (const short * restrict GSP0_TABLE, const short * restrict

errCoefs, int * restrict max_index)

Arguments

GSP0_TABLE[9*256] GSP0 terms array. Must be double-word aligned.

errCoefs[9]

max_index

return int

Array of error coefficients.

Pointer to GSP0_TABLE[max_index] found.

Maximum dot product result.

Algorithm

This is the C equivalent of the assembly code without restrictions. Note that the

assembly code is hand optimized and restrictions may apply.

int minerr

(

const short *restrict GSP0_TABLE,

const short *restrict errCoefs,

int

*restrict max_index

)

{

int val, maxVal = −50;

int i, j;

for (i = 0; i < GSP0_NUM; i++)

{

for (val = 0, j = 0; j < GSP0_TERMS; j++)

val += GSP0_TABLE[i*GSP0_TERMS+j] * errCoefs[j];

if (val > maxVal)

{

maxVal = val;

*max_index = i*GSP0_TERMS;

}

return (maxVal);

}

C64x+ DSPLIB Reference

4-87

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_minerror

Special Requirements Array GSP0_TABLE[] must be double-word aligned.

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- The load double-word instruction is used to simultaneously load four

values in a single clock cycle.

- The inner loop is completely unrolled.

- The outer loop is 4 times unrolled.

Benchmarks

Cycles

256/4 * 9 + 17 = 593

Codesize 352 bytes

4-88

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_q15tofl

Q15 to Float Conversion

DSP_q15tofl

Function

void DSP_q15tofl (short *x, float *r, int nx)

Arguments

x[nx]

r[nx]

Pointer to Q.15 input vector of size nx.

Pointer to floating-point output data vector of size nx containing

the floating-point equivalent of vector x.

Length of input and output data vectors. Must be multiple of 2.

Description

Algorithm

Converts the values stored in vector x[ ] in Q.15 format to IEEE floating point

numbers in output vector r[ ].

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_q15tofl(short *x, float *r, int nx)

{

int i;

for (i=0;i<nx;i++)

r[i] = (float) x[i] / 0x8000;

}

Special Requirements nx must be a multiple of 2.

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- Loop is unrolled twice

Benchmarks

Cycles

2 * nx + 14

Codesize 184 bytes

C64x+ DSPLIB Reference

4-89

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_bitrev_cplx

4.8 Obsolete Functions

4.8.1 FFT

Complex Bit-Reverse

DSP_bitrev_cplx

NOTE: This function is provided for backward compatibility with the C62x

DSPLIB. It has not been optimized for the C64x architecture. You are advised

to use one of the newly added FFT functions which have been optimized for the

C64x.

Function

void DSP_bitrev_cplx (int *x, short *index, int nx)

Arguments

x[nx]

Pointer to complex input vector x of size nx

index[ ]

Array of size ∼sqrt(nx) created by the routine digitrev_index

(provided in the directory ‘support\fft’).

Number of elements in vector x. nx must be a power of 2.

Description

Algorithm

This function bit-reverses the position of elements in complex vector x. This

function is used in conjunction with FFT routines to provide the correct format

for the FFT input or output data. The bit-reversal of a bit-reversed order array

yields a linear-order array.

TI retains all rights, title and interest in this code and only authorizes the use

of this code on TI TMS320 DSPs manufactured by TI. This is the C equivalent

of the assembly code without restrictions. Note that the assembly code is hand

optimized and restrictions may apply.

void DSP_bitrev_cplx (int *x, short *index, int nx)

{

int

short

int

i0, i1, i2, i3;

j0, j1, j2, j3;

xi0, xi1, xi2, xi3;

xj0, xj1, xj2, xj3;

int

short

int

a, b, ia, ib, ibs;

mask;

int

4-90

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_bitrev_cplx

int

nbits, nbot, ntop, ndiff, n2, halfn;

*xs= (short *) x;

short

nbits = 0;

i = nx;

while (i > 1){

i = i >> 1;

nbits++;}

nbot = nbits >> 1;

ndiff = nbits & 1;

ntop = nbot + ndiff;

= 1 << ntop;

mask = n2 − 1;

halfn = nx >> 1;

for(i0 = 0; i0 < halfn; i0 += 2) {

b = i0 & mask;

a = i0 >> nbot;

if (!b) ia

= index[a];

ib = index[b];

ibs= ib << nbot;

j0 = ibs + ia;

t = i0 < j0;

xi0= x[i0];

xj0= x[j0];

if (t){x[i0] = xj0;

x[j0] = xi0;}

i1 = i0 + 1;

j1 = j0 + halfn;

xi1= x[i1];

xj1= x[j1];

x[i1] = xj1;

x[j1] = xi1;

i3 = i1 + halfn;

j3 = j1 + 1;

xi3= x[i3];

xj3= x[j3];

C64x+ DSPLIB Reference

4-91

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_bitrev_cplx

if (t){x[i3] = xj3;

x[j3] = xi3;}

}

Special Requirements

- nx must be a power of 2.

- The array index[] is generated by the routine bitrev_index provided in the

directory ‘support\fft’.

- If nx ≤ 4K, you can use the char (8-bit) data type for the “index” variable.

This requires changing the LDH when loading index values in the

assembly routine to LDB. This further reduces the size of the Index Table

by half.

Implementation Notes

Benchmarks

- Interruptibility: The code is interrupt-tolerant but not interruptible.

The performance of this function has not yet been characterized on the C64x+

4-92

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_radix2

Complex Forward FFT (radix 2)

DSP_radix2

NOTE: This function is provided for backward compatibility with the C62x

DSPLIB. It has not been optimized for the C64x architecture. You are advised

to use one of the newly added FFT functions which have been optimized for the

C64x.

Function

void DSP_radix2 (int nx, short * restrict x, const short * restrict w)

Arguments

Number of complex elements in vector x. Must be a power of 2

such that 4 ≤ nx ≤ 65536.

x[2*nx]

w[nx]

Pointer to input and output sequences. Size 2*nx elements.

Pointer to vector of FFT coefficients of size nx elements.

Description

Algorithm

This routine is used to compute FFT of a complex sequence of size nx, a power

of 2, with “decimation-in-frequency decomposition” method. The output is in

bit-reversed order. Each complex value is with interleaved 16-bit real and

imaginary parts.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_radix2 (short x[ ],short nx,short w[ ])

{

short n1,n2,ie,ia,i,j,k,l;

short xt,yt,c,s;

n2 = nx;

ie = 1;

for (k=nx; k > 1; k = (k >> 1) ) {

n1 = n2;

n2 = n2>>1;

ia = 0;

for (j=0; j < n2; j++) {

c = w[2*ia];

s = w[2*ia+1];

ia = ia + ie;

for (i=j; i < nx; i += n1) {

l = i + n2;

C64x+ DSPLIB Reference

4-93

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_radix2

= x[2*l] − x[2*i];

= x[2*i] + x[2*l];

= x[2*l+1] − x[2*i+1];

x[2*i]

x[2*i+1] = x[2*i+1] + x[2*l+1];

x[2*l] = (c*xt + s*yt)>>15;

x[2*l+1] = (c*yt − s*xt)>>15;

}

ie = ie<<1;

}

Special Requirements

- 2 ≤ nx ≤ 32768 (nx is a power of 2)

- Input x and coefficients w should be in different data sections or memory

spaces to eliminate memory bank hits. If this is not possible, they should

be aligned on different word boundaries to minimize memory bank hits.

- x data is stored in the order real[0], image[0], real[1], ...

- The FFT coefficients (twiddle factors) are generated using the program

tw_radix2 provided in the directory ‘support\fft’.

Implementation Notes

- Bank Conflicts: See Benchmarks.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- Loads input x and coefficient w as words.

- Both loops j and i0 shown in the C code are placed in the INNERLOOP of

the assembly code.

Benchmarks

The performance of this function has not yet been characterized on the C64x+.

4-94

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_r4fft

Complex Forward FFT (radix 4)

DSP_r4fft

NOTE: This function is provided for backward compatibility with the C62x

DSPLIB. It has not been optimized for the C64x architecture. You are advised

to use one of the newly added FFT functions which have been optimized for the

C64x.

Function

void DSP_r4fft (int nx, short * restrict x, const short * restrict w)

Arguments

Number of complex elements in vector x. Must be a power of 4

such that 4 ≤ nx ≤ 65536.

x[2*nx]

w[nx]

Pointer to input and output sequences. Size 2*nx elements.

Pointer to vector of FFT coefficients of size nx elements.

Description

Algorithm

This routine is used to compute FFT of a complex sequence size nx, a power

of 4, with “decimation-in-frequency decomposition” method. The output is in

digit-reversed order. Each complex value is with interleaved 16-bit real and

imaginary parts.

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

void DSP_r4fft (int nx, short x[ ], short w[ ])

{

int

n1, n2, ie, ia1, ia2, ia3, i0, i1, i2, i3,

j, k;

short t, r1, r2, s1, s2, co1, co2, co3, si1,

si2, si3;

n2 = nx;

ie = 1;

for (k = nx; k > 1; k >>= 2) {

n1 = n2;

n2 >>= 2;

ia1 = 0;

for (j = 0; j < n2; j++) {

ia2 = ia1 + ia1;

ia3 = ia2 + ia1;

co1 = w[ia1 * 2 + 1];

C64x+ DSPLIB Reference

4-95

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_r4fft

si1 = w[ia1 * 2];

co2 = w[ia2 * 2 + 1];

si2 = w[ia2 * 2];

co3 = w[ia3 * 2 + 1];

si3 = w[ia3 * 2];

ia1 = ia1 + ie;

for (i0 = j; i0 < nx; i0 += n1) {

i1 = i0 + n2;

i2 = i1 + n2;

i3 = i2 + n2;

r1 = x[2 * i0] + x[2 * i2];

r2 = x[2 * i0] − x[2 * i2];

t = x[2 * i1] + x[2 * i3];

x[2 * i0] = r1 + t;

r1 = r1 − t;

s1 = x[2 * i0 + 1] + x[2 * i2 + 1];

s2 = x[2 * i0 + 1] − x[2 * i2 + 1];

t = x[2 * i1 + 1] + x[2 * i3 + 1];

x[2 * i0 + 1] = s1 + t;

s1 = s1 − t;

x[2 * i2] = (r1 * co2 + s1 * si2) >>

15;

x[2 * i2 + 1] = (s1 * co2−r1 *

si2)>>15;

t = x[2 * i1 + 1] − x[2 * i3 + 1];

r1 = r2 + t;

r2 = r2 − t;

t = x[2 * i1] − x[2 * i3];

s1 = s2 − t;

s2 = s2 + t;

x[2 * i1] = (r1 * co1 + s1 * si1)

>>15;

x[2 * i1 + 1] = (s1 * co1−r1 *

si1)>>15;

x[2 * i3] = (r2 * co3 + s2 * si3)

4-96

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_r4fft

>>15;

x[2 * i3 + 1] = (s2 * co3−r2 *

si3)>>15;

}

ie <<= 2;

}

Special Requirements

- 4 ≤ nx ≤ 65536 (nx a power of 4)

- x is aligned on a 4*nx byte boundary for circular buffering

- Input x and coefficients w should be in different data sections or memory

spaces to eliminate memory bank hits. If this is not possible, w should be

aligned on an odd word boundary to minimize memory bank hits

- x data is stored in the order real[0], image[0], real[1], ...

- The FFT coefficients (twiddle factors) are generated using the program

tw_r4fft provided in the directory ‘support\fft’.

Implementation Notes

- Bank Conflicts: See Benchmarks.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- Loads input x and coefficient w as words.

- Both loops j and i0 shown in the C code are placed in the INNERLOOP of

the assembly code.

Benchmarks

The performance of this function has not yet been characterized on the C64x+.

C64x+ DSPLIB Reference

4-97

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft

Complex Forward FFT With Digital Reversal

DSP_fft

Function

void DSP_fft (const short * restrict w, int nx, short * restrict x, short * restrict y)

Arguments

w[2*nx]

Pointer to vector of Q.15 FFT coefficients of size 2 * nx

elements. Must be double-word aligned.

Number of complex elements in vector x. Must be a power of

4 and 4 ≤ nx ≤ 65536.

x[2*nx]

y[2*nx]

Pointer to input sequence of size 2 * nx elements. Must be

double-word aligned.

Pointer to output sequence of size 2 * nx elements. Must be

double-word aligned.

Description

This routine is used to compute an FFT of a complex sequence of size nx, a

power of 4, with “decimation-in-frequency decomposition” method. The output

is returned in a separate array y in normal order. This routine also performs

digit reversal as a special last step. Each complex value is stored as

interleaved 16-bit real and imaginary parts. The code uses a special ordering

of FFT factors and memory accesses to improve performance in the presence

of cache.

Algorithm

This is the C equivalent of the assembly code without restrictions. Note that

the assembly code is hand optimized and restrictions may apply.

/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/

/* The following macro is used to obtain a digit reversed index, of a given */

/* number i, into j where the number of bits in ”i” is ”m”. For the natural */

/* form of C code, this is done by first interchanging every set of ”2 bit” */

/* pairs, followed by exchanging nibbles, followed by exchanging bytes, and */

/* finally halfwords. To give an example, consider the following number:

/* N = FEDCBA9876543210, where each digit represents a bit, the following */

/* steps illustrate the changes as the exchanges are performed: */

/* M = DCFE98BA54761032 is the number after every ”2 bits” are exchanged. */

/* O = 98BADCFE10325476 is the number after every nibble is exchanged.

/* P = 1032547698BADCFE is the number after every byte is exchanged.

/* Since only 16 digits were considered this represents the digit reversed */

/* index. Since the numbers are represented as 32 bits, there is one more */

/* step typically of exchanging the half words as well.

4-98

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft

#include <stdio.h>

#include <stdlib.h>

#if 0

# define DIG_REV(i, m, j) ((j) = (_shfl(_rotl(_bitr(_deal(i)), 16)) >> (m)))

#else

# define DIG_REV(i, m, j)

do {

unsigned _ = (i);

_ = ((_ & 0x33333333) << 2) | ((_ & ~0x33333333) >> 2);

_ = ((_ & 0x0F0F0F0F) << 4) | ((_ & ~0x0F0F0F0F) >> 4);

_ = ((_ & 0x00FF00FF) << 8) | ((_ & ~0x00FF00FF) >> 8);

_ = ((_ & 0x0000FFFF) << 16) | ((_ & ~0x0000FFFF) >> 16);

(j) = _ >> (m);

} while (0)

#endif

void fft_cn

(

const short *restrict w,

int n,

short

*restrict x,

*restrict y

)

{

int stride, i, j, k, t, s, m;

short xh0, xh1, xh20, xh21;

short xl0, xl1, xl20, xl21;

short xt0, yt0, xt1, yt1;

short xt2, yt2, xt3, yt3;

/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/

/* Inform the compiler that the input array ”x”, twiddle factor array */

/* ”w” and output array ”y” are double word aligned. In addition, the */

/* number of points to be transformed is assumed to be greater than or */

/* equal to 16, and less than 32768.

#ifndef NOASSUME

C64x+ DSPLIB Reference

4-99

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft

_nassert((int)x % 8 == 0);

_nassert((int)y % 8 == 0);

_nassert((int)w % 8 == 0);

_nassert(n >= 16);

_nassert(n < 32768);

#endif

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

/* Perform initial stages of FFT in place w/out digit reversal.

#ifndef NOASSUME

#pragma MUST_ITERATE(1,,1);

#endif

for (stride = n, t = 0; stride > 4; stride >>= 2)

{

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

/* Perform each of the butterflies for this particular stride.

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

s = stride >> 2;

/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/

/* stride represents the seperation between the inputs of the radix */

/* 4 butterfly. The C code breaks the FFT, into two cases, one when */

/* the stride between the elements is greater than 4, other when

/* the stride is less than 4. Since stride is greater than 16, it */

/* can be guaranteed that ”s” is greater than or equal to 4.

/* In addition, it can also be shown that the loop that shares this */

/* stride will iterate at least once. The number of times this

/* loop iterates depends on how many butterflies in this stage

/* share a twiddle factor.

#ifndef NOASSUME

_nassert(stride >= 16);

_nassert(s

>= 4);

#pragma MUST_ITERATE(1,,1);

#endif

for (i = 0; i < n; i += stride)

4-100

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft

{

#ifndef NOASSUME

_nassert(i % 4 == 0);

_nassert(s

>= 4);

#pragma MUST_ITERATE(2,,2);

#endif

for (j = 0; j < s; j += 2)

{

for (k = 0; k < 2; k++)

{

short

w1c, w1s, w2c, w2s, w3c, w3s;

short x0r, x0i, x1r, x1i, x2r, x2i, x3r, x3i;

short y0r, y0i, y1r, y1i, y2r, y2i, y3r, y3i;

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

/* Read the four samples that are the input to this

/* particular butterfly.

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

x0r = x[2*(i+j+k ) + 0]; x0i = x[2*(i+j+k ) + 1];

x1r = x[2*(i+j+k + s) + 0]; x1i = x[2*(i+j+k + s) + 1];

x2r = x[2*(i+j+k + 2*s) + 0]; x2i = x[2*(i+j+k + 2*s) + 1];

x3r = x[2*(i+j+k + 3*s) + 0]; x3i = x[2*(i+j+k + 3*s) + 1];

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

/* Read the six twiddle factors that are needed for 3 */

/* of the four outputs. (The first output has no mpys.) */

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/

w1s = w[t + 2*k + 6*j + 0];

w2s = w[t + 2*k + 6*j + 4];

w3s = w[t + 2*k + 6*j + 8];

w1c = w[t + 2*k + 6*j + 1];

w2c = w[t + 2*k + 6*j + 5];

w3c = w[t + 2*k + 6*j + 9];

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

/* Calculate the four outputs, remembering that radix4 */

/* FFT accepts 4 inputs and produces 4 outputs. If we */

/* imagine the inputs as being complex, and look at the */

/* first stage as an example:

/* Four inputs are x(n) x(n+N/4) x(n+N/2) x(n+3N/4)

/* In general the four inputs can be generalized using */

C64x+ DSPLIB Reference

4-101

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft

/* the stride between the elements as follows:

/* x(n), x(n + s), x(n + 2*s), x(n + 3*s).

/* These four inputs are used to calculate four outputs */

/* as shown below:

/* X(4k) = x(n) + x(n + N/4) + x(n + N/2) + x(n + 3N/4) */

/* X(4k+1)= x(n) −jx(n + N/4) − x(n + N/2) +jx(n + 3N/4) */

/* X(4k+2)= x(n) − x(n +N/4) + x(N + N/2) − x(n + 3N/4) */

/* X(4k+3)= x(n) +jx(n + N/4) − x(n + N/2) −jx(n + 3N/4) */

/* These four partial results can be re−written to show */

/* the underlying DIF structure similar to DSP_radix2 as */

/* follows:

/* X(4k) = (x(n)+x(n + N/2)) + (x(n+N/4)+ x(n + 3N/4)) */

/* X(4k+1)= (x(n)−x(n + N/2)) −j(x(n+N/4) − x(n + 3N/4)) */

/* x(4k+2)= (x(n)+x(n + N/2)) − (x(n+N/4)+ x(n + 3N/4)) */

/* X(4k+3)= (x(n)−x(n + N/2)) +j(x(n+N/4) − x(n + 3N/4)) */

/* which leads to the real and imaginary values as foll: */

/* y0r = x0r + x2r + x1r + x3r

/* y0i = x0i + x2i + x1i + x3i

= xh0 + xh20

= xh1 + xh21

/* y1r = x0r − x2r + (x1i − x3i) = xl0 + xl21

/* y1i = x0i − x2i − (x1r − x3r) = xl1 − xl20

/* y2r = x0r + x2r − (x1r + x3r) = xh0 − xh20

/* y2i = x0i + x2i − (x1i + x3i

= xh1 − xh21

/* y3r = x0r − x2r − (x1i − x3i) = xl0 − xl21

/* y3i = x0i − x2i + (x1r − x3r) = xl1 + xl20

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

xh0 = x0r + x2r;

xh1 = x0i + x2i;

xh20 = x1r + x3r;

xh21 = x1i + x3i;

xl0 = x0r − x2r;

4-102

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft

xl1 = x0i − x2i;

xl20 = x1r − x3r;

xl21 = x1i − x3i;

xt0 = xh0 + xh20;

yt0 = xh1 + xh21;

xt1 = xl0 + xl21;

yt1 = xl1 − xl20;

xt2 = xh0 − xh20;

yt2 = xh1 − xh21;

xt3 = xl0 − xl21;

yt3 = xl1 + xl20;

/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/

/* Perform twiddle factor multiplies of three terms,top */

/* term does not have any multiplies. Note the twiddle */

/* factors for a normal FFT are C + j (−S). Since the

/* factors that are stored are C + j S, this is

/* corrected for in the multiplies.

/* Y1 = (xt1 + jyt1) (c + js) = (xc + ys) + (yc −xs)

/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/

y0r = xt0;

y0i = yt0;

y1r = (xt1 * w1c + yt1 * w1s) >> 15;

y1i = (yt1 * w1c − xt1 * w1s) >> 15;

y2r = (xt2 * w2c + yt2 * w2s) >> 15;

y2i = (yt2 * w2c − xt2 * w2s) >> 15;

y3r = (xt3 * w3c + yt3 * w3s) >> 15;

y3i = (yt3 * w3c − xt3 * w3s) >> 15;

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

/* Store the final results back to the input array.

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

x[2*(i+j+k

) + 0] = y0r; x[2*(i+j+k

) + 1] = y0i;

x[2*(i+j+k + s) + 0] = y1r; x[2*(i+j+k + s) + 1] = y1i;

x[2*(i+j+k + 2*s) + 0] = y2r; x[2*(i+j+k + 2*s) + 1] = y2i;

x[2*(i+j+k + 3*s) + 0] = y3r; x[2*(i+j+k + 3*s) + 1] = y3i;

}

C64x+ DSPLIB Reference

4-103

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

/* Offset to next subtable of twiddle factors. With each iteration */

/* of the above block, six twiddle factors get read, s times,

/* hence the offset into the twiddle factor array is advanced by */

/* this amount. */

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

t += 6 * s;

}

/* Get the magnitude of ”n”, so we know how many digits to reverse. */

for (i = 31, m = 1; (n & (1 << i)) == 0; i−−, m++) ;

/* Perform final stage with digit reversal.

s = n >> 2;

/* One of the nice features, of this last stage are that, no multiplies */

/* are required. In addition, the data always strides by a fixed amount */

/* namely 8 elements. Since the data is stored as interleaved pairs, of */

/* real and imaginary data, the first eight elements contain the data */

/* for the first four complex inputs. These are the inputs to the first */

/* radix4 butterfly.

#ifndef NOASSUME

#pragma MUST_ITERATE(4,,4);

#endif

for (i = 0; i < n; i += 4)

{

short x0r, x0i, x1r, x1i, x2r, x2i, x3r, x3i;

short y0r, y0i, y1r, y1i, y2r, y2i, y3r, y3i;

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

/* Read the four samples that are the input to this butterfly.

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

4-104

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft

x0r = x[2*(i + 0) + 0];

x1r = x[2*(i + 1) + 0];

x2r = x[2*(i + 2) + 0];

x3r = x[2*(i + 3) + 0];

x0i = x[2*(i + 0) + 1];

x1i = x[2*(i + 1) + 1];

x2i = x[2*(i + 2) + 1];

x3i = x[2*(i + 3) + 1];

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

/* Calculate the final FFT result from this butterfly. */

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

y0r = (x0r + x2r) + (x1r + x3r);

y0i = (x0i + x2i) + (x1i + x3i);

y1r = (x0r − x2r) + (x1i − x3i);

y1i = (x0i − x2i) − (x1r − x3r);

y2r = (x0r + x2r) − (x1r + x3r);

y2i = (x0i + x2i) − (x1i + x3i);

y3r = (x0r − x2r) − (x1i − x3i);

y3i = (x0i − x2i) + (x1r − x3r);

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

/* Digit reverse our address to convert the digit−reversed input */

/* into a linearized output order. This actually results in a

/* digit−reversed store pattern since we’re loading linearly, but */

/* the end result is that the FFT bins are in linear order. */

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

DIG_REV(i, m, j); /* Note: Result is assigned to ’j’ by the macro. */

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

/* Store out the final FFT results.

/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

y[2*(j + 0) + 0] = y0r; y[2*(j + 0) + 1] = y0i;

y[2*(j + s) + 0] = y1r; y[2*(j + s) + 1] = y1i;

y[2*(j + 2*s) + 0] = y2r; y[2*(j + 2*s) + 1] = y2i;

y[2*(j + 3*s) + 0] = y3r; y[2*(j + 3*s) + 1] = y3i;

}

C64x+ DSPLIB Reference

4-105

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft

Special Requirements

- In-place computation is not allowed.

- nx must be a power of 4 and 4 ≤ nx ≤ 65536.

- Input x[ ] and output y[ ] are stored on double-word aligned boundaries.

- Input data x[ ] is stored in the order real0, img0, real1, img1, ...

- The FFT coefficients (twiddle factors) must be double-word aligned and

are generated using the program tw_fft16x16 provided in the directory

‘support\fft’.

Implementation Notes

- Bank Conflicts: No bank conflicts occur.

- Interruptibility: The code is interrupt-tolerant but not interruptible.

- Loads input x[ ] and coefficient w[ ] as double words.

- Both loops j and i0 shown in the C code are placed in the inner loop of the

assembly code.

Benchmarks

Cycles

1.25 * nx * log (nx) – 0.5 * nx + 23 * log (nx) – 1

Codesize

984 bytes

4-106

Download from Www.Somanuals.com. All Manuals Search And Download.

DSP_fft16x16t

Complex Forward Mixed Radix 16- x 16-Bit FFT With Truncation

DSP_fft16x16t

Function

void DSP_fft16x16t(const short * restrict w, int nx, short * restrict x, short * re-

strict y)