FPGA-Audio – FPGA based MP3/WAV Player
The aim of this project was to build an MP3/WAV player using just a FPGA, some RAM & a stereo DAC.
The project consists of a custom 32-bit soft core processor running at just under 60MHz which decodes the MP3 algorithm in software with no hardware acceleration apart from a single cycle Xilinx multiplier unit.
The FPGA-Audio project hardware simply consists of a Xilinx Spartan 6 FPGA with supporting components (2xVREGs & SPI-PROM), asynchronous 512KB SRAM, 24-bit Stereo DAC & Micro SD socket.
(Voltage regulators & decoupling caps on bottom of board)
Additional connectivity is available via a ‘UEXT’ connector, a 10-pin expansion connector that can be used to connect to various modules provided by the popular Olimex Ltd (see here…).
The FPGA can be used to instantiate UARTs, SPI & GPIO (or other) interfaces to this connector in any chosen pinout.
This interface will in the future be connected to another custom board with a display, buttons & IR receiver to enable control of the audio player.
The aim was to create a minimal design where as much functionality as possible was realized internally in the FPGA.
The FPGA consists of the following blocks;
‘MPX’ is a custom 32-bit soft core processor written originally in VHDL but now also in Verilog. It is a pipelined RISC processor which implements the majority of MIPS-I ISA excluding the formally patented un-aligned load/store instructions, as well as the native HW multiplier & divider (mult, multu, div, divu instructions).
By not including native multiplication & division instructions, the pipeline is simplified and the core is smaller.
Multiplication & division is provided by replacement functions in the C library (mulsi3, divsi3, etc).
Note: GCC was modified to provide the option to disable the MIPS mult & div instructions as well as to enable turning off the patented unaligned memory access instructions.
A separate single cycle Xilinx/Spartan6 specific multiplier unit is instantiated as a memory mapped peripheral to provide the fast multiplication that the MP3 algorithm demands.
The FPGA-Audio board has an 8.192MHz oscillator which is used directly to clock the DAC (MCLK) and also internally multiplied using a Xilinx DCM to 57.34MHz to drive the rest of the SOC & CPU core.
Unfortunately, the 8.192MHz clock means that playback of a 44.1KHz song is around 3% too slow, as the DAC MCLK frequency should be 8.467MHz for 192xLRC clock (44.1KHz).
It doesn’t seem too noticeable to me however!
The soft core 32-bit CPU is called MPX. The name is derived from the fact that it implements the majority of MIPS-I ISA, meaning that it can make use of an existing GCC port for MIPs architecture (MIPs is a trademark of MIPS Technologies which I am in no way affiliated with).
MPX can execute 1 instruction every cycle except for memory access instructions which take 2 cycles (see below). It ‘features’ both a load delay slot & a branch delay slot.
When executing from internal single cycle memory, MPX is able to achieve 65.57 DMIPS (Dhrystone 1.1) @ 57Mhz.
MPX is implemented with a 4 stage pipeline.
As the architecture has a branch delay slot, knowing that you will branch in stage 2 means that you will have also already scheduled a instruction fetch for PC+4 in stage 1, meaning you do not have to flush any part of the pipeline on a branch operation.
MPX is a pipelined Von-Neumann architecture (shared data & instruction bus) which lends itself to connecting to single ported RAM / external memory interfaces.
An improvement would be to switch to a Harvard architecture & arbitrate concurrent accesses to same single ported memory (e.g external async SRAM), and allow for fully concurrent accesses to distinct memories/peripherals (e.g. executing from external memory whilst accessing internal memory).
As MPX is currently a Von-Neumann architecture, memory access instructions cause a ‘bubble’ instruction to be inserted into the pipeline. Interrupts are also a source of pipeline bubbles.
All other data hazards in the pipeline are resolved by forwarding logic.
Instruction/Data memory pause (or cache miss) results in the pipelined being stalled.
The current implementation of MPX in the FPGA-Audio project does not make use of caches; internal memory is tight on the FPGA and can be better used for audio FIFOs & SD DMA buffers than for a cache.
The MP3 algorithm was profiled and high frequency / critical functions & data are relocated to internal single cycle block RAM, so a cache would not improve the performance for these anyway.
The FPGA-Audio player is able to play MP3 & WAV files. The player is capable of playing 320Kbps MP3s & un-compressed WAV files smoothly.
MP3s with a bitrate of 320Kbps (stereo) use around 96% of CPU time, but the 4% of free CPU cycles is enough for the decode task to produce more data than is consumed.
This extra data is stored in a queue of audio buffers to be loaded into the 2K audio FIFO by the I2S driver when space is available.
For comparison, a mono 64Kbps MP3 uses around 55% of CPU time.
WAV files skip the going through the MP3 decoder, instead data is loaded from the file system then fed straight into the audio FIFO which is expecting 16-bit stereo PCM data.
MP3 decoding is provided by the open-source Helix MP3 decoder library.
This 3rd party C library provides MPEG compliant decoding of MP3s and the version used is a fixed point implementation that was optimized for ARM CPUs.
The port of this software to MPX requires no assembly optimization; the system is fast enough to execute the algorithm without rewriting any sections in ASM.
FPGA-Audio uses a 32×32 hardware pipelined multiplier block created via Xilinx ISE’s Coregen which provides a 64-bit result in 1 cycle using the Spartan6’s DSP48 slices.
The hardware multiplier unit is key to being able to play high bit-rate (or even low bit-rate) MP3s in real-time due to the vast number of multiplications done as part of the MP3 decode algorithm.
When profiling the software on the software simulator, a 1 second MP3 clip @ 128kbps performed 290,304 multiply add operations (MADD64) and 46,080 multiply shift operations (MULSHIFT32).
The audio file is read from a FAT32/16 formatted micro SD card using the previously developed ‘FAT File I/O Library’ (see here…).
At the lowest level, the SD card is accessed in SPI mode, where file system data is loaded into a dual-port RAM block using DMA from the SPI-Master peripheral.
Upon transfer complete, and interrupt is generated so that the software can continue doing more useful things than polling for SPI transfer complete.
The dual-port RAM block is memory mapped so the processor can quickly access & manipulate file data in-situ.
The RTOS was written mainly in portable C with a small amount of assembly code for CPU specific context save & restore. It has also been ported to ARM ARM7TDMI, ARM Cortex M3, TI MSP430 & Atmel AVR processors.
In this system, the RTOS is pre-emptive with a 1ms tick time, and features support for interrupts, mailboxes, semaphores & mutexes.
The FPGA-Audio project uses the RTOS to allow separate threads for reading / decoding MP3 files & audio playback.
Audio playback is a high priority task that is interrupt driven by the audio FIFO block.
SD card access is also interrupt driven meaning that no polling of peripherals is required in the project.
Verification & hunting for bugs in the MPX core probably took most of the project time.
Verification was aided by an instruction set simulator of the MPX core to allow proving of the instruction decode / execute logic, which in-conjunction with a GDB stub, allowed a ‘pleasant’ environment for debugging the software.
In addition to the instruction set simulator, the RTL code was used with Verilator which is an open source tool that allows Verilog to be compiled to a C++ model.
The ‘Verilated’ model could then be used to execute code in a cycle accurate way (with peripherals) and also be run in tandem with the instruction set simulator to allow for co-simulation.
Co-simulation is a useful way of catching inconsistencies between the ISS & the RTL.
– RTL including simulation framework & FPGA project
– Software source
– GCC modifications
– FPGA-Audio Schematic(CC BY-SA 2.0)
– FPGA-Audio Layout (CC BY-SA 2.0)
– PCB Gerber Files (CC BY-SA 2.0)