An ASIC-Based Artificial Neural Network Applied Real-time Speech Recognition SOPC

Lam D. Pham, Hieu M. Nguyen, Du N. N. T. Nguyen and Trang Hoang

Abstract—Artificial Neural Network (ANN) is promoted to one of major schemes applied in pattern recognition area. Indeed, many approaches to software-based platforms have proven great performance of ANN. However, developing pattern recognition systems integrating ANN hardware-based architecture has been limited not only by the silicon requirements such as frequency, area, power, or resource but also by high accuracy and real-time applications strictly. Although a considerable number of ANN hardware-based architectures have been proposed currently, they have experienced a deprivation of functions due to both small configurations and ability of reconfiguration. Consequently, achieving an effective ANN hardware-based architecture so as to adapt to not only strict accuracy, enormous configures, or silicon area but also real-time criterion in pattern recognition systems has been really challenged. To tackle these issues, this work has proposed a dynamic structure of three-layer ANN architecture being able to reconfigure for adapting to various real-time applications. What is more, a complete SOPC system integrating proposed ANN hardware has also implemented to apply Vietnamese speech recognition automatically to confirm high recognition probability around 95.2 % towards 20 Vietnamese discrete words. Moreover, experiment results on such ASIC-based architecture have witnessed maximum frequency at 250 MHz on 130nm technology as well as great ability of reconfiguration.

Index Terms—Artificial Neural Network (ANN), System on a Programmable Chip (SOPC), floating Point, booth algorithm, activation function.

1. Introduction

A LTHOUGH Moores law has been still valued to confirm the increasing of transistor density gradually, distinct applications utilizing ANN consisting of an enormous number of single neurons exceed limitation of resource. As a matter of fact, the authors of [1] and [2] presented critical issues relating to drawback of hardware resource, which is needed to be halted in future work. In order to not only configure a large ANN successfully but also adapt to silicon requirements feasibly, popular methods have attended to apply approximate models to reduce utilized resource. Particularly, almost techniques have referred to combine between fixed-point data formats and approximated activation functions in structure of single neuron showed in [3] - [6]. To be more specific, these published papers have approximated nonlinear functions such as Sigmoid and Tan-Sigmoid to forms of linear functions being able to approximate models to reduce utilized resource. Particularly, almost techniques have referred to combine between fixed-point data formats and approximated activation functions in structure of single neuron showed in [3] - [6]. To be more specific, these published papers have approximated nonlinear functions such as Sigmoid and Tan-Sigmoid to forms of linear functions being able to implement hardware-based architectures feasibly. Nevertheless, the trade-off between accuracy and using resource is greatly concerned whenever complex systems integrating ANN hardware require high accuracy strictly. Obviously, an ANN comprising of a considerable number of single neurons requires enormous operations, which leads to a lack of accuracy critically with too many approximated formulas. Moreover, the training results in pattern recognition machines applying ANN algorithm has experienced a variety of weight coefficients in real format number. As a result, if fixed-point data format utilized to approximate nonlinear activation functions is also used in hold system completely, the measurement uncertainty will negatively affect performance inevitably. Obtained results as TABLE 1 confirmed the comparison among approximated methods with original Sigmoid. These obtained errors will increase gradually after data is transferred through other single neurons in network.

\[
E_{eva} = \frac{\sum_{i=0}^{N} | \hat{f}(x_i) - f(x_i) |}{N}
\]

\[
E_{max} = \max | \hat{f}(x_i) - f(x_i) |
\]

where: \( \hat{f}(x_i) \) is result of hardware implementation while \( f(x_i) \) is from expected theory.

In terms of ANN application, a considerable number of published results based on SOPC systems approaching a FPGA chip have confirmed the efficiency. It is vital that dynamic SOPC systems can be feasible reconfig-
TABLE 1 Performances of approximated methods

<table>
<thead>
<tr>
<th>Activation Function</th>
<th>Maximum Error ($E_{max}$)†</th>
<th>Mean Error ($E_{eva}$)†</th>
</tr>
</thead>
<tbody>
<tr>
<td>M.Jame [6]</td>
<td>0.0822</td>
<td>0.0596</td>
</tr>
<tr>
<td>Zhang [5]</td>
<td>0.0215</td>
<td>0.0076</td>
</tr>
<tr>
<td>Allipi [3]</td>
<td>0.0189</td>
<td>0.0087</td>
</tr>
<tr>
<td>Plan [4]</td>
<td>0.0189</td>
<td>0.0058</td>
</tr>
</tbody>
</table>

† $E_{eva}$ and $E_{max}$ are mean absolute error (MEA) and maximum absolute error, respectively.

TABLE 2 Artificial Neural Network configurations

<table>
<thead>
<tr>
<th>Previous Works</th>
<th>ANN Configuration</th>
<th>Technology</th>
</tr>
</thead>
<tbody>
<tr>
<td>Dhirajkumar S. Jinde [10]</td>
<td>Three layers: 2-6-2</td>
<td>FPGA</td>
</tr>
<tr>
<td>Alexander Gomperts [12]</td>
<td>Four layers: 10-6-3-2</td>
<td>FPGA</td>
</tr>
<tr>
<td>Panca M. Rahardjo [14]</td>
<td>Three layers: 36-10-10</td>
<td>FPGA</td>
</tr>
<tr>
<td>Hassne Faiidd [16]</td>
<td>Three layers: 2-12-1</td>
<td>350 nm</td>
</tr>
</tbody>
</table>

ured to generate larger ANN to obtain the superior performances. Specially, a proposed system referred from [7] showed complete speech recognition SOPC configuring a large three-layer ANN, while another SOPC illustrated in [8] presented smaller ANN configuration. Nonetheless, these solutions basing on reconfigured SOPC systems dynamically have not solved real time issue completely. Particularly, the research described in [9] also pointed out the trade-off between software and hardware implementations related to timing consumption.

As regards ANN configuration, detailed information given by TABLE 2 illustrates concerning statistics towards published ANN hardware architectures implemented on different platforms. Despite of adapting to real-time requirement, almost those reveal limited resource with maximum configuration at three layers (36:10:10) as in [14]. Based on TABLE 2, most of researches have approached to FPGA-based designs to be able to setup various ANN configurations flexibly, whereas ASIC-based designs have experienced the deprivation of results mainly caused of the dynamical setting ability.

From such circumstances, an effective dynamic ANN architecture is proposed in this work so as to balance these issues. First and foremost, floating-point data format identified by IEEE 754 standard is applied for all formulas in complete design to ensure high accuracy. Thus, ASIC-based ANN architecture has experienced effective combination between single neuron and controller so as to perform not only ability of flexible reconfiguration by input ports but also silicon requirement and real-time specification. Furthermore, word-level speech recognition SOPC integrating the proposed ANN hardware is also implemented to confirm the higher efficiency in comparison to other researches.

The rest of paper is organized follows. Section II introduces applied techniques integrated in proposed ANN hardware such as IEEE 754 standard, Booth algorithm as well as applied ASIC design flow. Next, Section III proposes ANN architecture integrated in complete SOPC system for speech recognition application and Section IV describes the experiment results to witness achieved successes. Finally, Section V consults the paper and future work.

2. Applied Techniques

2.1. IEEE 754 floating point format

Floating point data format is widely applied to variety of CPUs and FUs in computer systems shown in [17] and [18] so to enhance accuracy mainly. Floating point number following IEEE 754 standard experiences total of 32 bit comprising of 1 sign bit, 8 exponent bits and 23 mantissa bits as Fig.1. In this work, such format is applied for all mathematical expressions such as addition as well as multiplication. From Fig.2 to Fig.5, those figures present interfaces of 32 bit floating point addition and multiplication and corresponding waveforms collected by experiment simulation.
As regards addition and multiplication architecture, a 23 bit binary carry-save adder (CSA) is also required to perform 23 mantissa bit addition, while Booth algorithm is applied for calculating 23 bit multiplication instead of normally array multiplication. Based on applied IEEE 754 floating point format, the high accuracy is ensured with a wide range of real numbers.

2.2. Booth algorithm

Booth algorithm invented by Andrew Donald Booth is a multiplication algorithm for large signed binary numbers so to decrease calculation steps as well as enhance speech of calculation. As regards multiplication of real numbers basing on IEEE 754 floating point format, if normal array multiplication is applied for 23 mantissa bits, it is fact that issue of critical path through synthesis step of ASIC design flow is inevitable and produces negative effect for expecting high frequency. Indeed, Booth algorithm presented by Puneet Paruthi [19] or Shaifali [20] confirmed the high performance compared with array multiplication. Moreover, together with Gokul Govindus survey, it also confirmed the interpretation of Booth algorithm for matrix multiplication application on FPGA based architecture [21]. As a result, Booth algorithm is approached in this work to enhance speech and to achieve maximum frequency as well as possible.

2.3. ASIC design flow

In order to achieve ANN architecture adapting to industrial standards, complete design followed ASIC design flow is presented in Fig.6. Based on such figure, all steps are supported by Synopsys tools and Verilog hardware language is utilized to design at register transfer level (RTL). However, pre-layout static timing analysis (Pre-layout STA) is not called mainly because design for test (DFT) is not added into proposed design. Besides, complete design with different levels such as RTL design, pre-layout gate level netlist and post-layout gate level netlist is always verified to ensure matching functions and expected frequency correctly. The target library accompanying ASIC design flow applied for ANN architecture is 130nm. Noticeably, at RTL verification phase, not only is VCS tool used to simulate but practical ANN architecture on FPGA (Cyclone II-Altera) is also evaluated to confirm matching functions together.

3. Proposed Architecture

3.1. Single neuron architecture

Based on electrical pulse behavior of biological neuron, the most popular mathematical model for an artificial neuron widely approached in almost pattern recognition machines is identified as Fig. 7, where $W_{ki}$ and $X$ are weight coefficients and input vectors known as extracted features of sound in speech recognition application or features in other ones, $f$ is the activation function, and $y_{ki}$ is the output of single neuron $i$ at layer $k$. The relation between inputs and output is illustrated in (3)

$$y_k = f(W_{ki} \times X)$$

with $X = \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix}$; $W_{ki} = \begin{bmatrix} w_{0} \\ w_{1} \\ \vdots \\ w_{n} \end{bmatrix}$

Based on mathematic model, an ANN includes many nodes known as single neuron, which comprises simple components named as Weight Sum (WS) and Activation Function (AF) structures. Particularly, total sum of all multiplications of input vectors and weight coefficients is executed by WS structure, while activation function is calculated on AF structure as Fig.8. In
WS structure, combination between one floating point addition and one floating point multiplication is presented in Fig.9 to reduce total of resource. As regards active functions, currently Sigmoid and Tan-sigmoid functions as (4) and (5), role of AF, are made effort to transform to linear functions to be able to implement hardware feasibly.

\[ f(x) = \frac{1}{1 + e^{-x}} \]  
\[ f(x) = \frac{2}{1 + e^{-2x}} - 1 \]

Normally, in order to implement hardware architecture of these non-linear functions, fixed-point format and approximated methods are applied to reduce the area of AF designs but still to adapt to low error in comparison to original function. However, the critical issue is that while AF structure is only triggered one time per single neuron, WS structure requires an enormous additions and multiplications if single neuron receives a hundred of inputs. Consequently, high accuracy ANN applied in speech recognition is negatively affected by using fixed-point technique. Moreover, in larger ANN with many layers, it is vital that AF is called many times, which results in low measurement uncertainty. Indeed, the error of previous layer will be propagated to next layers and increase such error in final results inevitably.

In this work, an innovation of Amins method applying combination between IEEE 754 floating point data format and Booth algorithm is proposed to satisfy not only the high accuracy requirement but also speed of calculation. More particularly, the Sigmoid is approximated to linear function \( (y = ax + b) \) comprising of multiplication and addition with real number as (6).

\[ f(x) = \begin{cases} 
1 & (x \geq 5) \\
0.0141217245x + 0.924396007 & (3.4 \leq x < 5) \\
0.04963536x + 0.800778912 & (2.3 \leq x < 3.6) \\
0.136783431x + 0.606760369 & (1 \leq x < 2.3) \\
0.231058579x + 0.504540715 & (0 \leq x < 1) \\
1 - f(|x|) & (x < 0) 
\end{cases} \]  

Basing on Sigmoid approximation as (6), this activation function can be transferred to hardware architecture feasibly as Fig.10, in which the floating point multiplication and addition are proposed as basic structures.

Besides, coefficient register takes the role of comparison to decide \( a, b \) coefficients.

According to the formula of Tan-sigmoid as (5), it is vital that this function can be transferred to hardware architecture as Fig. 4 following an explanation (7).

\[ f(x) = \frac{2}{1 + e^{-2x}} - 1 = 2 \times \frac{1}{1 + e^{-y}} + (-1) \]  

As a result, hardware architecture of single neuron is combination between WS structure and AF structure as Fig.12 and Fig.13, respectively.

3.2. Artificial Neural network architecture

Regarding to ANN configuration in Automatic Speech Recognition (ASR) applications, it experiences that almost approaches have referred to software-based systems in order to configure expected ANN dynamically. Moving on to ASIC-based ANN architecture, configurations are always fixed to adapt for exact targets. Therefore, ability of reusing ANN configuration or expanding applications has become one of the critical issues. Towards large ANN configuration in hardware design, a single neuron is called as instance many times to meet timing requirement with parallel process, which results in unexpected large area. Besides, reconfiguration of ANN basing on calling instance needs an
effective compiler tool together with changing design in Register Transfer Level (RTL) inconveniently. In this research, a suitable solution in which it performs ability of reconfiguration though parameters feasibly while still adapt to resource requirement is proposed. Particularly, three-layer ANN hardware as Fig.14 is introduced with the ability of reconfiguration through input ports. According to Fig.14, there is only one neuron taking the role of all different ones which belong to same layer. Therefore, function of single neuron at one layer will be recalled many times to execute the same function to cover all neurons in same layer instead of calling many instance of single neuron. As the result, during a single neuron takes the role of all neurons in one layer, the output of this neuron should be stored for next calculation in next layers. Hence, internal memories utilized to store intermediate results are integral. Furthermore, a complex controller is also requested to control the reading or writing processes of memories harmoniously. The detailed information describing how to control the ANN is specified as the below Fig.15, TABLE 3 and TABLE 4.

Based on the Fig.15, WS structure at hidden layer is called until finishing all inputs and weight coefficients. Output of WS structure is transferred to AF structure to finish all calculation of one hidden neuron. At hidden loop state, this will check how many single neurons are calculated. If there is not enough H neurons, WS structure at hidden layer is recalled again. These steps are similar to output layer.

Consequently, combination between accuracy single router and complex controller comprising of many loop states not only achieves to build up the large ANN to adapt the targets of ASR system but also confirm the ability of reusing successfully. Because such dynamic ANN can be also reconfigured feasibly through exact input ports in certain systems to setup configuration such I, H and O, unchanging design at RTL phase make designer flexibly and promote ability of reusing in industrial environment.

### Table 3: ANN state machine description.

<table>
<thead>
<tr>
<th>State</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reset</td>
<td>Reset state.</td>
</tr>
<tr>
<td>Initial</td>
<td>Initial the ANN functions.</td>
</tr>
<tr>
<td>Setup</td>
<td>Receive and store the setup parameters: I, H, O.</td>
</tr>
<tr>
<td>Hidden WS</td>
<td>Execute WS in hidden neuron.</td>
</tr>
<tr>
<td>Hidden AF</td>
<td>Execute AF in hidden neuron.</td>
</tr>
<tr>
<td>Hidden Loop</td>
<td>Check loop cycle at hidden neuron.</td>
</tr>
<tr>
<td>Out WS</td>
<td>Execute WS in output neuron.</td>
</tr>
<tr>
<td>Out AF</td>
<td>Execute AF in output neuron.</td>
</tr>
<tr>
<td>Out Loop</td>
<td>Check loop cycle at output neuron.</td>
</tr>
<tr>
<td>Finish</td>
<td>Finish ANN function.</td>
</tr>
</tbody>
</table>

### Table 4: Transferred condition of ANN state machine.

<table>
<thead>
<tr>
<th>State</th>
<th>Next State</th>
<th>Conditions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reset</td>
<td>Initial</td>
<td>Reset is inactive.</td>
</tr>
<tr>
<td>Initial</td>
<td>Setup</td>
<td>ANN is triggered.</td>
</tr>
<tr>
<td>Setup</td>
<td>Hidden WS</td>
<td>After loading the ANN parameters.</td>
</tr>
<tr>
<td>Hidden WS</td>
<td>Hidden AF</td>
<td>After finishing WS at hidden neuron.</td>
</tr>
<tr>
<td>Hidden AF</td>
<td>Hidden Loop</td>
<td>After finishing AF at hidden neuron.</td>
</tr>
<tr>
<td>Hidden Loop</td>
<td>Hidden WS</td>
<td>If all neurons in hidden layer have not calculated yet.</td>
</tr>
<tr>
<td>Hidden Loop</td>
<td>Out WS</td>
<td>After finishing all neurons in hidden layer.</td>
</tr>
<tr>
<td>Out NWS</td>
<td>Out AF</td>
<td>After finishing WS at output neuron.</td>
</tr>
<tr>
<td>Out AF</td>
<td>Out Loop</td>
<td>After finishing AF at output neuron.</td>
</tr>
<tr>
<td>Out Loop</td>
<td>Out WS</td>
<td>If all neurons in output layer have not calculated yet.</td>
</tr>
<tr>
<td>Finish</td>
<td>Initial</td>
<td>After storing all final results in internal memory.</td>
</tr>
</tbody>
</table>

3.3. Complete speech recognition system

The description given by the below Fig.16 illustrates a typical speech recognition scheme comprising of training and recognition processes respectively. To be able to develop high performance speech recognition system...
successfully, popular algorithms such Artificial Neural Network (ANN), Hidden Markov Model (HMM), Dynamic Time Wrapping (DTW), or Vector Quantity (VQ) significantly affecting not only training process but also recognition process are proposed. However, ANN is applied in this work, which results in choosing a popular training algorithm named Back Propagation. It is vital that almost researches have approached training process on software-based systems such performance computers mainly because of long time requirement and complex training algorithms as shown in [1] and [2]. In this work, the training process is also implemented on software by Matlab in advance.

One of considerable aspects relating to the performance of ASR system is applied feature extracting method. According to Fig.16, before solving in decode or training structure, sound sample is pre-solved by feature extraction structure to obtain the most particular features. At the present, there are many solutions for feature extraction task such as Principal Component Analysis (PCA), Linear Prediction Coefficients (LPC), or Mel-Frequency Cepstrum Coefficients (MFCC) applying to speech recognition machines; nevertheless, MFCC applied in this paper has been more popular than others. As a result, this work presents a MFCC configuration as TABLE 5, in which such parameters affecting quality of feature extraction seriously are detailed.

Concerning to combination between MFCC and ANN, there is an issue that total MFCC vectors are too large to configure the input neurons of input layer. Furthermore, MFCC vector number is dynamic and depended on the length of the sound sample. Own investigation has acknowledged that an average of four MFCC vectors corresponding to 104 input neurons of input layer confirmed the higher probability compared with other MFCC vector configurations.

In order to estimate the performance of proposed ANN hardware architecture as well as complete ASR system, a SOPC configuration built on Quatus II tool of Altera company integrates proposed ANN hardware. From such circumstances, the below Fig. 17 describes proposed SOPC system in detail.

According to Fig.17, an ASR system is built on SOPC utility, in which the MFCC structure is implemented by software on Nios CPU core while recognition core approaching ANN is implemented by Verilog language named ANN decoder. First and foremost, set of weight coefficients obtained from training process on Matlab is loaded in to internal memory through Control Panel memories and transferred to ANN decoder block until finish process. Finally, a feedback from ANN decoder containing recognition result is transferred to Nios CPU to confirm final result to user through LCD device.

4. Experiment Results
4.1. Approximated method and single neuron

As regards real-time recognition process implemented on SOPC system, an audio codec is integrated to record human sound from micro phone. Next, the sound sample is filtered to reduce the sample number to collect the value samples for recognition process before extracting particular features by MFCC method. In order to improve efficiency in cutting silence signals, a combination between Zero-Crossing Rate (ZCR) and Zero Threshold Energy (ZTE) is proposed. After MFCC vectors obtained from feature extracting process are stored to internal memory, the ANN decoder is triggered to begin the recognition task. During time of decoding, the MFCC vectors as well as weight values are read from internal memories and transferred to ANN decoder block until to finish process. Finally, a feedback from ANN decoder containing recognition result is transferred to Nios CPU to confirm final result to user though LCD device.

### TABLE 6
**Performances of approximated methods**

<table>
<thead>
<tr>
<th>Activation Function</th>
<th>$E_{max}$</th>
<th>$E_{ave}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>M.Jame [6]</td>
<td>0.0822</td>
<td>0.0596</td>
</tr>
<tr>
<td>Zhang [5]</td>
<td>0.0215</td>
<td>0.0076</td>
</tr>
<tr>
<td>Allipi [3]</td>
<td>0.0189</td>
<td>0.0087</td>
</tr>
<tr>
<td>Plan [4]</td>
<td>0.0189</td>
<td>0.0058</td>
</tr>
<tr>
<td>Proposed Sigmoid Func.</td>
<td><strong>0.0124</strong></td>
<td><strong>0.0023</strong></td>
</tr>
</tbody>
</table>

$E_{ave}$ and $E_{max}$ are mean absolute error (MEA) and maximum absolute error, respectively.

As first and foremost, the accuracy of proposed approximate Sigmoid and Tan-Sigmoid functions are compared to other methods. Because of using floating-point format for all operators conducted in NWS structures and AF structures in complete ANN hardware, the description given by the TABLE 7 confirms the highest accuracy through maximum error $E_{max}$ and mean absolute error $E_{ave}$. Nonetheless, it is also experienced disadvantage of resource in comparison to others as TABLE 8.

Although the results from TABLE 6 present low mean absolute errors for all solutions, only AF is performed. Using fix format towards weight coefficients and input vectors poses a risk to be able to achieve low final error not only for one neuron but also for complete
TABLE 7
Performances of approximated methods

<table>
<thead>
<tr>
<th>Activation Function</th>
<th>Slices</th>
<th>LUTs</th>
<th>DSPs</th>
<th>BRAMs</th>
</tr>
</thead>
<tbody>
<tr>
<td>M. Jame [6]</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Zhang [5]</td>
<td>93</td>
<td>86</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>Allipi [3]</td>
<td>127</td>
<td>218</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Plan [4]</td>
<td>109</td>
<td>138</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Proposed Sigmoid Func.</td>
<td>1420</td>
<td>2609</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

ANN because of measurement uncertainty on WS. In order to have comprehensive perspective and deeper comparisons between proposed neuron architecture and other approaches, a random inputs comprising of input vectors and weight coefficients are generated firstly. Thus, such inputs are transferred to different single neuron architectures, in which approximate functions as [3], [4], [5] and [6] are applied to estimate differences compared with the original mathematical models. Obviously, experiments have been implemented with 2000 times of random input generations, in which 104 pairs of weight coefficients and input vectors for every random process are generated. Every 20 times, average of absolute error as (1) is calculated and presented, which results in presented total of 100 values. The range of weight coefficients and input vectors are constrains from 1 to -1 mainly because almost input vectors and weight coefficients which are extracted from MFCC and training process are belonged this range.

Firstly, single neuron, in which Sigmoid function is applied as AF, is verified to confirm measurement uncertainties among 5 approximated methods in comparison to original mathematical model as Fig. 18, Fig 19 and Fig. 20, in which our proposed method is experimented on hardware implementation in RTL design while others are based on the Matlab software.

According to Fig. 18, [5] experience much higher absolute errors with maximum index at 0.31 in comparison to other methods while [3], [4] and [6] show slight difference. In more detail, Fig.19 comparing absolute error between [3], [4] and [5] shows maximum indexes at 0.06, 0.08 and 0.045, respectively. Proposed method witnesses a maximum absolute error at 0.0015, which is nearly equal to original method and much lower than other methods as Fig.20.

Next, Sigmoid function is replaced by Tan-sigmoid function while the previous set of inputs is kept. The compared results are presented as Fig. 21.

Following Fig. 21, the absolute error presents the similar result in comparison to Sigmoid experiments, in which proposed method experiences the least absolute error.

The detailed information given by TABLE 6 and TABLE 8 reveals that although the maximum errors of approximated methods are not different much, absolute
errors through full calculations of single neuron integrated in enormous ANN has experienced considerate number, in which proposed method witnesses the minimum number in comparison to others. It is concluded that high accuracy ANN is concerned not only by AF but also WS.

4.2. Complete Neural Network

In order to perform both neural network and full SOPC, firstly 20 Vietnamese discrete words as TABLE 9 are presented and total of 4000 sound samples are recorded with 200 sound samples per one word. Thus, the first 2000 samples are utilized for training process and the other samples are for recognition process. Noticeably, the words are selected to apply simple control or approaching to blind people.

As regards ANN configuration, enormous configurations have been simulated to confirm the final configuration with 104 input neurons, 60 hidden neurons and 20 output neurons being the most suitable as TABLE 9. To be more specific, the input neuron number is simulated with 4 values comprising of 78, 104, 130 and 156, in which these values are average of 3, 4, 5 or 6 MFCC vectors over total MFCC vectors. As regards hidden neurons and output neurons, while arrange from 50 to 300, in which step is set 10, is analyzed to confirm the hidden neuron number, the output neuron number is fixed 20 being suitable to 20 discrete words for our application.

From above configurations, experiments are simulated and results are collected as Fig.22. Basing on results of recognition probability as Fig.22, almost probabilities upper 94% concentrate on the big configurations (156; 50-300; 20) while the lowest indexes experience at low configurations (78; 50-300; 20) with maximum value at nearly 93%. Although range of configures (104; 50-300; 20) presents intermediate configurations, it confirms the high recognition probability compared with other ranges, maximum and minimum indexes at 97% and 92%, respectively. As a result, final configuration (104; 60; 20) belonging range of (104; 50-300; 20) which confirms the minimum configuration is selected to apply recognition process.

In order to estimate the accuracy of proposed ANN configuration (104; 60; 20) in comparison to mathematical model, set of 1000 test cases, in which 104 pairs of input vector and weight coefficients corresponds every case, is reused to do experiments. According to every case, 20 output values are compared with original mathematical model to obtain the mean absolute error. Obviously, the result survey is shown as Fig.23 to confirm high performance with low measurement uncertainty in comparison between hardware simulation of proposed ANN and original mathematical model only under 0.0035.

Moreover, another statistics concerning utilized internal memories as well as number of clock cycle are also confirmed to estimate trade-off among silicon requirements. Increasing ANN configuration, specially hidden neuron number from 20 to 250 while keeping instance input neuron number and output neuron number at 104 and 20, confirms the rising of both internal memory requirement and clock cycle as below Fig.24 and Fig.25.

According to Fig.24, increasing the total neuron number basing on setup parameter means that larger internal memories are required. As the result, Fig. 24 points out that maximum number of neuron (104; 250; 20) requires approximately 1Mbit. It confirms that abil-
MemoryCapacity = (I × H + H × O) × 32 \quad (8)

where I, H, and O are input neuron number, neuron number in hidden layer, neuron number in output layer, respectively.

However, setting input ports to configure ANN and utilizing internal memory techniques face to a trouble of real-time requirement strictly. Actually, the loop of calculation need more time whenever increasing the neuron number, specially the rising of hidden neuron number. So, certain calculation of clock cycle number is investigated to confirm ability of adapting real-time requirement. Following Fig.25, nearly 350000 clock cycle is required to perform largest ANN configuration comprising of 374 single neurons in (104; 250; 20) configuration; However, with obtained maximum frequency at 250MHz, the total consuming time for maximum ANN configuration (104; 250; 20) confirms only 1.4ms, which satisfies real-time requirement feasibly.

Besides, a referred formula as (9) to calculate the clock cycle is also collected from waveform simulations, where I, H, and O are input neuron number, neuron number in hidden layer, neuron number in output layer respectively.

\[
\text{No.of Clock} = 2 + (67 + 10 \times I) \times H + (67 + 10 \times H) \times O \quad (9)
\]

Next, complete ANN is synthesized on 130nm technology to confirm the silicon performance. Hence, the description given by the TABLE 11 presents physical results compared with another ANN configuration presented in [13] following ASIC-based design on similar 130nm technology. Based on the TABLE 10, the achieved results have confirmed the maximum frequency at 250MHz more than 220 MHz in [13]. Although the number of gates in proposed ANN is larger than one in [13], the proposed architecture presents much large configuration and ability of dynamic setting to achieve larger configuration in comparison to fix configuration (16-4-16) in [13].

4.3. Full SOPC system integrating proposed ANN hardware

Finally, the complete SOPC integrating proposed ANN hardware is built on Quatus II of Altera and experimented on DEII kit integrated Cyclone II FPGA chip family to perform ability of recognition. Obviously, the quality of speech recognition system depends on not only proposed ANN but also MFCC configuration. That is reasonable to implement necessary experiments to make decisions related to MFCC parameters such as overlap rate, FFT number or Mel filter number recommended in TABLE 5. The below detailed information collected from reports of synthesis process on Quatus II not only presents the final synthesized results of SOPC system but also confirms the effective recognition performance with set of 2000 samples for recognition process which is simulated on hardware before.

Based on TABLE 12, it is vital that although proposed ANN hardware is more advantage in timing implement, the final result of SOPC system has expanded...
more time to complete all tasks including MFCC function, audio controller or collect recognition results. This is obviously recognized that the Nios II performance has experienced drawback. To compare with another system as [22], the description presented by TABLE 13 classifies the comparisons in many different aspects. Although [22] confirms a higher index of speech recognition, larger configuration and less word number are achieved. Furthermore, TABLE 14 also adapts silicon requirements such as high frequency, accuracy and real-time issues required on handle device but also experiences the expanded ability to larger ANN later. As regards future works, a complete hardware-based ASR system is our target to achieve powerful integrated circuit. Moreover, other applications integrating proposed ANN will be also implemented to estimate the ANN hardware performance obviously.

### Acknowledgments

This work is funded by the Ministry of Science and Technology, State-level key program, Research for application and development of ITC, code KC.01.23/11-15

### References


**Lam D. Pham** was born in Vung Tau city, Vietnam. He received the Bachelor of Engineering, and Master of Science degree in Electronics-Telecommunication Engineering from Ho Chi Minh City University of Technology in 2009 and 2012, respectively. During 2009-2012, he joined Renesas VietNam Company as system level design engineer. Currently, he is lecturer at Faculty of Electrals-Electronics Engineering, Ho Chi Minh City University of Technology. His research focuses on Network on Chip, Speech Recognizer, IC architecture.

**Hieu M. Nguyen** received the B.S. degrees in Electronics and Telecommunication Engineering from Ho Chi Minh City University of Technology in 2014. During 2013 and 2014, he joined Integrated Circuit Design Research and Education Center where he studied about Analog and RF integrated circuit design. He also received Award of Best Student in Analog IC Design for the design of 24-Bit Delta Sigma ADC. He presently works as Teaching and Research Assistant at Department of Electronics Engineering, Faculty of Electricals-Electronics Engineering, Ho Chi Minh City University of Technology. His current research focus is mainly on low power, high speed, high performance analog, mix-signal and RF integrated circuit design.

**Du N. N. T. Nguyen** has been senior student in Electronics and Telecommunication Engineering from Ho Chi Minh City University of Technology currently (HCMUT). During 2012 up to now, he has worked as member of IC design laboratory in Department of Electronics Engineering, HCMUT. His current project concentrates on Artificial Neural Network and real time applications.

**Trang Hoang** was born in Nha Trang city, Vietnam. He received the Bachelor of Engineering, and Master of Science degree in Electronics-Telecommunication Engineering from Ho Chi Minh City University of Technology in 2002 and 2004, respectively. He received the Ph.D. degree in Microelectronics- MEMS from CEA-LETI and University Joseph Fourier, France, in 2009. From 2009-2010, he did the post-doctorate research in Orange Lab- France Telecom. Since 2010, he is lecturer at Faculty of Electrals-Electronics Engineering, Ho Chi Minh City University of Technology. His field of research interest is in the domain of FPGA implementation, Speech Recognizer, IC architecture, MEMS, fabrication.