ANN means Artificial Neural Network
ANN is known for its applications in Deep Learning

1 – ANN without hidden layer

This article will be illustrated with R language (installation details: https://vgir.fr/toolbox/ )

Disclaimer

An ANN without a hidden layer has no application. The aim of this chapter is:

Gentle introduction of the concept
Analogy of ANN in its less powerful « hidden layer » configuration with an already-seen model

Imaginary scenario

For example, an aeronautic company, Airbing, sells three types of planes:

Small Size: the airplane P320
Medium Size: the airplane P330
Big Size: the airplane P350

The aim is to see if the margin can be explained depending on the number of planes of each type sold each month. If there is a loss, a negative number applies.

Experiences of the scenario

Dataset

Hereunder is a generated dataset for this article corresponding to 10 months:

P320 sold	P330 sold	P350 sold	Margin (million €)
3	3	10	33
4	2	3	1
6	7	7	14
10	4	2	-38
3	8	3	22
9	5	4	-13
10	8	1	-22
7	10	4	-4
7	4	9	16
1	8	4	28

An observation is equivalent to a row of the dataset

Explanation of the first row of the dataset in the first month of the series: 3 PP320, 3 P330, and 3 P350 have been sold. The margin is +33 million dollars.

Generalization of the scenario

In a machine-learning context, we have three explicative variables: X1, X2, and X3. So, we are looking to explain variable Y.

Mapping:

Type of variable	Scenario Name	General Name
Input (or explanatory) variable	Plane P320	X1
Input (or explanatory) variable	Plane P330	X2
Input (or explanatory) variable	Plane P330	X3
Output (or target) variable	Margin	Y

Now we will use the variables X1, X2, X3, and Y.

Modelization of the scenario with an ANN without a hidden layer

Our first and most possible basic model contains four artificial neurons, represented as yellow circles:

Three input neurons
One output neuron

There is no neuron between the input and the output block, so it is called « no hidden layer ». At the very end of this article, you’ll discover two examples with 1 and 2 hidden layers.

This model aims to find four parameters:

For each row of the dataset, the operations in the figure below give a result close to Y.

When you want to distinguish a real value from an estimated value, add a circumflex accent
Example:
Ŷ : the estimated value
Y : the real value

An example

Let’s illustrate the above figure:

We arbitrarily choose a solution and then do a calculation and check if it fits nicely:

w1 = 1
w2 = 2
w3 = 3
w4 = 4

We define this model as [1;2;3;4]

The figure is updated accordingly:

The first row is:

X1	X2	X3	Y
3	3	10	33

So, let’s extract its input variables:

X1	X2	X3
3	3	10

And inject them in the figure:

This model’s result is 43. The real value from the dataset is 33. Hence, we have an error of 10 million between the modelized 43 million and the actual value of 33 million.

Application with R language

Hereunder the « .csv » source file:

ann Download

Let’s define the working directory

mydirectory <- "C:\\...my_directory..."
setwd(mydirectory)

We’ll work with data tables, so the library is required:

library(data.table)

Now we read the source file

data <- fread("ann.csv")

# definition of arbitrarily coefficients, the model is called {1,2,3,4}
w1 <- 1
w2 <- 2
w3 <- 3
w4 <- 4
# calcation of estimated values for model {1,2,3,4}
data$ESTIMATION <- data$X1 * w1 + data$X2 * w2 + data$X3 * w3 + w4
# Rename "Y" column into "REAL" for visibility
colnames(data)[colnames(data) == "Y"] <- "REAL"

The raw result is:

Metric of the scenario

Why the need for a metric?

To compare the reliability of several models, we need a metric.

The metric is a measurable value indicating the performance of a model

We have many options for determining the global difference; let’s describe three usual metrics:

Mean percentage error (called MPE)
Mean absolute percentage error (called MAPE)
Root mean square error (called RMSE)

How do we interpret the metric?

If the value is 0, then the model is perfect
The higher the value, the worse the model
Between 2 models, the one with the lowest metric is the better

Choice of a metric

Whatever of these three solutions, the aim is to find the parameters w0, w1, w2, and w4 which will minimize the result

Arbitrarily, we use the metric percentage error (called MAPE). Indeed, it is a very classical metric. https://en.wikipedia.org/wiki/Mean_absolute_percentage_error

# Use of a library in order to gain time in the calculation
library(Metrics)
metric <- mape(data$REAL, data$ESTIMATION)
metric

The MAE metric has a result of 4.47, which means we have a terrible global error of 447%.

Definitely, this model [1;2;3;4] with the hereunder coefficients gives an awful result.

w1 = 1
w2 = 2
w3 = 3
w4 = 4

We hope we can do better. Indeed, ANN aims to find the best coefficients, providing the minimum metric and the minimal error in modelization.

Optimizer of the scenario

Why the need for an optimizer

We know we need to find the values of w1, w2, w3, and w4 coefficients. It is not reasonable to try randomly a large number of values; an infinite number of combinations is possible.

Here, we need an optimizer.

An optimizer is a tool which aim is to find the optimum of a problem.

This optimum can be a minimal or a maximal solution.

It is required when no analytical solution is available (an analytical solution is applied through a precise calculation which provides the best solution.

Optimizer is looking for a numeric solution (a numeric solution may find only a local solution instead of the global solution.), usually through iterations.

We will provide the optimizer with the metric we want to minimize. As seen in the definition, we are still determining whether it will find the global minimum value. The function may be too complex, or the optimizer may not be optimized for this function. At least, it should find a local minimum.

We arbitrarily chose the optim function in the R language as an optimizer. This is a simple optimizer, which does not suit deep learning. Indeed, we only need to understand the concept of ANN and whether any optimizer is suitable.

First, we define the function to optimize.

Beware of the new paradigm
– Before, in the example w1, w2, w3, w4 were fixed values
– Now w1, w2, w3, w4 become some variables, while X1, X2, X3 are fixed

Let’s name it fmetric, and we know it depends on {w1,w2,w3,w4}

Remember, we’ll need to minimize fmetric, as we want w1, w2, w3, and w4 to make the error as small as possible.

Previously, we have written metric <- mape(data$REAL, data$ESTIMATION). data$REAL is known. We need to calculate data$ESTIMATION depending on {w1,w2,w3,w4}

Let’s detail data$ESTIMATION:

X1	X2	X3	ESTIMATION
3	3	10	W13+W23+W3*10+W4
4	2	3	W14+W22+W3*3+W4
6	7	7	…
10	4	2	…
3	8	3	…
9	5	4	…
10	8	1	…
7	10	4	…
7	4	9	…
1	8	4	…

One last thing: the function optim is working with an input vector. Instead of four input parameters w1, w2, w3, w4, we are providing it one vector w of four values

We are ready to code everything above in R:

# definition of the function fmetric. It depends on the vector w, containing 4 numeric values w[1],  w[2],  w[3],  w[4],
fmetric <- function(w) { 
  # in case of debugging, why not giving another name for the data table
  tempdata <- data
  # calculation of all estimations values
  tempdata$ESTIMATION <- w[1] * tempdata$X1 + w[2] * tempdata$X2 + w[3] * tempdata$X3 + w[4]
  # the function returns the MAPE
  return(mape(tempdata$REAL, tempdata$ESTIMATION))
  }

Let’s check the result is the same for {1,2,3,4}

myweight <- c(1,2,3,4)
fmetric(myweight)

We have the same result, and we assume fmetric is correct. We are ready to optimize it, and the R function optim requires:

The starting points w1, w2, w3, w4. We have no idea how to choose some good initial points. Hence, arbitrarily, we chose {0,0,0,0}
The function fmetric to optimize
By default, optim is minimizing; it’s what we seek

set.seed(123)
optim( c(0,0,0,0), fmetric)

The best possible metric is 0.201, indeed 20.1% of MAPE
With the values
w1 = -5.57
w2 = 0.84
w3 = 4.98
w4 = 6.64

Let’s find the estimations according to the weights:

tempdata <- data
tempdata$ESTIMATION <- tempdata$X1 * result_par[1] + tempdata$X2 * result_par[2] + tempdata$X3 * result_par[3] +  result_par[4]
tempdata

mape(tempdata$REAL, tempdata$ESTIMATION)

Analysis of the no-hidden layer ANN

Disclaimer: with no hidden layer, the analysis is oversimplified
Remind this article is a ramp-up before going deeper into the complexity of the ANN

Considering the starting hypothesis is:

This means the estimated value is calculated with

Yestimated = W1*X1+W2*X2+W3*X3+W4

Yestimated = -5.58*X1+0.84*X2+4.98*X3+6.64

The explanation given by the model is:

5.58 million€ loss per plane P320 sold each month.
0.84 million€ gain per plane P330 sold each month.
4.98 million€ gain per plane P340 sold each month.
6.64 million€ fixed gain each month: some fixed training, certification, maintenance, earned by Airbing?

We managed to have a global accuracy of 20.2%.

For this very very light ANN, the model is explainable.

A feeling of « déjà vu »?

Instead of MAPE metric, if you choose the RSE (Root Square error) metric, translate:

weight into coefficient
bias into intercept

And then you have a linear regression, described in this article:

https://vgir.fr/linear-regression/

Indeed, the linear regression provides an analytical solution:

It calculates in one time the solution.
It provides the best solution.

The no-hidden layer has provided a numerical solution

It calculates many solutions with many iterations.
At the last iteration, it is not sure to find the best solution.

Indeed, why is applying a solution less performant than the linear regression?

Remember this example has no hidden layer. We will increase the complexity

ANN is complex. An adequate learning curve is required:

By Alanf777 – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=25252380

2 – One ANN with 1 hidden layer

As you know now how an ANN without a hidden layer has been built, we create an ANN with seven artificial neurons including one hidden layer:

Three input neurons
Three hidden layer neurons
One output neuron

The aim of this model is now to find 16 parameters:

w1	w5	w9	w13
w2	w6	w10	w14
w3	w7	w11	w15
w4	w8	w12	w16

Such that, for each row of the dataset, the operations in the hereunder figure give a result close to Y

For example, we arbitrarily choose a solution and then do a calculation and check if it fits nicely:

Let’s illustrate the above figure:

w1=1	w5=-1	w9=2	w13=-2
w2=2	w6=2	w10=-3	w14=2
w3=-1	w7=1	w11=2	w15=-1
w4=-2	w8=1	w12=1	w16=1

We define this model as [1,2,-1,-2,-1,2,1,1,2,-3,2,1,-2,2,-1,1]

Reminder: the dataset is

P320 sold	P330 sold	P350 sold	Margin (million €)
3	3	10	33
4	2	3	1
6	7	7	14
10	4	2	-38
3	8	3	22
9	5	4	-13
10	8	1	-22
7	10	4	-4
7	4	9	16
1	8	4	28

We extract the first row input variables:

X1	X2	X3
3	3	10

First, we set the weights:

Then, the intermediate calculations: for example, the first intermediate neuron has a value of 3*1 + 3*2 + 10*-1 -2 = -3

Then the final value: for instance, the output neuron has the value -3*-2 + 12*2 + 18*-1 +1 = 13

Hence the estimated value of the model [1,2,-1,-2,-1,2,1,-1,2,-3,2,1,-2,2,-1,1] is 13 instead of a real value of 33 million €

The MAPE error for this value is (13-33)/33 = 60.6%.

Application with R language

Hereunder the “.csv” source file:

ann Download

Let’s define the working directory:

mydirectory <- "C:\\...my_directory..."
setwd(mydirectory)

We’ll work with data tables, so the library is required:

library(data.table)

Now we read the source file:

data <- fread("ann.csv")

# definition of arbitrarily coefficients, the model is called  [1,2,-1,-2,-1,2,1,-1,2,-3,2,1,-2,2,-1,1]
# we directly use an vector, in order not to create 16 different variable
w <- c(1,2,-1,-2,-1,2,1,-1,2,-3,2,1,-2,2,-1,1)
tempdata <- data
# calcation of estimated values for model [1,2,-1,-2,-1,2,1,-1,2,-3,2,1,-2,2,-1,1]
tempdata$H1 <- ( tempdata$X1 * w[1] + tempdata$X2 * w[2] + tempdata$X3 * w[3] + w[4] )
tempdata$H2 <- ( tempdata$X1 * w[5] + tempdata$X2 * w[6] + tempdata$X3 * w[7] + w[8] )
tempdata$H3 <- ( tempdata$X1 * w[9] + tempdata$X2 * w[10] + tempdata$X3 * w[11] + w[12] )
tempdata$ESTIMATION<- ( tempdata$H1 * w[13] + tempdata$H2 * w[14] + tempdata$H3 * w[15] + w[16] )
# Rename "Y" column into "REAL" for visibility
colnames(tempdata)[colnames(tempdata) == "Y"] <- "REAL"

Choice of a metric

We keep the same metric than in zero hidden layer: the MAPE (Moving Average Percentage Error)

We check the individual MAPE:

tempdata$MAPE <- abs( (tempdata$REAL -  tempdata$ESTIMATION) / tempdata$REAL )

It is same for the first row, 60.6% than previously

library(Metrics)
metric <- mape(tempdata$REAL, tempdata$ESTIMATION)
metric

The MAPE error is 168%, the model is disappointing

Optimizer of the scenario

We keep the optim function from the R language. As we will jump from 4 variables to 16 variables:

The optimizer will need more time.
The optimizer is more likely to find a sub-optimal solution.

Remember, we’ll need to minimize fmetric, as we want w1…w16 such that error is as small as possible.

Previously, we have written metric <- mape(data$REAL, data$ESTIMATION). data$REAL is known. We need to calculate data$ESTIMATION depending on {w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12,w13,w14,w15,w16}

Let’s detail data$ESTIMATION:

X1	X2	X3	ESTIMATION
3	3	10	(W13+W23+W310+W4)W13 + (W53+W63+W710+W8)W14 + (W93+W103+W1110+W12)W15 + W16
4	2	3	(W14+W22+W33+W4)W13 + (W54+W62+W73+W8)W14 + (W94+W102+W113+W12)W15 + W16
6	7	7	…
10	4	2	…
3	8	3	…
9	5	4	…
10	8	1	…
7	10	4	…
7	4	9	…
1	8	4	…

We are ready to code everything above in R:

# definition of the function fmetric. It depends on the vector w, containing 4 numeric values w[1],  w[2],  w[3],  w[4],
fmetric <- function(w) { 
  # in case of debugging, why not giving another name for the data table
tempdata <- data
  # calculation of all estimations values
tempdata$H1 <- ( tempdata$X1 * w[1] + tempdata$X2 * w[2] + tempdata$X3 * w[3] + w[4] )
tempdata$H2 <- ( tempdata$X1 * w[5] + tempdata$X2 * w[6] + tempdata$X3 * w[7] + w[8] )
tempdata$H3 <- ( tempdata$X1 * w[9] + tempdata$X2 * w[10] + tempdata$X3 * w[11] + w[12] )
tempdata$ESTIMATION<- ( tempdata$H1 * w[13] + tempdata$H2 * w[14] + tempdata$H3 * w[15] + w[16] )
  # the function returns the MAPE
  return(mape(tempdata$REAL, tempdata$ESTIMATION))
  }

Let’s check the result is same for the model [1,2,-1,-2,-1,2,1,-1,2,-3,2,1,-2,2,-1,1]

colnames(tempdata)[colnames(tempdata) == "Y"] <- "REAL"
myweight <- c(1,2,-1,-2,-1,2,1,-1,2,-3,2,1,-2,2,-1,1)
fmetric(myweight)

Good, we fing back the previous globale MAPE of 168%

We have the same result; we assume fmetric is correct. We are ready to optimize it, and the R function optim requires:

starting points w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12,w13,w14,w15,w16 . Right now, we have no clue for choosing some good initial points. Hence, arbitrarily, we chose {0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0}
the function fmetric to optimize
by default, optim is minimizing, it’s what we seek

set.seed(123)
optim( rep(0,16), fmetric)

Bad luck, the error is 67.5%, it is worst than with no-hidden layer which is 20.1%!

Maybe it is linked to the starting point? Let’s try two others:

set.seed(123)
optim( rep(1,16), fmetric)

set.seed(123)
optim( rep(-1,16), fmetric)

With the starting points (1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) and (-1,-1,-1,-1,1,1-,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1) it is even worst.

Let’s have an overview:

In theory our model with 1 hidden layer cannot be worst than our model with no hidden layer.

Explanation of reality different from theory:

Set :

w1 = 1 / w2 = 0 / w3 = 0 / w4 =0 => H1 as same value than X1
w5 = 0 / w6 = 1 / w7 = 0 / w8 =0 => H2 as same value than X2
w9 = 0 / w10 = 0 / w11 = 1 / w12 =0 => H3 as same value than X3

It means this is like:

With the transcodification:

w13[1-layer] becomes w1[0-layer]
w14[1-layer] becomes w2[0-layer]
w15[1-layer] becomes w3[0-layer]
w15[1-layer] becomes w3[0-layer]

Our model with no layer is a sub-element of our model with 1 layer.
Indeed, the model with 1 hidden layer proposes maybe calculation combinations, including the the ones from the model without hidden layer

Hence, in theory, the model with one hidden layer is at least as performing as the model with two hidden layers.

Why a model with more layers can be less performing than its reference model?

The explanation is inside the optimizer: the more variables, the less effective. It’s a classical optimization issue.

Then, for ANNand indeed for deep learning, the issue is finding performant optimizers. It’s why you need to find a suitable package with a good optimizer when applied to ANN.

Spoiler: this is very complex to develop a good optimizer for ANN.

Indeed, it’s why there are so many packages available, like:

Tensor Flow
Keras
Theano
Scikit-learn
PyTorch

There are many ways for starting an optimizer package, many ways to improve it, and sometimes it comes to a dead-end, needing a fresh start.

It’s why when applied to different ANN, the result of the packages can be totally different, and why it can be an option to test many of them.

3 – ANN with 2 hidden layers

Now, we add one hidden and create an ANN with 2 hidden layers, meaning 10 artificial neurons in this example

Three input neurons
Three + three hidden layer neurons
One output neuron

This image has an empty alt attribute; its file name is image-20-1024x523.png

The aim of this model is now to find 16 parameters:

w1	w5	w9	w13	w17	w21	w25
w2	w6	w10	w14	w18	w22	w26
w3	w7	w11	w15	w19	w23	w27
w4	w8	w12	w16	w20	w24	w28

Such that, for each row of the dataset, the operations in the hereunder figure, give a result close to Y

An example

We choose arbitrarily a solution, and then do a calculation and check if it fits well:

w1=-1	w5=-1	w9=-2	w13=-2	w17=-2	w21=-2	w25=-1
w2=1	w6=1	w10=-2	w14=-1	w18=-1	w22=-2	w26=2
w3=2	w7=-1	w11=2	w15=1	w19=1	w23=1	w27=1
w4=-2	w8=2	w12=2	w16=2	w20=2	w24=3	w28=-2

Let’s illustrate the above figure:

We define this model as Model 2-0.

We extract the first row input variables:

X1	X2	X3
3	3	10

First, we set the weights:

Calculation of the first row:

Calculation with R:

# we directly use an vector, in order not to create 28 different variables
w <- c( -1, 1, 2, -2,
        -1, 1, -1, 2,
        -2, -2, 2, 2,
        -2, -1, 1, 2,
        -2, -1, 1, 2,
        -2, -2, 1, 3,
        -1, 2, 1, -2)
tempdata <- data
# calcation of estimated values for model [1,2,-1,-2,-1,2,1,-1,2,-3,2,1,-2,2,-1,1]
tempdata$H11 <- ( tempdata$X1 * w[1] + tempdata$X2 * w[2] + tempdata$X3 * w[3] + w[4] )
tempdata$H21 <- ( tempdata$X1 * w[5] + tempdata$X2 * w[6] + tempdata$X3 * w[7] + w[8] )
tempdata$H31 <- ( tempdata$X1 * w[9] + tempdata$X2 * w[10] + tempdata$X3 * w[11] + w[12] )

tempdata$H12 <- ( tempdata$H11 * w[13] + tempdata$H21 * w[14] + tempdata$H31 * w[15] + w[16] )
tempdata$H22 <- ( tempdata$H11 * w[17] + tempdata$H21 * w[18] + tempdata$H31 * w[19] + w[20] )
tempdata$H32 <- ( tempdata$H11 * w[21] + tempdata$H21 * w[22] + tempdata$H31 * w[23] + w[24] )

tempdata$ESTIMATION<- ( tempdata$H12 * w[25] + tempdata$H22 * w[26] + tempdata$H32 * w[27] + w[28] )

# Rename "Y" column into "REAL" for visibility
colnames(tempdata)[colnames(tempdata) == "Y"] <- "REAL"

What is the error according to the MAPE metric?

metric <- mape(tempdata$REAL, tempdata$ESTIMATION)
metric

The MAPE error is 453%

Optimizer of the scenario

We keep the optim function from the R language. Now we have 28 variables instead of 16 variables.

The risk of performance degradation is indeed greater.

Remember, we’ll need to minimize fmetric, as we want w1…w28 such that error is as small as possible.

Previously we have written metric <- mape(data$REAL, data$ESTIMATION). data$REAL is known. We need to calculate data$ESTIMATION depending on {w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12,w13,w14,w15,w16,w17,w18,w19,w20,w21,w22,w23,w24,w25,w26,w27,w28}

Let’s detail data$ESTIMATION:

X1	X2	X3	ESTIMATION
3	3	10	W25(W13(W13+W23+W310+W4)+W14 (W53+W63+W710+W8)+W15 (W93+W103+W1110+W12)+W16)+ W26( W17(W13+W23+W310+W4)+W18* (W53+W63+W710+W8)+W19 (W93+W103+W1110+W12)+W20)+ W27( W21(W13+W23+W310+W4)+W22* (W53+W63+W710+W8)+W23 (W93+W103+W11*10+W12) +W24)+ W28
4	2	3	…
6	7	7	…
10	4	2	…
3	8	3	…
9	5	4	…
10	8	1	…
7	10	4	…
7	4	9	…
1	8	4	…

Yes, I know, the equation becomes quite long. It could be develop, without any real comprehension gain.

We are ready to code everything above in R:

# definition of the function fmetric. It depends on the vector w, containing 28 numeric values w[1],  w[2],  w[3],  w[4],...w[28]
fmetric <- function(w) { 
  # in case of debugging, why not giving another name for the data table
tempdata <- data
# calcation of estimated values for model [1,2,-1,-2,-1,2,1,-1,2,-3,2,1,-2,2,-1,1]
tempdata$H11 <- ( tempdata$X1 * w[1] + tempdata$X2 * w[2] + tempdata$X3 * w[3] + w[4] )
tempdata$H21 <- ( tempdata$X1 * w[5] + tempdata$X2 * w[6] + tempdata$X3 * w[7] + w[8] )
tempdata$H31 <- ( tempdata$X1 * w[9] + tempdata$X2 * w[10] + tempdata$X3 * w[11] + w[12] )

tempdata$H12 <- ( tempdata$H11 * w[13] + tempdata$H21 * w[14] + tempdata$H31 * w[15] + w[16] )
tempdata$H22 <- ( tempdata$H11 * w[17] + tempdata$H21 * w[18] + tempdata$H31 * w[19] + w[20] )
tempdata$H32 <- ( tempdata$H11 * w[21] + tempdata$H21 * w[22] + tempdata$H31 * w[23] + w[24] )

tempdata$ESTIMATION<- ( tempdata$H12 * w[25] + tempdata$H22 * w[26] + tempdata$H32 * w[27] + w[28] )
  # the function returns the MAPE
  return(mape(tempdata$REAL, tempdata$ESTIMATION))
  }

Checking the metric with the previous example:

colnames(data)[colnames(data) == "Y"] <- "REAL"
myweight <-  
     c( -1, 1, 2, -2,
        -1, 1, -1, 2,
        -2, -2, 2, 2,
        -2, -1, 1, 2,
        -2, -1, 1, 2,
        -2, -2, 1, 3,
        -1, 2, 1, -2)
fmetric(myweight)

The result is same, we can continue

set.seed(123)
optim( rep(0,28), fmetric)

Again, the model with two hidden layers, 92% MAPE error, is disappointing compared to zero hidden layer, 20%.
Nevertheless, there is a gain compared to the 1-hidden layer, 168%.

The reasons are the same as one hidden layer: the more variables, the less compelling. It’s again a classical optimization issue.

Interpretation of the results with 2 hidden layers

Here you get the 28 W1…W28 parameters:

It means the estimation of Y is equal to

Y estimated =

W25*(W13*(W1*3+W2*3+W3*10+W4)+W14* (W5*3+W6*3+W7*10+W8)+W15* (W9*3+W10*3+W11*10+W12)+W16)+
W26*( W17*(W1*3+W2*3+W3*10+W4)+W18* (W5*3+W6*3+W7*10+W8)+W19* (W9*3+W10*3+W11*10+W12)+W20)+
W27*( W21*(W1*3+W2*3+W3*10+W4)+W22* (W5*3+W6*3+W7*10+W8)+W23* (W9*3+W10*3+W11*10+W12) +W24)+
W28

With

W1= 0.025357806

W2 = 0.096776244

…

W28 = 0.723600306

As a conclusion, it is not possible to humanly interpret the results with ANN with 2 hidden layers.

We only have an estimation.

After adding only a couple of hidden layers, ANN becomes a black box. This can be an issue:

when money is invested: the less understanding by the business team, the less confidence
it could be required legally or asked by the customer to understand the model. For example, why a credit is denied by a bank?

Conclusion

The aim of these two articles about ANN was to point out:

It requires complex calculation
It is a black box about understanding

Artificial Neural Network – ANN

1 – ANN without hidden layer

Disclaimer

Imaginary scenario

Experiences of the scenario

Dataset

Generalization of the scenario

Modelization of the scenario with an ANN without a hidden layer

An example

Application with R language

Metric of the scenario

Why the need for a metric?

Choice of a metric

Optimizer of the scenario

Why the need for an optimizer

Analysis of the no-hidden layer ANN

A feeling of « déjà vu »?

Indeed, why is applying a solution less performant than the linear regression?

2 – One ANN with 1 hidden layer

For example, we arbitrarily choose a solution and then do a calculation and check if it fits nicely:

Application with R language

Choice of a metric

Optimizer of the scenario

Why a model with more layers can be less performing than its reference model?

3 – ANN with 2 hidden layers

An example

Optimizer of the scenario

Interpretation of the results with 2 hidden layers

Conclusion