This article was published as part of the Data Science Blogathon.
Credit: https://gifer.com/en/GxlE
The 2 main questions that arose in my mind while working on this article were “Why am I writing this article?” & “How is my article different from other articles?” Good, the cost function is an important concept to understand in the fields of data science, but while I was following my graduate, I realized that the resources available online are too general and do not cover my needs completely.
I had to consult a lot of articles and watch some videos on YouTube to get an idea of the cost functions. As a result, I wanted to gather the functions “That”, “When”, “How” Y “Why” from Cost that can help explain this topic more clearly. I hope my article acts as a one-stop shop for cost functions!!
Dummy guide to cost function 🤷♀️
Loss functionThe loss function is a fundamental tool in machine learning that quantifies the discrepancy between model predictions and actual values. Its goal is to guide the training process by minimizing this difference, thus allowing the model to learn more effectively. There are different types of loss functions, such as mean square error and cross-entropy, each one suitable for different tasks and...: it is used when we refer to the error of a single example of trainingTraining is a systematic process designed to improve skills, physical knowledge or abilities. It is applied in various areas, like sport, Education and professional development. An effective training program includes goal planning, regular practice and evaluation of progress. Adaptation to individual needs and motivation are key factors in achieving successful and sustainable results in any discipline.....
Cost function: is used to reference an average of the loss functions in a complete training data set.
But, * because * use a cost function?
Why the hell do we need a cost function? Consider a scenario in which we want to classify the data. Suppose we have the height and weight details of some dogs and cats. Let's use these 2 characteristics to classify them correctly. If we trace these records, we obtain the following Dispersion diagramThe scatter plot is a graphical tool used in statistics to visualize the relationship between two variables. It consists of a set of points in a Cartesian plane, where each point represents a pair of values corresponding to the variables analyzed. This type of chart allows you to identify patterns, Trends and possible correlations, facilitating data interpretation and decision-making based on the visual information presented....:
Fig 1: Scatterplot for the height and weight of various cats and dogs
The blue dots are cats and the red dots are dogs. Below are some solutions to the classification problem above.
Fig: Probable solutions to our classification problem
Essentially, all three classifiers have very high precision, but the third solution is the best because it does not misclassify any points. The reason it ranks all the points perfectly is that the line is almost exactly between the two groups and no closer to either group.. This is where the concept of cost function comes in.. The cost function helps us to reach the optimal solution. The cost function is the technique of evaluating “the performance of our algorithm / model”.
Takes both the results expected by the model and the actual results, and calculate how wrong the model was in its prediction. Produces a higher number if our predictions differ greatly from the actual values. As we adjust our model to improve predictions, the cost function acts as an indicator of how the model has improved. This is essentially an optimization problem. Optimization strategies always aim to "minimize the cost function".
Types of cost functions
There are many cost functions in machine learning and each has its use cases depending on whether it is a regression or classification problem..
- Regression cost function
- Binary classification cost functions
- Multiple Class Classification Cost Functions
1. Regression cost function:
Regression models try to predict a continuous value, for instance, the salary of an employee, the price of a car, predicting a loan, etc. A cost function used in the regression problem is called “Regression cost function”. They are calculated on the error based on the distance as follows:
Error = y-y ‘
Where,
Y – Real entrance
AND '- Planned departure
The most commonly used regression cost functions are below,
1.1 Mean error (ME)
- In this cost function, the error is calculated for each training data and then the mean value of all these errors is derived.
- Calculating the average of the errors is the simplest and most intuitive way possible.
- Errors can be both negative and positive. Therefore, can cancel each other out during addition, which gives a zero mean error for the model.
- Therefore, this is not a recommended cost function, but it lays the foundation for other cost functions of regression models.
1.2 Root mean square error (MSE)
- This improves on the drawback we found in the above average error. Here a square of the difference between the actual and predicted value is calculated to avoid any possibility of negative error.
- It is measured as the average of the sum of the squared differences between the predictions and the actual observations.

MSE = (sum of squared errors) / n
- Also known as L2 loss.
- In MSE, since each error is squared, helps to penalize even small deviations in prediction compared to MAE. But if our dataset has outliers that contribute to larger prediction errors, then squaring this error even more will magnify the error many times more and also lead to a higher MSE error.
- Therefore, we can say that it is less robust to outliers.
1.3 Mean absolute error (MUCH)

EASY (sum of absolute errors) / n
2. Cost functions for classification problems
The cost functions used in the classification problems are different from the ones we use in the regression problem.. A commonly used loss function for classification is the cross entropy loss. Let's understand cross entropy with a little example. Consider that we have a classification problem of 3 classes as follows.
Class (Orange, Apple, tomato)
The machine learning model will give a probability distribution of these 3 classes as output for a given input data. The class with the highest probability is considered a winning class for prediction.
Output = [P(Orange),P(Apple),P(Tomato)]
The actual probability distribution for each class is shown below.
Orange = [1,0,0]
Apple = [0,1,0]
Tomato = [0,0,1]
If during the training phase, the input class is Tomato, the predicted probability distribution should tend towards Tomato's actual probability distribution. If the predicted probability distribution is no closer to the real one, the model must adjust its weight. This is where the cross entropy becomes a tool for calculating how far the predicted probability distribution is from the actual. In other words, cross entropy can be thought of as a way to measure the distance between two probability distributions. The following image illustrates the intuition behind cross entropy:
FIGURE"Figure" is a term that is used in various contexts, From art to anatomy. In the artistic field, refers to the representation of human or animal forms in sculptures and paintings. In anatomy, designates the shape and structure of the body. What's more, in mathematics, "figure" it is related to geometric shapes. Its versatility makes it a fundamental concept in multiple disciplines.... 3: Intuition behind croos-entropy (credit – machinelearningknowledge.ai)
This was just an intuition behind the cross entropy. It has its origin in information theory. Now, with this understanding of cross entropy, let's now look at the classification cost functions.
2.1 Multiple Class Classification Cost Functions
This cost function is used in classification problems where there are multiple classes and the input data belongs to a single class. Now let's understand how the cross entropy is calculated. Suppose the model gives the probability distribution as shown below for 'n’ classes and for a particular input data D.

And the actual or target probability distribution of the data D is

Later, the cross entropy for that particular datum D is calculated as
Loss of cross entropy (Y, p) = – YT Registration (p)
= – (Y1 log (p1) + Y2 log (p2) + …… andNorth log (pNorth))

Let's now define the cost function using the previous example (See cross entropy image -Fig3),
p (tomato) = [0.1, 0.3, 0.6]
Y (tomato) = [0, 0, 1]
Cross entropy (Y, P) = – (0 * Log (0.1) + 0 * Log (0.3) + 1 * Log (0.6)) = 0.51
The above formula only measures the cross entropy for a single observation or input data. The error in the classification of the complete model is given by the categorical cross entropy, which is nothing more than the mean of the cross entropy for all the N training data.
Categorical cross entropy = (Cross-entropy sum for N data) / N
2.2 Binary cross entropy cost function
Binary cross entropy is a special case of categorical cross entropy when there is only one output that simply assumes a binary value of 0 O 1 to denote the negative and positive class respectively. For instance, classification between cat and dog.
Suppose the actual output is denoted by a single variableIn statistics and mathematics, a "variable" is a symbol that represents a value that can change or vary. There are different types of variables, and qualitative, that describe non-numerical characteristics, and quantitative, representing numerical quantities. Variables are fundamental in experiments and studies, since they allow the analysis of relationships and patterns between different elements, facilitating the understanding of complex phenomena.... Y, then the cross entropy for a particular datum D can be simplified as follows:
Cross entropy (D) = – Y * log (p) when y = 1
Cross entropy (D) = – (1-Y) * log (1-p) when y = 0
The error in the binary classification for the complete model is given by the binary cross entropy, which is nothing more than the mean of the cross entropy for all the N training data.
Binary cross entropy = (Cross-entropy sum for N data) / N
Conclution
I hope this article has been helpful to you!! Let me know what you think, especially if there are suggestions for improvement. You can connect with me on LinkedIn: https://www.linkedin.com/in/saily-shah/ and here is my GitHub profile: https://github.com/sailyshah
The media shown in this article is not the property of DataPeaker and is used at the author's discretion.


