Overview
- SQL is a must-have language for anyone in data science or analytics.
- Here there is 8 ingeniosas técnicas de SQL para el análisis de datos con las que los profesionales de la analyticsAnalytics refers to the process of collecting, Measure and analyze data to gain valuable insights that facilitate decision-making. In various fields, like business, Health and sport, Analytics Can Identify Patterns and Trends, Optimize processes and improve results. The use of advanced tools and statistical techniques is essential to transform data into applicable and strategic knowledge.... y la ciencia de datos adorarán trabajar
Introduction
SQL is a key gear in the arsenal of a data science professional. I speak from experience: you just can't hope to build a successful career in data science or analytics if you haven't learned SQL yet.
And why is SQL so important?
A measureThe "measure" it is a fundamental concept in various disciplines, which refers to the process of quantifying characteristics or magnitudes of objects, phenomena or situations. In mathematics, Used to determine lengths, Areas and volumes, while in social sciences it can refer to the evaluation of qualitative and quantitative variables. Measurement accuracy is crucial to obtain reliable and valid results in any research or practical application.... que avanzamos hacia una nueva década, the speed at which we produce and consume data is skyrocketing day by day. To make smart data-driven decisions, organizations around the world are hiring data professionals such as business analysts and data scientists to extract and unearth insights from the vast trove of data.
And one of the most important tools needed for this is, I guess it, ¡SQL!
The structured query language (SQL) has been around for decades. It is a programming language used to manage data stored in relational databases. SQL is used by most large companies around the world. A data analyst can use SQL to access, read, manipular y analizar los datos almacenados en una databaseA database is an organized set of information that allows you to store, Manage and retrieve data efficiently. Used in various applications, from enterprise systems to online platforms, Databases can be relational or non-relational. Proper design is critical to optimizing performance and ensuring information integrity, thus facilitating informed decision-making in different contexts.... y generar información útil para impulsar un proceso de toma de decisiones informado.
In this article, I will discuss 8 techniques / SQL queries that will prepare you for any advanced data analysis problems. Please note that this article assumes a very basic understanding of SQL.
I would suggest checking out the courses below if you are new to SQL and / or business analysis:
Table of Contents
- First let's understand the dataset
- SQL Technique n. ° 1: count rows and elements
- SQL Technique n. ° 2: aggregation functions
- SQL technique # 3: Identification of extreme values
- SQL Technique n. ° 4: data cut
- SQL Technique n. ° 5: data limitation
- SQL Technique n. ° 6: data classification
- SQL Technique n. ° 7: filter patterns
- SQL Technique n. ° 8: clusters, data accumulation and filtering in groups
First let's understand the dataset
What is the best way to learn to analyze data? Doing it side by side on a data set!! For this purpose, I have created a dummy dataset of a retail store. The customer data table is represented by ConsumerDetails.
Our dataset consists of the following columns:
- Name – The consumer's name
- Location – The customer's location
- Total_amt_spend – The total amount of money spent by the consumer in the store.
- Industry – It means the industry to which the consumer belongs
Note: – I will use MySQL 5.7 to advance in the article. You can download it from here – Descargas de My SQL 5.7.
SQL Technique n. ° 1: row and item count
We will start our analysis with the simplest query, namely, counting the number of rows in our table. We will do this using the function – COUNT ().
Excellent! Now we know the number of rows in our table, What is it 10. It may seem fun to use this function on a small set of test data, But it can go a long way when your ranks number in the millions!!
Many times, our data table is full of duplicate values. To achieve the unique value, usamos la función DISTINCTThe word "DISTINCT" in English it translates into Spanish as "different" O "different". In the field of programming and databases, especially in SQL, Used to remove duplicates in query results. When applying the DISTINCT clause, only the unique values of a dataset are obtained, which facilitates the analysis and presentation of relevant and non-redundant information.....
In our data set, How can we find the unique industries that customers belong to?
You guessed it right. We can do this using the DISTINCT function.
You can even count the number of unique rows by using counting in conjunction with different. You can refer to the following query:
SQL technique # 2 – Aggregation functions
Aggregation functions are the basis of any type of data analysis. They give us an overview of the dataset. Some of the functions we will discuss are: SUM (), AVG () and STDDEV ().
We use the SUM() function to calculate the sum of the numeric column in a table.
Let's find out the sum of the amount spent by each of the clients:
In the example above, sum_all is the variableIn statistics and mathematics, a "variable" is a symbol that represents a value that can change or vary. There are different types of variables, and qualitative, that describe non-numerical characteristics, and quantitative, representing numerical quantities. Variables are fundamental in experiments and studies, since they allow the analysis of relationships and patterns between different elements, facilitating the understanding of complex phenomena.... en la que se almacena el valor de la suma. The sum of the amount of money spent by consumers is Rs. 12.560.
To calculate the average of numeric columns, we use the AVG () function. Let's find average consumer spending for our retail store:
The average amount spent by customers in the retail store is Rs. 1256.
-
Calculate the standard deviation
If you have looked at the dataset and then the average value of consumer spending, you will have noticed that something is missing. Average does not provide a complete picture, so let's look for another important metric: the standard deviation. The function is STDDEV ().
The standard deviation turns out to be 829,7, which means that there is a large disparity between consumer spending.
SQL technique # 3 – Identification of extreme values
The next type of analysis is to identify extreme values that will help you better understand the data..
The maximum numerical value can be identified by the MAX function (). Let's see how to apply it:
The maximum amount of money that the consumer spends in the retail store is Rs. 3000.
Similar to the max function, we have the MIN function () to identify the minimum numeric value in a given column:
The minimum amount of money spent by the retail store consumer is Rs. 350.
SQL Technique n. ° 4: data cut
Now, let's focus on one of the most important parts of data analysis: divide the data. This section of the analysis will form the basis for advanced queries and help you to retrieve data based on some kind of condition.
- Let's say the retail store wants to find customers who come from a locality, specifically Shakti Nagar and Shanti Vihar. What will be the query for this?
¡Genial, have 3 customers! Hemos utilizado la cláusula WHERE"WHERE" is a term in English that translates as "where" in Spanish. Used to ask questions about the location of people, Objects or events. In grammatical contexts, it can function as an adverb of place and is fundamental in the formation of questions. Its correct application is essential in everyday communication and in language teaching, facilitating the understanding and exchange of information on positions and directions.... para filtrar los datos en función de la condición de que los consumidores deberían vivir en la localidad: Shakti Nagar y Shanti Vihar. I did not use the OR condition here. Instead, I have used the IN operator which allows us to specify multiple values in the WHERE clause.
- We need to find clients who live in specific locations (Shakti Nagar y Shanti Vihar) and spend an amount greater than Rs. 2000.
In our data set, only Shantanu and Natasha meet these conditions. How both conditions must be met, the AND condition is best suited here. Let's see another example to divide our data.
- This time, the retail store wants to win back all consumers who spend between Rs. 1000 y Rs. 2000 to drive special marketing offers. What will be the query for this?
Another way to write the same statement would be:
Only Rohan is clearing this criterion!!
Excellent! We have reached the middle of our journey. Let's build more on the knowledge we've gained so far.
SQL Technique n. ° 5: data limitation
Let's say we want to see the data table consisting of millions of records. No podemos usar la instrucción SELECTThe command "SELECT" is fundamental in SQL, used to query and retrieve data from a database. Allows you to specify columns and tables, filtering results using clauses such as "WHERE" and ordering with "ORDER BY". Its versatility makes it an essential tool for data manipulation and analysis, facilitating the obtaining of specific information efficiently.... directamente ya que esto volcaría la tabla completa en nuestra pantalla, which is cumbersome and computationally intensive. Instead, we can use the LIMIT clause:
The above SQL command helps us to show the first 5 table rows.
What will you do if you only want to select the fourth and fifth rows? We will use the OFFSET clause. The OFFSET clause will skip the specified number of rows. Let's see how it works:
SQL Technique n. ° 6: data classification
Sorting data helps us put our data in perspective. We can perform the classification process using the keyword – ORDER BYThe command "ORDER BY" in SQL it is used to sort the results of a query based on one or more columns. Allows you to specify the ascending order (ASC) or descending (DESC) of the data, facilitating the visualization and analysis of information. It is an essential tool for organizing data in databases, improving understanding and access to relevant information.....
The keyword can be used to sort the data in ascending or descending order. The ORDER BY keyword sorts the data in ascending order by default.
Let's see an example in which we sort the data according to the Total_amt_spend column in ascending order:
Impressive! To sort the dataset in descending order, we can follow the following command:
SQL technique # 7 – Filtering patterns
In the previous sections, we learned how to filter data based on one or more conditions. Here, we will learn how to filter the columns that match a specific pattern. To get on with this, we will first understand the LIKE operator and wildcard characters.
The LIKE operator is used in a WHERE clause to find a specific pattern in a column.
The wildcard character is used to substitute one or more characters in a string. These are used in conjunction with the LIKE operator. The two most common wildcard characters are:
-
- %: It represents 0 or more characters
- _ – Represents a single character
In our fictitious retail dataset, let's say we want all the localities that end with "Nagar". Take a moment to understand the problem statement and think about how we can solve it.
Let's try to solve the problem. We require all locations ending with “Nagar” and they can have any number of characters before this particular string. Therefore, we can make use of the wild card “%” before “Nagar”:
Impressive, have 6 localities that end with this name. Notice that we are using the LIKE operator to perform pattern matching.
Then, we will try to solve another problem based on patterns. We want the names of consumers whose second character has “a” in their respective names. Again, I suggest you take a moment to understand the problem and think of a logic to solve it.
Let's analyze the problem. Here, the second character must be “a”. The first character can be anything, so we substitute this letter for the wildcard "_". After the second character, there can be any number of characters, so we substitute those characters with the wildcard “%”. The final pattern match will look like this:
Have 6 people who satisfy this strange condition!
SQL Technique n. ° 8: clusters, data accumulation and filtering in groups
We have finally arrived at one of the most powerful analysis tools in SQL: la agrupación de datos que se realiza utilizando la instrucción GROUP BYThe clause "GROUP BY" in SQL it is used to group rows that share values into specific columns. This allows aggregation functions to be performed, as SUM, COUNT or AVG, About the resulting groups. Its use is essential to analyze data and obtain statistical summaries. It is important to remember that all selected columns that are not part of an aggregation function must be included in the "GROUP BY"..... The most useful application of this statement is to find the distribution of categorical variables. This is done using the GROUP BY statement in conjunction with aggregation functions like – COUNT, SUM, AVG, etc.
Let's try to understand this better by taking a statement of the problem. The retail store wants to find the number of customers corresponding to the industries to which it belongs:
We observe that the count of clients belonging to the different industries is more or less the same. Then, Let's go ahead and find the sum of the expenses of the clients grouped by the industry to which they belong:
We can see that the maximum amount of money spent is by customers belonging to the Manufacturing industry. This seems a bit easy, truth? Let's step forward and make it more complicated.
Now, the retailer wants to find the industries whose Total amount is greater than 2500. To solve this problem, volveremos a agrupar los datos según la industria y luego usaremos la cláusula HAVINGThe verb "have" In Spanish it is a fundamental auxiliary that is used to form compound tenses. Its conjugation varies according to time and subject, being "I", "You", "has", "we have", "You" Y "have" The Forms of the Present. What's more, in some regions, Used "have" as an impersonal verb to indicate existence, like in "there is" to "there is/are". Its correct use is essential for effective communication in Spanish.....
HAVING clause is like WHERE clause but only to filter grouped data. Remember, will always come after the GROUP BY statement.
We have only 3 categories that satisfy the conditions: Aviation, Defending, Y Manufacturing. But to make it clearer, I will also add the ORDER BY keyword to make it more intuitive:
Final notes
I'm so glad you made it this far. These are the building blocks of all data analysis queries in SQL. You can also do advanced queries using these basics. In this article, i used mysql 5.7 to set the examples.
I really hope these SQL queries help you in your day to day when you are analyzing complex data. Have any of their tips and tricks for analyzing data in SQL? Let me know in the comments!!