Kaggle data sets | Top Kaggle Data Sets to Practice for Data Scientists

Contents

Introduction

65134art-8450680

Kaggle has many online resources to help one get started with data science. Has thousands of data sets, data science contests, code submissions in data sets, community chat and even beginner courses. The user also gets a public user profile that can be shared, which tracks and displays all user contributions and achievements.

The user profile shows who the user follows, who follows the user, user code, any set of user data and other information. There are also several classification methods. The kaggle profile serves as a great way to create shareable online projects and showcase your talent.. Like your HackerEarth or Code Chef profile shows your competitive coding skills, your kaggle profile serves as a way to express your data science skills.

To build a good kaggle profile, you need to work on the data and create high-quality Python or R notebooks in the form of projects and tell a story through the data. Multiple data charts can be added, write sales and train models in Kaggle Notebooks. You can do many things with them. And the best thing about Kaggle Notebooks is that: user does not need to install Python or R on their computer to use it. Almost all major libraries can be imported directly. Kaggle also provides TPUs for free. Tensor Processing Units (TPU) are hardware accelerators specialized in deep learning tasks. They are compatible with Tensorflow 2.1 both via the Keras high-level API and, at a lower level, on models that use a custom training cycle.

Therefore, working with datasets in Kaggle is very easy and convenient and all beginners should give Kaggle a try to develop some skills and knowledge.

Here are some datasets that every beginner can try and create amazing projects:

1. Netflix Movies and TV Shows

51509ntflix-8793984

Who doesn't like Netflix? This kaggle dataset has TV shows and movies available on Netflix. A good quality exploratory data analysis project can be created using this dataset. With this data set, you can find out: what kind of content is produced in which country, identify similar content from the description and much more interesting tasks.

  1. Link to dataset

My favorite notebooks

  1. EDA on Netflix laptop
  2. Netflix Data: analysis and visualization notebook

2. Student performance on exams

85211exam-8403279

These data are based on the demographics of the population. The data contains several characteristics such as the type of food that the student is given, the level of test preparation, parental education level and student performance in mathematics, Reading and writing. With the data, various types of regression and classification problems can be solved. It can also be used to find what factors can lead to better test scores.. In general, it will be interesting to work on it.

  1. Link to dataset

My favorite notebooks

  1. Student performance in the test book

3. Mobile pricing classification

84146phone-3523780

The Mobile Price Ranking dataset has many data characteristics and a wide variety of data that follows various distribution patterns. There are categorical features, continuous numeric data and even binary data. A large number of data patterns ensures that one is able to work with a large amount of data and deal with various mathematical calculations and statistics..

  1. Link to dataset

My favorite notebooks

  1. Price prediction notebook for mobile devices
  2. Moving price prediction n. ° 2

4. Cat and dog images

84974cat_and_dog-2996462

The classic Dog vs Cat classification dataset. There are many images of dogs and cats that can be used to train models and make predictions.. This dataset is a must-have for students trying to get into image processing or computer vision. What's more, you can see many cute pictures of cats and dogs.

  1. Link to dataset

My favorite notebooks

  1. Dog and cat image classifier notebook

5. Trip Advisor Hotel Reviews

90269trip-9098233

Hotels are an important part of travel and vacations. Hotel reviews are text data, that can be processed using natural language processing methods (PNL). There is more of 20.000 hotel reviews followed by a star rating from 1 a 5. The dataset can be used to train a rating model to determine the star rating for a given test review.. It can be a good stepping stone to get into text analysis and NLP.

  1. Link to dataset

My favorite notebooks

  1. Hotel Opinion Prediction Notebook

6. Melbourne Housing Market

16194melb-8955601

The Melbourne Housing Market Dataset is an all-time favorite learning resource for data science beginners. It has many features: numerical data, categorical and even geographic (latitude and longitude). Therefore, can also be used for geospatial analysis and other grouping problems. Similarly, regression and classification tasks can also be performed on this data set. There are also numerous code samples and guides available for this dataset, making it the ideal data set for students.

  1. Link to dataset

My favorite notebooks

  1. Melbourne || Comprehensive analysis notebook of the housing market
  2. Melboune real estate market comprehensive analysis notebook

7. Abandonment modeling

15848churn-9648912

Employee churn rate indicates how often company employees quit their jobs within a given period. It is an important aspect of HR Analytics and corporate strategy. Data is real life characteristics such as age, the gender, the time of connection with the company and other important characteristics. The data can be used to create a classification model and explore interesting patterns in the data..

  1. Link to dataset

My favorite notebooks

  1. Abandonment classification notebook

8. Amazon Top 50 best selling books 2009-2019

82787book-9936496

It is always interesting to work with a sales data set and obtain information. Features include rating from Amazon users, the number of reviews on Amazon and others. This dataset can be used to create EDA projects and also create regression analysis. Can be used to create an interesting case study on the success of bestselling books.

  1. Link to dataset

My favorite notebooks

  1. Amazon Best Selling Book Notebook

9. Medical expenses personal data set

24160hosp-5591675

This data set is used to make insurance forecasts based on various functions. Interesting features include BMI, the number of children and whether the person is a smoker or not. It is also included in the demographic data category and can be used to display an analysis of an individual's insurance expenses.

  1. Link to dataset

My favorite notebooks

  1. Patient charges || Clustering and Regression Notebook

10. Kepler exoplanets search results

47296space-8839848

Kepler had verified 1284 new exoplanets in May 2016. In october 2017, there is more of 3000 total confirmed exoplanets (using all detection methods, including terrestrial). The telescope is still active and continues to collect new data on its extended mission..

The data has several characteristics, all of which can be a bit difficult to understand. Detailed explained guide can be found here.

  1. Link to dataset

Final notes

There are many laptops in this dataset, can be a bit difficult for beginners, but you can do a lot of work on this dataset.

There are many more datasets and challenges available on Kaggle, from which beginners can learn. Your Kaggle profile can also be used as a means of expressing your data science skills..

The media shown in this article about Kaggle datasets is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.