What does a data scientist do on a daily basis?

Contents

Overview

  • What does a data scientist do on a daily basis? A popular and essential question
  • We look at this question from the perspective of a data scientist through the lens of 5 Detailed and insightful answers from seasoned data scientists.

Introduction

I am a curious person by nature. Every time I come across a concept that I haven't heard of before, can't wait to dig deeper and find out how it works. This has been quite useful on my own Data science trip.

But before I get my first shot at data science, I was always curious to know what data scientists did every day. Was I supposed to just build models all the time? Or is the often-quoted saying about moving from 70 al 80% of our time cleaning data was truly true?

I'm sure you also wondered (or at least you wondered) about this. The role of a data scientist could be to “sexiest job of the 21st century”, but what does that imply in the day-to-day?

what does a data scientist do

I decided to investigate this. I wanted to expand my horizons and understand how data scientists view their role in different domains (like NLP). This helped me better understand our role and why we should always read different perspectives when it comes to data science..

Then, here is a list of the 5 top answers to help you get an idea of ​​what a typical data scientist routine is. Prepare to be amazed: Modeling is not the main function (and only) in the daily tasks of a data scientist!

I also encourage you to participate in a discussion on this question here. This will enrich your current understanding of what a data scientist does and your thoughts will foster a discussion among our community!!

Note: I took the answers verbatim from Quora and added my thoughts at the beginning of each answer. This will help you get a good perspective on what the solution covers without diluting the author's thoughts.. Enjoy!

I like this answer because it is sharp, straightforward and simple. The author has even designed a flow chart and explained his thinking procedure in a wonderfully illustrated way.. Here is your full answer:

Machine learning is very process oriented. Therefore, I'm always somewhere in one of the images below:data_scientist_role

Machine learning engineers spend a lot of time on the first two images (o stages). The fun part is really in the third stage, but it's only a small part of what happens in the real world.

Some key things to pay attention to about data science in the real world:

  1. Almost all applied machine learning is supervised. That means we build models against structured data sets.
  2. Data disputes are a big part of what happens in the real world
  3. When you hear the word supervised, think classification and regression. Most of my models are sorting problems.
  4. Model building is approximately the 20% of my work. Yes, that is all!
  5. Many small and medium-sized companies don't use deep learning at all. Why? Because structured data algorithms like XGBoost always win
  6. Everything I do is programmatic
  7. Most real-world data resides in relational databases. It will be your job to build queries to extract the data you need
  8. Big data is unstructured data. If you have to build your models against big data, you will need to learn another set of skills
  9. The cloud is here to stay. I use BigQuery for my really large structured data. Most of the large models cannot be built on your laptop
  10. Computers are monolingual. They only speak numbers. When you pass data to your model, you are passing a highly structured and well refined numeric data set

I really like Vinita's use of visualization. The percentage description of each data science task is helpful and insightful. Vinita has also relied on her experience to explain the step-by-step work that a data scientist does. It's a must-read answer!!

Contrary to popular belief, data science isn't all glamor. The following CrowdFlower survey results accurately summarize a typical day for a data scientist:

data_scientist_role

There is a lot of backtracking involved. Sometimes, you even need to be able to predict what consequences deleting might have / add a variable.

  • Collection of data sets: Data is the lifeblood of data science, so we spend a lot of time selecting them. On rare occasions, some projects may already have a lot of data
  • Cleaning and organizing data: This is the longest and most crucial step of the entire procedure.. Has a big impact on the bottom line. As usual, after this step, the large amount of data is reduced, so it is possible that we need to compile more data for effective training.
  • Data processing: It is the practice of examining large pre-existing databases to generate new information. Once the data is organized and stored in databases, in short we can start to get value from them by finding patterns within the data.
  • Create training sets and test sets: Once we have a decent amount of data, we have to divide it into training set and test set. A training set is a set of data that is used to discover potentially predictive relationships. Contains all the information about the expected output. A test set is a set of data that is used to examine the strength and usefulness of a predictive linkage.. Contains mixed variables
  • Refinement of algorithms: We start with a skeletal algorithm. It is very basic and establishes approximately what result is expected. After a few sessions, accuracy is recorded, precision, etc. and the algorithm is refined to maximize its efficiency.

This is an excellent and relatable answer. Note that machine learning, the most anticipated aspect of a data scientist's job, just occupy the 5% of total time. In the same way that Vinita, you have also explained your tasks in terms of percentage. Here's Justin's opinion:

  • Tasks associated with NLP (15%). It's no wonder that PaperRater's automatic correction technology requires heavy use of parsers., taggers, regular expressions and other advantages of NLP as part of core algorithms and feedback modules.
  • Machine learning (5%). This is usually the most enjoyable part. Data cleansing, extraction / engineering / feature selection and model construction
  • Reports and analysis (10%). Run queries, review analysis and help with strategic decision making
  • Data management (5%). Configure and manage database servers, including MySQL, Redis y MongoDB. Larger projects may require Hadoop or Spark
  • General software development (40%). Many data scientists have a background in computer science, so expect to collaborate if you have adequate experience. API integration, web development and anywhere else you can add value. Even in an AI startup, most of the development is not going to involve AI
  • Other (25%). This includes a wide variety of tasks, including blog posts, marketing, management, Technical documentation, technical support, copy of the web portal, emails, meetings, etc.

The author, Tim Kiely, use a Venn diagram to explain what data science is. Just take a look at this Venn diagram below: will blow your mind. Tim further talks about what data scientists are supposed to be by taking a somewhat contradictory view of the general definition.. Here is Tim's solution:

The “Data Scientist” it's a bit of a myth, in my opinion. It does not mean that they are not out there, but they are much rarer than is popularly understood and are more the exception than the rule.

I compare it to the title of “Web Master” from the dotcom bubble: these so-called people who could do full programming, front-end development, marketing, everything. All those roles / skills have always been specialized and remain so today.

"Data scientists" are supposed to be database architects, understand distributed computing, have in-depth knowledge of statistics AND some business area or experience in the field. That's asking a lot when any of those skill sets can take a career to build..

data_scientist_role


The data scientists I have worked with usually have a Ph.D.. in artificial intelligence or machine learning and are effective communicators, which gives them the ability to direct analysts, DevOps people, Database developers and administrators on hand to troubleshoot with data-driven solutions. They describe the desired answer and let their teams fill in the gaps.

Let's dive into a particular machine learning specialization. One of my favorites – Natural language processing (NLP)! I wanted to bring out the opinion of a machine learning engineer here (a role every data scientist should be familiar with). See Evan's Complete Solution:

Today working in NLP, for the most part, including intent classification and entity extraction. This is a typical day for me:

  • get to work, open GitHub and check the ZenHub dashboard (something like Jira, except it's much cooler). I had some models that were training last night on our servers and should have received an email stating that they finished. I did!
  • I will probably spend a few minutes testing those new models and then adjust some parameters, later I will restart the training procedure
  • The rest of the day I tend to be coding, either working on a back-end Python application that will provide the artificial intelligence for one of our products, or implementing a new algorithm that I want to test.
  • As an example, I recently read a post about docked simulated annealing (CSA) and wanted to try it to adjust the parameters for XGBoost as an alternative to a grid search. CSA is a generalized form of simulated annealing (TO), which is an algorithm to take full advantage of a function that does not use any information about the derivative of the function.
  • Unfortunately, I couldn't find an implementation in Python, therefore I decided to write my own. Two days later, I had sent my first package to PyPI!

Final notes

The role of the data scientist is truly multifaceted, It is not like this? MANY aspiring data scientists assume they will primarily build models around the clock, But that is not the case.

There are all kinds of tasks involved in a typical data science project that you will find yourself working on from day to day. I quite like it because it opens up ways to learn new concepts and apply them in the real world.

I'll post some more career related posts on DataPeaker, So stay tuned and keep learning!

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.