Steps to complete a machine learning project

Contents

Introduction

80329roadmap-9726882
Machine learning project workflow

1. Data collection

  1. What kind of problem are we trying to solve?
  2. What data sources already exist?
  3. What privacy issues are there?
  4. Is the data public?
  5. Where should we store the files?
99513data-2480889
  1. Structured data: appear in tabular format (row and column style, like what you would find in an excel spreadsheet). Contains different types of data, for instance, numerical time series, categorical.
  • · Nominal / categorical – One thing or another (mutually exclusive). For instance, for car scales, color is a category. A car can be blue but not white. An order does not matter.
  • Numeric: Any continuous value where the difference between them matters. For instance, when selling houses, $ 107,850 is more than $ 56,400.
  • Ordinal: Data that have order but the distance between values ​​is unknown. For instance, a question like, How would you rate your health from 1 al 5? 1 being poor, 5 healthy. Can you reply 1, 2, 3, 4, 5, but the distance between each value does not necessarily mean that a response of 5 is five times as good as an answer of 1. Time series: data over time. For instance, historical sales values ​​of bulldozers 2012 a 2018.
  • Time series: Data over time. For instance, historical sales values ​​of bulldozers 2012 a 2018.
  1. Unstructured data: Data without rigid structure (images, video, voice, natural
    language text)
55546struc-4191559

2. Data preparation

  • Exploratory data analysis (EDA), learn about the data you are working with
  1. What are the characteristic variables (entry) and the target variable (Exit)? For instance, to predict heart disease, characteristic variables may be age, the weight, a person's average heart rate and level of physical activity. And the objective variable will be whether or not they have a disease.
  2. What kind of do you have? Structured time series, unstructured, numerical. Missing values? In case you delete or complete them, the imputation function.
  3. Where are the outliers? How many of them are there? Why are they there? Are there any questions you can ask a domain expert about data? For instance, Could a heart disease physician shed some light on their heart disease dataset?
52802eda-7114194
  • Data preprocessing, preparing your data for modeling.
  • Imputation function: fill in missing values (a machine learning model cannot learn
    in data that is not there)
  1. Single imputation: Fill with media, a median of the column.
  2. Multiple imputations: Model other missing values ​​and with what your model finds.
  3. KNN (k nearest neighbors): Fill in the data with a value from another example that is similar.
  4. Many more, like random imputation, the last observation carried out (for time series), the moving window and the most frequent.
  • Function coding (convert values ​​to numbers). A machine learning model
    requires all values ​​to be numeric)
  • A hot coding: Convert all unique values ​​to lists of zeros and ones where the target value is 1 and the rest are zeros. For instance, when a car colors green, Red, blue, verde, the future of car color would be represented as [1, 0, and 0] and a serious red [0, 1, and 0].
  • Label encoder: Convert labels to distinct numeric values. For instance, if your target variables are different animals, like a dog, cat, bird, these could become 0, 1 Y 2, respectively.
  • Embed coding: Learn a representation between all the different data points. For instance, a language model is a representation of how different words relate to each other. Embedding is also increasingly available for structured data (tabular).
  • Function standardization (scaled) or standardization: When numerical variables are on different scales (for instance, the number_of_bathroom is between 1 Y 5 and the size_of_land between 500 Y 20000 square feet), some machine learning algorithms don't work very well. Scaling and standardization help solve this problem.
  • Function engineering: transform the data into a representation (potentially) more meaningful by adding domain knowledge
  1. Decompose
  2. Discretization: convert large groups into smaller groups
  3. Crossing and interaction functions: combination of two or more functions
  4. The characteristics of the indicator: use other parts of the data to indicate something potentially significant
  • Feature selection: selecting
    the most valuable characteristics of your dataset to model. Potentially reducing training time and overfitting (less general data and less redundant data to train) and improving accuracy.
  1. Dimensionality reduction: A common method of dimensionality reduction, PCA or Principal Component Analysis takes a lot of dimensions (features) and use linear algebra to reduce them to fewer dimensions. For instance, suppose you have 10 numeric functions, I could run PCA to reduce it to 3.
  2. Importance of function (post modeling): Fit a model to a data set, then inspect which characteristics were most important to the results, remove the least important.
  3. Wrapping methods how genetic algorithms and recursive feature removal involve creating large subsets of feature options and then removing the ones that don't matter.
  • Deal with imbalances: Does your data have 10,000 examples of a class but only 100 examples of another?
  1. Collect more data (yes can)
  2. Use the scikit-learn-contrib imbalanced package- learn
  3. Use SMOTE: synthetic minority oversampling technique. Create synthetic samples of your junior class to try to level the playing field.
  4. A useful item to look at is “Learning from unbalanced data”.
88314pre-3581833
  1. Training set (generally 70-80% of the data): the model learns about this.
  2. Validation set (normally from 10 al 15% of the data): the model hyperparameters conform to this
  3. Test set (normally between 10% and the 15% of the data): the final performance of the models is evaluated on this basis. If you have done well, hopefully the test set results give a good indication of how the model should work in the real world. Do not use this data set to fit the model.
90583ratio-6762328

3. Train the model on the data (3 Steps: choose an algorithm, fit the model, reduce fit with regularization)

  1. Supervised algorithms: linear regression, Logistic regression, KNN, SVM, decision tree and random forests, AdaBoost / Gradient Boosting Machine (impulse)
  2. Unsupervised algorithms: grouping, dimensionality reduction (PCA, automatic encoders, t-SNE), anomaly detection
66050alla-lgo-1048999
  1. Batch learning
  2. Learn online
  3. Transfer learning
  4. Active learning
  5. Assembly
90100train20a20model-5412514
  • Maladjustment – occurs when your model does not perform as well as you would like on your data. Try training for a longer or more advanced model.
  • Over-adjustment– occurs when the loss of validation begins to increase or when the model performs better in the training set than in the test set.
  1. Regularization: a collection of technologies to prevent / reduce overfitting (for instance, L1, L2, Abandonment, Early stop, Data augmentation, Batch normalization)
  • Hyperparameter tuning – Run a bunch of experiments with different settings and see which one works best

4. Analysis / Evaluation

  1. Classification: precision, precision, Recovery, F1, confusion matrix, mean mean precision (object detection)
  2. Regression – MSE, MUCH, R ^ 2
  3. Task-based metric: for instance, for the autonomous car, you may want to know the number of disconnections
91696performance20metric-7874948
  • Importance of the feature
  • Training / inference time / cost
  • What if the tool: how my model compares to other models?
  • Less safe examples: Where is the model wrong?
  • Bias compensation / variance

5. Service model (implementation of a model)

  • Put the model in production and see how it goes.
  • Instruments that you can use: TensorFlow Servinf, PyTorch Serving, Google AI Platform, Sagemaker
  • MLOps: where software engineering meets machine learning, essentially all the technology required around a machine learning model to make it work in production
91979services-4123830

6. Retrain Model

  • See how the model works after publishing (or before publication) based on various evaluation metrics and revisit previous steps as needed (remember, machine learning is very experimental, so this is where you will want to keep track of your data and experiments.
  • You will also find that your model predictions start to 'age’ (generally not in a fancy style) the ‘derive, such as when data sources change or are updated (new hardware, etc.). This is when you will want to train him again.
62747retrain-3508263

7. Machine learning tools

82076overall-3564171

Thanks for reading this. If you like this article, Share it with your friends. In case of any suggestion / doubt, comment below.

Email identification: [email protected]

Follow me on LinkedIn: LinkedIn

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.