7 Amazing Machine Learning GitHub Repositories for Data Scientists

Contents

Introduction

If I had to choose a platform that has kept me up to date with the latest developments in Data science Y machine learning – it would be GitHub. The large scale of GitHub, combined with the power of super data scientists around the world, makes it a mandatory platform for anyone interested in this field.

Can you imagine a world where libraries and machine learning frameworks like BERT, StanfordNLP, TensorFlow, PyTorch, etc. were not open source? It's unthinkable! GitHub has democratized machine learning for the masses, exactly in line with what we believe in DataPeaker.

This was one of the main reasons we started this GitHub series covering the most useful machine learning packages and libraries in January 2018.

best-github-4423920

Along with that, we've also been covering Reddit discussions that we think are relevant to all data science professionals. This month is no different. I have selected the top five debates for May, that focus on two things: machine learning techniques and professional advice from expert data scientists.

You can also check out the GitHub repositories and Reddit discussions we've covered throughout this year.:

Top GitHub repositories (May of 2019)

github-octocat-3178175

microsoft-80660_960_720-5301780

Interpretability is a HUGE thing in machine learning right now. Being able to understand how a model produced the result it produced, a fundamental aspect of any machine learning project. In fact, we even did a podcast with Christoph Molar on interpretable ML which you should check out.

InterpretML is an open source package from Microsoft for training interpretable models and explaining black box systems. Microsoft put it best when it explained why interpretability is essential:

  • Debugging models: Why did my model make this mistake?
  • Detecting bias: Does my model discriminate?
  • Human-AI cooperation: How can I understand and trust the decisions of the model?
  • Normative compliance: Does my model meet the legal requirements?
  • High risk applications: Sanitary, financial, judicial, etc.

Interpreting the inner workings of a machine learning model becomes more difficult as complexity increases. Have you ever tried to disassemble and understand a set of multiple models? It takes a lot of time and effort to do it.

We can't just go to our client or leadership with a complex model without being able to explain how it produced a good score. / precision. That's a one way ticket back to the drawing board for us.

The folks at Microsoft Research have developed the Explainable Boosting Machine algorithm (EBM) to help with interpretation. This MBE technique has high precision and intelligibility: The Holy Grail.

Interpreting ML is not limited to using EBM. It also supports algorithms like LIME, linear models, decision trees, among others. Comparing models and choosing the best one for our project has never been so easy!!

You can install InterpretML using the following code:

pip install numpy scipy pyscaffold
pip install -U interpret

Google Research makes another appearance in our monthly Github series. No surprises: they have the most computational power in the business and are using it in machine learning.

Your latest open source release, called Tensor2Robot (T2R) it's quite impressive. T2R is a library for training, large-scale deep neural network evaluation and inference. But wait, has been developed with a specific goal in mind. It is designed for neural networks related to robotic perception and control.

No prizes for guessing the deep learning framework Tensor2Robot is built on. That's how it is, TensorFlow. Tensor2Robot is used within Alphabet, Google's parent organization.

Here are a couple of projects implemented with Tensor2Robot:

TensorFlow 2.0, the TensorFlow version (TF) most anticipated this year, officially launched last month. And I couldn't wait to get my hands on it!!

tf-2-8299446

This repository contains TF implementations of multiple generative models, including:

  • Antagonistic generative networks (GAN)
  • Car coder
  • Variational autoencoder (Alas)
  • VAE-GAN, among others.

All of these models are implemented in two data sets that you will be quite familiar with.: Fashion MNIST y NSYNTH.

The best part? All of these implementations are available on a Jupyter Notebook!! So you can download and run it on your own machine or export it to Google Colab. The choice is yours and TensorFlow 2.0 is here for you to understand and use.

stumpy_logo_small-8802941

A repository of time series! I haven't come across a new time series development in quite a while.

STUMPY is a powerful and scalable library that helps us perform time series data mining tasks. STUMPY is designed to calculate a matrix profile. I can see you wondering: What the hell is a matrix profile? Good, this matrix profile is a vector that stores the normalized Euclidean distance z between any subsequence within a time series and its closest neighbor.

Here are some time series data mining tasks that this matrix profile helps us perform:

  • Anomaly discovery
  • Semantic segmentation
  • Density estimation
  • Time series chains (temporally ordered set of subsequence patterns)
  • Pattern discovery / reason (approximately repeated subsequences within a longer time series)

Use the following code to install it directly via pepita:

pip install stumpy

MeshCNN is a general purpose deep neural network for 3D triangular meshes. These meshes can be used for tasks such as 3D shape classification or segmentation. A great machine vision application.

MeshCNN framework includes convolution layers, grouping and vanishing applied directly to the edges of the mesh:

meshcnn_overview-4652046

Convolutional Neural Networks (CNN) are perfect for working with images and visual data. CNNs have become all the rage of late with a boom in image-related tasks emerging from them.. Object detection, image segmentation, image classification, etc., all this is possible thanks to the advance of CNN.

Deep learning in 3D is attracting industry interest, including fields such as robotics and autonomous driving. The problem with 3D shapes is that they are inherently irregular.. This makes operations like convolutions difficult and challenging..

This is where MeshCNN comes in.. From the repository:

Meshes are a list of vertices, edges and faces, that together define the shape of the 3D object. The problem is that each vertex has a different number of neighbors and there is no order.

If you are a fan of computer vision and are interested in learning or applying CNN, this is the perfect repository for you. You can learn more about CNN through our articles:

tree-8820971

Decision tree algorithms are among the first advanced techniques we learn in machine learning. Honestly, I really appreciate this technique after logistic regression. Could use it on larger data sets, understand how it worked, how the divisions occurred, etc.

Personally, i love this repository. It's a treasure trove for data scientists. The repository contains a collection of articles on tree-based algorithms, including decision trees, regression and classification. The repository also contains the implementation of each article. What more could we ask for?

Have you ever wondered how the training process of your machine learning algorithm works? We write the code, some complication happens behind the scenes (The pleasure of programming!), And we get the results.

Microsoft Research has created a tool called TensorWatch that allows us to see real-time visualizations of the training process of our machine learning model. Amazing! See a snippet of how TensorWatch works:

tensorwatch-6981466

TensorWatch, in simple terms, is a debugging and visualization tool for deep learning and reinforcement learning. It works in Jupyter notebooks and allows us to do many other custom visualizations of our data and our models.

Reddit discussions

Let's take a few moments to check out the most amazing Reddit discussions related to data science and machine learning from May 2019. Here is something for everyone, whether you are a data science enthusiast or practitioner. So let's dig deeper!

This is a tough nut to crack. The first question is whether you should opt for a PhD before taking up a position in the industry. And later, if you chose one, What skills should you acquire to ease your industry transition?

I think this discussion could be helpful in deciphering one of the greatest riddles of our career: How do we transition from one field or line of work to another? Don't look at this just from the point of view of a PhD student. This is very relevant to most of us who want to get that first leap into machine learning..

I highly recommend that you follow this thread, as many seasoned data scientists have shared their personal experiences and learning.

Recently, a research article was published expanding the title of this thread. The newspaper explained the lottery ticket hypothesis in which a smaller subnet, also known as a winning ticket, could train faster compared to a larger network.

This discussion focuses on this document. To read more about the lottery ticket hypothesis and how it works, you can refer to my article where I discuss this concept so that even beginners understand:

Decoding the best ICLR articles 2019: neural networks are here to rule

I chose this discussion because I can totally relate to it. I used to think: I have learned a lot and, but nevertheless, much more remains. Will I ever become an expert? I made the mistake of looking only at the quantity and not the quality of what I was learning.

With fast and continuous advance technology, there will always be a LOT to learn. This thread has some solid advice on how you can prioritize, stick to them and focus on the task at hand rather than trying to become an expert in all trades.

Final notes

I had a lot of Fun (and i learned) when putting together this month's machine learning GitHub collection! I highly recommend bookmarking both platforms and checking them regularly. It's a great way to stay up-to-date with all the latest machine learning news..

Or you can always come back every month and see our best options. 🙂

If you think I have missed a repository or any discussion, comment below and i will be happy to have a discussion about it.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.