7 Amazing Machine Learning GitHub Repositories for Data Scientists

Introduction

If I had to choose a platform that has kept me up to date with the latest developments in Data science Y machine learning – it would be GitHub. The large scale of GitHub, combined with the power of super data scientists around the world, makes it a mandatory platform for anyone interested in this field.

Can you imagine a world where libraries and machine learning frameworks like BERT, StanfordNLP, TensorFlow, PyTorch, etc. were not open source? It's unthinkable! GitHub has democratized machine learning for the masses, exactly in line with what we believe in DataPeaker.

This was one of the main reasons we started this GitHub series covering the most useful machine learning packages and libraries in January 2018.

Along with that, we've also been covering Reddit discussions that we think are relevant to all data science professionals. This month is no different. I have selected the top five debates for May, that focus on two things: machine learning techniques and professional advice from expert data scientists.

You can also check out the GitHub repositories and Reddit discussions we've covered throughout this year.:

Top GitHub repositories (May of 2019)

Interpretability is a HUGE thing in machine learning right now. Being able to understand how a model produced the result it produced, a fundamental aspect of any machine learning project. In fact, we even did a podcast with Christoph Molar on interpretable ML which you should check out.

InterpretML is an open source package from Microsoft for training interpretable models and explaining black box systems. Microsoft put it best when it explained why interpretability is essential:

Debugging models: Why did my model make this mistake?
Detecting bias: Does my model discriminate?
Human-AI cooperation: How can I understand and trust the decisions of the model?
Normative compliance: Does my model meet the legal requirements?
High risk applications: Sanitary, financial, judicial, etc.

Interpreting the inner workings of a machine learning model becomes more difficult to interpret measureThe "measure" it is a fundamental concept in various disciplines, which refers to the process of quantifying characteristics or magnitudes of objects, phenomena or situations. In mathematics, Used to determine lengths, Areas and volumes, while in social sciences it can refer to the evaluation of qualitative and quantitative variables. Measurement accuracy is crucial to obtain reliable and valid results in any research or practical application.... that increases complexity. Have you ever tried to disassemble and understand a set of multiple models? It takes a lot of time and effort to do it.

We can't just go to our client or leadership with a complex model without being able to explain how it produced a good score. / precision. That's a one way ticket back to the drawing board for us.

The folks at Microsoft Research have developed the Explainable Boosting Machine algorithm (EBM) to help with interpretation. This MBE technique has high precision and intelligibility: The Holy Grail.

Interpreting ML is not limited to using EBM. It also supports algorithms like LIME, linear models, decision trees, among others. Comparing models and choosing the best one for our project has never been so easy!!

You can install InterpretML using the following code:

pip install numpy scipy pyscaffold
pip install -U interpret

Google Research makes another appearance in our monthly Github series. No surprises: they have the most computational power in the business and are using it in machine learning.

Your latest open source release, called Tensor2Robot (T2R) it's quite impressive. T2R is a library for trainingTraining is a systematic process designed to improve skills, physical knowledge or abilities. It is applied in various areas, like sport, Education and professional development. An effective training program includes goal planning, regular practice and evaluation of progress. Adaptation to individual needs and motivation are key factors in achieving successful and sustainable results in any discipline...., large-scale deep neural network evaluation and inference. But wait, has been developed with a specific goal in mind. It is designed for neural networks related to robotic perception and control.

There are no prizes for guessing the frame of deep learningDeep learning, A subdiscipline of artificial intelligence, relies on artificial neural networks to analyze and process large volumes of data. This technique allows machines to learn patterns and perform complex tasks, such as speech recognition and computer vision. Its ability to continuously improve as more data is provided to it makes it a key tool in various industries, from health... in which Tensor2Robot is built. That's how it is, TensorFlow. Tensor2Robot is used within Alphabet, Google's parent organization.

Here are a couple of projects implemented with Tensor2Robot:

TensorFlow 2.0, the TensorFlow version (TF) most anticipated this year, officially launched last month. And I couldn't wait to get my hands on it!!

This repository contains TF implementations of multiple generative models, including:

Antagonistic generative networks (GAN)
Car coder
Variational autoencoder (Alas)
VAE-GAN, among others.

All of these models are implemented in two data sets that you will be quite familiar with.: Fashion MNIST y NSYNTH.

The best part? All of these implementations are available on a Jupyter Notebook!! So you can download and run it on your own machine or export it to Google Colab. The choice is yours and TensorFlow 2.0 is here for you to understand and use.

A repository of time series! I have not come across a new development of Time SeriesA time series is a set of data collected or measured at successive times, usually at regular time intervals. This type of analysis allows you to identify patterns, Trends and cycles in data over time. Its application is wide, covering areas such as economics, Meteorology and public health, facilitating prediction and decision-making based on historical information.... in quite a long time.

STUMPY is a powerful and scalable library that helps us perform time series data mining tasks. STUMPY is designed to calculate a matrix profile. I can see you wondering: What the hell is a matrix profile? Good, this matrix profile is a vector that stores the normalized Euclidean distance z between any subsequence within a time series and its closest neighbor.

Here are some time series data mining tasks that this matrix profile helps us perform:

Anomaly discovery
SegmentationSegmentation is a key marketing technique that involves dividing a broad market into smaller, more homogeneous groups. This practice allows companies to adapt their strategies and messages to the specific characteristics of each segment, thus improving the effectiveness of your campaigns. Targeting can be based on demographic criteria, psychographic, geographic or behavioral, facilitating more relevant and personalized communication with the target audience.... semantics
Density estimation
Time series chains (temporally ordered set of subsequence patterns)
Pattern discovery / reason (approximately repeated subsequences within a longer time series)

Use the following code to install it directly via pepita:

pip install stumpy

MeshCNN is a red neuronalNeural networks are computational models inspired by the functioning of the human brain. They use structures known as artificial neurons to process and learn from data. These networks are fundamental in the field of artificial intelligence, enabling significant advancements in tasks such as image recognition, Natural Language Processing and Time Series Prediction, among others. Their ability to learn complex patterns makes them powerful tools.. Deep General Purpose for 3D Triangular Meshes. These meshes can be used for tasks such as 3D shape classification or segmentation. A great machine vision application.

MeshCNN framework includes convolution layers, grouping and vanishing applied directly to the edges of the mesh:

Convolutional Neural Networks (CNN) are perfect for working with images and visual data. CNNs have become all the rage of late with a boom in image-related tasks emerging from them.. Object detection, image segmentation, image classification, etc., all this is possible thanks to the advance of CNN.

Deep learning in 3D is attracting industry interest, including fields such as robotics and autonomous driving. The problem with 3D shapes is that they are inherently irregular.. This makes operations like convolutions difficult and challenging..

This is where MeshCNN comes in.. From the repository:

Meshes are a list of vertices, edges and faces, that together define the shape of the 3D object. The problem is that each vertex has a different number of neighbors and there is no order.

If you are a fan of computer vision and are interested in learning or applying CNN, this is the perfect repository for you. You can learn more about CNN through our articles:

Decision tree algorithms are among the first advanced techniques we learn in machine learning. Honestly, I really appreciate this technique after logistic regression. Could use it on larger data sets, understand how it worked, how the divisions occurred, etc.

Personally, i love this repository. It's a treasure trove for data scientists. The repository contains a collection of articles on tree-based algorithms, including decision trees, regression and classification. The repository also contains the implementation of each article. What more could we ask for?

Have you ever wondered how the training process of your machine learning algorithm works? We write the code, some complication happens behind the scenes (The pleasure of programming!), And we get the results.

Microsoft Research has created a tool called TensorWatch that allows us to see real-time visualizations of the training process of our machine learning model. Amazing! See a snippet of how TensorWatch works:

TensorWatch, in simple terms, is a debugging and visualization tool for deep learning and reinforcement learning. It works in Jupyter notebooks and allows us to do many other custom visualizations of our data and our models.

Reddit discussions

Let's take a few moments to check out the most amazing Reddit discussions related to data science and machine learning from May 2019. Here is something for everyone, whether you are a data science enthusiast or practitioner. So let's dig deeper!

This is a tough nut to crack. The first question is whether you should opt for a PhD before taking up a position in the industry. And later, if you chose one, What skills should you acquire to ease your industry transition?

I think this discussion could be helpful in deciphering one of the greatest riddles of our career: How do we transition from one field or line of work to another? Don't look at this just from the point of view of a PhD student. This is very relevant to most of us who want to get that first leap into machine learning..

I highly recommend that you follow this thread, as many seasoned data scientists have shared their personal experiences and learning.

Recently, a research article was published expanding the title of this thread. The newspaper explained the lottery ticket hypothesis in which a smaller subnet, also known as a winning ticket, could train faster compared to a larger network.

This discussion focuses on this document. To read more about the lottery ticket hypothesis and how it works, you can refer to my article where I discuss this concept so that even beginners understand:

Decoding the best ICLR articles 2019: neural networks are here to rule

I chose this discussion because I can totally relate to it. I used to think: I have learned a lot and, but nevertheless, much more remains. Will I ever become an expert? I made the mistake of looking only at the quantity and not the quality of what I was learning.

With fast and continuous advance technology, there will always be a LOT to learn. This thread has some solid advice on how you can prioritize, stick to them and focus on the task at hand rather than trying to become an expert in all trades.

Final notes

I had a lot of Fun (and i learned) when putting together this month's machine learning GitHub collection! I highly recommend bookmarking both platforms and checking them regularly. It's a great way to stay up-to-date with all the latest machine learning news..

Or you can always come back every month and see our best options. 🙂

If you think I have missed a repository or any discussion, comment below and i will be happy to have a discussion about it.

7 Amazing Machine Learning GitHub Repositories for Data Scientists

Contents

Introduction

Top GitHub repositories (May of 2019)

Reddit discussions

Final notes

Related

Recent posts

Artificial Intelligence in Video: How New Technologies Are Changing Video Production?

IT profiles you should consider

How to record a screen on Windows computer?

¿Do you know the seniority levels?

Find Your Best Slip Rings and Rotary Joints Here

Posittion Agency: Advantages of link building for an online store

Subscribe to our Newsletter

Gaming

Brands

Business

Languages

7 Amazing Machine Learning GitHub Repositories for Data Scientists

Contents

Introduction

Top GitHub repositories (May of 2019)

Reddit discussions

Final notes

Related

Related Posts:

Recent posts

Artificial Intelligence in Video: How New Technologies Are Changing Video Production?

IT profiles you should consider

How to record a screen on Windows computer?

¿Do you know the seniority levels?

Find Your Best Slip Rings and Rotary Joints Here

Posittion Agency: Advantages of link building for an online store

Subscribe to our Newsletter

Gaming

Brands

Business

Languages