This article was published as part of the Data Science Blogathon
GitHub is one of the most popular source code management and version control platforms. It is also one of the largest social media sites for programmers. Software developers use it to showcase their skills to recruiters and hiring managers. When parsing repositories on GitHub, we can obtain valuable information such as user behavior, what makes a repository popular or what technologies are trending among developers today, and much more.
You can find the full code used in the article here.
I have used the ‘GitHub Repositories 2020’ Kaggle dataset, since it is more recent.
Implementation
Go git this started by importing the necessary libraries and reading the input data,
The data set contains 19 columns, of which I chose 11 columns based on the most popular GitHub terminologies and those relevant to the context of this discussion. You can see that there are typos in the column names, I have renamed them for clarity.
A brief summary about the columns of the data,
- Theme – A tag that describes the repository field or domain.
- Repo_Name – Repository name (repository short name)
- Username – Repository owner name
- Star – Number of stars a repository has received
- Fork – Number of times a repository has been forked
- Look – Number of users looking at the repository
- Questions – Number of open issues
- Pull_Requests – Total pull requests generated
- Topic_Tags – List of topic tags added to that repository by user
- Compromises – Total number of confirmations made
- Collaborators – Number of people contributing to the repository
Find out how Star, Fork, Y Look the columns contain ‘Kansas to denote thousands, so let's convert them to multiples of 1000. What's more, replacing the ‘,’(commas) of the Questions Y Compromises columns.
Now that the columns are numeric, we can obtain basic statistical information from them.
# display basic statistical details about the columns github_df.describe()
1. Analysis of the main repositories according to their popularity
What Makes a GitHub Repository Popular? This question can be answered with 3 metrics: star, clock and fork.
- Star: when a user likes your repository or wants to show some appreciation, marks it with a star.
- Watch: when a user wants to be notified of all activities in a repository, sees it.
- Fork: when a user wants a copy of the repository or intends to make a contribution, the fork.
# create a dataframe with average values of the columns across all topics pop_mean_df = github_df.groupby('Topic').mean().reset_index() pop_mean_df
1.1 Star analysis
View the average number of stars in each topic,
# top 10 most starred repos github_df.nlargest(n=10, columns="Star")[['Repo_Name','Topic','Star']]
# Quick tip: '33[1m' prints a string in bold and '33[0m' prints it back normally. print('Most starred repository is {}{}{} in the topic {}{}{} with {}{}{} stars'. format('33[1m',github_df.iloc[github_df['Star'].idxmax ()]['Repo_Name'], '33[0m', '33[1m',github_df.iloc[github_df['Star'].idxmax ()]['Topic'], '33[0m', '33[1m',github_df.iloc[github_df['Star'].idxmax ()]['Star'], '33[0m'))
In the top 10 most starred repositories, 4 are frameworks (Seen, React, TensorFlow, BootStrap) and 6 of them are about JavaScript.
1.2 Analysis of Watch
Visualizing the average number of watchers across each topic,
Note: Code for the above graph is the same as the ‘Average Stars on each topic’ except for the column names. I have not added the same to avoid redundancy.
# top 10 most watched repos github_df.nlargest(n=10, columns="Watch")[['Repo_Name','Topic','Watch']]
print('Most watched repository is {}{}{} in the topic {}{}'. format('33[1m',github_df.iloc[github_df['Watch'].idxmax()]['Repo_Name'], '33[0m','33[1m',github_df.iloc[github_df['Watch'].idxmax()]['Topic']))
In the 10 most viewed repositories, 4 son frameworks (TensorFlow, BootStrap, React, Seen), 6 are about JavaScript and 5 of them contain learning content for programmers.
1.3 Fork analysis
View the average number of forks in each topic,
# top 10 most forked repos github_df.nlargest(n=10, columns="Fork")[['Repo_Name','Topic','Fork']]
print('Most forked repository is {}{}{} in the topic {}{}'. format('33[1m',github_df.iloc[github_df['Fork'].idxmax()]['Repo_Name'],'33[0m', '33[1m',github_df.iloc[github_df['Fork'].idxmax()]['Topic']))
On top 10 more forked repositories, 4 son frameworks (TensorFlow, bootstrap, spring-boot, react) Y 5 of them contain learning content for programmers.
1.4 Star relationship, hairpin and watch
Often, users fork a repository when they want to contribute to it. Then, Let's explore the relationship between star fork and clock fork.
# set figure size and dpi fig, ax = plt.subplots(figsize=(8,4), dpi=100) # set seaborn theme for background grids sns.set_theme('paper') # plot the data sns.regplot(data=github_df, x='Watch', y='Fork', color="purple"); # set x and y-axis labels and title ax.set_xlabel('Watch', fontsize=13, color="#333F4B") ax.set_ylabel('Fork', fontsize=13, color="#333F4B") fig.subtitle('Relationship between Watch and Fork',fontsize=18, color="#333F4B")
The data points are much closer to the regression line between Watch and Fork compared to Star and Fork.
From this we can conclude, if a user is viewing a repository, more likely to fork it.
2. Analysis of users with more repositories
Let's take a look at the users who have most popular repositories.
On top 10 users with more repositories,
- Microsoft tops the list with 17 repositories.
- Google continues with 15 repositories.
- 6 of them are companies or owned by a company (Microsoft, Google, Adafruit, Alibaba, PacktPublishing, flutter)
- 3 they are individual users (junyanz, rasbt, MicrocontrollersAndMore)
3. Understanding contribution activities in repositories
GitHub is famous for its contribution graph.
This chart is a record of all contributions a user has made. Whenever a user makes a confirmation, open a problem or propose a pull request, is considered a contribution. There are four columns related to contributions in our dataset, Problems, Pull_Requests, Commits, Collaborators. Let's see if there is any real relationship between them.
The number of confirmations does not depend on any problem, pull requests or contributors. There is a moderate positive relationship between issues and pull requests.
Let's explore the 100 most popular repositories and let's see if it's the same,
It is almost the same in 100 repositories more popular than in the general dataset.
Let's find users with more repositories,
Surprisingly, users with more repositories tend to be more active. There is a fairly strong positive correlation between
- Confirm and extract requests
- Compromises and problems
- Pull requests and issues
Regarding contributions,
- There is no real relationship between contribution activities in the overall data set.
- There is also no correlation between the contributions in the 100 most popular repositories.
- If users tend to have more repositories, then the possibilities of contributions are much higher.
4. Topic tag analysis
Adding tags to a repository is a way to classify them by topic. It helps other users find and contribute to that repository and also helps you explore topics on the platform by type, domain, technology, etc.
The column topic_tags consists of lists. To find popular tags, convert the entire column to a list of lists and count the occurrence of each label. With that, we can visualize some of the most popular topic tags and see which topics tend to get tagged the most.
Of the 15 most popular tags, 10 belong to the world of data science.
# length of tags list in each column len_tags = [len(tag) for tag in topic_tags] # create a new column -> total_tags github_df['Total_Tags'] = len_tags # group based on topic and calculate total_tags in each topic topic_wise_tags = github_df.groupby('Topic').sum()['Total_Tags'].reset_index(name="Total Tags") # set figure size and dpi fig, ax = plt.subplots(figsize=(7,4), dpi=100) # remove background grids ax.grid(False) ax.set_facecolor('white') sns.despine() # plot the data sns.barplot(data=topic_wise_tags,x='Total Tags', y='Topic', ci = None, palette="gist_rainbow") # set x and y-axis labels and title ax.set_xlabel('Total Tags', fontsize=13, color="#333F4B") ax.set_ylabel('Topic', fontsize=13, color="#333F4B") fig.subtitle('Tags distribution across topics',fontsize=18, color="#333F4B")
Repositories with Computer Vision themes, Data Science and Machine Learning tend to be more labeled.
Let's end with a word cloud of topic_tags,
Inference:
- Between the 10 most prominent repositories, seen and forked, 4 son frameworks.
- Tensorflow is the most watched forked repository.
- If a user is looking at a repository, more likely to fork it.
- Microsoft and Google tend to be users with more popular repositories.
- On top 10 users with most popular repositories, 6 of them are companies.
- There is no real relationship between contribution activities (problems, pull requests, confirmations).
- The most used tags are Machine Learning, Deep Learning, Python, Computer Vision, JavaScript.
- Repositories with Computer Vision themes, Data science and machine learning have more labels.
If we had analyzed data from a decade ago, these trends would have been completely different. It's as if data science has seen monstrous growth in recent years!!
Thanks for watching all the way here! I would love to connect LinkedIn
Let me know in the comment section if you have any concerns, comment or criticism. Have a nice day!
The media shown in this article is not the property of DataPeaker and is used at the author's discretion.