Analyzing popular repositories on GitHub

Contents

This article was published as part of the Data Science Blogathon

GitHub is one of the most popular source code management and version control platforms. It is also one of the largest social media sites for programmers. Software developers use it to showcase their skills to recruiters and hiring managers. When parsing repositories on GitHub, we can obtain valuable information such as user behavior, what makes a repository popular or what technologies are trending among developers today, and much more.

You can find the full code used in the article here.

I have used the ‘GitHub Repositories 2020’ Kaggle dataset, since it is more recent.

Implementation

Go git this started by importing the necessary libraries and reading the input data,

335841-20org20dataset20info-1940303

The data set contains 19 columns, of which I chose 11 columns based on the most popular GitHub terminologies and those relevant to the context of this discussion. You can see that there are typos in the column names, I have renamed them for clarity.

730962-20dataframe20head-7086476

A brief summary about the columns of the data,

  • Theme – A tag that describes the repository field or domain.
  • Repo_Name – Repository name (repository short name)
  • Username – Repository owner name
  • Star – Number of stars a repository has received
  • Fork – Number of times a repository has been forked
  • Look – Number of users looking at the repository
  • Questions – Number of open issues
  • Pull_Requests – Total pull requests generated
  • Topic_Tags – List of topic tags added to that repository by user
  • Compromises – Total number of confirmations made
  • Collaborators – Number of people contributing to the repository

Find out how Star, Fork, Y Look the columns contain ‘Kansas to denote thousands, so let's convert them to multiples of 1000. What's more, replacing the ‘,’(commas) of the Questions Y Compromises columns.

Now that the columns are numeric, we can obtain basic statistical information from them.

# display basic statistical details about the columns
github_df.describe()
585363-20column20describe-8169861

1. Analysis of the main repositories according to their popularity

What Makes a GitHub Repository Popular? This question can be answered with 3 metrics: star, clock and fork.

  • Star: when a user likes your repository or wants to show some appreciation, marks it with a star.
  • Watch: when a user wants to be notified of all activities in a repository, sees it.
  • Fork: when a user wants a copy of the repository or intends to make a contribution, the fork.
# create a dataframe with average values of the columns across all topics
pop_mean_df = github_df.groupby('Topic').mean().reset_index()
pop_mean_df
451844-20pop20mean20df-4399723

1.1 Star analysis

View the average number of stars in each topic,

579151-20average20stars-4753116
# top 10 most starred repos
github_df.nlargest(n=10, columns="Star")[['Repo_Name','Topic','Star']]
752695-20top201020stars-9423044
# Quick tip: '33[1m' prints a string in bold and '33[0m' prints it back normally.
print('Most starred repository is {}{}{} in the topic {}{}{} with {}{}{} stars'.
      format('33[1m',github_df.iloc[github_df['Star'].idxmax ()]['Repo_Name'], '33[0m',
             '33[1m',github_df.iloc[github_df['Star'].idxmax ()]['Topic'], '33[0m',
            '33[1m',github_df.iloc[github_df['Star'].idxmax ()]['Star'], '33[0m'))
399936-20most20starred-8927601

In the top 10 most starred repositories, 4 are frameworks (Seen, React, TensorFlow, BootStrap) and 6 of them are about JavaScript.

1.2 Analysis of Watch

Visualizing the average number of watchers across each topic,

758272-20average20watchers-1547358

Note: Code for the above graph is the same as the ‘Average Stars on each topic’ except for the column names. I have not added the same to avoid redundancy.

# top 10 most watched repos
github_df.nlargest(n=10, columns="Watch")[['Repo_Name','Topic','Watch']]
459667-20top201020watch-3768590
print('Most watched repository is {}{}{} in the topic {}{}'.
        format('33[1m',github_df.iloc[github_df['Watch'].idxmax()]['Repo_Name'],
        '33[0m','33[1m',github_df.iloc[github_df['Watch'].idxmax()]['Topic']))
775918-20most20watched-2358802

In the 10 most viewed repositories, 4 son frameworks (TensorFlow, BootStrap, React, Seen), 6 are about JavaScript and 5 of them contain learning content for programmers.

1.3 Fork analysis

View the average number of forks in each topic,

899503-20average20forks-4182018
# top 10 most forked repos
github_df.nlargest(n=10, columns="Fork")[['Repo_Name','Topic','Fork']]
774959-20top201020forks-7091283
print('Most forked repository is {}{}{} in the topic {}{}'.
      format('33[1m',github_df.iloc[github_df['Fork'].idxmax()]['Repo_Name'],'33[0m',
      '33[1m',github_df.iloc[github_df['Fork'].idxmax()]['Topic']))
5199810-20most20forked-3480329

On top 10 more forked repositories, 4 son frameworks (TensorFlow, bootstrap, spring-boot, react) Y 5 of them contain learning content for programmers.

1.4 Star relationship, hairpin and watch

Often, users fork a repository when they want to contribute to it. Then, Let's explore the relationship between star fork and clock fork.

161284-20star20and20fork-4703214
# set figure size and dpi
fig, ax = plt.subplots(figsize=(8,4), dpi=100)

# set seaborn theme for background grids
sns.set_theme('paper')

# plot the data
sns.regplot(data=github_df, x='Watch', y='Fork', color="purple");

# set x and y-axis labels and title
ax.set_xlabel('Watch', fontsize=13, color="#333F4B")
ax.set_ylabel('Fork', fontsize=13, color="#333F4B")
fig.subtitle('Relationship between Watch and Fork',fontsize=18, color="#333F4B")
287555-20watch20and20fork-8309100

The data points are much closer to the regression line between Watch and Fork compared to Star and Fork.

From this we can conclude, if a user is viewing a repository, more likely to fork it.

2. Analysis of users with more repositories

Let's take a look at the users who have most popular repositories.

815966-20top20users-5672251

On top 10 users with more repositories,

  • Microsoft tops the list with 17 repositories.
  • Google continues with 15 repositories.
  • 6 of them are companies or owned by a company (Microsoft, Google, Adafruit, Alibaba, PacktPublishing, flutter)
  • 3 they are individual users (junyanz, rasbt, MicrocontrollersAndMore)

3. Understanding contribution activities in repositories

GitHub is famous for its contribution graph.

75193screenshot202021-06-2620at2000-20-1220padhma20-20overview-3112791

This chart is a record of all contributions a user has made. Whenever a user makes a confirmation, open a problem or propose a pull request, is considered a contribution. There are four columns related to contributions in our dataset, Problems, Pull_Requests, Commits, Collaborators. Let's see if there is any real relationship between them.

584247-20corr20bw20cols-5229716

The number of confirmations does not depend on any problem, pull requests or contributors. There is a moderate positive relationship between issues and pull requests.

Let's explore the 100 most popular repositories and let's see if it's the same,

693438-20corr20bw20pop20cols-4153920

It is almost the same in 100 repositories more popular than in the general dataset.

Let's find users with more repositories,

959559-20corr20bw20top20users-3033052

Surprisingly, users with more repositories tend to be more active. There is a fairly strong positive correlation between

  • Confirm and extract requests
  • Compromises and problems
  • Pull requests and issues

Regarding contributions,

  • There is no real relationship between contribution activities in the overall data set.
  • There is also no correlation between the contributions in the 100 most popular repositories.
  • If users tend to have more repositories, then the possibilities of contributions are much higher.

4. Topic tag analysis

Adding tags to a repository is a way to classify them by topic. It helps other users find and contribute to that repository and also helps you explore topics on the platform by type, domain, technology, etc.

The column topic_tags consists of lists. To find popular tags, convert the entire column to a list of lists and count the occurrence of each label. With that, we can visualize some of the most popular topic tags and see which topics tend to get tagged the most.

2937210-20pop20tags-7480345

Of the 15 most popular tags, 10 belong to the world of data science.

# length of tags list in each column
len_tags = [len(tag) for tag in topic_tags]

# create a new column -> total_tags
github_df['Total_Tags'] = len_tags

# group based on topic and calculate total_tags in each topic
topic_wise_tags = github_df.groupby('Topic').sum()['Total_Tags'].reset_index(name="Total Tags")

# set figure size and dpi
fig, ax = plt.subplots(figsize=(7,4), dpi=100)

# remove background grids
ax.grid(False)
ax.set_facecolor('white')
sns.despine()

# plot the data
sns.barplot(data=topic_wise_tags,x='Total Tags', y='Topic', ci = None, palette="gist_rainbow")

# set x and y-axis labels and title
ax.set_xlabel('Total Tags', fontsize=13, color="#333F4B")
ax.set_ylabel('Topic', fontsize=13, color="#333F4B")
fig.subtitle('Tags distribution across topics',fontsize=18, color="#333F4B")
4470411-20tag20distribution-8531858

Repositories with Computer Vision themes, Data Science and Machine Learning tend to be more labeled.

Let's end with a word cloud of topic_tags,

7397512-20git20word20cloud-8193343

Inference:

  • Between the 10 most prominent repositories, seen and forked, 4 son frameworks.
  • Tensorflow is the most watched forked repository.
  • If a user is looking at a repository, more likely to fork it.
  • Microsoft and Google tend to be users with more popular repositories.
  • On top 10 users with most popular repositories, 6 of them are companies.
  • There is no real relationship between contribution activities (problems, pull requests, confirmations).
  • The most used tags are Machine Learning, Deep Learning, Python, Computer Vision, JavaScript.
  • Repositories with Computer Vision themes, Data science and machine learning have more labels.

If we had analyzed data from a decade ago, these trends would have been completely different. It's as if data science has seen monstrous growth in recent years!!

Thanks for watching all the way here! I would love to connect LinkedIn

Let me know in the comment section if you have any concerns, comment or criticism. Have a nice day!

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.