Machine Learning Top Topic on the Minds of GitHub Users

Machine learning (ML) and data science are becoming hot topics on GitHub, the organization reported last week. A recent analysis of the “Octoverse,” the nickname for the community of users of the popular code repository and social coding platform recently acquired by Microsoft, revealed that AI/ML development tools, such as TensorFlow and Pytorch, are among its fastest-growing projects. And Python, one of the most popular languages for AI/ML development, was the third-most popular language on GitHub.

“We looked at contributors to repositories tagged with the 'machine-learning' topic, and ranked the most common primary languages of the repositories,” explained Thomas Elliot, a data scientist in charge of product analytics at GitHub, in a blog post. Although Python is currently the most common language among ML repositories and the third-most common language on GitHub overall, he noted, not all ML happens in Python. Some of the most common languages on GitHub are also common languages for ML projects. According to Elliot, C++, JavaScript, Java, C#, Shell and TypeScript are all in the top 10 languages on GitHub and the top 10 for ML projects. Julia, R and Scala all appear in the top 10 for ML projects, though not for GitHub overall. Julia and R are both languages commonly used by data scientists, and Scala is becoming increasingly common when interacting with Big Data systems, such as Apache Spark, he added.

Four of the top contributed ML projects on GitHub focus on image processing (CMU-Perceptual-Computing-Lab/openpose,thtrieu/darkflow,ageitgey/face_recognition and tesseract-ocr/tesseract). TensorFlow, the open source library for numerical computation and large-scale machine learning, remained the most popular ML project on GitHub in 2018, with five times as many contributors as scikit-learn, the second-most popular project.

For this “State of the Octoverse: report, GitHub data scientists pulled data on contributions between Jan. 1, 2018, and Dec. 31, 2018, Elliot said. Contributions could include pushing code, opening an issue or pull request, commenting on an issue or pull request, or reviewing a pull request. “For the most imported packages, we used data from the dependency graph,” he said, “which includes all public repositories and any private repositories that have opted in to the dependency graph.”

The data scientists also found:

  • Numpy, a package with support for mathematical operations on multidimensional data, was the most imported package, used in nearly three-quarters of machine learning and data science projects.
  • Scipy, a package for scientific computation, pandas, a package for managing datasets, and matplotlib, a visualization library, are all used in over 40 percent of machine learning and data science projects.
  • Scikit-learn, a popular machine learning package, containing implementations of a large number of machine learning algorithms, is used by nearly 40 percent of projects.
  • TensorFlow, a package for working with neural nets, is used in nearly a quarter of packages.

The list of the top 10 are utility packages also included: six, a Python 2 and 3 compatibility library, and python-dateutil and pytz, packages for working with dates.

About the Author

John K. Waters is the editor in chief of a number of sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at