Machine Learning Top Topic on the Minds of GitHub Users

Machine learning (ML) and data science are becoming hot topics on GitHub, the organization reported last week. A recent analysis of the “Octoverse,” the nickname for the community of users of the popular code repository and social coding platform recently acquired by Microsoft, revealed that AI/ML development tools, such as TensorFlow and Pytorch, are among its fastest-growing projects. And Python, one of the most popular languages for AI/ML development, was the third-most popular language on GitHub.

“We looked at contributors to repositories tagged with the 'machine-learning' topic, and ranked the most common primary languages of the repositories,” explained Thomas Elliot, a data scientist in charge of product analytics at GitHub, in a blog post. Although Python is currently the most common language among ML repositories and the third-most common language on GitHub overall, he noted, not all ML happens in Python. Some of the most common languages on GitHub are also common languages for ML projects. According to Elliot, C++, JavaScript, Java, C#, Shell and TypeScript are all in the top 10 languages on GitHub and the top 10 for ML projects. Julia, R and Scala all appear in the top 10 for ML projects, though not for GitHub overall. Julia and R are both languages commonly used by data scientists, and Scala is becoming increasingly common when interacting with Big Data systems, such as Apache Spark, he added.

Four of the top contributed ML projects on GitHub focus on image processing (CMU-Perceptual-Computing-Lab/openpose,thtrieu/darkflow,ageitgey/face_recognition and tesseract-ocr/tesseract). TensorFlow, the open source library for numerical computation and large-scale machine learning, remained the most popular ML project on GitHub in 2018, with five times as many contributors as scikit-learn, the second-most popular project.

For this “State of the Octoverse: report, GitHub data scientists pulled data on contributions between Jan. 1, 2018, and Dec. 31, 2018, Elliot said. Contributions could include pushing code, opening an issue or pull request, commenting on an issue or pull request, or reviewing a pull request. “For the most imported packages, we used data from the dependency graph,” he said, “which includes all public repositories and any private repositories that have opted in to the dependency graph.”

The data scientists also found:

  • Numpy, a package with support for mathematical operations on multidimensional data, was the most imported package, used in nearly three-quarters of machine learning and data science projects.
  • Scipy, a package for scientific computation, pandas, a package for managing datasets, and matplotlib, a visualization library, are all used in over 40 percent of machine learning and data science projects.
  • Scikit-learn, a popular machine learning package, containing implementations of a large number of machine learning algorithms, is used by nearly 40 percent of projects.
  • TensorFlow, a package for working with neural nets, is used in nearly a quarter of packages.

The list of the top 10 are utility packages also included: six, a Python 2 and 3 compatibility library, and python-dateutil and pytz, packages for working with dates.

About the Author

John has been covering the high-tech beat from Silicon Valley and the San Francisco Bay Area for nearly two decades. He serves as Editor-at-Large for Application Development Trends ( and contributes regularly to Redmond Magazine, The Technology Horizons in Education Journal, and Campus Technology. He is the author of more than a dozen books, including The Everything Guide to Social Media; The Everything Computer Book; Blobitecture: Waveform Architecture and Digital Design; John Chambers and the Cisco Way; and Diablo: The Official Strategy Guide.