If software is eating the world, Python should be the mascot. This week we look into Python’s dominance of the machine learning industry and how it will only get bigger.
Hello, world. Meet Python.
How much would you pay for a machine learning tool that gives you access to almost every machine learning algorithm, has new algorithms developed for you by teams from Google/Facebook/Uber and has access to a global messaging forum where engineers would answer any of your questions? Given all the brouhaha over ML/AI, this tool would probably cost a lot.
This tool is free and is called Python
Originally released in 1991, Python is a high level, general purpose, open source programming language. All of that fancy language means that it is easy to code/read, free to use and anyone can see how the code is written. Because Python is so easy and you can do anything with it, Python is on track to become the most used programming language and has inspired cartoons like this:
People love Python
Love, Python. Per the Economist, and proven in this link, more people searched for the Python computing language on Google than Kim Kardashian. Python is the most preferred language by a significant margin, according to a Hacker Rank survey. Finally, attendance to PyCon, the annual Python conference has tripled since 2010. (As a funny aside, PyCon attendance is down slightly from 2017 to 2018 as the conference moved from Portland, Oregon to Cleveland. Just stating the facts.)
Python leads in machine learning
Because Python is so popular, it only makes sense it has become the language of choice for implementing machine learning code. In a 2017 Kaggle survey of data scientists, Python was the most frequently used tool, and it’s worth mentioning that “Jupyter Notebooks” and “Tensorflow” are used with Python.
Implementing machine learning in Python is also quite easy. The leading machine learning package, called Scikit-learn, gives users access to almost every machine learning algorithm under the sun (seriously, if you click on this link, it’s unbelievable), as well as error metrics and data manipulation tools. To implement algorithms is insanely easy and just involves saying “algorithm.fit(data)” and viola, you have a machine learning model. The beauty of scikit-learn is that it is all open source, meaning once you learn a machine learning concept, you then can see how it is implemented in code to get a better understanding of how it works. Scikit-learn’s dominance is so great, that when you search for “Machine Learning” in Amazon, the first match is for a book called “Hands-On Machine Learning with Scikit-Learn and TensorFlow.”
Major tech companies are also contributing to the Python community and allocating talent towards developing open source solutions. Facebook released a sophisticated time series/forecasting tool in Python/R called Prophet, Google recently released their implementation on automatically tuning Neural Networks and Uber released their probabilistic programming language via Python. A more common trend is when researchers present their papers at conferences, they will then post code via Python; as long as you know the language, you have a front row seat to the cutting edge in machine learning.
Which doesn’t benefit everyone
Python has managed to completely displace the data science solution market. A good proxy to this is revenue from the SAS corporation, which sells advanced analytics/machine learning solutions. In the past three years, revenue has increased about 2% each year, despite an explosion of data science job openings. While there have been many other data science platforms being developed, adding competition to SAS, considering Python for data science is free, compared to the thousands of dollars a year that SAS license cost, you’ll probably go open source. Even more worrying, in a Kaggle survey of data scientists, when asked ‘What language would you recommend new data scientists learn first?’, only 0.3% of data scientists would recommend learning SAS…
Other data science startups, like Dataiku and H20.ai have raised tens of millions of dollars and have big name investors, but there certainly isn’t a dominant leader. The Gartner data science magic quadrant is this odd mash up of data engineering, visualization and machine learning companies, and if you know enough Python, you can do almost everything these companies offer.
Python takes the next generation
The next generation of deep learning tools also are primarily Python driven. Google’s Tensorflow is meant primarily to be used with Python (although it relies on C++ on the backend) and similar deep learning tools, including Pytorch, Chainer and Keras also rely on Python. Tensorflow is so popular that there are three times as many contributors to Tensorflow than Bitcoin and Tensorflow has more “Stars”, which is the Github equivalent of “’likes” than the top 100 cryptocurrency projects*. Let me say that again, more people “like” a Python based deep learning framework than practically all of the cryptocurrency projects combined.
Will everyone learn Python?
Companies are now requiring their employees to learn Python. Per the FT, and confirmed to Cloudy by a source currently enrolled in the program, JP Morgan is putting hundreds of new investment bankers and asset managers through Python coding schools, with the intention of expanding the training to include topics like machine learning and cloud computing. It seems JP Morgan is using Python as the gateway for people to start learning and applying machine learning techniques. While this is an isolated case, I would be surprised if more companies follow suit
“Hands-On Machine Learning with Scikit-Learn and TensorFlow” may by this years most popular stocking stuffer