Better Dask Than Spark
I like to meddle with Data. I wanted to be a A.I guy when I was in college, but we were not allowed to take that elective, and A.I then was very different than its now. Back in 2004, 1 GB was considered a luxury, only very few carried laptops, desktops were computers then, and no iPhones then. People were having less devices that spied on them.
Now things have changed thanks to the bullish push of Silicon foundries around the world. Price per compute has fallen like hell. Any one having right skill and slim wallet can have banks of computers working for him in minutes, we have Raspberry Pi thats so small and far powerful and far cheaper than the first computer I had.
For people who love computers, its like kid in candy store where the price of candy drops by the day. Wish I saw the same smartness and price drop in energy, health, education and food industry.
With computers becoming so cheap, and so much data available, what an A.I project would take months or a year, will possibly take a week or two to complete. Such has grown computer hardware and software that has run on it. Thanks to Free Software Foundation, most great softwares that are available today are affordable and do not have shackling licence.
Now for people like me, doing a A.I curiosity project is a monthly exercise than a dream. I use Python, Numpy and Scikit-learn for most of my stuff. Most of the A.I things are just looking at data, cleaning it and trying to fit it in a model and checking the accuracy of prediction.
Now let’s get to the practical situation, we know that A.I algorithm spits out a better predicting model if it has more data to train it. So you have a toy data set in your office and you are confident the code you coded has a spit out a model that is good enough. Now your boss tells to scale, and there your smile disappears. You may be good at looking at data and making computers predict something out of it, but when it comes to scaling, nah, thats not your game.
You need Spark, right? I don’t think so. There is a new stuff called Dask https://dask.org/ , I haven’t tried it yet, but reading this https://docs.dask.org/en/latest/spark.html tells me its time to say goodbye to spark. Rather than having a separate scaling up guy for your A.I project, Dask might let a Datascientist do it, and it seems to fit the normal data science workflow without adding much to the learning curve. So try it out if your project is getting bigger.