IRMAC Meetup: “A Brief History of Hadoop vs. The Cloud”

January 22nd, 2018

Attended IRMAC Meetup last Wednesday, titled “A Brief History of Hadoop vs. The Cloud” by Neil Hepburn. The slides are publicly available at: http://www.irmac.ca/events/list.php?event_id=119

Very relevant talk as enterprises are insistent on onprem Hadoop data centers. Puzzling for me as the longer onprem route takes requires a major upfront commitment: time to setup, up front capital, dedicated sys admins, dizzy upgrades etc. Meanwhile, HDInsight in Azure Cloud, a Hortonwork Hadoop cluster by Microsoft, can be setup in 30 minutes or so. For PoC, this is blazing fast.

My clients voice security and privacy concerns, and I recall this bottleneck from Armbrust et al. at UC Berkeley in a paper published 10 years ago, currently citied over 9,000 times: https://dl.acm.org/citation.cfm?id=1721672. I guess the clients are not yet convinced on the advances on security and privacy in clouds.

My academic colleague, Mark Shtern at York University, questioned me at CASCON 2017 whether I considered the cost of changing the process from onprem to cloud in my analysis. My refuge, TBD :p

Hadoop in its simple form requires management of hardware and operational services such as Hive atop. The talk sees the Platform-as-a-Service (PaaS) such as Amazon Redshift as a simpler alternative to setting up a data warehouse using Hive. Claims for other Hadoop services were made including S3 for HDFS. I have requested a mapping of services from the speaker. Given such a mapping or a semi-mapping exists, the speaker sees the use of dynamic Hadoop clusters for specific jobs such as processing unstructured data.

While a big fan of cloud, I am yet unconvinced of such an idea, at least till I see the mapping. To me, it’s a good thought but merits investigation.