Attended IRMAC Meetup last Wednesday, titled “A Brief History of Hadoop vs. The Cloud” by Neil Hepburn. The slides are publicly available at: http://www.irmac.ca/events/list.php?event_id=119
Very relevant talk as enterprises are insistent on onprem Hadoop data centers. Puzzling for me as the longer onprem route takes requires a major upfront commitment: time to setup, up front capital, dedicated sys admins, dizzy upgrades etc. Meanwhile, HDInsight in Azure Cloud, a Hortonwork Hadoop cluster by Microsoft, can be setup in 30 minutes or so. For PoC, this is blazing fast.
My clients voice security and privacy concerns, and I recall this bottleneck from Armbrust et al. at UC Berkeley in a paper published 10 years ago, currently citied over 9,000 times: https://dl.acm.org/citation.cfm?id=1721672. I guess the clients are not yet convinced on the advances on security and privacy in clouds.
My academic colleague, Mark Shtern at York University, questioned me at CASCON 2017 whether I considered the cost of changing the process from onprem to cloud in my analysis. My refuge, TBD :p
Hadoop in its simple form requires management of hardware and operational services such as Hive atop. The talk sees the Platform-as-a-Service (PaaS) such as Amazon Redshift as a simpler alternative to setting up a data warehouse using Hive. Claims for other Hadoop services were made including S3 for HDFS. I have requested a mapping of services from the speaker. Given such a mapping or a semi-mapping exists, the speaker sees the use of dynamic Hadoop clusters for specific jobs such as processing unstructured data.
While a big fan of cloud, I am yet unconvinced of such an idea, at least till I see the mapping. To me, it’s a good thought but merits investigation.