With new technologies coming out so fast for analytics, its hard to keep up with the best tool for the job. Take Berkely’s Data Analytics Stack (BDAS) featuring Spark, Shark, Mesos, for advanced analytics and mining. Should I use this or stick with Apache Hadoop, Hive, and Mahout? How do you decide? From my experience, I’ve found this to be the most common stack:
- Hadoop: for distributed file system for data collection.
- Database: Hbase or Cassandra to enable random reads
- Analysis: Hive, Pig, Impala for advanced analysis
- Real-Time: Storm or Spark
- Visualization: Tableau Software or if you have programmers D3.JS
- Applications: Datameer, Alpine Data Labs, WibiData, Wise.io, others?
- Infrastructure: On-premise or Hosted?
- Add-ons: Hue, Sqoop, and Flume.
Is this generally what you see? Are there additional configuration I am missing? Feel free to leave a comment or contact me directly.