When to Hadoop, and When Not To
There’s an abundance of interest in Hadoop, and for good reason–the open source framework has changed how we approach processing and storing very large, diverse, and fast-moving data sets, i.e. “big data.” And while there is a movement to turn Hadoop into a general-purpose processing platform, there are certain applications where Hadoop makes more sense than others. Here are five circumstances when you should use Hadoop, and five when you shouldn’t.
Five Reasons You Should Use Hadoop:
1. Your Data Sets Are Really Big
Most everybody thinks there data is big. But don’t even think about Hadoop if the data you want to process is measured in MBs or GBs. Hadoop (at least in its current, mostly MapReduce-based incarnation) imposes limitations in terms of how applications are programmed and how quickly you can get results out of it. If the data driving the main problem you are hoping to use Hadoop to solve is measured in GBs, save yourself the hassle and use Excel, a SQL BI tool on Postgres, or some similar combination. On the other hand, if it’s several TB or (even better) measured in petabytes, Hadoop’s superior scalability will save you a considerable amount of time and money.
2. You Celebrate Data Diversity
One of the advantages of the Hadoop Distributed File System (HDFS) is it’s really flexible in terms of data types. It doesn’t matter whether your raw data is structured (like out of an ERP system), semi-structured (like XML and log files), unstructured (like video files) or all three–Hadoop and its forgiving schema will gobble it up like an insatiable beast. What’s more, Hadoop makes it really easy for you to continually add on to existing data, or append it. This is very handy if you’re mixing and matching all kinds of data–for example, transaction data, clickstream data, social sentiment data, and geo-location data–into a big old Hadoop mud pie.
3. You Have Mad Programming Skills
Hadoop is written in Java, and therefore requires Java programming skills to master. We’re beginning to see more turnkey applications for Hadoop–that is part and parcel of the drive to turn Hadoop into a general purpose computing framework. But currently most Hadoop apps doing work in the wild are written by in-house Java programmers or hired guns. There’s a good reason that Java developers with data science skills are in incredibly high demand.
4. You Are Building an ‘Enterprise Data Hub’ for the Future
If you work in large enterprise, you might sign up for Hadoop even if your data isn’t particularly massive or diverse or fast at this point in time. It might make sense to start experimenting with Hadoop even if your data warehouse is doing its job and even there isn’t much benefit of being an early-adopter of Hadoop in your industry. If you like what Hadoop distributors like Cloudera are doing with the “enterprise data hub” vision, and you can see how Hadoop might be beneficial in the future, then it might make sense to gear up your IT staff’s skillset in Hadoop now to be ready to take advantage when the elephant really start sizzling and goes mainstream in a few years.
5. You Find Yourself Throwing Away Perfectly Good Data
One of the great things about Hadoop is its capability to store petabytes of data. If you find that you are throwing away potentially valuable data because its costs too much to archive, you may find that setting up a Hadoop cluster allows you to retain this data, and gives you the time to figure out how to best make use of that data.
Five Reasons Not to Use Hadoop:
1. You Need Answers in a Hurry
Hadoop is probably not the ideal solution if you need really fast access to data. The various SQL engines for Hadoop have made big strides in the past year, and will likely continue to improve. That’s more out of necessity and to enable the huge amount of existing business intelligence tools that speak SQL to gain access into Hadoop. But if you’re using MapReduce to crunch your data, expect to wait days or even weeks to get results back.
2. Your Queries Are Complex and Require Extensive Optimization
Hadoop is great because it gives you a massively parallel cluster for low-cost Lintel servers (or Wintel servers in the case of Hortonworks’ distribution) and scads of cheap hard disk capacity. While the hardware and scalability is straightforward, getting the most out of Hadoop typically requires a hefty investment in the technical skills required to optimize queries. According to a paper written by Hortonworks and Teradata, the software-based optimizers that are included with traditional data warehouse platforms can often outperform Hadoop.
3. You Require Random, Interactive Access to Data
The pushback from the limitations of the batch-oriented MapReduce paradigm in early Hadoop led the community to improve SQL performance and boost its capability to serve interactive queries against random data. Products like Cloudera’s Impala and Hortonworks Stinger initiative to improve the Hive SQL engine have emerged and are making headway. While SQL on Hadoop is getting better, in most cases it’s not a reason in of itself to adopt Hadoop.
4. You Want to Store Sensitive Data
Hadoop is evolving quickly and is able to do a lot of things that it couldn’t do just a few years ago. But one of the things that it’s not particularly good at today is storing sensitive data. Hadoop today has basic data and use access security. And while these features are improving by the month, the risks of accidentally losing personally identifiable information due to Hadoop’s less-than-stellar security capabilities is probably not worth the risk.
5. You Want to Replace Your Data Warehouse
A lot has been said about how Hadoop is decimating the market for traditional data warehouse platforms. And while there may be a grain of truth to that–it appears that Teradata customers are putting off upgrades until they can figure out this Hadoop thing–most data pros will tell you that Hadoop is complementary to a traditional data warehouse, not a replacement for it. The superior economics of Hadoop-based storage make it an excellent place to land raw data and pre-process it before siphoning it over to a traditional data warehouse to run analytic workloads.