A Technologist Who Speaks Business
I’m at QConSF all this week, so you’ll get to hear my impressions of every session I go to. Lucky you!
Facebook handles 200GB/day worth of updates coming in and 12+TB per day if you include derived data. That’s a lot of data and has no hope of fitting in a traditional data warehouse like Oracle. Consequently, they use Hadoop for both data storage and data processing, as do many organizations that work at that kind of scale. But once they started doing that, they ran into the problem that it’s very difficult, especially for analysts, to conduct ad-hoc queries over the data.
So, the Facebook team created HIVE as a SQL-like layer over Hadoop to allow for ad-hoc analysis. HIVE is an open-source sub project of Hadoop. They spent most of the talk describing HIVE and some of the clever ways they use Hadoop and map-reduce to execute SQL-like queries in parallel.
To give you an idea of the kind of load they’re putting through the system, they said they have a production Hadoop cluster with 5800 cores, 8.7 PB (/3 for replication) of data. Over this cluster they run ~7500 HIVE jobs per day. Wow. That’s not just massive scale, that’s mind-blowing scale.
♦ End