Archive for ‘Conferences’

November 22nd, 2009

Common Themes — QConSF 2009 Impressions

by Tim Cull

I’m at QConSF all this week, so you’ll get to hear my impressions of every session I go to. Lucky you!

I’ve had a good week at QConSF. I’ll try to summarize some of the themes I saw in the tracks I attended.

You don’t know scale like these guys know scale

Many of the presenters were talking about applications at truly mind-blowing scale. Historically, that kind of scale would only apply to secret government operations and serious physics research. But today we’re talking about that scale for silly consumer sites where people post pictures about their cats. The best takeaway from that is that everyone should design with serious scale in mind and no longer assume it doesn’t appply to them. If it doesn’t apply to you today, it may very well tomorrow.

Cheap and horizontal, not expensive and vertical

Every one of these guys that operate at scale do so on commodity or almost commodity hardware. I didn’t hear anyone mention 64-way Sun servers, except as a joke.

Asynchronous interaction and coupling

Applications have to be designed for asynchronous interaction. That means not only between tiers, but also with the user. To get the kinds of performance and resiliency gains many of these sessions were talking about, you can’t do it any other way. Also, asynchronicity helps take advantage of future innovations like cloud computing.

Hadoop and map reduce

Map reduce in general and Hadoop in particular were everywhere. Even the Microsoft guy made a plug for Hadoop. If you’re a technologist and looking for a good framework to spend some time learning, you can’t go wrong with Hadoop.

Measure, monitor, and profile

I remember many years ago attending an SDWest conference where every presenter said something like “and of course here you’d run your unit tests” as if using unit tests were a foregone conclusion. That’s what measuring, monitoring, and profiling were like at this conference. If you’re not doing it routinely on your applications, then you are seriously missing the boat. (that goes double for unit testing, you slackers!)

Expect failure

Enterprise developers (which is most of my background) spend a lot of time and effort designing systems so that they “never” go down. If we lose a piece of hardware, or a disk drive, or a network link, then that’s a big deal and often we’ll spend many hours in meetings explaining the root cause and what steps we’re taking to make sure it never happens again. Enterprise developers could learn a lot from the huge-scale web developers who dominated this conference: just give up! Design systems to expect failure and just retry when it happens. As systems reach a certain level of complexity, it becomes literally impossible to guarantee that all their individual components are in place and healthy at the same time, all the time. Instead we have to design systems to assume failure and survive anyway. This is the true meaning of “fault tolerant” — an often abused phrase sometimes falsely interpreted to mean “never breaks.”

Eventual consistency

To listen to the presenters at this conference, you’d think ACID is dead and buried. At massive web scale, you instead have to design for ACID’s naughty younger brother: Eventual Consistency. This means that different actors using different parts of a system might get different views of the same piece of data. Eventual Consistency’s answer to that problem?: “Get over it, they’ll be the same soon enough.” As with the Expect Failure concept, I suspect this might be a hard (but necessary) pill for enterprise developers to swallow.

November 20th, 2009

LinkedIn: Network Updates Uncovered with Ruslan Belkin and Sean Dawson — QConSF 2009 Impressions

by Tim Cull

I’m at QConSF all this week, so you’ll get to hear my impressions of every session I go to. Lucky you!

LinkedIn is a 90% Java shop with lots of memcached for caching and ActiveMQ for messaging. They said they started the traditional way with big relational databases and n-tier architectures, but quickly ran into the scale wall. To give you and idea what they’re talking about, they do 35 million updates per week and 20 million services calls per day.

Once they hit real scale, they found they had to change the way they approached updates in a user’s social network from an “inbox” like approach to more of an “activity area” type approach.

In the old inbox type approach, each time a user does something (say, update status), the system writes a notification to each of that user’s connections describing the update. This means for every one update, the system has to do N reads, where N is the number of user connections. That’s the bad part. The good part is that when that user’s connections go to their home page, only one quick read has to be done to see everything in that connection’s network.

The activity area approach turns this on its head. Instead of every update they write into an “activity area”. Then as that user’s connections log in, their home page does up to N reads to fetch updates from the social network. I say “up to” because they have a very clever filter and summary bit in front that narrows down the list of social connections to a subset the user is likely care about.

Then they went on to describe some of the infrastructure they use. Interestingly, as updates come in they are stored in two places:
level 1 storage: temporal, rolling store on Oracle containing CLOB data with varchar keys
level 2 storage: tenured data on Voldemort containing key-value pairs

And lastly some random tidbits that don’t fit anywhere else:

  • They use Zenoss for monitoring (as do lots of presenters here)
  • Even at this scale they still use xml and are happy about it
  • Information given to service is sometimes unresolved (ie. member id instead of first/last name) and
    gets resolved by a service in batch
  • They’ve optimized comment streams by duplicating the first and last comments in their update summary and the full comment thread in Tier 2 storage
November 20th, 2009

Facebook’s Petabyte Scale Data Warehouse using Hive and Hadoop with Namit Jain and Ashish Thusoo — QConSF 2009 Impressions

by Tim Cull

I’m at QConSF all this week, so you’ll get to hear my impressions of every session I go to. Lucky you!

Facebook handles 200GB/day worth of updates coming in and 12+TB per day if you include derived data. That’s a lot of data and has no hope of fitting in a traditional data warehouse like Oracle. Consequently, they use Hadoop for both data storage and data processing, as do many organizations that work at that kind of scale. But once they started doing that, they ran into the problem that it’s very difficult, especially for analysts, to conduct ad-hoc queries over the data.

So, the Facebook team created HIVE as a SQL-like layer over Hadoop to allow for ad-hoc analysis. HIVE is an open-source sub project of Hadoop. They spent most of the talk describing HIVE and some of the clever ways they use Hadoop and map-reduce to execute SQL-like queries in parallel.

To give you an idea of the kind of load they’re putting through the system, they said they have a production Hadoop cluster with 5800 cores, 8.7 PB (/3 for replication) of data. Over this cluster they run ~7500 HIVE jobs per day. Wow. That’s not just massive scale, that’s mind-blowing scale.

November 18th, 2009

Caching at Scale, Architecture Reviews, and Hadoop — QConSF 2009 Impressions

by Tim Cull

I’m at QConSF all this week, so you’ll get to hear my impressions of every session I go to. Lucky you!

Today is the start of the shorter sessions, so you get a three-for-one deal.

1) Caching at Scale with Alex Miller
Miller works for Terracotta and so most of what he concentrated on was EHCache and Terracotta. Much of the session had to do with configuring and using each of those tools, but I did get a couple of good reminders about what’s good to cache and what isn’t. Specifically, before caching something, make sure it has good “locality” (i.e. the same piece of data tends to be asked for in clumpy bursts of time) and a good distribution (i.e. the majority of people ask of a small subset of the total data universe).

2) Lessons Learned from Architecture Reviews with Rebecca Wirfs-Brock
Wirfs-Brock opened with two slides showing two different ideas of “collaborative.” In one, all the stakeholders and reviewers of an architecture gather together in harmony and all are shooting for the same goal for the common good. In the other, they only collaborate in the sense that the conquered collaborate with their occupying army. It’s important to know which kind of situation you’re in before picking your toolset to deal with it. I was a little shaken when she showed a slide of my boss’ book and said it was an example of a toolset to use in the occupying army kind of collaboration. What does that say about my day job?!

My best takeaway from the talk is that it’s useful to clearly organize your architectural feedback into buckets:
1) Recommendations — we really think you need to do these and not doing them would be a mistake
2) Suggestions — if you do these I predict they will make you happy, but you won’t miss them if you don’t do them
3) Observations — a place to put statements about perceived problems that aren’t really problems, or point out good choices that should be kept

3) Hadoop with Philip Ziegler
Hadoop is a system for running massive calculations over massive amounts of data. Ziegler took us through an overview of it that was good and engaging, but really not much different from what you can get reading the web site.

November 18th, 2009

Domain Specific Languages with Ola Bini and Martin Fowler — QConSF 2009 Impressions

by Tim Cull

I’m at QConSF all this week, so you’ll get to hear my impressions of every session I go to. Lucky you!

In college, I never took the “compilers” class that most of my other classmates took. For the first five years of my career I felt smug and superior for not having wasted the time, but I’ve spent every year since then regretting the decision.

Domain Specific Languages (or DSLs) are not a new concept (how long has “make” been around?). But they have been catching a lot of renewed attention lately. We are seeing a convergence of enabling technologies that make creating them easier: dynamic languages like ruby and python and easy-to-use parsers like antlr. This ease is important because, by definition, a DSL has to be specific to a domain and therefore you can’t spread the cost of its creation over very many projects.

Nowadays I thoroughly regret not having taken the compilers class, but Bini and Fowler’s day-long session yesterday helped fill some of the gap. The session was very technical and (frankly) very dry. They talked a great deal about parsers, parse trees and symantic models. But the density was warranted because you can’t really understand the full power of a DSL until you know those concepts. Fowler and Bini gave us just enough background not to hang ourselves with our shiny new ropes.

One point they hammered again and again was that you need to keep your syntax, your semantic model, and your execution separate. For example, if you were to write a new kind of Spring configuration file format in JSON, then JSON would be your syntax, the Spring BeanDefinition interface would be your semantic model, and the Spring GenericApplicationContext might be your executing code. Many implementations of DSLs might be tempted to leap directly from parsing the input to calling code on the fly but according to the presenters that usually leads to heartache as your DSL becomes more complex.

They also went into detail about the difference between an External DSL (something you have to write a parser for, like Ant) and an Internal DSL (basically helper functions on top of an existing language, like Rake).

November 18th, 2009

QCon 2009 Impressions: Java Performance with Kirk Pepperdine

by Tim Cull

I’m at QConSF all this week, so you’ll get to hear my impressions of every session I go to. Lucky you!

My first all-day tutorial at QCon 2009 was “Java Performance Tuning” with Kirk Pepperdine. He spent much of the day encouraging us to spend time classifying the nature of the performance problem before trying to actually solve it.

Specifically, he asserted that you can diagnose much of an application’s problem without knowing anything about its source (or even ever having seen its source). His methodology basically boils down to this: First, use monitoring tools to classify your problem:
1) Is it user cpu bound, io bound, system cpu bound or memory bound?
2) Is it Application layer, JVM layer, or OS layer?

Then, and only then, use a profiler plus your knowledge of the source to pin the problem down and solve it.

Basically, most of the morning was about codifying what’s really common sense (but what isn’t at the end of the day?). In particular, you should measure and monitor, then make hypotheses based on monitoring, then use profiling to pinpoint the problem and support your hypotheses. Fix the problem and re-sample your monitoring. Repeat until the user is happy.

I’ve seen this many times on projects that have performance problems. Each developer has his own pet theory based on a hunch and will not let go. If more projects started with measurements and testable hypotheses, they’d be in a much better place.

Pepperdine also introduced several tools, the best of which was one I’ve been searching for but couldn’t find: IronEye SQL. He confirmed what I’d suspected–that the project was dead. But I was happy to discover that he personally had the source and was trying to revive it! Yay!

The best technical tidbit I got out of the session was some insight about Collections classes that have a medium-term life. It’s possible to have a Collections class (say a LinkedList) live just long enough for some of its plumbing to make it into the “old” generation. If that happens, then even if that LinkedList goes out of scope and is eligible for garbage collection, it won’t actually get collected until the next FullGC. Partial GCs only get the “young” generations. That’s not necessarily the end of the world until you consider that many of the objects contained in the LinkedList might have been very short-lived and might be in the young generation heap. But they can’t be garbage collected because the LinkedList still refers to them. These are known as ‘zombie’ objects–objects that aren’t referred to any more, but never-the-less won’t get collected in a Partial GC.