Wednesday, March 12, 2008

Threads in a process and servers in a cluster: same issues, different context

I was just re-reading my last two posts, and it triggered an epiphany: language architectures like Scala and Erlang and server architectures like the H-Store, Gigaspaces, memcached (and others) share a very common model: each unit runs independently and collaboration between units is accomplished through asynchronous messaging. This messaging can be thought of as sharing data (e.g. asynchronous replication), or as sending messages that contain data, but at a certain level the result is the same.

Developers struggling with threading, consistency and locking issues in a single process are actually dealing with the same issues that system architects are struggling with when trying to figure out how to scale out a system where data needs to be shared across processing units. Concurrency, consistency and performance - same issues, different context.

Just thought I'd share that little lightbulb moment.

7 comments:

Taylor said...

Agreed! That's why Terracotta supports distributed objects and synchronization over clustered VMs as though they are the same VM. With Terracotta, clustering is just threading - latency and concurrency issues between threads is no different a problem within a single VM as it is across VMs. While the latencies involved are different, the theory (and best practices, and tuning skills) remain the same.

The other technologies you discussed solve the problem a different way. They say, hi how's it going. Forget everything you ever knew about stateful threaded programs. We have this new model that if you adopt 100% then you will get clustering services. Terracotta is the opposite - we say stick with what you have (stateful heap and thread model), we will worry about how to cluster it.

Same problems, different solutions.

Taylor Gautier/Product Manager Terracotta

Anonymous said...

Van Couvering wrote:
I was just re-reading my last two posts, and it triggered an epiphany: language architectures like Scala and Erlang and server architectures like the H-Store, Gigaspaces, memcached (and others) share a very common model: each unit runs independently and collaboration between units is accomplished through asynchronous messaging. This messaging can be thought of as sharing data (e.g. asynchronous replication), or as sending messages that contain data, but at a certain level the result is the same.


Van,

Interesting epiphany. I believe you have a slight misunderstanding regarding the Space Based Architecture (SBA) supported by GigaSpaces, though. The basic idea behind SBA is that scalability, performance, resiliency, and other NFRs should be addressed in the context of an entire, end-to-end, business use case, rather than at individual tiers. In this model, the business logic, data, and messaging capabilities are co-located in a single Processing Unit (to use GigaSpaces' terminology).

Communication between components, either business services or full processing units, in this model is richer than simple message passing. The underlying space provides an arbitrarily large distributed shared memory that allows full collaboration, not simply communication.

Further, when used as an SOA framework, GigaSpaces' Service Virtualization Framework allows services to be invoked either synchronously or asynchronously. In short, GigaSpaces is far more than a simple cache.

Your point about Erlang is very interesting. I've long thought that adding some capabilities from functional languages to an SBA would be incredibly powerful. GigaSpaces will support running dynamic languages within the compute grid in the release scheduled for early May, so we'll have the opportunity to try some of these ideas then.


Developers struggling with threading, consistency and locking issues in a single process are actually dealing with the same issues that system architects are struggling with when trying to figure out how to scale out a system where data needs to be shared across processing units. Concurrency, consistency and performance - same issues, different context.


I agree, to an extent. However, just as "quantity has a quality all its own", distributed systems have different constraints from single process systems. Latency, scalability, and high availability are NFRs that immediately come to mind. Peter Deutsch's Fallacies of Distributed Computing cover some other issues of concern. GigaSpaces SBA addresses these problems directly. Experience shows that trying to build a distributed system the same way one builds a standalone application is a recipe for disaster.

Regards,

Patrick May
GigaSpaces Technologies Inc.

Unknown said...

Hi, Patrick. Great comments, thanks.

I think what you are doing is refining definitions, but I think the overall principles are still in play: keeping as much as you can local and shared nothing improves both concurrency and reliability. I think the SBA is a step ahead of the the simpler caches, I agree, because everything is kept local and you can do the full processing for an operation in a single node, on a single tier.

But I think latency, scalability and HA do need to be addressed in a single standalone system, just using different mechanisms and at a different level. I didn't really think about this until building code inside NetBeans.

latency: response time of the UI when you press a button
availability: is the UI available even when the underlying program is doing some kind of heavy or slow processing like I/O or computation
scalability: the issues of scale in NB are huge: things that work for a small project fall on their face for larger projects with millions of classes and lines of code.

I believe that the principles applied to server/service architectures could be gleaned for use in a single desktop system. Parallelism, caching, shared nothing, and message passing come to mind.

Anonymous said...

Van Couvering wrote:
But I think latency, scalability and HA do need to be addressed in a single standalone system, just using different mechanisms and at a different level. I didn't really think about this until building code inside NetBeans.

latency: response time of the UI when you press a button
availability: is the UI available even when the underlying program is doing some kind of heavy or slow processing like I/O or computation
scalability: the issues of scale in NB are huge: things that work for a small project fall on their face for larger projects with millions of classes and lines of code.


Van,

No argument there. The NFRs are constant, even if the problems in addressing them change.

Regards,

Patrick

Taylor said...

"I think the SBA is a step ahead of the the simpler caches, I agree, because everything is kept local and you can do the full processing for an operation in a single node, on a single tier."

Hmm, yes, but this isn't taking the whole picture into account. The SBA approach relies on partitioning the spaces (no data flowing between spaces == linear scale, according to pundits), but it's really not any different than tiers.

The fundamental approach here has not changed - both are a way to divide up the layers into discreet, non-overlapping units. tiers. spaces. it's just terminology.

No. The problem here is that tiers were not meant to do high volume message processing throughput, and so unsurprisingly they *don't* do it well. That's because the lines that were drawn for tiers were meant to divide them up across non-overlapping concerns, and for web apps, they did that. For high volume message processing they don't.

So is introducing a different split a good thing - absolutely. But let's not get ahead of ourselves, SBA is not all things to all apps. As soon as the app has a new way of cutting data, it's going to be tiers all over again.

Gurney said...

I don't think the analogy is so obvious. Local and distributed systems have a fundamental difference. As mentioned before latency is not a big problem.

The real difference is link reliability. Inside single process it is an exceptional case when one thread can not communicate with another. And it usually caused by hardware failure.

In contrary, in distributed (i.e. processes connected with a network) communication failure is a normal case. Moreover, there is no way to learn state of the system in case of failure. So distributed coordination protocols are much more complex to implement.

So usage of the same protocols for distributed and local systems may be ineffective. It is much simpler to implement interprocess synchronization protocols.

Taylor said...

@Gurney,

I share the concern you have that managing distributed systems is difficult and complex, however I disagree that the right way to manage this complexity and difficulty is to just dump that complexity into the developers lap.

I urge you to take 5 minutes to read a recent blog I wrote that shows inter JVM coordination that is absolutely dead-simple using Terracotta. From there you can draw your own conclusions about how complicated or not complicated a distributed system has to be.

The blog is here: Stupid Simple JVM Coordination

The example is inline with the original bloggers epiphany - communicating across threads and communicating across nodes share many of the same problems. Wouldn't it be nice if you could use a framework that abstracted away all of those concerns and made distributed computing simple and fun? That is one of the main principles behind Terracotta.