Asp Forum - Horizontal scaling - advice needed

Dennis S. Brown

3/7/2007 5:36:00 PM

Hi folks,

I posted this in railsweenie, but it's not really a rails question, and the
post isn;t getting any answers.

I'm not an experienced web developer. I'm still rather groping around. I
come from a client server background. I've plumped for Rails, although this
post's focus isn't really Rails.

I'm putting together a relatively simple site which I want to design from
the ground up for horizontal scalability, partly for the challenge, partly
because I need to learn and get experience. To help me do this I am going to
run at least two virtual machines to enforce the correct environment.

Currently my idea is to federate the data so that the users are divided
between the 2 or more machines, perhaps splitting alphabetically by user
name (ie. A-G to machine 1, etc). Where there is interaction between account
holders I am thinking of Drb'ing. Obviously rails is not going to be able to
do the interaction side of things, but I am fine with that; I'm prepared for
a bit of manual labour.

I would love comments/advice on my above ideas, and further insights into
horizontal scaling.

But also....

To facilitate the above I need some kind of proxy in front of the two
machines directing incoming requests to the correct machine based on the
login name which will be part of the url. Here I come unstuck since I have
no idea how to do this.

There must be proxies of this kind, but I'll be blowed if I know what an
appropriate one would be, or where to start in making it do what I want.

Can anyone give me a few pointers? Is squid the thing? Mongrel (I don't
really know what mongrel is)? Can apache be made to do this, and if so is it
a bad idea? Obviously it needs to be pluggable since I'll be using my own
code (C or Pascal) to do the lookups for the redirection.

Thanks for your words of wisdom,

Greg

19 Answers

Brian Adkins

3/7/2007 5:59:00 PM

Greg Loriman wrote:
> Hi folks,
>
> I posted this in railsweenie, but it's not really a rails question, and the
> post isn;t getting any answers.

Maybe try: http://www.ruby-forum.c...

> I would love comments/advice on my above ideas, and further insights into
> horizontal scaling.
> [...]
> Can anyone give me a few pointers? Is squid the thing? Mongrel (I don't
> really know what mongrel is)? Can apache be made to do this, and if so is it
> a bad idea? Obviously it needs to be pluggable since I'll be using my own
> code (C or Pascal) to do the lookups for the redirection.

One approach that works quite well is to use Apache 2.2 with
mod_proxy_balancer in front of a cluster of Mongrel processes while
keeping session state in the database. That should scale horizontally
quite nicely until the database becomes the bottleneck - by then, I
expect you'd have a pretty respectable amount of volume. I have no idea
how much traffic it would take to swamp MySQL on a beefed up server with
super fast disks, but my guess is it's enough to allow you to pay
someone to cluster MySQL :)

http://mongrel.ruby...

http://blog.codahale.com/2006/06/19/time-for-a-grown-up-server-rails-mongrel-apache-capistra...

http://www.simplisticcomplexity.com/2006/8/13/apache-2-2-mod_proxy_balancer-mongrel-on-u...

I've been down the road of traditional clustering with appservers
sending messages to each other furiously duplicating state, and also
with session affinity to avoid some of the duplication, but presently I
feel the "shared nothing" architecture is best for me. It has some
successful precedences with high-volume sites, and you basically get it
for free with Ruby/Rails/Apache/Mongrel.

> Thanks for your words of wisdom,
>
> Greg
>
>

Robert Klemme

3/7/2007 5:59:00 PM

On 07.03.2007 18:35, Greg Loriman wrote:
> Hi folks,
>
> I posted this in railsweenie, but it's not really a rails question, and the
> post isn;t getting any answers.
>
> I'm not an experienced web developer. I'm still rather groping around. I
> come from a client server background. I've plumped for Rails, although this
> post's focus isn't really Rails.
>
> I'm putting together a relatively simple site which I want to design from
> the ground up for horizontal scalability, partly for the challenge, partly
> because I need to learn and get experience. To help me do this I am going to
> run at least two virtual machines to enforce the correct environment.
>
> Currently my idea is to federate the data so that the users are divided
> between the 2 or more machines, perhaps splitting alphabetically by user
> name (ie. A-G to machine 1, etc). Where there is interaction between account
> holders I am thinking of Drb'ing. Obviously rails is not going to be able to
> do the interaction side of things, but I am fine with that; I'm prepared for
> a bit of manual labour.

I am not sure whether I get your interaction correctly. What types of
things would have to be dealt with between users that are not done
through persistent state?

> I would love comments/advice on my above ideas, and further insights into
> horizontal scaling.
>
> But also....
>
> To facilitate the above I need some kind of proxy in front of the two
> machines directing incoming requests to the correct machine based on the
> login name which will be part of the url. Here I come unstuck since I have
> no idea how to do this.
>
> There must be proxies of this kind, but I'll be blowed if I know what an
> appropriate one would be, or where to start in making it do what I want.
>
> Can anyone give me a few pointers? Is squid the thing? Mongrel (I don't
> really know what mongrel is)? Can apache be made to do this, and if so is it
> a bad idea? Obviously it needs to be pluggable since I'll be using my own
> code (C or Pascal) to do the lookups for the redirection.
>
> Thanks for your words of wisdom,

Just a quickie as I'm on my way out: your Drbing will certainly hurt
horizontal scalability - apart from the issue of finding instances etc.
If possible, you should build you app in a way that it does not need
this. Ideally you create it so that each HTTP request can be satisfied
by communicating with the backend store (database) only.

It's probably also ok to assume some session stickiness as load
balancing routers can do that (for example based on IP) and this seems a
fairly common scenario. If not, you need some mechanism to make session
information available to all app servers (either via the backend store
or via some other mechanism).

Kind regards

robert

khaines

3/7/2007 6:09:00 PM

Greg Lorriman

3/7/2007 8:04:00 PM

>
> I am not sure whether I get your interaction correctly. What types of
> things would have to be dealt with between users that are not done
> through persistent state?

By interaction I just mean recording user relationships, like when one
user records another user as a friend on slashdot. They'll be a
'friend' table with two foriegn keys.

In my naive idea of things each machine would be equivalent to the
others. They would each have Webserver->Rails->Database. Interactions
between users means recording a new relationship between two users,
normally quite straightforward on a single database, but requiring
inter-machine communication and a certain amount of fiddling about
where the user accounts are on different machines/database-backends.
Two-phase commmit comes to mind, but I intend to work around the lack
of that.

In other words I'm shifting the scalability problem to the network
(routing, switches etc). I may also address the networking problem, in
the distant future, by migrating accounts based on usage patterns so
that user "interactions" will tend to be local to one machine.
Obviously that is a strategy with some interesting problems to be
solved, as you can probably immediately guess at, which I am looking
forward to.

> > Thanks for your words of wisdom,
>
> Just a quickie as I'm on my way out: your Drbing will certainly hurt
> horizontal scalability - apart from the issue of finding instances etc.

Do you think that my answer above addresses that?

> It's probably also ok to assume some session stickiness as load
> balancing routers can do that (for example based on IP) and this seems a
> fairly common scenario. If not, you need some mechanism to make session
> information available to all app servers (either via the backend store
> or via some other mechanism).

Definately. I have had this in mind. Perhaps ultimately I would end up
with two (or more) domain (as in data) databases, one session
database, one proxy-redirection database, and the proxy-redirector
itself. And one day each running on their own machines. Right now I am
imagining several vmware instances to allow for developement.

Greg Lorriman

3/7/2007 8:16:00 PM

> Drb would rapidly become a bottleneck there.

I've not seen any articles on the characteristics and limitations of
Drb but it isn't immediately apparent to me that it should be a
bottleneck. Why do you think it would be a problem?

I can only see that in the longer term there might be a network
traffic problem if I had a lot of machines. I haven't worked it out
yet but perhaps the growth in inter-machine traffic would not be
linear.

> And explicitly federating
> your data like that seems needlessly complicated.

I would agree except that the application is simple while at the same
time such federation is the ultimate in horizontal scaling AFAIK, and
that is the exercise I would like to embark on. I want to find out the
ins and outs, costs and difficulties.

> Put a fast proxy of some sort in front of your backend processes. When
> you need more throughput, add another machine and some more processes.

Eventually the database will be the bottleneck, hence federation.

> > There must be proxies of this kind, but I'll be blowed if I know what an
> > appropriate one would be, or where to start in making it do what I want.
>
> If I were doing something that required some very specific proxying
> behavior that I couldn't get with an off-the-shelf solution (HAProxy is a
> very nice general purpose proxy with a ton of features), I'd write a
> purpose built proxy. It's not really too hard to do.

I'll keep that in mind. I hope you are right. But I've always had the
idea that tcp/ip is a somewhat painful affair, and I suspect one would
have to drop to that level to get adequate efficiency/speed to act as
a router/redirector of incoming requests.

thanks for the words.

Greg Lorriman

3/7/2007 8:24:00 PM

> I've been down the road of traditional clustering with appservers
> sending messages to each other furiously duplicating state, and also
> with session affinity to avoid some of the duplication, but presently I
> feel the "shared nothing" architecture is best for me. It has some
> successful precedences with high-volume sites, and you basically get it
> for free with Ruby/Rails/Apache/Mongrel.

I would agree in general, but as you mentioned earlier eventually
something is going to become a bottleneck if user growth is enourmous.
I expressely wish to explore the clustered appserver side of things
since it does ultimately scale endlessly. This particular app is
simple enough for me to consider trying my had at it.

So it is interesting to read your comments about "furious duplication"
and session affinity. Were there other notable features to such a
system, and to the developemnt of it?

Greg

khaines

3/7/2007 8:42:00 PM

Brian Adkins

3/7/2007 9:02:00 PM

Greg Lorriman wrote:
>> I've been down the road of traditional clustering with appservers
>> sending messages to each other furiously duplicating state, and also
>> with session affinity to avoid some of the duplication, but presently I
>> feel the "shared nothing" architecture is best for me. It has some
>> successful precedences with high-volume sites, and you basically get it
>> for free with Ruby/Rails/Apache/Mongrel.
>
> I would agree in general, but as you mentioned earlier eventually
> something is going to become a bottleneck if user growth is enourmous.
> I expressely wish to explore the clustered appserver side of things
> since it does ultimately scale endlessly. This particular app is
> simple enough for me to consider trying my had at it.
>
> So it is interesting to read your comments about "furious duplication"
> and session affinity. Were there other notable features to such a
> system, and to the developemnt of it?

Probably the most "notable" feature was that I really do want to do it
again :) I don't know of any highly scalable architectures that use that
type of approach - they could very well exist, I just don't know of
them. On the other hand, I know of several using the "share nothing"
approach.

It might be enlightening for you to perform some tests to determine what
type of load is required to cause the database to become the limiting
factor in your growth. It may be premature for you to worry about it. It
might be cheaper for you to simply purchase a faster database server.

This should do:
http://www.sun.com/servers/highend/sunfire_e25k...

Or even something more attainable:
http://www.dell.com/content/products/productdetails.aspx/pedge_6800?c=us&l=en&s=bsd...

If we ignore the fact that you most likely can't create a workload large
enough to swamp a fast database server (and if you can, you'll have no
trouble hiring some brilliant performance freaks to handle the mundane
scaling task while you do the fun stuff), you may want to consider other
ways of reducing the limiting factor of the database besides
partitioning your data - maybe one database instance for insert/updates
with replication to several read/only database instances for example.

Google for livejournal architecture - there is some interesting info
regarding how they scaled.

>
> Greg
>

Brian Candler

3/7/2007 9:24:00 PM

On Thu, Mar 08, 2007 at 02:40:07AM +0900, Greg Loriman wrote:
> I'm putting together a relatively simple site which I want to design from
> the ground up for horizontal scalability, partly for the challenge, partly
> because I need to learn and get experience. To help me do this I am going to
> run at least two virtual machines to enforce the correct environment.
>
> Currently my idea is to federate the data so that the users are divided
> between the 2 or more machines, perhaps splitting alphabetically by user
> name (ie. A-G to machine 1, etc). Where there is interaction between account
> holders I am thinking of Drb'ing. Obviously rails is not going to be able to
> do the interaction side of things, but I am fine with that; I'm prepared for
> a bit of manual labour.
>
> I would love comments/advice on my above ideas, and further insights into
> horizontal scaling.

Well: my advice is that the sort of "loose federation" you describe is
something which is very difficult to build. You can make it work for, say, a
proxy-based POP3/IMAP mail cluster: here the protocol is straightforward,
the session can be unambiguously proxied to the right backend server, and
there is no interaction between accounts. (When you start using IMAP shared
folders, this breaks down)

However even in such a scenario, you don't have resilience. If you lose the
machine where the A to G accounts are stored, then all those users lose
their mail. So in fact each backend machine has to be a mini-cluster, or at
least, have mirrored disks and a warm spare machine to plug them into when
disaster strikes.

Many people have resilience as high, or higher, on their agenda than
performance. So this doesn't sound like a good way to go.

My advice would be:

1. Keep your database in one place, so that all the front-ends have shared
access to the same data at all times.

To start with have a single database machine. Then expand this to a
2-machine database cluster. You can then point 2, 3, 4 or more front-ends at
this cluster; for many applications you may find that you won't need to
scale the database until later.

(Note that regardless of whether your application ends is heavier on
front-end CPU or back-end database resources, scaling the frontends and the
database cluster separately makes it much easier to monitor resource
utilisation and scale each part as necessary)

The easiest way to do database clustering is with a master-slave
arrangement: do all your updates on the master, and let these replicate out
to the slaves, where read-only queries take place. Of course, this isn't
good enough for all applications, but for others it's fine.

Full database clustering is challenging, but if your site is making you lots
of money you can always throw an Oracle 10g grid at it. If you're seriously
thinking of that route, you can start with Oracle on day one; it is now free
for a single processor with up to 1GB of RAM and 4GB of table space.

2. For transient session state, assuming your session objects aren't
enormous, use DRb to start with. Point all your front-ends at the same DRb
server. DRb is remarkably fast for what it does, since all the marshalling
is done in C.

When you outgrow that, go to memcached instead. This is actually not hard to
set up: you just run a memcached process on each server. The session data is
automatically distributed between the nodes.

Both cases aren't totally bombproof: if you lose a node, you'll lose some
session data. Either put important session data in the database, or build a
bombproof memcached server [boots from flash, no hard drive, fanless]

If that's not important, then you don't need a separate memcached server. If
you have N webapp frontends, then just run memcached on each of them.

> To facilitate the above I need some kind of proxy in front of the two
> machines directing incoming requests to the correct machine based on the
> login name which will be part of the url. Here I come unstuck since I have
> no idea how to do this.

Well, the traditional approach is to buy a web loadbalancing appliance (or
resilient pair of them), and configure it for "sticky" load balancing based
on a session cookie or some other attribute in the URL.

Hardware appliances are generally good. They are reliable over time; there's
much less to go wrong than a PC. They do a single job well.

You could instead decide to use a recent version of Apache with mod_proxy to
do the proxying for you.

But it may be better to design your app with a single shared database and a
single shared session store, such that it actually doesn't matter where each
request arrives.

> Can anyone give me a few pointers? Is squid the thing? Mongrel (I don't
> really know what mongrel is)? Can apache be made to do this, and if so is it
> a bad idea? Obviously it needs to be pluggable since I'll be using my own
> code (C or Pascal) to do the lookups for the redirection.

mod_proxy with mod_rewrite is "pluggable" in the way you describe. See
http://httpd.apache.org/docs/2.2/misc/rewrite...
and skip to the section headed "Proxy Throughput Round-Robin".

You'd use an External Rewriting Program (map type prg) to choose which
back-end server to redirect to. The example above is written in Perl, but
the same is equally possible in Ruby, C, Pascal or whatever.

However, if you don't know anything about Apache, this is certainly not
where I'd recommend you start.

squid is a proxy cache. You can use it to accelerate static content, but it
won't help you much with dynamic pages from Rails. Mongrel is a webserver
written in Ruby, much as webrick is, although is apparently more efficient.

In summary I'd say start your design with the KISS principle:

- one database; scale it horizontally (by database clustering) when needed

- one global session store; scale it horizontally when needed

- one frontend application server; scale horizontally when needed

In addition to that, consider:

- serve your static HTML, images and CSS from a fast webserver
(e.g. apache, lighttpd). This is easy to arrange.

- consider serving your Rails application from the same webserver using
fastcgi (e.g. Apache mod_fcgid), rather than a Ruby webserver like
mongrel or webrick. Harder to set up, but you can migrate to this later.
Then most HTTP protocol handling is being done in C.

- profile your application carefully to find out where the bottlenecks are,
before you throw hardware at performance problems.

HTH,

Brian.

Gary Wright

3/7/2007 9:27:00 PM

On Mar 7, 2007, at 3:41 PM, khaines@enigo.com wrote:
>> Eventually the database will be the bottleneck, hence federation.
>
> Again, though, you'll have to scale very, very large before you
> ever have to worry about the db being a bottleneck, provided you
> pay attention to schema design and proper db tuning.

I think that long before the database becomes a performance
bottleneck you'll be worried about it being a single point of failure.

Redundancy and availability issues are tricky even for low-volume
applications.

Gary Wright

comp.lang.ruby

Horizontal scaling - advice needed

Dennis S. Brown

Brian Adkins

Robert Klemme

khaines

Greg Lorriman

Greg Lorriman

Greg Lorriman

khaines

Brian Adkins

Brian Candler

Gary Wright

x Login to ForumsZone