Erlang/OTP Forums

Author Message

<  Open Telecom Platform (OTP)  ~  Design patterns for distributed applications

lgas
Posted: Wed Dec 19, 2007 12:06 pm Reply with quote
Joined: 19 Dec 2007 Posts: 2
Hi all,

I'm a newbie just getting into Erlang and so far I love the language. It's been fun to learn and it looks like it has a lot of potential for being a great tool in a lot of situations...

That being said, I have been having a hard time finding information on design patterns for building highly concurrent applications.

It looks to me like OTP provides a lot of great stuff... supervision trees, behaviors, etc. all of which are great and seem to address various issues that are associated with highly concurrent applications -- but I haven't seen much that addresses the heart of the concurrency issue directly.

For example, two of the applications that I am interested in using Erlang for are basically highly customized webcrawlers.

One would be used for automatically mirroring (potentially very large) sites -- think "wget -r", and the other would be for following chains of redirected URLs. E.g. given URL A, you go to it and it redirects to B, B to C, and C to D. We need to map URL A->D, once again for potentially very large pools of URLs.

If I'm not mistaken I could build a super OTP-ish application that incorprates all the various elements from the OTP manual -- supervision trees, gen_server, etc. but unless I specifically design it to be concurrent, I will have a highly available sequential crawler, which obviously wouldn't be very helpful (and also would be much eaiser in a language like ruby).

Now obviously I could take this simple sequential crawler and make it concurrent by spawning a new process to crawl and parse each URL and then recursively spawn more processes to handle each extracted URL... the problem is that the work load would increase exponentially and even if Erlang can scale to hundreds of thousands of processes on a single box, if my target site of site(s) are big enough, I would eventually run out of resources.

Even if I were to break this up across nodes I could still run into a situation where I run out of resources... but for the moment let's keep things simple and assume that I only have one box available.

(As an aside, it seems intuitive to me that you would rapidly reach a point of diminishing returns where spawning more simultaneous processes would begin to slow things down. But whether this is even true or not, I still have my fundamental issue that I could eventually exceed the capacity of my box if I just naively spawned a new process for every extracted URL).

The solution in the languages I am more familiar with (java, ruby) would be something like a thread pool where you created a fixed size pool of threads and they pull URLs off a queue to process as fast as they can. Obviously I could implement something like this in Erlang relatively easily, but I feel like this sort of situation should come up all of the time in highly concurrent applications and it should either be handled by something like OTP or there should be other sets of libraries out there for addressing these types of issues.

(As another aside, this rings even more true when you start talking about distributing applications like this across nodes, but again, we don't even need to complicate things by bringing that into the equation).

So... while I would happily welcome feedback on my specific crawler applications, what I'm really interested in is the general guidance: am I missing something big? does OTP already take care of this kind of stuff for me and I just skipped that page in the documentation? Are there other (ideally Open Source) libraries out there for this kid of stuff? Is my thinking just stuck in an imperative or threaded world and I need to shift to functional or concurrent thinking? Are there good papers to read on this topic? Are there good open source projects (ejabberd?) that I should be looking at?

Thanks in advance.
View user's profile Send private message
lgas
Posted: Sat Dec 22, 2007 11:59 pm Reply with quote
Joined: 19 Dec 2007 Posts: 2
Nothing?

I found this today:

http://code.google.com/p/erlinda/

I haven't had any time to look into it yet, but I'm hoping it will provide some of what I am looking for.
View user's profile Send private message
thanos
Posted: Thu Jan 10, 2008 4:41 pm Reply with quote
Joined: 23 Nov 2007 Posts: 5 Location: new york
http://code.google.com/p/erlinda/ is still an null project but presents some interesting but Javaesque ideas.


I used YAWS for a similar requirement. Inefficient maybe - but really convenient and easy. For instance in your case I would have a yaw handler to process a url.
It would spawn a processor depending on the type of page (Factory). That processor would "recursively" submit each URL to the YAWS server.

Be lazy and let YAWS handle all the issues.
You can even have a YAWS page to show you the current status and results of your crawl and another to control your whole app.

You could use erlweb and do everything using its optional Controller Hooks. Also check out what libtre project gets up to.

Time to market is what counts and this will work and scale well, then worry about whole caboodle with a well crafted framework.

And yes you are right there is no consensus on distributive development or single Rail stack like solution.
View user's profile Send private message AIM Address Yahoo Messenger
Nikimathew
Posted: Thu Sep 25, 2008 11:23 am Reply with quote
Joined: 25 Sep 2008 Posts: 1
Many people refer to Erlang as “Erlang/OTP.” OTP stands for Open Telecom Platform, and is more or less a set of libraries which come packaged with Erlang. They consist of Erlang behaviors (or behaviours, technically) for writing servers, finite state machines, event managers. But that is not all, OTP also encompasses the application behavior which allows programmers to package their Erlang code into a single “application.” And the supervisor behavior allows programmers to create a hierarchy of processes, where if one process dies, it will be restarted.

----------------
Nikimathew
viral and buzz marketing
View user's profile Send private message
iWantToKeepAnon
Posted: Sat Sep 27, 2008 4:41 am Reply with quote
User Joined: 31 Jul 2007 Posts: 14 Location: Dallas, TX
I'm pretty new to this, so this is just an newbies 2 cents. Processes are cheap, right? So spawn every URL you're tracking. You may get too many threads for your resources, but dont worry about that.

Now, the URL "follower" routine requests the right to run to some "throttler" process. The throttler, actually let's call it a govorner, keeps a list of processes wanting to run and a count of those already running. If the running count is less than the max, it replies back "go" and increments the count.

If the govorner is at its max, it stores the PID of the requestor in a list and just tail-recursively loops.

When the govorner recognizes a resource-consuming-child exits, it decrements the count and loops. The next loop recognizes that there is a free slot and replies "go" to the first PID in the list; increments, loops w/o first PID, etc...

I hope that makes sense, and that's probably how I'd start on the problem.
View user's profile Send private message
zamous
Posted: Sat Sep 27, 2008 4:38 pm Reply with quote
Joined: 27 Sep 2008 Posts: 2 Location: Berkeley, CA
I am in the process of doing something very similar--albeit still very new to erlang. I think two modules that would really help are pg2 and pool.

The pool module looks like it can help load balance across nodes.

pg2 is used to create groups of processes.

When developing erlang systems like this it really helps to think of things as services. So I would create a pool of processes that can actually fetch a page. Maybe another pool of processes that can parse a page. I would probably use a gen_server to manage and dispatch jobs to each pool. When your process that is parsing a page sees another url, it can just send a message to your fetching processes. I would use the pool module to help monitor the system and add/subtract processes from the pool depending on load--again pg2 allows you to do this.

Keep in mind, that each process has a mail box, so when a message gets sent to it, it will get queued in a mailbox, this is like a mini queue for each of your processes. Some newbies fail to take this into account and assume that they need one process for every sequential bit of code.
View user's profile Send private message

Display posts from previous:  

All times are GMT
Page 1 of 1
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum