| Author |
Message |
|
| lgas |
Posted: Wed Dec 19, 2007 12:06 pm |
|
|
|
Joined: 19 Dec 2007
Posts: 2
|
Hi all,
I'm a newbie just getting into Erlang and so far I love the language. It's been fun to learn and it looks like it has a lot of potential for being a great tool in a lot of situations...
That being said, I have been having a hard time finding information on design patterns for building highly concurrent applications.
It looks to me like OTP provides a lot of great stuff... supervision trees, behaviors, etc. all of which are great and seem to address various issues that are associated with highly concurrent applications -- but I haven't seen much that addresses the heart of the concurrency issue directly.
For example, two of the applications that I am interested in using Erlang for are basically highly customized webcrawlers.
One would be used for automatically mirroring (potentially very large) sites -- think "wget -r", and the other would be for following chains of redirected URLs. E.g. given URL A, you go to it and it redirects to B, B to C, and C to D. We need to map URL A->D, once again for potentially very large pools of URLs.
If I'm not mistaken I could build a super OTP-ish application that incorprates all the various elements from the OTP manual -- supervision trees, gen_server, etc. but unless I specifically design it to be concurrent, I will have a highly available sequential crawler, which obviously wouldn't be very helpful (and also would be much eaiser in a language like ruby).
Now obviously I could take this simple sequential crawler and make it concurrent by spawning a new process to crawl and parse each URL and then recursively spawn more processes to handle each extracted URL... the problem is that the work load would increase exponentially and even if Erlang can scale to hundreds of thousands of processes on a single box, if my target site of site(s) are big enough, I would eventually run out of resources.
Even if I were to break this up across nodes I could still run into a situation where I run out of resources... but for the moment let's keep things simple and assume that I only have one box available.
(As an aside, it seems intuitive to me that you would rapidly reach a point of diminishing returns where spawning more simultaneous processes would begin to slow things down. But whether this is even true or not, I still have my fundamental issue that I could eventually exceed the capacity of my box if I just naively spawned a new process for every extracted URL).
The solution in the languages I am more familiar with (java, ruby) would be something like a thread pool where you created a fixed size pool of threads and they pull URLs off a queue to process as fast as they can. Obviously I could implement something like this in Erlang relatively easily, but I feel like this sort of situation should come up all of the time in highly concurrent applications and it should either be handled by something like OTP or there should be other sets of libraries out there for addressing these types of issues.
(As another aside, this rings even more true when you start talking about distributing applications like this across nodes, but again, we don't even need to complicate things by bringing that into the equation).
So... while I would happily welcome feedback on my specific crawler applications, what I'm really interested in is the general guidance: am I missing something big? does OTP already take care of this kind of stuff for me and I just skipped that page in the documentation? Are there other (ideally Open Source) libraries out there for this kid of stuff? Is my thinking just stuck in an imperative or threaded world and I need to shift to functional or concurrent thinking? Are there good papers to read on this topic? Are there good open source projects (ejabberd?) that I should be looking at?
Thanks in advance. |
|
|
| Back to top |
|
| lgas |
Posted: Sat Dec 22, 2007 11:59 pm |
|
|
|
Joined: 19 Dec 2007
Posts: 2
|
Nothing?
I found this today:
http://code.google.com/p/erlinda/
I haven't had any time to look into it yet, but I'm hoping it will provide some of what I am looking for. |
|
|
| Back to top |
|
| thanos |
Posted: Thu Jan 10, 2008 4:41 pm |
|
|
|
Joined: 23 Nov 2007
Posts: 5
Location: new york
|
http://code.google.com/p/erlinda/ is still an null project but presents some interesting but Javaesque ideas.
I used YAWS for a similar requirement. Inefficient maybe - but really convenient and easy. For instance in your case I would have a yaw handler to process a url.
It would spawn a processor depending on the type of page (Factory). That processor would "recursively" submit each URL to the YAWS server.
Be lazy and let YAWS handle all the issues.
You can even have a YAWS page to show you the current status and results of your crawl and another to control your whole app.
You could use erlweb and do everything using its optional Controller Hooks. Also check out what libtre project gets up to.
Time to market is what counts and this will work and scale well, then worry about whole caboodle with a well crafted framework.
And yes you are right there is no consensus on distributive development or single Rail stack like solution. |
|
|
| Back to top |
|
| Nikimathew |
Posted: Thu Sep 25, 2008 11:23 am |
|
|
|
Joined: 25 Sep 2008
Posts: 1
|
Many people refer to Erlang as “Erlang/OTP.” OTP stands for Open Telecom Platform, and is more or less a set of libraries which come packaged with Erlang. They consist of Erlang behaviors (or behaviours, technically) for writing servers, finite state machines, event managers. But that is not all, OTP also encompasses the application behavior which allows programmers to package their Erlang code into a single “application.” And the supervisor behavior allows programmers to create a hierarchy of processes, where if one process dies, it will be restarted.
----------------
Nikimathew
viral and buzz marketing |
|
|
| Back to top |
|
| iWantToKeepAnon |
Posted: Sat Sep 27, 2008 4:41 am |
|
|
|
User
Joined: 31 Jul 2007
Posts: 14
Location: Dallas, TX
|
I'm pretty new to this, so this is just an newbies 2 cents. Processes are cheap, right? So spawn every URL you're tracking. You may get too many threads for your resources, but dont worry about that.
Now, the URL "follower" routine requests the right to run to some "throttler" process. The throttler, actually let's call it a govorner, keeps a list of processes wanting to run and a count of those already running. If the running count is less than the max, it replies back "go" and increments the count.
If the govorner is at its max, it stores the PID of the requestor in a list and just tail-recursively loops.
When the govorner recognizes a resource-consuming-child exits, it decrements the count and loops. The next loop recognizes that there is a free slot and replies "go" to the first PID in the list; increments, loops w/o first PID, etc...
I hope that makes sense, and that's probably how I'd start on the problem. |
|
|
| Back to top |
|
| zamous |
Posted: Sat Sep 27, 2008 4:38 pm |
|
|
|
Joined: 27 Sep 2008
Posts: 2
Location: Berkeley, CA
|
I am in the process of doing something very similar--albeit still very new to erlang. I think two modules that would really help are pg2 and pool.
The pool module looks like it can help load balance across nodes.
pg2 is used to create groups of processes.
When developing erlang systems like this it really helps to think of things as services. So I would create a pool of processes that can actually fetch a page. Maybe another pool of processes that can parse a page. I would probably use a gen_server to manage and dispatch jobs to each pool. When your process that is parsing a page sees another url, it can just send a message to your fetching processes. I would use the pool module to help monitor the system and add/subtract processes from the pool depending on load--again pg2 allows you to do this.
Keep in mind, that each process has a mail box, so when a message gets sent to it, it will get queued in a mailbox, this is like a mini queue for each of your processes. Some newbies fail to take this into account and assume that they need one process for every sequential bit of code. |
|
|
| Back to top |
|
|
|
All times are GMT
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You can attach files in this forum You can download files in this forum
|
|
|