[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

net/http and rexml

Louis J Scoras

11/2/2006 6:49:00 PM

Hey all;

Net/http has a method for processing data in segments--by passing the
get method a block--but is there an easy way to get to the socket
directly?

The script I'm working on is grabbing a giant xml file from a remote
site, and I want to process it as it comes in. I've already changed
the code to use rexml's pull parser, so I'm guessing that now all I
need to do is give it the correct IO handle and let it go.

I don't see anything in the docs about exposing the socket, and I
don't want to rip open the class if there's something obvious I'm
missing. Any ideas on the best way to go about this?

--
Lou.

5 Answers

Paul Lutus

11/2/2006 9:34:00 PM

0

Louis J Scoras wrote:

> Hey all;
>
> Net/http has a method for processing data in segments--by passing the
> get method a block--but is there an easy way to get to the socket
> directly?
>
> The script I'm working on is grabbing a giant xml file from a remote
> site, and I want to process it as it comes in.

You mean, line by line? The socket class you are describing doesn't know
about lines, it knows about blocks. So try this: read a block, split it
into lines, do your processing. If you do this, you will discover some
blocks end in the middle of a line. Then you will say, "Gee, maybe I should
read the whole thing, then do the processing."

At that point, you will understand why the class is written as it is.

> I've already changed
> the code to use rexml's pull parser, so I'm guessing that now all I
> need to do is give it the correct IO handle and let it go.
>
> I don't see anything in the docs about exposing the socket, and I
> don't want to rip open the class if there's something obvious I'm
> missing. Any ideas on the best way to go about this?

Yep. Read the entire thing. Then process the result.

--
Paul Lutus
http://www.ara...

Louis J Scoras

11/2/2006 9:50:00 PM

0

On 11/2/06, Paul Lutus <nospam@nosite.zzz> wrote:

> You mean, line by line? The socket class you are describing doesn't know
> about lines, it knows about blocks.

Nope. Not line by line. All the parser should need is a token, and I
should only have to read as much data as I need to complete one, so
blocks would be fine.

> So try this: read a block, split it into lines, do your processing. If you
> do this, you will discover some blocks end in the middle of a line. Then you
> will say, "Gee, maybe I should read the whole thing, then do the
> processing."

No, I wouldn't say that ;) I'd just read enough segments into a
buffer until I could complete the next token.

>
> Yep. Read the entire thing. Then process the result.
>

Why? I should be able to start processing simultaneously. That's
what the stream paradigm was developed for. What if I got three
tokens into the xml and found that it was malformed? That would be an
aweful waste of bandwidth, no?


--
Lou.

Paul Lutus

11/2/2006 10:02:00 PM

0

Louis J Scoras wrote:

> On 11/2/06, Paul Lutus <nospam@nosite.zzz> wrote:
>
>> You mean, line by line? The socket class you are describing doesn't know
>> about lines, it knows about blocks.
>
> Nope. Not line by line. All the parser should need is a token, and I
> should only have to read as much data as I need to complete one, so
> blocks would be fine.

And you could set things up to read more data when your block-oriented input
stream is depleted, easy to arrange. This will provide the appearance of a
local stream, a common arrangement in socket reading algorithms.

>> So try this: read a block, split it into lines, do your processing. If
>> you do this, you will discover some blocks end in the middle of a line.
>> Then you will say, "Gee, maybe I should read the whole thing, then do the
>> processing."
>
> No, I wouldn't say that ;) I'd just read enough segments into a
> buffer until I could complete the next token.

s/segments/blocks/

>
>>
>> Yep. Read the entire thing. Then process the result.
>>
>
> Why? I should be able to start processing simultaneously.

Block by block, yes. The block reading back end can be made to appear to be
a stream locally, but there are excellent reasons to read blocks at the
network-protocol level, and sometimes the bigger the better.

> That's
> what the stream paradigm was developed for.

Yes. You can always turn a block into a stream locally. And no, you don't
have to read the entire thing, I just prefer it that way. A personal
preference, nothing more, doubtless springing from my unreliable Internet
access.

--
Paul Lutus
http://www.ara...

Vidar Hokstad

11/2/2006 10:43:00 PM

0


Paul Lutus wrote:
> You mean, line by line? The socket class you are describing doesn't know
> about lines, it knows about blocks. So try this: read a block, split it
> into lines, do your processing. If you do this, you will discover some
> blocks end in the middle of a line. Then you will say, "Gee, maybe I should
> read the whole thing, then do the processing."
>
> At that point, you will understand why the class is written as it is.

Class TCPSocket has both the methods each_line and readline. That isn't
the problem.

The issue with net/http is that it's an overly complicated API for
something that in most instances is very easy.

> Yep. Read the entire thing. Then process the result.

Not all network streams (or HTTP initiated transfers) ever finish. And
often the files will be too large to process that way - especially with
REXML
which is extremely memory hungry.

A better solution is to use openuri:
http://www.ruby-doc.org/stdlib/libdoc/open...

Or use a decent HTTP API instead of net/http.

Vidar

Vidar Hokstad

11/2/2006 10:59:00 PM

0


Louis J Scoras wrote:
> The script I'm working on is grabbing a giant xml file from a remote
> site, and I want to process it as it comes in. I've already changed
> the code to use rexml's pull parser, so I'm guessing that now all I
> need to do is give it the correct IO handle and let it go.
>
> I don't see anything in the docs about exposing the socket, and I
> don't want to rip open the class if there's something obvious I'm
> missing. Any ideas on the best way to go about this?

I fully sympathize... I went through the same mess a while back.
IO Iis one of the spots where the Ruby standard library is fairly
messy.

net/http is overly complicated for this kind of stuff. Look at openuri:
http://www.ruby-doc.org/stdlib/libdoc/open...

And you can't actually "expose the socket" for the HTTP stream and
expect things to work - the server might very well be using HTTP/1.1
chunked encoding, which means you'd get things interspersed bytes
indicating the length of the following chunk etc., or the connection
might be marked Keep-Alive and use the content-length to indicate
how far you should read, so to pass it to REXML you'd need a wrapper -
which is what openuri provides you with.

Vidar