Asp Forum - Marshal Pipe - comp.lang.ruby

Carlos J. Hernandez

1/5/2008 11:37:00 PM

I've just re-discovered pipes.
Using Linux bash... stuff like `grep zip.89433 addresses.csv | sort |
head`
Bash pipes work very well for many problems, such as mass downloads and
data filtering.
But they're simplest to implement on line by line text data.
This is not a true limitation of pipe architectures.

You can implement data pipes with Marshal.
Within your class, you can define a puts method for the source's
$stdout:

def self.puts(data)
data = Marshal.dump( data )
# tell the sink how many bytes to read
$stdout.print [data.length].pack('l')
# then print out data
$stdout.print data
end

and then the sink reads from $stdin:

while data = $stdin.read(4) do
data = data.unpack('l').shift # bytes to read
data = $stdin.read( data ) # marshal'ed dump from stdin
data = Marshal.load( data ) # restored data structure
# what you do here.........
end

I don't think this is implemented in a standard way anywhere in Ruby (or
any other language), but
looks to me like a really, really good idea.

-Carlos

15 Answers

Eric Hodel

1/7/2008 9:59:00 PM

On Jan 5, 2008, at 15:37 PM, Carlos J. Hernandez wrote:
> I've just re-discovered pipes.
> Using Linux bash... stuff like `grep zip.89433 addresses.csv | sort |
> head`
> Bash pipes work very well for many problems, such as mass downloads
> and
> data filtering.
> But they're simplest to implement on line by line text data.
> This is not a true limitation of pipe architectures.
>
> You can implement data pipes with Marshal.
> Within your class, you can define a puts method for the source's
> $stdout:
>
>>
> [...]
>
> and then the sink reads from $stdin:
>
> [...]
>
> I don't think this is implemented in a standard way anywhere in Ruby
> (or
> any other language), but
> looks to me like a really, really good idea.

You've written the core of DRb, which is these data pipes expanded to
a multi-process, multi-machine distributed programming tool.

Carlos J. Hernandez

1/8/2008 5:09:00 AM

Eric, thanks for your comment.
I'll look again, but I don't think I saw in DRb the simplicity achieved
by bash as in:

cat source.txt | filter | sort > result.txt

I'm saying cat, filter, and sort could be ruby programs piping Marshal
data structures.
-Carlos

Robert Klemme

1/8/2008 8:32:00 AM

2008/1/8, Carlos J. Hernandez <carlosjhr64@fastmail.fm>:
> Eric, thanks for your comment.
> I'll look again, but I don't think I saw in DRb the simplicity achieved
> by bash as in:
>
> cat source.txt | filter | sort > result.txt

That line makes you eligible for a "useless cat award".

> I'm saying cat, filter, and sort could be ruby programs piping Marshal
> data structures.

Your solution is still too complicated: you do not need the byte
transfer - in fact, it may be disadvantageous because you need the
full marshaled representation in memory before you can send it. This
is not very nice for streaming processing. Instead, simply directly
marshal data into the pipe:

$ ruby -e '10.times {|i| Marshal.dump(i, $stdout) }' | ruby -e 'until
$stdin.eof?; p Marshal.load($stdin) end'
0
1
2
3
4
5
6
7
8
9

The question is: how often do you actually need the processing power
of two processes? On a single core machine the code is probably as
efficient with a single Ruby process (probably using multiple threads)
- and you do not need the piping complexity and marshaling overhead.
For tasks that involve IO Ruby threads work pretty well. So, I'd be
interested to hear what is the use case for your solution?

Kind regards

robert

--
use.inject do |as, often| as.you_can - without end

Carlos J. Hernandez

1/8/2008 1:57:00 PM

Robert:
Thanks for your performance improvement suggestion.
I did not think of giving Marshal $stdout.
But the problem remains that I don't know ahead of time how many bytes
the Marshal data will have and
I can no longer use "\n", the input line separator, as a record
separator.

As for general usefulness.
If you already have a general purpose cat, filter, transform, and sort
programs...
And just want to see the results of manipulating the contents of some
source file....
Then just say
cat source.txt | transform | filter | sort > result.txt
I do these kind of stuff all the time, I just have not program that way
before.
I just started because the model is useful in my data downloads where
I download history CSVs from Finance.Yahoo.com and along the way to
append to my data files,
I transform the data.
There is an impedance problem though,
in having to flatten and convert a data structure that contain floats,
integers, and dates,
back to a CSV line every time you go through the pipe, and then restore
it back in the receiver.
Marshal solves this, except that "\n" can no longer be used as record
separators.
Marshal is more efficient, that's why someone wrote it.

Lastly, computer will be multi-processing from here on...
Faster chips are finding their physical limits.

BTW, I have an implementation of Marshal Pipes, just as I described in
my opening email.
It works great.

-Carlos

Robert Klemme

1/8/2008 2:22:00 PM

2008/1/8, Carlos J. Hernandez <carlosjhr64@fastmail.fm>:
> Robert:
> Thanks for your performance improvement suggestion.
> I did not think of giving Marshal $stdout.
> But the problem remains that I don't know ahead of time how many bytes

No, this is not a problem because Marshal.load will take care of this
(as you can see from the command line example I posted).

> the Marshal data will have and
> I can no longer use "\n", the input line separator, as a record
> separator.

Not needed as said before.

> As for general usefulness.
> If you already have a general purpose cat, filter, transform, and sort
> programs...
> And just want to see the results of manipulating the contents of some
> source file....
> Then just say
> cat source.txt | transform | filter | sort > result.txt

... and get another "useless cat award". :-)

> I do these kind of stuff all the time, I just have not program that way
> before.
> I just started because the model is useful in my data downloads where
> I download history CSVs from Finance.Yahoo.com and along the way to
> append to my data files,
> I transform the data.
> There is an impedance problem though,
> in having to flatten and convert a data structure that contain floats,
> integers, and dates,
> back to a CSV line every time you go through the pipe, and then restore
> it back in the receiver.
> Marshal solves this, except that "\n" can no longer be used as record
> separators.

Marshal basically just hides the conversion and makes it faster. The
conversion is still there: you have a data structure (say an array),
transform it into a sequence of bytes (either CSV or Marshal format),
send it through a pipe, transform byte sequence back (either from CSV
or Marshal format) and get out the array again. That's why I say it's
more efficient to not use two processes but do it in one Ruby process
most of the time (i.e. on single core machine or with IO bound stuff).

> Marshal is more efficient, that's why someone wrote it.

Not only that. Marshal servers a slightly different purpose, namely
converting object graphs which can contain loops into a byte stream
and resurrecting this graph from the byte stream.

> Lastly, computer will be multi-processing from here on...
> Faster chips are finding their physical limits.

But OTOH Ruby will rather sooner than later use native threads and a
multithreaded application is easier and in this particular case also
more efficient (unless you use tons of memory per processing step)
because you do not need the conversion for IPC. Do you actually
/need/ that processing power?

> BTW, I have an implementation of Marshal Pipes, just as I described in
> my opening email.
> It works great.

That's nice for you. But you proposed a general solution in your
original posting. At least that's what I picked up from your last
statements. With this (public!) discussion we are trying to find out
whether it *is* actually a good idea for the general audience. So far
I haven't been convinced that it is indeed.

Kind regards

robert

--
use.inject do |as, often| as.you_can - without end

Carlos J. Hernandez

1/8/2008 3:19:00 PM

Robert:

ruby -e '10.times {|i| Marshal.dump(i, $stdout) }' | ruby -e 'until
$stdin.eof?; p Marshal.load($stdin) end'

THANKS!!!
Did not recognized it at first read, because it's a bit cryptic.
-Carlos
--
Posted via http://www.ruby-....

ara.t.howard

1/8/2008 4:59:00 PM

On Jan 7, 2008, at 10:09 PM, Carlos J. Hernandez wrote:

> I'll look again, but I don't think I saw in DRb the simplicity
> achieved
> by bash as in:
>
> cat source.txt | filter | sort > result.txt
>
> I'm saying cat, filter, and sort could be ruby programs piping Marshal
> data structures.

check out ruby queue (rq) - it uses that paradigm but, instead of
marshal'd data, it uses yaml which accomplishes the same goal without
giving up human readability. for instance one might do (simplified)

rq q query tag==foobar
---
jid: 1
tag: foobar
command: processing_stage_a input

so query is dumping a job object, as yaml. then you do

!! | rq q update priority=42 -

which is to say use the output of the last command, a ruby object, and
input that into the next command, which takes a job, or jobs, on stdin
when '-' is given, and update that job in the queue

you can also do things like

rq q query priority=42 tag=foobar | rq q resubmit -

etc.

the pattern is a good one - but i wouldn't touch marshal data over
yaml for the commandline with a ten foot pole: one slip and you'll
blast out chars that will hose the display or disconnect your ssh
session. also, yaml provides natural document separators so you can
embed more than one set in a stream separated by --- which allows for
chunking of huge output streams

food for thought.

kind regards.

a @ http://codeforp...
--
we can deny everything, except that we have the possibility of being
better. simply reflect on that.
h.h. the 14th dalai lama

Carlos J. Hernandez

1/8/2008 7:02:00 PM

Ara:

Yaml is find over internet connection where transmission time is high
compared to cpu time, and
where human readability is a plus.
For my case, separate programs/processes on the same machine working
very closely
as if a single program in a pipe architecture... Marshal is better.
In fact, if Marshal is a bit of a Hybrid (don't know the details), then
what I really want is pure binary, I think.

Anyways, for a bit more details of my implementation,
taking out the specifics of my application and including Roberts'
comments,
I now have:

class MarshalPipe
def self.puts(data)
Marshal.dump( data, $stdout )
end

def _pipe
data = nil
while data = Marshal.load($stdin) do
pipe(data)
break if $stdin.eof?
end
end
end

I don't know why this did not work:

until $stdin.eof do
data = Marshal.load($stdin)
pipe( data )
end

fedzor

1/8/2008 9:41:00 PM

On Jan 7, 2008, at 4:58 PM, Eric Hodel wrote:

> On Jan 5, 2008, at 15:37 PM, Carlos J. Hernandez wrote:
>> I don't think this is implemented in a standard way anywhere in
>> Ruby (or
>> any other language), but
>> looks to me like a really, really good idea.
>
> You've written the core of DRb, which is these data pipes expanded
> to a multi-process, multi-machine distributed programming tool.

I'm really looking to get into DRb, but it's dsl and stuff is a
little.... daunting... Is there a slightly toned-down wrapper for it
or an alternative?

Robert Klemme

1/8/2008 9:58:00 PM

On 08.01.2008 20:01, Carlos J. Hernandez wrote:
> Ara:
>
> Yaml is find over internet connection where transmission time is high
> compared to cpu time, and
> where human readability is a plus.
> For my case, separate programs/processes on the same machine working
> very closely
> as if a single program in a pipe architecture... Marshal is better.
> In fact, if Marshal is a bit of a Hybrid (don't know the details), then
> what I really want is pure binary, I think.
>
> Anyways, for a bit more details of my implementation,
> taking out the specifics of my application and including Roberts'
> comments,
> I now have:
>
> class MarshalPipe
> def self.puts(data)
> Marshal.dump( data, $stdout )
> end
>
> def _pipe
> data = nil
> while data = Marshal.load($stdin) do
> pipe(data)

What does #pipe do? Why don't you use a block for the processing of the
data? For a general (aka library) solution it would also be much better
to pass the IO as an argument, in case there are more pipes to work with.

> break if $stdin.eof?
> end
> end
> end
>
> I don't know why this did not work:
>
> until $stdin.eof do
> data = Marshal.load($stdin)
> pipe( data )
> end

Probably because this is not the same as my code (hint: punctuation
matters).

Bte, I am still interested to learn the use case where your solution is
significantly better than an in process solution with Threads and a Queue...

Regards

robert

comp.lang.ruby

Marshal Pipe

Carlos J. Hernandez

Eric Hodel

Carlos J. Hernandez

Robert Klemme

Carlos J. Hernandez

Robert Klemme

Carlos J. Hernandez

ara.t.howard

Carlos J. Hernandez

fedzor

Robert Klemme

x Login to ForumsZone