Asp Forum - IO#Foreach -- Max line length

Tristin Davis

3/6/2008 9:06:00 PM

I'm trying to emulate the new feature in 1.9 that allows you to specify
the maximum length of a line read in Ruby 1.8.6. Can anyone help?
--
Posted via http://www.ruby-....

12 Answers

7stud --

3/6/2008 10:38:00 PM

Tristin Davis wrote:
> I'm trying to emulate the new feature in 1.9 that allows you to specify
> the maximum length of a line read in Ruby 1.8.6. Can anyone help?

max = 3
count = 0

IO.foreach('data.txt') do |line|
if count == max
break
else
count += 1
end

puts line
end
--
Posted via http://www.ruby-....

Tristin Davis

3/6/2008 10:54:00 PM

But by the time you actually get count, isn't the line already read in
memory. So if the line is 7 gigabytes, it'll probably crash the system.

7stud -- wrote:
> Tristin Davis wrote:
>> I'm trying to emulate the new feature in 1.9 that allows you to specify
>> the maximum length of a line read in Ruby 1.8.6. Can anyone help?
>
> max = 3
> count = 0
>
> IO.foreach('data.txt') do |line|
> if count == max
> break
> else
> count += 1
> end
>
> puts line
> end

--
Posted via http://www.ruby-....

Arlen Cuss

3/6/2008 11:45:00 PM

[Note: parts of this message were removed to make it a legal post.]

Hi,

On Fri, Mar 7, 2008 at 9:37 AM, 7stud -- <bbxx789_05ss@yahoo.com> wrote:

> max = 3
> count = 0
>
> IO.foreach('data.txt') do |line|
> if count == max
> break
> else
> count += 1
> end
>
> puts line
> end

Not quite the solution. This reads a number of lines, as opposed to
limiting the length of a single line read.

Arlen

7stud --

3/6/2008 11:52:00 PM

Tristin Davis wrote:
> But by the time you actually get count, isn't the line already read in
> memory. So if the line is 7 gigabytes, it'll probably crash the system.
>
>

Is this what you are looking for:

max_bytes = 30
text = IO.read('data.txt', max_bytes)
puts text
--
Posted via http://www.ruby-....

Peña, Botp

3/7/2008 2:35:00 AM

T24gQmVoYWxmIE9mIFRyaXN0aW4gRGF2aXM6DQojIEJ1dCBieSB0aGUgdGltZSB5b3UgYWN0dWFs
bHkgZ2V0IGNvdW50LCBpc24ndCB0aGUgbGluZSANCiMgYWxyZWFkeSByZWFkIGluIA0KIyBtZW1v
cnkuICBTbyBpZiB0aGUgbGluZSBpcyA3IGdpZ2FieXRlcywgaXQnbGwgcHJvYmFibHkgY3Jhc2gg
DQojIHRoZSBzeXN0ZW0uDQoNCnJlYWQgd2lsbCBhY2NlcHQgYXJnIG9uIGhvdyBtYW55IGJ5dGVz
IHRvIHJlYWQuDQoNCnNvIGhvdyBhYm91dCwNCg0KaXJiKG1haW4pOjA0MDowPiBGaWxlLm9wZW4g
InRlc3QucmIiIGRvIHxmfCBmLnJlYWQgZW5kDQo9PiAiYT0oMS4uMilcblxuYVxucHV0cyBhXG5c
bnB1dHMgYS5lYWNoe3x4fCBwdXRzIHh9Ig0KDQppcmIobWFpbik6MDQxOjA+IEZpbGUub3BlbiAi
dGVzdC5yYiIgZG8gfGZ8IGYucmVhZCAyIGVuZA0KPT4gImE9Ig0KDQppcmIobWFpbik6MDQyOjA+
IEZpbGUub3BlbiAidGVzdC5yYiIgZG8gfGZ8IGYucmVhZCAyOyBmLnJlYWQgMiBlbmQNCj0+ICIo
MSINCg0KaXJiKG1haW4pOjA0MzowPiBGaWxlLm9wZW4gInRlc3QucmIiIGRvIHxmfCB3aGlsZSB4
PWYucmVhZCgyKTsgcCB4OyBlbmQ7IGVuZA0KImE9Ig0KIigxIg0KIi4uIg0KIjIpIg0KIlxuXG4i
DQoiYVxuIg0KInB1Ig0KInRzIg0KIiBhIg0KIlxuXG4iDQoicHUiDQoidHMiDQoiIGEiDQoiLmUi
DQoiYWMiDQoiaHsiDQoifHgiDQoifCAiDQoicHUiDQoidHMiDQoiIHgiDQoifSINCj0+IG5pbA0K
DQpraW5kIHJlZ2FyZHMgLWJvdHANCg==

Adam Shelly

3/7/2008 5:59:00 PM

On 3/6/08, Pe=F1a, Botp <botp@delmonte-phil.com> wrote:
> On Behalf Of Tristin Davis:
> # But by the time you actually get count, isn't the line
> # already read in
> # memory. So if the line is 7 gigabytes, it'll probably crash
> # the system.
>
> read will accept arg on how many bytes to read.
>
> so how about,
>
...
> irb(main):043:0> File.open "test.rb" do |f| while x=3Df.read(2); p x; end=
; end

That solution essentially ignores linebreaks.
If you want to read up to a linebreak or N characters, whichever comes
first, you could one of these:

------
class IO
#read by characters
def for_eachA(linelen)
c=3D0
while (c)
buf=3D''
linelen.times {
break unless c=3Dgetc
buf<<c
break if c.chr=3D=3D $/
}
yield buf
end
end

#read by lines
def for_eachB(linelen)
re =3D Regexp.new(".*?#{Regexp.escape($/)}")
buf=3D''
while (line =3D read(linelen-buf.length))
buf =3D (buf+line).gsub(re){|l| yield l;''}
if buf.length =3D=3D linelen
yield buf
buf=3D''
end
end
yield buf
end
end

File.open("foreach.rb") do |f|
f.for_eachA(10){|l| p l}
end

File.open("foreach.rb") do |f|
f.for_eachB(10){|l| p l}
end
------

I'd guess the second version would be faster, but I didn't time it.

-Adam

Tristin Davis

3/8/2008 9:31:00 PM

Thanks for the ideas Adam. I thought someone might be able to use it so
I figured i'd post it. It processed about 675,000 1100+ byte records in
an hour. Not fantastic performance, but it works. If someone can tell
me how to improve the performance then have at it. :)

module Util

def too_large?(buffer,max=10)
return true if buffer.length >= max
false
end
end

include Util

file = ARGV.shift #"C:/Documents and Settings/trdavi/Desktop/a1-1k.aa"
buf=''
record = 1
frequency = 100

f = File.open(file,'r')

while c=f.getc
buf << c

if too_large?(buf,max=102400)
p "record #{record} is too long, skipping to end"
while(x=f.getc)
if x.chr == $/
buf=''
record += 1
p "At record #{record}" if( (record % frequency ) == 0 )
break
end
end
end

if c.chr == $/
record += 1
print "At record #{record}" if( (record % frequency ) == 0 )
buf = ''
end
end

#If we still have something in the buffer, then it is probably the last
record.
unless buf.empty?
#record += 1
p "Last record is:" + buf
end

f.close
p record

Adam Shelly wrote:
> On 3/6/08, Peï¿½a, Botp <botp@delmonte-phil.com> wrote:
>> On Behalf Of Tristin Davis:
>> # But by the time you actually get count, isn't the line
>> # already read in
>> # memory. So if the line is 7 gigabytes, it'll probably crash
>> # the system.
>>
>> read will accept arg on how many bytes to read.
>>
>> so how about,
>>
> ...
>> irb(main):043:0> File.open "test.rb" do |f| while x=f.read(2); p x; end; end
>
> That solution essentially ignores linebreaks.
> If you want to read up to a linebreak or N characters, whichever comes
> first, you could one of these:
>
> ------
> class IO
> #read by characters
> def for_eachA(linelen)
> c=0
> while (c)
> buf=''
> linelen.times {
> break unless c=getc
> buf<<c
> break if c.chr== $/
> }
> yield buf
> end
> end
>
> #read by lines
> def for_eachB(linelen)
> re = Regexp.new(".*?#{Regexp.escape($/)}")
> buf=''
> while (line = read(linelen-buf.length))
> buf = (buf+line).gsub(re){|l| yield l;''}
> if buf.length == linelen
> yield buf
> buf=''
> end
> end
> yield buf
> end
> end
>
> File.open("foreach.rb") do |f|
> f.for_eachA(10){|l| p l}
> end
>
> File.open("foreach.rb") do |f|
> f.for_eachB(10){|l| p l}
> end
> ------
>
> I'd guess the second version would be faster, but I didn't time it.
>
> -Adam

--
Posted via http://www.ruby-....

7stud --

3/9/2008 5:03:00 AM

Tristin Davis wrote:
> Thanks for the ideas Adam. I thought someone might be able to use it so
> I figured i'd post it. It processed about 675,000 1100+ byte records in
> an hour. Not fantastic performance, but it works. If someone can tell
> me how to improve the performance then have at it. :)
>
>
> module Util
>
> def too_large?(buffer,max=10)
> return true if buffer.length >= max
> false
> end
> end
>
> include Util
>
> file = ARGV.shift #"C:/Documents and Settings/trdavi/Desktop/a1-1k.aa"
> buf=''
> record = 1
> frequency = 100
>
> f = File.open(file,'r')
>
> while c=f.getc
if buf.length < max #(but what if you find a '\n' before max?)
buf << c
else
buf = ''
f.gets
end

--
Posted via http://www.ruby-....

Tristin Davis

3/9/2008 10:46:00 AM

That's what the 2nd if statement is; for catching the delimiter if the
buffer isn't too large. I can't use gets b/c I may expend all the
memory before the actual line is read. I'm reading variable length
records, but some of them are bad data and exceed a max length of 100k.
That's what the script is scanning for. :)

7stud -- wrote:
> Tristin Davis wrote:
>> Thanks for the ideas Adam. I thought someone might be able to use it so
>> I figured i'd post it. It processed about 675,000 1100+ byte records in
>> an hour. Not fantastic performance, but it works. If someone can tell
>> me how to improve the performance then have at it. :)
>>
>>
>> module Util
>>
>> def too_large?(buffer,max=10)
>> return true if buffer.length >= max
>> false
>> end
>> end
>>
>> include Util
>>
>> file = ARGV.shift #"C:/Documents and Settings/trdavi/Desktop/a1-1k.aa"
>> buf=''
>> record = 1
>> frequency = 100
>>
>> f = File.open(file,'r')
>>
>> while c=f.getc
> if buf.length < max #(but what if you find a '\n' before max?)
> buf << c
> else
> buf = ''
> f.gets
> end

--
Posted via http://www.ruby-....

7stud --

3/9/2008 9:35:00 PM

Tristin Davis wrote:
> That's what the 2nd if statement is; for catching the delimiter if the
> buffer isn't too large. I can't use gets b/c I may expend all the
> memory before the actual line is read.
>

Look. A string and a file are really no different--except reading from
a file is slow. Therefore, to speed things up read in the maximum every
time you read from the file, and store it in a string. Process the
string just like you would the file. Then read from the file again.

--
Posted via http://www.ruby-....

comp.lang.ruby

IO#Foreach -- Max line length

Tristin Davis

7stud --

Tristin Davis

Arlen Cuss

7stud --

Peña, Botp

Adam Shelly

Tristin Davis

7stud --

Tristin Davis

7stud --

x Login to ForumsZone