Asp Forum - NOT reading an entire file into memory

Daniel Brumbaugh Keeney

10/27/2007 9:07:00 PM

I am trying to write a parser for a text-based file format. Files in
this format frequently become very large. While the specification
specifically allows applications to crash on large files, I know
several people who have taken to editing these files by hand in
Notepad or other basic text editors. This format is not at all
friendly for this type of editing, and it is extremely tedious work,
but their programs all crash due to the size of these files.
What I really want to know is:
I had been using File.readline and saving a lot of temporary files via
tempfile.rb (http://www.ruby-doc.org/stdlib/libdoc/tempfile/rdoc/...).
However, I have heard that File.readline is in fact equivalent to
File.read.split('\n').each, which would really ruin my purpose of not
loading the whole file. I'd really like to keep this in ruby, as I
want to package the whole thing via the wonderful rubyscipt2exe, as
well as, of course, a standard rubygem.
What I would actually really love is if there was a way to read lines
4 through 7 without reading the whole file.
My current method has made the program not nearly as beautiful as ruby
ought to be.

-------------------------------------------
Daniel Brumbaugh Keeney
Devi Web Development
Devi.WebMaster@gMail.com
-------------------------------------------

8 Answers

Konrad Meyer

10/27/2007 11:48:00 PM

Quoth Devi Web Development:
> I am trying to write a parser for a text-based file format. Files in
> this format frequently become very large. While the specification
> specifically allows applications to crash on large files, I know
> several people who have taken to editing these files by hand in
> Notepad or other basic text editors. This format is not at all
> friendly for this type of editing, and it is extremely tedious work,
> but their programs all crash due to the size of these files.
> What I really want to know is:
> I had been using File.readline and saving a lot of temporary files via
> tempfile.rb
(http://www.ruby-doc.org/stdlib/libdoc/tempfile/rdoc/...).
> However, I have heard that File.readline is in fact equivalent to
> File.read.split('\n').each, which would really ruin my purpose of not
> loading the whole file. I'd really like to keep this in ruby, as I
> want to package the whole thing via the wonderful rubyscipt2exe, as
> well as, of course, a standard rubygem.
> What I would actually really love is if there was a way to read lines
> 4 through 7 without reading the whole file.
> My current method has made the program not nearly as beautiful as ruby
> ought to be.
>
> -------------------------------------------
> Daniel Brumbaugh Keeney
> Devi Web Development
> Devi.WebMaster@gMail.com
> -------------------------------------------

f = File.open("myfile")
# skip through 3rd line
3.times do f.readline end

Array.new(4).map do
f.readline
end

--
Konrad Meyer <konrad@tylerc.org> http://konrad.sobertil...

7stud --

10/28/2007 12:29:00 AM

Devi Web Development wrote:
> I have heard that File.readline is in fact equivalent to
> File.read.split('\n').each, which would really ruin my purpose of not
> loading the whole file.
>

I doubt that is true, but as is often the case with Ruby there is no
easily locatable documentation that describes File I/O buffering. Just
in case, here is another solution:

#create a data file containing:
#line 1
#line 2
#...
#line 10

File.open("data.txt", "w") do |file|
10.times do |i|
file.puts("line #{i+1}")
end
end

#read lines 4-7 and display them:
File.open("data.txt") do |file|
file.each_with_index do |line, i|
i = i + 1 #i starts at 0

if i < 4
next
elsif i < 8
puts line
else
break
end

end
end
--
Posted via http://www.ruby-....

7stud --

10/28/2007 12:31:00 AM

--output:--
line 4
line 5
line 6
line 7
--
Posted via http://www.ruby-....

Konrad Meyer

10/28/2007 2:35:00 AM

Quoth 7stud --:
> Devi Web Development wrote:
> > I have heard that File.readline is in fact equivalent to
> > File.read.split('\n').each, which would really ruin my purpose of not
> > loading the whole file.
> >
>
> I doubt that is true, but as is often the case with Ruby there is no
> easily locatable documentation that describes File I/O buffering. Just
> in case, here is another solution:
>
> #create a data file containing:
> #line 1
> #line 2
> #...
> #line 10
>
> File.open("data.txt", "w") do |file|
> 10.times do |i|
> file.puts("line #{i+1}")
> end
> end
>
>
> #read lines 4-7 and display them:
> File.open("data.txt") do |file|
> file.each_with_index do |line, i|
> i = i + 1 #i starts at 0
>
> if i < 4
> next
> elsif i < 8
> puts line
> else
> break
> end
>
> end
> end

IO#each_with_index and IO#readline are probably the same internally, so the
real answer here is that NO, IO#readline is NOT the same as
File.read.split('\n'), that's IO#readlines.

--
Konrad Meyer <konrad@tylerc.org> http://konrad.sobertil...

7stud --

10/28/2007 7:30:00 AM

Konrad Meyer wrote:
> Quoth 7stud --:
>> #create a data file containing:
>>
>> else
>> break
>> end
>>
>> end
>> end
>
> IO#each_with_index and IO#readline are probably the same internally, so
> the
> real answer here is that NO, IO#readline is NOT the same as
> File.read.split('\n'), that's IO#readlines.
>

The real question is: does readline do any buffering? What about
each()? If a file has ten lines in it, does ruby access the file ten
times? Or, does ruby read some reasonable amount of data into a buffer?

--
Posted via http://www.ruby-....

Konrad Meyer

10/28/2007 8:18:00 AM

Quoth 7stud --:
> Konrad Meyer wrote:
> > Quoth 7stud --:
> >> #create a data file containing:
> >>
> >> else
> >> break
> >> end
> >>
> >> end
> >> end
> >
> > IO#each_with_index and IO#readline are probably the same internally, so
> > the
> > real answer here is that NO, IO#readline is NOT the same as
> > File.read.split('\n'), that's IO#readlines.
> >
>
> The real question is: does readline do any buffering? What about
> each()? If a file has ten lines in it, does ruby access the file ten
> times? Or, does ruby read some reasonable amount of data into a buffer?

Performance isn't everything. If it was, you wouldn't be using ruby. The idea
is that this will work "well enough", shouldn't take too much thought on the
programmer's behalf, and doesn't load the entire (huge) file into ram.

--
Konrad Meyer <konrad@tylerc.org> http://konrad.sobertil...

Robert Klemme

10/28/2007 12:51:00 PM

On 28.10.2007 08:29, 7stud -- wrote:
> Konrad Meyer wrote:
>> Quoth 7stud --:
>>> #create a data file containing:
>>>
>>> else
>>> break
>>> end
>>>
>>> end
>>> end
>> IO#each_with_index and IO#readline are probably the same internally, so
>> the
>> real answer here is that NO, IO#readline is NOT the same as
>> File.read.split('\n'), that's IO#readlines.
>>
>
> The real question is: does readline do any buffering? What about
> each()? If a file has ten lines in it, does ruby access the file ten
> times? Or, does ruby read some reasonable amount of data into a buffer?

Ruby does buffering but will not read the whole file unless asked to do so.

There are several ways to access only lines 4 through 7. For example:

# 1
require 'enumerator' # pre 1.9
File.to_enum(:foreach, "foo.dat").each_with_index do |line,idx|
case idx
when 0...3
# ignore
when 3...7
puts line
else
break # or return or exit
end
end

# 2
File.open("foo.dat") do |io|
io.each do |line|
case io.lineno
when 1...4
# ignore
when 4..7
puts line
else
break
end
end
end

# 3
File.foreach "foo.dat" do |line|
case $.
when 1...4
# ignore
when 4..7
puts line
else
break
end
end

Kind regards

robert

Ken Bloom

10/29/2007 1:38:00 PM

On Sun, 28 Oct 2007 16:29:47 +0900, 7stud -- wrote:

> Konrad Meyer wrote:
>> Quoth 7stud --:
>>> #create a data file containing:
>>>
>>> else
>>> break
>>> end
>>>
>>> end
>>> end
>>
>> IO#each_with_index and IO#readline are probably the same internally, so
>> the
>> real answer here is that NO, IO#readline is NOT the same as
>> File.read.split('\n'), that's IO#readlines.
>>
>>
> The real question is: does readline do any buffering?

It must. There's no POSIX call that can read until the end of a line, so
you have to read(2) a bunch of data, look for a newline, and if there's
no newline in it you have to read more. If there is a newline in it, then
you have to buffer everything you read that comes after the newline.
That's life with POSIX.

The standard C library has fgets(3) which can find a newline, butit
probably does its own buffering internally, for the same reasons that
other POSIX apps would.

Ruby uses fread(3), the C library's equivalent of read(2), so ruby has to
do its own buffering.

> What about
> each()? If a file has ten lines in it, does ruby access the file ten
> times? Or, does ruby read some reasonable amount of data into a buffer?

rb_io_each_line implements IO#each_line and IO#each. It boils down to a
loop:

while (!NIL_P(str = rb_io_getline(rs, io))) {
rb_yield(str);
}

and rb_io_getline reads only as much as it feels is necessary to find
that newline. It doesn't put the whole file in memory at once.

--Ken

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu...

comp.lang.ruby

NOT reading an entire file into memory

Daniel Brumbaugh Keeney

Konrad Meyer

7stud --

7stud --

Konrad Meyer

7stud --

Konrad Meyer

Robert Klemme

Ken Bloom

x Login to ForumsZone