[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Does Ruby need a "line separator" class?

Wes Gamble

7/31/2006 3:40:00 PM

I've run into a problem where Ruby can't handle newlines on Windows
because the regexp is explicitly looking for \n and not \r\n.

In the Java world, there is a system property to represent line
separator so that you can write code that is cross-platform with respect
to line separation on Unix/Windows/Mac. Is there an equivalent
abstraction of the newline character in Ruby? If not, where does it
belong?

For some reason, I thought I read somewhere that sometimes the "\n"
character is overloaded in this way (to represent a "newline" regardless
of platform), but not sure if I'm misremembering.

Thanks,
Wes

--
Posted via http://www.ruby-....

9 Answers

Xavier Noria

7/31/2006 4:09:00 PM

0

On Jul 31, 2006, at 5:40 PM, Wes Gamble wrote:

> I've run into a problem where Ruby can't handle newlines on Windows
> because the regexp is explicitly looking for \n and not \r\n.

It shouldn't look for CRLFs. The rules of the game in languages that
inherit the newline normalization approach from C (those include C++,
and Perl, for instance, but not Java) are that if you work in text
mode and the text file follows runtime conventions, you only read and
print "\n"s.

That's because there's an intermediate IO layer that transforms CRLF
into LF in CRLF platforms on reading, and LF back to CRLF on writing.

In Java this is handled in a different way, "\n" is not portable in
Java. Portable code in Java uses method calls like println. But in
Ruby a portable regexp that assumes text mode and data with the
runtime platform conventions for newlines have to use "\n", no CR
ever gets into the string.

-- fxn




Wes Gamble

7/31/2006 4:24:00 PM

0

Xavier,

That's interesting.

In a pure Ruby (Rails) app, I've had to modify regexps to handle the
\r\n sequence so that my regexps will work in a Windows environment.
I'm guessing that this is related to the "file follows runtime
conventions" in your post. Meaning that the file that I'm processing
(which is actually sourced externally) did not conform to C runtime
conventions when it was written.

In general, this seems simple enough to handle, you just allow for
optional \r \n combinations in your regexp (assuming setting the
multiline flag for the regexp), like so:

[^\r\n]*
[\r\n]*
(\r*\n*)

Wes



--
Posted via http://www.ruby-....

Wes Gamble

7/31/2006 4:30:00 PM

0

FWIW, I'm pursuing this question because of the JRuby issue.

--
Posted via http://www.ruby-....

Xavier Noria

7/31/2006 4:40:00 PM

0

On Jul 31, 2006, at 6:15 PM, Charles O Nutter wrote:

> This has come up in the JRuby project fairly frequently since Java
> wants to
> normalize line-terminators internally to the underlying platform,
> rather
> than normalizing to \n and handling conversion on read-write.
> Xavier, are
> you saying that Ruby has in its IO layer code to convert from CRLF
> to LF on
> input/output, and this is the primary means of normalizing
> newlines? We have
> had in our bug tracker a patch that resolves JRuby's newline issues
> in a
> similar way, but had not committed it pending research into whether
> this
> would be appropriate and sufficient.

If I am not mistaken, in Ruby that is delegated to stdio. After a
quick code inspection I think the exact point where that is done is
in the call to write():

r = write(fileno(f), RSTRING(str)->ptr+offset, l);

That's in the function io_fwrite(), line 455 of io.c in Ruby 1.8.4.

In Perl that was delegated to stdio as well until 5.8.0, where the I/
O layer was substituted with PerlIO who is now the responsible for
that filtering in CRLF platforms.

-- fxn


Xavier Noria

7/31/2006 4:55:00 PM

0

On Jul 31, 2006, at 6:23 PM, Wes Gamble wrote:

> In a pure Ruby (Rails) app, I've had to modify regexps to handle the
> \r\n sequence so that my regexps will work in a Windows environment.
> I'm guessing that this is related to the "file follows runtime
> conventions" in your post. Meaning that the file that I'm processing
> (which is actually sourced externally) did not conform to C runtime
> conventions when it was written.

Yes, that is an important point.

When we talk about portability as far as newlines is concerned we are
assuming the newline conventions of the platform and the data match.
A portable line-oriented script might fail if it is running on Linux
processing text files from a FAT32 partition that were generated by
some Windows program. There a lot of common situations when
conventions may not match. A portable line-oriented script is not
supposed to handle those situation, a robust line-oriented script
should do something sensible with foreign conventions.

Web programming is one of them, because you cannot assume anything in
the input that comes from a text area or an uploaded text file for
instance. In that case you better normalize first (written on the way):

normalized_text_area = text_area.gsub(/\015\012/, "\n").gsub(/
\015/, "\n")
# Now text_area has been normalized and all standard line-oriented
# idioms will work.

In Ruby we are done because "\n" is "\012" everywhere, in Perl that
gets slightly more complicated because "\n" is eq "\015" on MacOS pre-
X. But you see the idea and why you do that.

-- fxn (<-- whose article about newlines for O'Reilly is about to
appear)


Xavier Noria

7/31/2006 5:39:00 PM

0

On Jul 31, 2006, at 7:27 PM, Charles O Nutter wrote:

> A large part of our problem is that we currently tend to normalize
> everything to \n....all the time. That has the effect of also
> writing out \n
> to the filesystem for newlines, which as you describe above causes
> problems
> when trying to re-read. So for the case in question, we run Rails...it
> generates files with newlines...we normalize those newlines to \n
> and write
> such to disk...and then future use of those files (in this case, ERB
> templates) fails because the newlines aren't handled correctly
> (i.e. we
> can't normalize \r\n to \n again because they're already \n on disk).

If those files are only handled by that application there is no
problem because \ns are precisely what the script should see.

For instance, if you pass a Unix text file to a line-oriented script
running on Windows the script will work as long as it only reads.
That's because LFs not following a CR are left untouched by the I/O
layer, and by a happy coincidence LFs is what readline expects. So
everything works, by chance, but works.

Problem is the application generates text files that do not follow
the conventions of the platform, and other programs may assume they do.

-- fxn


Xavier Noria

7/31/2006 6:22:00 PM

0

On Jul 31, 2006, at 6:54 PM, Xavier Noria wrote:

> normalized_text_area = text_area.gsub(/\015\012/, "\n").gsub(/
> \015/, "\n")

Just for the archives, this normalizes in Ruby with only one pass

normalized_text_area = text_area.gsub(/\015\012?/, "\n")

though it is less explicit. Let me add now that we are on it that if
the text is Unicode it may come with a few more codes for newlines.
All in all this is a PITA like character encodings, but is what we've
got for historical reasons.

-- fxn




Wes Gamble

7/31/2006 6:42:00 PM

0

I was thinking about this a little more.

Why wouldn't JRuby just take advantage of the Java runtime's
normalization facility in this case, using the JVM's notion of "newline"
on the particular platform to handle I/O?

Is the JRuby issue that only _some_ of the code that is doing I/O is
pure Java and some other set of the code is Ruby so that trying to
always use the JVM "line separator" concept won't work?

Wes


--
Posted via http://www.ruby-....

Wes Gamble

7/31/2006 7:18:00 PM

0

In this particular case, could
java.lang.System.getProperty("line.separator") be used to handle
platform-specific reading/writing? That way, you get to piggyback on
the multiplatform support built into Java. If the low-level I/O code is
centralized, it seems like this would be the way to go.

Are there performance implications for this approach? Seems like you
could just grab all of the system specific newline properties from the
System object upon the initialization of the JRuby interpreter and just
refer to them later.

Wes


--
Posted via http://www.ruby-....