Asp Forum - ruby noob - comp.lang.ruby

Vincent RICHOMME

3/5/2007 11:54:00 AM

Hi,

I would like to parse a very simple html(index_msg.htm) file described
below :

<tr>
<td>WM_ACTIVATE</td>
<td>0x0006</td>
<td></td>
<td>0x0000</td>
<td>WM_NULL</td>
</tr>
<tr>
<td>WM_ACTIVATEAPP</td>
<td>0x001C</td>
<td></td>
<td>0x0001</td>
<td>WM_CREATE</td>
</tr>
....
I would like to parse this file and to extract information like this :

enum foo
{
eWM_ACTIVATE = 0x0006,
eWM_ACTIVATEAPP = 0x0001,
...
};

I am starting with this :

fileIn = File.open("C:/WIKI_CE/index_msg.htm", "r")
fileOut = File.new("C:/WIKI_CE/enumWmMsg.h", "w")

begin
while (line = fileIn.readline)
line.chomp
$stdout.print line
end
rescue EOFError
fileIn.close
fileOut.close
end

but now I am stuck. Should I use regex or how can I compare two string ?

2 Answers

Peter Szinek

3/5/2007 12:12:00 PM

mosfet wrote:
> Hi,
>
> I would like to parse a very simple html(index_msg.htm) file described
> below :
>
> <tr>
> <td>WM_ACTIVATE</td>
> <td>0x0006</td>
> <td></td>
> <td>0x0000</td>
> <td>WM_NULL</td>
> </tr>
> <tr>
> <td>WM_ACTIVATEAPP</td>
> <td>0x001C</td>
> <td></td>
> <td>0x0001</td>
> <td>WM_CREATE</td>
> </tr>
> ...
> I would like to parse this file and to extract information like this :
>
> enum foo
> {
> eWM_ACTIVATE = 0x0006,
> eWM_ACTIVATEAPP = 0x0001,
> ...
> };
>
> I am starting with this :
>
>
> fileIn = File.open("C:/WIKI_CE/index_msg.htm", "r")
> fileOut = File.new("C:/WIKI_CE/enumWmMsg.h", "w")
>
> begin
> while (line = fileIn.readline)
> line.chomp
> $stdout.print line
> end
> rescue EOFError
> fileIn.close
> fileOut.close
> end

This should get you started:

=====================================================================
require 'rubygems'
require 'scrubyt'

data = Scrubyt::Extractor.define do
fetch('input.html')

record do
var_name 'WM_ACTIVATE'
code '0x0006'
end
end

result = data.to_xml.to_s
names = result.scan(/var_name>(.+?)<\/var_name/).flatten
values = result.scan(/code>(.+?)<\/code/).flatten
pairs = names.zip(values)

pairs.each do |name, value|
puts "e#{name} = #{value}"
end
=====================================================================

The XML to array code kind of sucks, in the next version of scRUBYt! you
will be able to output the result directly to a hash (or CSV or YAML or
some other, more friendly format for such a task).

Cheers,
Peter
__
http://www.rubyra... :: Ruby and Web2.0 blog
http://s... :: Ruby web scraping framework
http://rubykitch... :: The indexed archive of all things Ruby

Jano Svitok

3/5/2007 12:14:00 PM

On 3/5/07, mosfet <richom.v@free.fr> wrote:
> Hi,
>
> I would like to parse a very simple html(index_msg.htm) file described
> below :
>
> <tr>
> <td>WM_ACTIVATE</td>
> <td>0x0006</td>
> <td></td>
> <td>0x0000</td>
> <td>WM_NULL</td>
> </tr>
> <tr>
> <td>WM_ACTIVATEAPP</td>
> <td>0x001C</td>
> <td></td>
> <td>0x0001</td>
> <td>WM_CREATE</td>
> </tr>
> ...
> I would like to parse this file and to extract information like this :
>
> enum foo
> {
> eWM_ACTIVATE = 0x0006,
> eWM_ACTIVATEAPP = 0x0001,
> ...
> };
>
> I am starting with this :
>
>
> fileIn = File.open("C:/WIKI_CE/index_msg.htm", "r")
> fileOut = File.new("C:/WIKI_CE/enumWmMsg.h", "w")
>
> begin
> while (line = fileIn.readline)
> line.chomp
> $stdout.print line
> end
> rescue EOFError
> fileIn.close
> fileOut.close
> end
>
> but now I am stuck. Should I use regex or how can I compare two string ?

1. have a look at hpricot
2. if it's too big for you use regexen with /m flag, and use Regex#scan():

REGEX = /<tr>\s*
<td>(.*?)<\/td>\s*
<td>(.*?)<\/td>
<td>(.*?)<\/td>
<td>(.*?)</td>
<td>(.*?)</td>
<\/tr>/xm

file_in = File.read("C:/WIKI_CE/index_msg.htm")
File.open("C:/WIKI_CE/enumWmMsg.h", "w") do |file_out|
file_in.scan(REGEX) do
file_out.puts $1, $2, $3, $4, $5
end
end
end

Notes:
1. we_use_snake_case_for_variable_names
2. Use File.open with block to automatically close the file
3. You'll have the values in $1..$5
4. It seems you are inconsistent - in the first example you chose the
second line, in the other the fourth one.

In any case, Peter's approach will be easier, and more stable.

comp.lang.ruby

ruby noob

Vincent RICHOMME

Peter Szinek

Jano Svitok

x Login to ForumsZone