Jano Svitok
3/5/2007 12:14:00 PM
On 3/5/07, mosfet <richom.v@free.fr> wrote:
> Hi,
>
> I would like to parse a very simple html(index_msg.htm) file described
> below :
>
> <tr>
> <td>WM_ACTIVATE</td>
> <td>0x0006</td>
> <td></td>
> <td>0x0000</td>
> <td>WM_NULL</td>
> </tr>
> <tr>
> <td>WM_ACTIVATEAPP</td>
> <td>0x001C</td>
> <td></td>
> <td>0x0001</td>
> <td>WM_CREATE</td>
> </tr>
> ...
> I would like to parse this file and to extract information like this :
>
> enum foo
> {
> eWM_ACTIVATE = 0x0006,
> eWM_ACTIVATEAPP = 0x0001,
> ...
> };
>
> I am starting with this :
>
>
> fileIn = File.open("C:/WIKI_CE/index_msg.htm", "r")
> fileOut = File.new("C:/WIKI_CE/enumWmMsg.h", "w")
>
> begin
> while (line = fileIn.readline)
> line.chomp
> $stdout.print line
> end
> rescue EOFError
> fileIn.close
> fileOut.close
> end
>
> but now I am stuck. Should I use regex or how can I compare two string ?
1. have a look at hpricot
2. if it's too big for you use regexen with /m flag, and use Regex#scan():
REGEX = /<tr>\s*
<td>(.*?)<\/td>\s*
<td>(.*?)<\/td>
<td>(.*?)<\/td>
<td>(.*?)</td>
<td>(.*?)</td>
<\/tr>/xm
file_in = File.read("C:/WIKI_CE/index_msg.htm")
File.open("C:/WIKI_CE/enumWmMsg.h", "w") do |file_out|
file_in.scan(REGEX) do
file_out.puts $1, $2, $3, $4, $5
end
end
end
Notes:
1. we_use_snake_case_for_variable_names
2. Use File.open with block to automatically close the file
3. You'll have the values in $1..$5
4. It seems you are inconsistent - in the first example you chose the
second line, in the other the fourth one.
In any case, Peter's approach will be easier, and more stable.