Phlip
12/16/2008 3:48:00 AM
Kenneth McDonald wrote:
> I'd very much like to use ReXML's XPATH features to extract info from
> Google's financial info pages, but find that Rexml chokes on the
> Javascript, here's the result of trying to read in a page with this
> bit of code:
I have studied REXML for many years, and I still can't figure out how to get it
to recognize an — or similar advanced entity.
Like the other responder said, give up while you still can. libxml-ruby is also
stable enough to give a shot - oh yeah, except it crashes on non-tiny inputs.
Aaaand...
> /usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in `parse':
> #<RuntimeError: Illegal character '&' in raw string
That's because REXML and your web browser disagree on the definition of
well-formed. Your browser accepts a naked & inside a JavaScript tag, but REXML
does not. REXML is technically correct, and your browser would have accepted
&& here, but...
> a.length>0&&a[a.length-1].selectorText==".lastFinanceRule"}return false}
....browsers cannot correctly interpolate & appearing inside JavaScript literal
strings, because some lowlife coder using Notepad might have actually wanted
"&" when they wrote "&" - such as with document.write().
So, because REXML cannot accept normal HTML, due to hits and misses of standards
compliance on all sides - you are better off with a dedicated parser!
--
Phlip