Phlip
8/21/2008 8:11:00 AM
Newb Newb wrote:
> I Need to Extract Img tag Using Regular Expressions From The Html Page
> <\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1
> Is This Code Would be ok
> if So How it can Be Implemented?
> Any Ideas??
Regexp is not a parser; it strongly resists matching well-formed syntax, such as
HTML.
You need to write unit tests so you can "see" what you are doing. They will feed
samples of input to your parser, and assert the output contains no <img tags.
I would load these strings into libxml-ruby or Hpricot documents, then use XPath
to seek '//img', then delete their nodes from the document, then write the
documents back. But note HTML supports several other ways to inject images,
including CSS styles, <object> tags, etc.
You need to consult with your client how clean you need your HTML. If they say
to only allow <i>, <em>, <b>, or <strong> tags, for example, you could use XPath
to seek '//*', meaning all nodes, then replace their tag names with <span>,
delete all their attributes, and write the document back.
Next, there might be gems out there to do this (or plugins), so you could google
for [rails scrub html], to just find one, and either raid its source, or install
and use it.
--
Phlip