Asp Forum - IMG (REGULAR EXPRESSIONS

Newb Newb

8/21/2008 6:29:00 AM

Hi All..
I Need to Extract Img tag Using Regular Expressions From The Html Page
<\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1
Is This Code Would be ok
if So How it can Be Implemented?
Any Ideas??
--
Posted via http://www.ruby-....

3 Answers

Robert Klemme

8/21/2008 7:46:00 AM

2008/8/21 Newb Newb <hema@angleritech.com>:
> I Need to Extract Img tag Using Regular Expressions From The Html Page
> <\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1
> Is This Code Would be ok

I would choose a different regexp.

> if So How it can Be Implemented?

What exactly?

> Any Ideas??

http://code.whytheluckystiff.ne...

Cheers

robert

--
use.inject do |as, often| as.you_can - without end

Phlip

8/21/2008 8:11:00 AM

Newb Newb wrote:

> I Need to Extract Img tag Using Regular Expressions From The Html Page
> <\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1
> Is This Code Would be ok
> if So How it can Be Implemented?
> Any Ideas??

Regexp is not a parser; it strongly resists matching well-formed syntax, such as
HTML.

You need to write unit tests so you can "see" what you are doing. They will feed
samples of input to your parser, and assert the output contains no <img tags.

I would load these strings into libxml-ruby or Hpricot documents, then use XPath
to seek '//img', then delete their nodes from the document, then write the
documents back. But note HTML supports several other ways to inject images,
including CSS styles, <object> tags, etc.

You need to consult with your client how clean you need your HTML. If they say
to only allow , , , or tags, for example, you could use XPath
to seek '//*', meaning all nodes, then replace their tag names with ,
delete all their attributes, and write the document back.

Next, there might be gems out there to do this (or plugins), so you could google
for [rails scrub html], to just find one, and either raid its source, or install
and use it.

--
Phlip

Phlip

8/21/2008 8:14:00 AM

> You need to consult with your client how clean you need your HTML. If
> they say to only allow , , , or tags, for example,
> you could use XPath to seek '//*', meaning all nodes, then replace their
> tag names with , delete all their attributes, and write the
> document back.

Another way to scrub input is don't allow raw HTML. Only allow a wiki markup,
such as RedCloth. Some wikis allow ''italic'' and '''bold''' content, and very
little else. Then you don't need to scrub it; you simply let the wiki engine
convert it to harmless read-only HTML.

comp.lang.ruby

IMG (REGULAR EXPRESSIONS

Newb Newb

Robert Klemme

Phlip

Phlip

x Login to ForumsZone