Asp Forum - HTML cleanup task

Victor 'Zverok' Shepelev

11/30/2006 1:09:00 PM

Hi all.

Sorry, if the question seems dumb.

My task is: I have some HTML fragment; no limitations on it correctness,
except of there can't be tag cutted:
This is possible: [</tr>.......] (fragment starts with closing tag)
This is not: [tr>...........]

I need to do tasks:
* Cut some tags with those contents, for ex., all tables
[before<table>inside</table>after] => [before after]
* cut some tags, leaving content:
[before<div>after] => [before after]
* other tags to make "consistent":
[before</p>after] => [before after]
[<p>before</p>after] => [<p>before</p>after]
.....

Can it be done with Hpricot? Or any other options?

Thanks.

V.

7 Answers

Dmitry Borodaenko

11/30/2006 2:21:00 PM

On 11/30/06, Victor Zverok Shepelev <vshepelev@imho.com.ua> wrote:
> My task is: I have some HTML fragment; no limitations on it correctness,
> except of there can't be tag cutted:
(...)
> Can it be done with Hpricot? Or any other options?

Tried HTMLTidy[0]? Sometimes it tries to be too smart, but it has a
lot of options. The way I do it[1] probably won't suit you, but might
give you some ideas.

[0] http://rubyforge.org/proj...

[1] http://cvs.savannah.gnu.org/viewcvs/samizdat/samizdat/lib/samizdat/sanitize.r...

--
Dmitry Borodaenko

Victor 'Zverok' Shepelev

11/30/2006 2:26:00 PM

From: Dmitry Borodaenko [mailto:angdraug@gmail.com]
Sent: Thursday, November 30, 2006 4:21 PM
>On 11/30/06, Victor Zverok Shepelev <vshepelev@imho.com.ua> wrote:
>> My task is: I have some HTML fragment; no limitations on it correctness,
>> except of there can't be tag cutted:
>(...)
>> Can it be done with Hpricot? Or any other options?
>
>Tried HTMLTidy[0]?

Not really tried, but had thought about.
The problem is I need something really "small, smart and simple" not "huge
and almighty" (as Tidy seems).

But thanks for advice.

>Dmitry Borodaenko

V.

Paul Lutus

11/30/2006 5:16:00 PM

Victor "Zverok" Shepelev wrote:

> From: Dmitry Borodaenko [mailto:angdraug@gmail.com]
> Sent: Thursday, November 30, 2006 4:21 PM
>>On 11/30/06, Victor Zverok Shepelev <vshepelev@imho.com.ua> wrote:
>>> My task is: I have some HTML fragment; no limitations on it correctness,
>>> except of there can't be tag cutted:
>>(...)
>>> Can it be done with Hpricot? Or any other options?
>>
>>Tried HTMLTidy[0]?
>
> Not really tried, but had thought about.
> The problem is I need something really "small, smart and simple" not "huge
> and almighty" (as Tidy seems).

Not "huge and almighty" but "small, smart and simple" ... I believe that's
my cue.

Have you considered writing your own miniature library? Maybe, a library
consisting of 20 lines of Ruby instructions (regulars: note the absence of
a certain trigger word)?

Why not express the problem to be solved more explicitly and clearly?

And ... were the HTML pages written by humans or a machine? I ask because
machine-generated HTML tends to be more syntactically reliable.

If I can have a sufficiently clear statement of the problem to be solved, I
can suggest a solution -- or post one.

On re-reading your first post in this thread, I venture to say that the
pages are sufficiently disorganized that an ad hoc solution is the best
approach overall, one in which various regular expression filters are used
to extract essential page data, and the pages can then be reconstructed
using stricter HTML or XHTML syntax.

So, let's write some cod ... oops, I mean let's write a small library.

--
Paul Lutus
http://www.ara...

Victor 'Zverok' Shepelev

11/30/2006 8:04:00 PM

From: Paul Lutus [mailto:nospam@nosite.zzz]
Sent: Thursday, November 30, 2006 8:20 PM
>Victor "Zverok" Shepelev wrote:
>
>> From: Dmitry Borodaenko [mailto:angdraug@gmail.com]
>> Sent: Thursday, November 30, 2006 4:21 PM
>>>On 11/30/06, Victor Zverok Shepelev <vshepelev@imho.com.ua> wrote:
>>>> My task is: I have some HTML fragment; no limitations on it
>correctness,
>>>> except of there can't be tag cutted:
>>>(...)
>>>> Can it be done with Hpricot? Or any other options?
>>>
>>>Tried HTMLTidy[0]?
>>
>> Not really tried, but had thought about.
>> The problem is I need something really "small, smart and simple" not
>"huge
>> and almighty" (as Tidy seems).
>
>Not "huge and almighty" but "small, smart and simple" ... I believe that's
>my cue.
>
>Have you considered writing your own miniature library? Maybe, a library
>consisting of 20 lines of Ruby instructions (regulars: note the absence of
>a certain trigger word)?
>
>Why not express the problem to be solved more explicitly and clearly?
>
>And ... were the HTML pages written by humans or a machine? I ask because
>machine-generated HTML tends to be more syntactically reliable.
>
>If I can have a sufficiently clear statement of the problem to be solved,
>I
>can suggest a solution -- or post one.
>
>On re-reading your first post in this thread, I venture to say that the
>pages are sufficiently disorganized that an ad hoc solution is the best
>approach overall, one in which various regular expression filters are used
>to extract essential page data, and the pages can then be reconstructed
>using stricter HTML or XHTML syntax.
>
>So, let's write some cod ... oops, I mean let's write a small library.

OK, here's the model of what I'm doing: small app, which interacts with
dictionaries like Wikipedia:
* user inputs something like "w matz"
* the software download first lines of http://en.wikipedia.org...
(first one or two meaningful paragraphs) and displays them.

What to download and to show is setted by simple templates (regexpes for
now, but may be something Xpath-like).

Now we have some part of page, need to delete all tables, images, and so on,
and strip all "non-content" tags (everything but p, ul, ol, li, b, i...),
and I need to have "consistent" HTML to show.

It is a task definition.

The task may vary for different dictionaries. For ex., with some
dictionaries tables must not be deleted, but "normalized":
"<td>text1<td>text2" => "<table><tr><td>text1<td>text2</table>"
Or even XHTMLish "<table><tr><td>text1</td><td>text2</td></tr></table>"

>--
>Paul Lutus
>http://www.ara...

V.

Paul Lutus

11/30/2006 8:58:00 PM

Victor "Zverok" Shepelev wrote:

/ ...

> Now we have some part of page, need to delete all tables, images, and so
> on, and strip all "non-content" tags (everything but p, ul, ol, li, b,
> i...), and I need to have "consistent" HTML to show.

Easy to say in one word, but that one word cannot be turned into code.

> It is a task definition.
>
> The task may vary for different dictionaries. For ex., with some
> dictionaries tables must not be deleted, but "normalized":
> "<td>text1<td>text2" => "<table><tr><td>text1<td>text2</table>"

Both the before and after forms show big syntax errors. I hope you
understand HTML syntax, if not, this may be more difficult than I thought.

> Or even XHTMLish "<table><tr><td>text1</td><td>text2</td></tr></table>"

Well, your description of the problem is way too general for any progress
toward a solution.

Perhaps you could post what you consider to be the desired end result for a
particular entry from the "dictionary" site of your choice.

By the way (my boilerplate remark about page scraping), if this is for any
purpose other than your own personal use, it represents a copyright
problem.

I want to emphasize this is not difficult at all, once there is a clear
statement of purpose. In can be done in a few (maybe a few dozen) lines of
Ruby code.

--
Paul Lutus
http://www.ara...

Victor 'Zverok' Shepelev

11/30/2006 9:40:00 PM

From: Paul Lutus [mailto:nospam@nosite.zzz]
Sent: Thursday, November 30, 2006 11:00 PM
>Victor "Zverok" Shepelev wrote:
>
>> It is a task definition.
>>
>> The task may vary for different dictionaries. For ex., with some
>> dictionaries tables must not be deleted, but "normalized":
>> "<td>text1<td>text2" => "<table><tr><td>text1<td>text2</table>"
>
>Both the before and after forms show big syntax errors. I hope you
>understand HTML syntax, if not, this may be more difficult than I thought.

I understand HTML syntax. And I see no problem in above.
Closing tags for <tr> and <td> are both optional in HTML 4.01 w3c spec.

>Perhaps you could post what you consider to be the desired end result for a
>particular entry from the "dictionary" site of your choice.

OK. Here it is:
Source page: http://en.wikipedia.org/wi...
Start pattern: 
End pattern: <h2>
Elements to exclude: tables, images.

Desired output (with text in middle of paragraph skipped):
---------------------------
<p><b>Ukraine</b> (<a href="/wiki/Ukrainian_language" title="Ukrainian language">Ukrainian</a>: <span lang="uk" xml:lang="uk">???????</span>, <i>Ukraina</i>, <span title="Pronunciation in IPA" class="IPA">/ukra'jina/</span>) is a <a href="/wiki/Country" title="Country">country</a> in <a href="/wiki/Eastern_Europe" title="Eastern Europe">Eastern Europe</a>.
....
It became independent again after the <a href="/wiki/History_of_the_Soviet_Union_%281985-1991%29" title="History of the Soviet Union (1985-1991)">Soviet Union's collapse</a> in 1991.</p>
---------------------------

That's all.

>By the way (my boilerplate remark about page scraping), if this is for any
>purpose other than your own personal use, it represents a copyright
>problem.

My application would be kinda browser (nano-browser), I don't want to "grab" dictionaries.

>I want to emphasize this is not difficult at all, once there is a clear
>statement of purpose. In can be done in a few (maybe a few dozen) lines of
>Ruby code.

I know. I'm not a nuby (my poor language in mails is due to natural language problems, not very low knowledge).
I've just asked about existing libraries.

>
>--
>Paul Lutus
>http://www.ara...

V.

Paul Lutus

12/1/2006 12:52:00 AM

Victor "Zverok" Shepelev wrote:

/ ...

> That's all.

You project is extremely ambitious, and will outstrip all but the most
ambitious, dedicated effort. Every location -- indeed, every page -- you
visit will require different filtering.

Good luck with your project.

--
Paul Lutus
http://www.ara...

comp.lang.ruby

HTML cleanup task

Victor 'Zverok' Shepelev

Dmitry Borodaenko

Victor 'Zverok' Shepelev

Paul Lutus

Victor 'Zverok' Shepelev

Paul Lutus

Victor 'Zverok' Shepelev

Paul Lutus

x Login to ForumsZone