Asp Forum - Searching the visual appearance of a Web page?

Dr J R Stockton

2/29/2016 11:15:00 PM

I have a reference to the body element of a local Web page, and can
assume that body.onload() has finished. I also have a RegExp, which has
been defined from the value of an input type=text element.

I want to apply that RegExp to the whole displayed text, all at once or
piecemeal, and get all of the matches. I have been using the match
method on body.innerText, body.innerHTML, or body.textContent, which was
good enough to do what I wanted, but not ideal.

For example, the text up on in the HTML source must be treated as the
two words "up on" and not the one word "upon". And, if practical,
"câm" should be treated as a three-letter word.

The immediate aim is to use something like /\b[A-Z]{4,}\b/gi to find all
upper-case "word"s of four or more letters, in order to discover most of
the acronyms without too many false positives or negatives, so that a
list of them can be converted into, or used to check, a Glossary.

It can be assumed that the authors of the pages are not trying to delude
this searcher?

How should it best be done, in outline?

--
(c) John Stockton, Surrey, UK. ¬@merlyn.demon.co.uk Turnpike v6.05 MIME.
Merlyn Web Site < > - FAQish topics, acronyms, & links.

5 Answers

Martin Honnen

3/1/2016 10:39:00 AM

Dr J R Stockton wrote:
> I have a reference to the body element of a local Web page, and can
> assume that body.onload() has finished. I also have a RegExp, which has
> been defined from the value of an input type=text element.
>
> I want to apply that RegExp to the whole displayed text, all at once or
> piecemeal, and get all of the matches. I have been using the match
> method on body.innerText, body.innerHTML, or body.textContent, which was
> good enough to do what I wanted, but not ideal.
>
> For example, the text up on in the HTML source must be treated as the
> two words "up on" and not the one word "upon". And, if practical,
> "câm" should be treated as a three-letter word.

innerText should give you a plain string in which e.g. has been
converted to a new line character and a character reference to its
character. The only drawback is that Firefox in its current version 44
does not support it, but according to
http://perfectionkills.com/the-poor-misunderstood-... in Firefox
45 we will see support. So doing the regular expression search on
body.innerText seems like the most promising approach. Or why did
"innerText" not give you the ideal result, unless you needed it with
Mozilla browsers?

> The immediate aim is to use something like /\b[A-Z]{4,}\b/gi to find all
> upper-case "word"s of four or more letters, in order to discover most of
> the acronyms without too many false positives or negatives, so that a
> list of them can be converted into, or used to check, a Glossary.

> How should it best be done, in outline?

As the article above suggests, an alternative would be to get the text
of a selection, the article suggests

function getSelectionString(el, win) {
win = win || window;
var doc = win.document, sel, range, prevRange, selString;
if (win.getSelection && doc.createRange) {
sel = win.getSelection();
if (sel.rangeCount) {
prevRange = sel.getRangeAt(0);
}
range = doc.createRange();
range.selectNodeContents(el);
sel.removeAllRanges();
sel.addRange(range);
selString = sel.toString();
sel.removeAllRanges();
prevRange && sel.addRange(prevRange);
}
else if (doc.body.createTextRange) {
range = doc.body.createTextRange();
range.moveToElementText(el);
range.select();
}
return selString;
}

as an implementation, I have tried that in
https://jsfiddle.net/v..., seems to do the " " to new line
conversion in Mozilla, so perhaps you could check
if (typeof document.body.innerText != 'undefined')
to work with innerText where supported and use that selection approach
in Mozilla browsers to get the text.

Bart Van der Donck

3/2/2016 8:49:00 AM

Dr J R Stockton

3/3/2016 11:26:00 PM

In comp.lang.javascript message <nb3rgk$ial$1@news.albasani.net>, Tue, 1
Mar 2016 11:39:17, Martin Honnen <mahotrash@yahoo.de> posted:

>Dr J R Stockton wrote:
>> I have a reference to the body element of a local Web page, and can
>> assume that body.onload() has finished. I also have a RegExp, which has
>> been defined from the value of an input type=text element.
>>
>> I want to apply that RegExp to the whole displayed text, all at once or
>> piecemeal, and get all of the matches. I have been using the match
>> method on body.innerText, body.innerHTML, or body.textContent, which was
>> good enough to do what I wanted, but not ideal.
>>
>> For example, the text up on in the HTML source must be treated as the
>> two words "up on" and not the one word "upon". And, if practical,
>> "câm" should be treated as a three-letter word.
>
>innerText should give you a plain string in which e.g. has been
>converted to a new line character and a character reference to its
>character. The only drawback is that Firefox in its current version 44
>does not support it, but according to http://perfectionkill...
>poor-misunderstood-innerText/ in Firefox 45 we will see support. So
>doing the regular expression search on body.innerText seems like the
>most promising approach. Or why did "innerText" not give you the ideal
>result, unless you needed it with Mozilla browsers?

I habitually use the most recent ordinary release of Firefox - Firefox
has "Zoom Text Only" which I consider essential.

I'm using WinXP sp3, and reading files into an iframe, directly from the
disc without any server, WinXP sp3. Early Chrome was OK, but Chrome 4.0
& later does not do that, it misapplies "Same domain Policy". Something
fails in Opera 35, with a decent error message; but Opera 12.18 is OK.
IE8 is OK. Vivaldi gives a message like Opera 35, then falls over dead.

I get the content of the page variously with textContent, innerText, and
innerHTML - and should review those after Firefox 45 appears.

So, as Firefox 45 is due out on Tuesday ...

One use is to search for candidate acronyms for listing across a web
site master - upper-case words of sensible length. I should be able to
arrange to maintain a list of acronyms already known (e.g. NATO) and
non-acronyms that look like that (e.g. KABOOM), and then report only the
ones not listed there ...

For amusement - a few years ago, Firefox and Opera would not load page
X.HTM into an IFRAME in page X.HTM, but others would. The workround, to
read itself rather than a copy of itself, is still in place, and I
wonder whether it is now needed in none, some, or any of the current
browsers.

Thanks.

> ...

--
(c) John Stockton, Surrey, UK. ¬@merlyn.demon.co.uk Turnpike v6.05 MIME.
Merlyn Web Site < > - FAQish topics, acronyms, & links.

Michael Haufe (\"TNO\")

3/4/2016 12:28:00 AM

On Thursday, March 3, 2016 at 5:42:47 PM UTC-6, Dr J R Stockton wrote:

> I'm using WinXP sp3, and reading files into an iframe, directly from the
> disc without any server, WinXP sp3. Early Chrome was OK, but Chrome 4.0
> & later does not do that, it misapplies "Same domain Policy". Something
> fails in Opera 35, with a decent error message; but Opera 12.18 is OK.
> IE8 is OK. Vivaldi gives a message like Opera 35, then falls over dead.

You should look at the new standard APIs available:

<https://developer.mozilla.org/en-US/docs/Using_files_from_web_applic...

Dr J R Stockton

3/5/2016 11:38:00 PM

In comp.lang.javascript message <977b2a8d-ed38-4e1e-9509-1bf20f45936f@go
oglegroups.com>, Thu, 3 Mar 2016 16:27:38, "Michael Haufe (TNO)"
<tno@thenewobjective.com> posted:

>On Thursday, March 3, 2016 at 5:42:47 PM UTC-6, Dr J R Stockton wrote:
>
>> I'm using WinXP sp3, and reading files into an iframe, directly from the
>> disc without any server, WinXP sp3. Early Chrome was OK, but Chrome 4.0
>> & later does not do that, it misapplies "Same domain Policy". Something
>> fails in Opera 35, with a decent error message; but Opera 12.18 is OK.
>> IE8 is OK. Vivaldi gives a message like Opera 35, then falls over dead.
>
>You should look at the new standard APIs available:
>
><https://developer.mozilla.org/en-US/docs/Using_files_from_web_applic...

Too much new stuff there for me, I fear. I cannot even see whether what
is on that page, or an associated page, allows me to open a file by code
which uses using a filename string in a way I don't already know.

To be fair on Vivaldi, it does not fall over dead in Windows 7, though
it also does so when updating itself in Win XP.

--
(c) John Stockton, Surrey, UK. ¬@merlyn.demon.co.uk Turnpike v6.05 MIME.
Merlyn Web Site < > - FAQish topics, acronyms, & links.

comp.lang.javascript

Searching the visual appearance of a Web page?

Dr J R Stockton

Martin Honnen

Bart Van der Donck

Dr J R Stockton

Michael Haufe (\"TNO\")

Dr J R Stockton

x Login to ForumsZone