Asp Forum - Script to fetch Wikipedia text

Kyrre Nygård

10/11/2006 10:27:00 AM

Hey!

I'm involved in a few research projects, and like to keep my
information well organized. I usually get most of it from Wikipedia,
however, I hate printing HTML articles to PDF. I'd rather want them
in pure, well laid out text. And I'm sure others would too. Being
able to master ones knowledge provides a warm inner peace.

Hence I've tried dumping the output from text browsers such as w3m,
elinks, lynx etc. I am, however, only interested in the articles
themselves, not their links, views, toolboxes, search bars, other
available languages and so on. I tried running a whole bunch of
regular expressions over the output, but that really felt like the hard way.

So some guy gave me this:

#!/usr/bin/env ruby

require 'rexml/document'
require 'cgi'
require 'tempfile'
require 'open-uri'

url = 'http://en.wikipedia.org/wiki/Special:Ex... +
CGI::escape(ARGV.join(" ").strip.squeeze(' ').tr(' ',
'_')).gsub(/%3[Aa]/,':').gsub(/%2[Ff]/,'/').gsub(/%23/,'#')

open(url) { |f|
puts REXML::XPath.first(REXML::Document::new(f.class == Tempfile ?
f.open : f), '//text').text
}

Which seem to take advantage of Wikipedia's special export feature,
which really seems cool. However there's a few issues. First, the
script looks kinda complex. I'm sure there's a simpler way of writing
it. Second, it does not yet output the kind of pure and well laid out
text as it should. For instance, on
http://en.wikipedia.org/wik..., it outputs:

########## BEGIN

{{Infobox_Software
| name = GNU Hurd
| logo = [[Image:Hurd-logo.png]]<br />
| developer = [[Thomas Bushnell| Michael (now Thomas) Bushnell]]
(original developer) and various contributors
| latest_release_version =
| latest_release_date =
| operating_system = [[GNU]]
| genre = [[Kernel (computer science)|Kernel]]
| family = [[POSIX]]-conformant [[Unix]]-Clones
| kernel_type = [[Microkernel]]
| license = [[GNU General Public License|GPL]]
| source_model = [[Free software]]
| working_state = In production / development
| website = [http://w.../software/hurd... www.gnu.org]
}}
{{redirect|Hurd}}
'''The GNU Hurd''' is a computer operating system [[Kernel (computer
science)|kernel]]. It consists of a set of [[Server
(computing)|servers]] (or [[daemon (computer software)|daemons]], in
[[Unix]]-speak) that work on top of either the [[GNU Mach]]
[[microkernel]] or the [[L4 microkernel family|L4 microkernel]];
together, they form the [[kernel (computer science)|kernel]] of the
[[GNU]] [[operating system]]. It has been under development since
[[1990]] by the [[GNU]] Project and is distributed as [[free
software]] under the [[GNU General Public License|GPL]]. The Hurd
aims to surpass [[Unix]] kernels in functionality, security, and
stability, while remaining largely compatible with them. This is done
by having the Hurd track the [[POSIX]] specification, while avoiding
arbitrary restrictions on the user.

"HURD" is an indirectly [[recursive acronym]], standing for "HIRD of
[[Unix]]-Replacing [[Daemon (computer software)|Daemons]]", where
"HIRD" stands for "HURD of Interfaces Representing Depth". It is also
a play of words to give "[[herd]] of [[wildebeest|gnus]]" reflecting
how it works.

==Development history==
Development on the GNU operating system began in 1984 and progressed
rapidly. By the early 1990s, the only major component missing was the kernel.

Development on the Hurd began in [[1990]], after an abandoned kernel
attempt started from the finished research [[Trix (kernel)|Trix]]
operating system developed by Professor [[Steve Ward (Computer
Scientist)| Steve Ward]] and his group at [[Massachusetts Institute
of Technology| MIT]]'s [[Laboratory for Computer Science]] (LCS).
According to [[Thomas Bushnell| Michael (now T
homas) Bushnell]], the initial Hurd architect, their early plan was
to adapt the [[BSD]] 4.4-Lite kernel and, in hindsight, "It is now
perfectly obvious to me that this would have succeeded splendidly and
the world would be a very different place today".<ref>{{cite web |
url = http://www.groklaw.net/article.php?story=2005072... |
title = The Hurd and BSDI|accessdate = 2006-08-08 | author = Peter H.
Salus | work = The Daemon, the GNU and the Penguin}}</ref> However,
due to a lack of cooperation from the [[University of California,
Berkeley|Berkeley]] programmers, [[Richard Stallman]] decided instead
to use the [[Mach microkernel]], which subsequently proved
unexpectedly difficult, and the Hurd's development proceeded slowly.

########## END

This should instead be something like:

########## BEGIN

http://en.wikipedia.org/wik...

Name = GNU Hurd
Developer = Thomas Bushnell (original developer) and various contributors
Operating_system = GNU
Genre = Kernel (computer science)
Family = POSIX-conformant Unix-Clones
Kernel type = Microkernel
License = GNU General Public License
Source model = Free software
Working state = In production / development
Website = http://w.../software/hurd...
http://w...

The GNU Hurd is a computer operating system. It consists of a set of
servers (or daemons, in Unix-speak) that work on top of either the
GNU Mach microkernel or the L4 microkernel; together, they form the
kernel of the GNU operating system. It has been under development
since 1990 by the GNU Project and is distributed as free software
under the GPL. The Hurd aims to surpass Unix kernels in
functionality, security, and stability, while remaining largely
compatible with them. This is done by having the Hurd track the POSIX
specification, while avoiding arbitrary restrictions on the user.

``HURD'' is an indirectly recursive acronym, standing for ``HIRD of
Unix-Replacing Daemons", where ``HIRD'' stands for ``HURD of
Interfaces Representing Depth". It is also a play of words to give
``herd of gnus'' reflecting how it works.

Development history

Development on the GNU operating system began in 1984 and progressed
rapidly. By the early 1990s, the only major component missing was the kernel.

Development on the Hurd began in 1990, after an abandoned kernel
attempt started from the finished research Trix operating system
developed by Professor Steve Ward and his group at MIT's Laboratory
for Computer Science (LCS). According to Michael (now Thomas)
Bushnell, the initial Hurd architect, their early plan was to adapt
the BSD 4.4-Lite kernel and, in hindsight, "It is now perfectly
obvious to me that this would have succeeded splendidly and the world
would be a very different place today". However, due to a lack of
cooperation from the Berkeley programmers, Richard Stallman decided
instead to use the Mach microkernel, which subsequently proved
unexpectedly difficult, and the Hurd's development proceeded slowly.

########## END

Looks real gorgeous doesn't it? Had I only been skilled enough to do
this myself. Which brings me to my question: Is anybody out there
willing to help me fix my script?

Thanks a lot,
Kyrre

4 Answers

user@domain.invalid

10/11/2006 11:43:00 AM

Hi,

forget regexp and use a dedicated tool like
http://rubyforge.org/projec...

I've successfully build several scraper robots with it and it works just
fine ! The only drawback is too little documentation, at least for me !

Peter Szinek

10/11/2006 12:23:00 PM

Hi,

Well, in my opinion it is still far easier to work on the exported
page text (i.e. remove MediaWiki 'markup' from there) than to scrape
HTML. Tools like scrAPI (or any reasonably working web extraction tool I
know of) are great if your input page is machine-generated and more or
less regular.

However, a wiki page does not fit into this category at all, since it is
not generated and (can be) totally irregular. For some pages you could
do well with HTML scraping (maybe even a lot of them), but IMHO it would
be *very* hard to develop a really generic solution, working for, say
99% of the pages. (Unless all needed text blocks of wikipedia can be
queried wit the same CSS selector, and no other blocks of HTML would be
matched - this is possible, but quite unlikely).

OTOH, the exported text is not that hairy - for the first sight you
would not need more than 10 regexes to clean up the 'markup' (like [[
]], {{ }} etc). I did not say it is a trivial task, but once you do it
it will always will work properly - whereas with the HTML scraping you
can be never sure. Or did I misunderstand the problem?

Just to make sure: I am not speaking against scrAPI (which is cool) or
web extraction (I am working and researching Web extraction on a daily
basis) - but from what I understood I think it is easier to mine the
exported text in this case.

Just my 2c.

Peter
http://www.rubyra...

Zouplaz wrote:
> Hi,
>
> forget regexp and use a dedicated tool like
> http://rubyforge.org/projec...
>
> I've successfully build several scraper robots with it and it works just
> fine ! The only drawback is too little documentation, at least for me !
>
>

Jano Svitok

10/11/2006 12:37:00 PM

This might be of interest to you:
http://rc3.org/2006/08/fun_with_...

Timothy Goddard

10/11/2006 7:44:00 PM

Kyrre Nygård wrote:
> Hey!
>
> I'm involved in a few research projects, and like to keep my
> information well organized. I usually get most of it from Wikipedia,
> however, I hate printing HTML articles to PDF. I'd rather want them
> in pure, well laid out text. And I'm sure others would too. Being
> able to master ones knowledge provides a warm inner peace.
>
> Hence I've tried dumping the output from text browsers such as w3m,
> elinks, lynx etc. I am, however, only interested in the articles
> themselves, not their links, views, toolboxes, search bars, other
> available languages and so on. I tried running a whole bunch of
> regular expressions over the output, but that really felt like the hard way.
>
> So some guy gave me this:
>
> #!/usr/bin/env ruby
>
> require 'rexml/document'
> require 'cgi'
> require 'tempfile'
> require 'open-uri'
>
> url = 'http://en.wikipedia.org/wiki/Special:Ex... +
> CGI::escape(ARGV.join(" ").strip.squeeze(' ').tr(' ',
> '_')).gsub(/%3[Aa]/,':').gsub(/%2[Ff]/,'/').gsub(/%23/,'#')
>
> open(url) { |f|
> puts REXML::XPath.first(REXML::Document::new(f.class == Tempfile ?
> f.open : f), '//text').text
> }
>
> Which seem to take advantage of Wikipedia's special export feature,
> which really seems cool. However there's a few issues. First, the
> script looks kinda complex. I'm sure there's a simpler way of writing
> it. Second, it does not yet output the kind of pure and well laid out
> text as it should. For instance, on
> http://en.wikipedia.org/wik..., it outputs:
>
> ########## BEGIN
>
> {{Infobox_Software
> | name = GNU Hurd
> | logo = [[Image:Hurd-logo.png]]<br />
> | developer = [[Thomas Bushnell| Michael (now Thomas) Bushnell]]
> (original developer) and various contributors
> | latest_release_version =
> | latest_release_date =
> | operating_system = [[GNU]]
> | genre = [[Kernel (computer science)|Kernel]]
> | family = [[POSIX]]-conformant [[Unix]]-Clones
> | kernel_type = [[Microkernel]]
> | license = [[GNU General Public License|GPL]]
> | source_model = [[Free software]]
> | working_state = In production / development
> | website = [http://w.../software/hurd... www.gnu.org]
> }}
> {{redirect|Hurd}}
> '''The GNU Hurd''' is a computer operating system [[Kernel (computer
> science)|kernel]]. It consists of a set of [[Server
> (computing)|servers]] (or [[daemon (computer software)|daemons]], in
> [[Unix]]-speak) that work on top of either the [[GNU Mach]]
> [[microkernel]] or the [[L4 microkernel family|L4 microkernel]];
> together, they form the [[kernel (computer science)|kernel]] of the
> [[GNU]] [[operating system]]. It has been under development since
> [[1990]] by the [[GNU]] Project and is distributed as [[free
> software]] under the [[GNU General Public License|GPL]]. The Hurd
> aims to surpass [[Unix]] kernels in functionality, security, and
> stability, while remaining largely compatible with them. This is done
> by having the Hurd track the [[POSIX]] specification, while avoiding
> arbitrary restrictions on the user.
>
> "HURD" is an indirectly [[recursive acronym]], standing for "HIRD of
> [[Unix]]-Replacing [[Daemon (computer software)|Daemons]]", where
> "HIRD" stands for "HURD of Interfaces Representing Depth". It is also
> a play of words to give "[[herd]] of [[wildebeest|gnus]]" reflecting
> how it works.
>
> ==Development history==
> Development on the GNU operating system began in 1984 and progressed
> rapidly. By the early 1990s, the only major component missing was the kernel.
>
> Development on the Hurd began in [[1990]], after an abandoned kernel
> attempt started from the finished research [[Trix (kernel)|Trix]]
> operating system developed by Professor [[Steve Ward (Computer
> Scientist)| Steve Ward]] and his group at [[Massachusetts Institute
> of Technology| MIT]]'s [[Laboratory for Computer Science]] (LCS).
> According to [[Thomas Bushnell| Michael (now T
> homas) Bushnell]], the initial Hurd architect, their early plan was
> to adapt the [[BSD]] 4.4-Lite kernel and, in hindsight, "It is now
> perfectly obvious to me that this would have succeeded splendidly and
> the world would be a very different place today".<ref>{{cite web |
> url = http://www.groklaw.net/article.php?story=2005072... |
> title = The Hurd and BSDI|accessdate = 2006-08-08 | author = Peter H.
> Salus | work = The Daemon, the GNU and the Penguin}}</ref> However,
> due to a lack of cooperation from the [[University of California,
> Berkeley|Berkeley]] programmers, [[Richard Stallman]] decided instead
> to use the [[Mach microkernel]], which subsequently proved
> unexpectedly difficult, and the Hurd's development proceeded slowly.
>
> ########## END
>
> This should instead be something like:
>
> ########## BEGIN
>
> http://en.wikipedia.org/wik...
>
> Name = GNU Hurd
> Developer = Thomas Bushnell (original developer) and various contributors
> Operating_system = GNU
> Genre = Kernel (computer science)
> Family = POSIX-conformant Unix-Clones
> Kernel type = Microkernel
> License = GNU General Public License
> Source model = Free software
> Working state = In production / development
> Website = http://w.../software/hurd...
> http://w...
>
>
> The GNU Hurd is a computer operating system. It consists of a set of
> servers (or daemons, in Unix-speak) that work on top of either the
> GNU Mach microkernel or the L4 microkernel; together, they form the
> kernel of the GNU operating system. It has been under development
> since 1990 by the GNU Project and is distributed as free software
> under the GPL. The Hurd aims to surpass Unix kernels in
> functionality, security, and stability, while remaining largely
> compatible with them. This is done by having the Hurd track the POSIX
> specification, while avoiding arbitrary restrictions on the user.
>
> ``HURD'' is an indirectly recursive acronym, standing for ``HIRD of
> Unix-Replacing Daemons", where ``HIRD'' stands for ``HURD of
> Interfaces Representing Depth". It is also a play of words to give
> ``herd of gnus'' reflecting how it works.
>
> Development history
>
> Development on the GNU operating system began in 1984 and progressed
> rapidly. By the early 1990s, the only major component missing was the kernel.
>
> Development on the Hurd began in 1990, after an abandoned kernel
> attempt started from the finished research Trix operating system
> developed by Professor Steve Ward and his group at MIT's Laboratory
> for Computer Science (LCS). According to Michael (now Thomas)
> Bushnell, the initial Hurd architect, their early plan was to adapt
> the BSD 4.4-Lite kernel and, in hindsight, "It is now perfectly
> obvious to me that this would have succeeded splendidly and the world
> would be a very different place today". However, due to a lack of
> cooperation from the Berkeley programmers, Richard Stallman decided
> instead to use the Mach microkernel, which subsequently proved
> unexpectedly difficult, and the Hurd's development proceeded slowly.
>
> ########## END
>
> Looks real gorgeous doesn't it? Had I only been skilled enough to do
> this myself. Which brings me to my question: Is anybody out there
> willing to help me fix my script?
>
> Thanks a lot,
> Kyrre

It sounds like the Wikipedia API would be best for this. Visit the
script without parameters at http://en.wikipedia.org/w... for
instructions on its use. For Ruby programs you're probably best off
retrieving YAML data as it's easiest to work with.

require 'open-uri'
require 'yaml'

module Wikipedia
class WikipediaException < Exception
end

def self.query(*titles)
result =
open("http://en.wikipedia.org/w...?what=content&format=yaml&titles=#{titles.flatten.join('|')}")
data = YAML.load(result)
if data['error']
raise WikipediaException, data['error'].values.first
end
data
end
end

comp.lang.ruby

Script to fetch Wikipedia text

Kyrre Nygård

user@domain.invalid

Peter Szinek

Jano Svitok

Timothy Goddard

x Login to ForumsZone