Asp Forum
Home
|
Login
|
Register
|
Search
Forums
>
comp.lang.ruby
Open URI & web scraping. Part II
Jean Nibee
11/13/2007 1:21:00 PM
Hi
(short form of a post I made yesterday that got no love, I suspect it'
sbecuase I was long winded)
Nutshell if I use open URI (and Hpricot) to download a web page and
'scrape' all the images to write them to my local disk dynamic images
always have improper format (Size 0) but static images are fine.
Example would be : <img
src="http://myserver:8080/Someservlet?name=blah¶m=value&etc=etc">
Whether I copy/paste this URL in another browser or use open URI to
"get" the image I get an an error of:
XML Parsing Error: no element found
Location: http://myserver:8080/Someservlet?name=blah¶m=value&etc=etc
Line Number 1, Column 1:
BUT, this image is displayed PERFECTLY in the html.
How can I get this image to download? (I suspect it's the mime type
being set on the server side but I am not 100% sure)
***
OUTPUT
***
[[URI information...]]
Fetched document:
http://myserver:8080/Someservlet?name=blah¶m=value&etc=etc
Content Type: application/voicexml+xml
Charset:
Content-Encoding:
Last Modified:
IMAGE INFO!!! ->
Writing to file ::
D:\sandbox\auto_attendant\archive_reports\trunk\dumps\1194882652_854.gif
Thanks for your help.
--
Posted via
http://www.ruby-...
.
3 Answers
Axel Etzold
11/13/2007 1:48:00 PM
0
-------- Original-Nachricht --------
> Datum: Tue, 13 Nov 2007 22:21:14 +0900
> Von: Jean Nibee <theopensourceguy@gmail.com>
> An: ruby-talk@ruby-lang.org
> Betreff: Open URI & web scraping. Part II
> Hi
>
> (short form of a post I made yesterday that got no love, I suspect it'
> sbecuase I was long winded)
>
> Nutshell if I use open URI (and Hpricot) to download a web page and
> 'scrape' all the images to write them to my local disk dynamic images
> always have improper format (Size 0) but static images are fine.
>
> Example would be : <img
> src="http://myserver:8080/Someservlet?name=blah¶m=value&etc=etc">
>
> Whether I copy/paste this URL in another browser or use open URI to
> "get" the image I get an an error of:
>
> XML Parsing Error: no element found
> Location: http://myserver:8080/Someservlet?name=blah¶m=value&etc=etc
> Line Number 1, Column 1:
>
> BUT, this image is displayed PERFECTLY in the html.
>
> How can I get this image to download? (I suspect it's the mime type
> being set on the server side but I am not 100% sure)
>
> ***
> OUTPUT
> ***
> [[URI information...]]
> Fetched document:
> http://myserver:8080/Someservlet?name=blah¶m=value&etc=etc
> Content Type: application/voicexml+xml
> Charset:
> Content-Encoding:
> Last Modified:
> IMAGE INFO!!! ->
> Writing to file ::
> D:\sandbox\auto_attendant\archive_reports\trunk\dumps\1194882652_854.gif
>
> Thanks for your help.
> --
> Posted via
http://www.ruby-...
.
Dear Jean,
maybe you can use ruby's rio (
http://rio.ruby...
) to download
an entire website. I'm thinking in particular of the examples
given in
http://rio.ruby...
classes/RIO/Doc/INTRO.html under the
headers
"Creating a Rio that refers to a web page" and
"Creating a Rio that refers to a file or directory on a FTP server".
Otherwise, maybe you get better responses on the Rails mailing list ?
Best regards,
Axel
--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN:
http://www.gmx.net/de/go/s...
Jean Nibee
11/13/2007 1:59:00 PM
0
Axel Etzold wrote:
>
> Dear Jean,
>
> maybe you can use ruby's rio (
http://rio.ruby...
) to download
> an entire website. I'm thinking in particular of the examples
> given in
>
http://rio.ruby...
classes/RIO/Doc/INTRO.html under the
> headers
>
> "Creating a Rio that refers to a web page" and
> "Creating a Rio that refers to a file or directory on a FTP server".
>
> Otherwise, maybe you get better responses on the Rails mailing list ?
>
> Best regards,
>
> Axel
Same issue with RIO (albeit a little more complex to get thae page and
parse it as I"m doing w/ OpenURI / HPricot.)
I didn't post to rails since this isn't using the rails framework, but,
maybe they do more web work that it will clue them into an issue I'm
missing.
Thanks for your reply and help!
--
Posted via
http://www.ruby-...
.
Peter Szinek
11/13/2007 2:04:00 PM
0
Hi Jean,
> Same issue with RIO (albeit a little more complex to get thae page and
> parse it as I"m doing w/ OpenURI / HPricot.)
What does an aggressive wget (i.e. with grab everything options) do?
Cheers,
Peter
___
http://www.rubyra...
http://s...
Servizio di avviso nuovi messaggi
Ricevi direttamente nella tua mail i nuovi messaggi per
Open URI & web scraping. Part II
Inserendo la tua e-mail nella casella sotto, riceverai un avviso tramite posta elettronica ogni volta che il motore di ricerca troverà un nuovo messaggio per te
Il servizio è completamente GRATUITO!
x
Login to ForumsZone
Login with Google
Login with E-Mail & Password