[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Opening a large file many times / optimisation

Paul Nulty

3/11/2007 4:26:00 PM

hello,

I have a method that basically searches through a largish (5mb) text
file for a word. I need to call this method about 1400 times, and i
care about speed.

If i open the file at the start of the script and then pass the file
object as a parameter to my method each time its called, the code runs
quite a bit faster than if i open the file inside the method each
time; but this seems ugly to me.

Is there a standard way to do this in ruby? How much overhead is
involved in opening a large text file?

thanks.

8 Answers

James Gray

3/11/2007 4:43:00 PM

0

On Mar 11, 2007, at 11:30 AM, Paul Nulty wrote:

> I have a method that basically searches through a largish (5mb) text
> file for a word. I need to call this method about 1400 times, and i
> care about speed.
>
> If i open the file at the start of the script and then pass the file
> object as a parameter to my method each time its called, the code runs
> quite a bit faster than if i open the file inside the method each
> time; but this seems ugly to me.
>
> Is there a standard way to do this in ruby? How much overhead is
> involved in opening a large text file?

Well, if you have enough RAM to support pulling it into memory,
that's certainly going to be faster. However, there are some
techniques you could use to speed up and index and query operation.
See this old Ruby Quiz for some ideas:

http://www.rubyquiz.com/q...

James Edward Gray II

M. Edward (Ed) Borasky

3/11/2007 4:46:00 PM

0

Paul Nulty wrote:
> hello,
>
> I have a method that basically searches through a largish (5mb) text
> file for a word. I need to call this method about 1400 times, and i
> care about speed.
>
> If i open the file at the start of the script and then pass the file
> object as a parameter to my method each time its called, the code runs
> quite a bit faster than if i open the file inside the method each
> time; but this seems ugly to me.
>
> Is there a standard way to do this in ruby? How much overhead is
> involved in opening a large text file?
>
> thanks.
>
1. You need to define the problem better. Are you searching for a
different word each time, does the file change each time, etc. Why do
you have to call it 1400 times?

2. Searching and indexing are extremely well documented areas of
computer science. Once you've correctly defined your problem, I'm sure
you'll come up with something far more efficient than a brute force
"open a five megabyte file, read the whole enchilada into RAM, and do a
text search for the word, then close the file and wait for the next
request".

3. Do you care about scalability, or is the file *never* going to get
bigger than 5 MBytes? Is the method *always* going to be called "only"
1400 times, or will someone see your success and say, "Great -- here's
20 million words!"?

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blo...

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.


Paul Nulty

3/11/2007 5:24:00 PM

0

>
> 1. You need to define the problem better. Are you searching for a
> different word each time, does the file change each time, etc. Why do
> you have to call it 1400 times?

ok here's a few lines from the file i'm searching (its a wordnet file
that holds different senses of words)

concavity%1:07:00:: 05070032 2 0
concavity%1:25:00:: 13864965 1 0
concavo-concave%5:00:00:concave:00 00536008 1 0
concavo-convex%5:00:00:concave:00 00536416 1 0
conceal%2:39:00:: 02146790 2 1
conceal%2:39:01:: 02144835 1 8
concealed%3:00:00:: 02088404 2 1
concealed%5:00:00:invisible:00 02517817 1 2
concealing%1:04:00:: 01048912 1 0
concealing%3:00:00:: 02091020 1 0


i need to search for the first part (e.g. conceal%2:39:00::) and
return the second last number (eg. 2). (getting the sense from the
sense key, if you know wordnet)

i have 1400 words, the wordnet file will never change. i'm unlikely to
need to scale up much past 1400.

here's my code: (senseKey is eg "conceal%2:39:00::")

lines=File.readlines("/usr/local/WordNet-3.0/dict/index.sense")

#gets a sysnet number from a sense key
def getSense(senseKey,lines)
for line in lines
if line.index(senseKey)==0
words=line.split(" ")
return words[-2]
end
end
end


thanks again!




M. Edward (Ed) Borasky

3/11/2007 5:35:00 PM

0

Paul Nulty wrote:
>> 1. You need to define the problem better. Are you searching for a
>> different word each time, does the file change each time, etc. Why do
>> you have to call it 1400 times?
>>
>
> ok here's a few lines from the file i'm searching (its a wordnet file
> that holds different senses of words)
>
> concavity%1:07:00:: 05070032 2 0
> concavity%1:25:00:: 13864965 1 0
> concavo-concave%5:00:00:concave:00 00536008 1 0
> concavo-convex%5:00:00:concave:00 00536416 1 0
> conceal%2:39:00:: 02146790 2 1
> conceal%2:39:01:: 02144835 1 8
> concealed%3:00:00:: 02088404 2 1
> concealed%5:00:00:invisible:00 02517817 1 2
> concealing%1:04:00:: 01048912 1 0
> concealing%3:00:00:: 02091020 1 0
>
>
> i need to search for the first part (e.g. conceal%2:39:00::) and
> return the second last number (eg. 2). (getting the sense from the
> sense key, if you know wordnet)
>
> i have 1400 words, the wordnet file will never change. i'm unlikely to
> need to scale up much past 1400.
>
> here's my code: (senseKey is eg "conceal%2:39:00::")
>
> lines=File.readlines("/usr/local/WordNet-3.0/dict/index.sense")
>
> #gets a sysnet number from a sense key
> def getSense(senseKey,lines)
> for line in lines
> if line.index(senseKey)==0
> words=line.split(" ")
> return words[-2]
> end
> end
> end
>
>
> thanks again!
>
Isn't there a Ruby/Wordnet interface? Doctor Google recommended
http://www.deveiate.org/projects/Rub...


--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blo...

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.


Paul Nulty

3/11/2007 5:56:00 PM

0

>Isn't there a Ruby/Wordnet interface? Doctor Google recommended
>http://www.deveiate.org/projects/Rub...

yep i'm using that; it's great but as far as i can tell it doesn't use
sense keys, it uses sense numbers. I only have the sense keys, so i
need to get the sense number from the sense key manually.

Robert Klemme

3/11/2007 8:42:00 PM

0

On 11.03.2007 18:55, Paul Nulty wrote:
>> Isn't there a Ruby/Wordnet interface? Doctor Google recommended
>> http://www.deveiate.org/projects/Rub...
>
> yep i'm using that; it's great but as far as i can tell it doesn't use
> sense keys, it uses sense numbers. I only have the sense keys, so i
> need to get the sense number from the sense key manually.

Try reading the file and storing all combinations in a Hash with sense
key as key and number as value.

robert

Brian Candler

3/11/2007 8:42:00 PM

0

On Mon, Mar 12, 2007 at 02:25:08AM +0900, Paul Nulty wrote:
> >
> > 1. You need to define the problem better. Are you searching for a
> > different word each time, does the file change each time, etc. Why do
> > you have to call it 1400 times?
>
> ok here's a few lines from the file i'm searching (its a wordnet file
> that holds different senses of words)
>
> concavity%1:07:00:: 05070032 2 0
> concavity%1:25:00:: 13864965 1 0
> concavo-concave%5:00:00:concave:00 00536008 1 0
> concavo-convex%5:00:00:concave:00 00536416 1 0
> conceal%2:39:00:: 02146790 2 1
> conceal%2:39:01:: 02144835 1 8
> concealed%3:00:00:: 02088404 2 1
> concealed%5:00:00:invisible:00 02517817 1 2
> concealing%1:04:00:: 01048912 1 0
> concealing%3:00:00:: 02091020 1 0
>
>
> i need to search for the first part (e.g. conceal%2:39:00::) and
> return the second last number (eg. 2). (getting the sense from the
> sense key, if you know wordnet)
>
> i have 1400 words, the wordnet file will never change. i'm unlikely to
> need to scale up much past 1400.

If you're searching a 5MB file 1400 times, it's almost certainly worth
reading it in once and building a hash as you go. Remember that on average,
you are reading half the lines in the file on every search. So you should
speed up by a factor of nearly 700 just by doing this.

If the wordnet file is too big to fit into RAM, then there are ways of
indexing the file on disk to make it quicker to search (external searching)

> here's my code: (senseKey is eg "conceal%2:39:00::")
>
> lines=File.readlines("/usr/local/WordNet-3.0/dict/index.sense")
>
> #gets a sysnet number from a sense key
> def getSense(senseKey,lines)
> for line in lines
> if line.index(senseKey)==0
> words=line.split(" ")
> return words[-2]
> end
> end
> end

Try something like:

class Wordnet
def initialize(filename)
@words = {}
File.open(filename) do |f|
f.each_line do |line|
fields = line.chomp.split(/ /)
key = fields.shift
@words[key] = fields
end
end
end
def sysnet(senseKey)
@words[senseKey][1]
end
end

wn = Wordnet.new("/usr/local/WordNet-3.0/dict/index.sense")
# Now do this 1400 times for different keys
puts wn.sysnet("conceal%2:39:00::")

Paul Nulty

3/12/2007 1:26:00 PM

0

Thanks!

before:

142.800000 0.100000 142.900000 (156.818797)

after (with hash)

9.900000 0.100000 10.000000 ( 11.259273)

thanks again.