Esmail
3/1/2008 12:46:00 PM
William James wrote:
>>
>> However, I have discovered a problem with the data. While the labels
>> may be different (and have different lengths, like the sequence may -
>> tough my simplified example doesn't), the associated sequences may
>> indeed be *identical*.
>>
>> So for instance the sequences associated with label_3 and label_5 are
>> in fact identical, but will not be flagged as such since their labels
>> differs.
>>
>> My goal is to eliminate duplicate sequences (and their label) from
>> this file. So in this case either label_3 and its sequence, or label_5
>> and its sequence would be eliminated. Does anyone have a good
>> suggestion on how to accomplish this?
>>
>> My initial thought is to somehow parse the file and create a hash and
>> then use the sequences as keys and the labels as values, the idea
>> being that duplicate sequences (ie keys) would overwrite each other if
>> they occurred and I'd end up with all unique sequences.
>>
>> Does this seem like a reasonable approach. Is there a better, more
>> elegant and perhaps more efficient solution?
>>
>> Thanks!
>>
>> eb
>
> Sounds like a good idea.
>
> h = Hash[ *DATA.read.split( /^( *>label_\d+\n)/ )[1..-1].reverse ]
>
> puts h.to_a.flatten.reverse
>
> __END__
> >label_1
> AAACCCCCCCTTTTTAAAAA
> AAACCCCCCCTTTTTAAAAA
wow .. that is pretty slick .. glad my idea wasn't that far off,
but it would have taken me some time to come up with such a nice
concise way of expressing it.
A few quick questions:
1. If my labels start with ">" in column 1 (they got indented in my
original post), and are a single line which may contain all sorts
of strings (not necessarily "label") .. would this reg expression
work too? .split(/>.+\n/)
2. I had never seen *DATA before, I can only find DATA in my books
which basically says the data used by the program is appended. What
is the significance of the * in front?
3. When I try reading the data from a file
h = Hash[ IO.read("td").split( /^( *>label_\d+\n)/ )[1..-1].reverse ]
I get: ./eSeqs.rb:7:in `[]': odd number of arguments for Hash (ArgumentError)
from ./eSeqs.rb:7
Clearly I need to read up on hashes more, but is there an easy way
for me to read the file in, instead of appending it to my program?
Thanks again!!
EB