Asp Forum - eliminate duplicate sequences from file

Esmail

3/1/2008 3:21:00 AM

A few months back I asked for suggestions to parse a file that
contained this type of data:

>label_1
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
TT
>label_2
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
CC
>label_3
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
GG
>label_4
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
AA
>label_5
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
GG

I wanted to eliminate duplicate sequences. This is what I ended up
using based on some great suggestions here:

-------------

def no_dups(data)
data.split(">").uniq.join(">")
end

data = IO.read(ARGV[0])
fixed = no_dups(data)
puts fixed

-------------

However, I have discovered a problem with the data. While the labels
may be different (and have different lengths, like the sequence may -
tough my simplified example doesn't), the associated sequences may
indeed be *identical*.

So for instance the sequences associated with label_3 and label_5 are
in fact identical, but will not be flagged as such since their labels
differs.

My goal is to eliminate duplicate sequences (and their label) from
this file. So in this case either label_3 and its sequence, or label_5
and its sequence would be eliminated. Does anyone have a good
suggestion on how to accomplish this?

My initial thought is to somehow parse the file and create a hash and
then use the sequences as keys and the labels as values, the idea
being that duplicate sequences (ie keys) would overwrite each other if
they occurred and I'd end up with all unique sequences.

Does this seem like a reasonable approach. Is there a better, more
elegant and perhaps more efficient solution?

Thanks!

eb

3 Answers

William James

3/1/2008 5:17:00 AM

On Feb 29, 9:21 pm, Esmail <ebonak_de...@hotmail.com> wrote:
> A few months back I asked for suggestions to parse a file that
> contained this type of data:
>
> >label_1
> AAACCCCCCCTTTTTAAAAA
> AAACCCCCCCTTTTTAAAAA
> AAACCCCCCCTTTTTAAAAA
> TT
> >label_2
> AAACCCCCCCTTTTTAAAAA
> AAACCCCCCCTTTTTAAAAA
> AAACCCCCCCTTTTTAAAAA
> CC
> >label_3
> AAACCCCCCCTTTTTAAAAA
> AAACCCCCCCTTTTTAAAAA
> AAACCCCCCCTTTTTAAAAA
> GG
> >label_4
> AAACCCCCCCTTTTTAAAAA
> AAACCCCCCCTTTTTAAAAA
> AAACCCCCCCTTTTTAAAAA
> AA
> >label_5
> AAACCCCCCCTTTTTAAAAA
> AAACCCCCCCTTTTTAAAAA
> AAACCCCCCCTTTTTAAAAA
> GG
>
> I wanted to eliminate duplicate sequences. This is what I ended up
> using based on some great suggestions here:
>
> -------------
>
> def no_dups(data)
> data.split(">").uniq.join(">")
> end
>
> data = IO.read(ARGV[0])
> fixed = no_dups(data)
> puts fixed
>
> -------------
>
> However, I have discovered a problem with the data. While the labels
> may be different (and have different lengths, like the sequence may -
> tough my simplified example doesn't), the associated sequences may
> indeed be *identical*.
>
> So for instance the sequences associated with label_3 and label_5 are
> in fact identical, but will not be flagged as such since their labels
> differs.
>
> My goal is to eliminate duplicate sequences (and their label) from
> this file. So in this case either label_3 and its sequence, or label_5
> and its sequence would be eliminated. Does anyone have a good
> suggestion on how to accomplish this?
>
> My initial thought is to somehow parse the file and create a hash and
> then use the sequences as keys and the labels as values, the idea
> being that duplicate sequences (ie keys) would overwrite each other if
> they occurred and I'd end up with all unique sequences.
>
> Does this seem like a reasonable approach. Is there a better, more
> elegant and perhaps more efficient solution?
>
> Thanks!
>
> eb

Sounds like a good idea.

h = Hash[ *DATA.read.split( /^( *>label_\d+\n)/ )[1..-1].reverse ]

puts h.to_a.flatten.reverse

__END__
>label_1
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
TT
>label_2
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
CC
>label_3
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
GG
>label_4
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
AA
>label_5
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
AAACCCCCCCTTTTTAAAAA
GG

Esmail

3/1/2008 12:46:00 PM

William James wrote:

>>
>> However, I have discovered a problem with the data. While the labels
>> may be different (and have different lengths, like the sequence may -
>> tough my simplified example doesn't), the associated sequences may
>> indeed be *identical*.
>>
>> So for instance the sequences associated with label_3 and label_5 are
>> in fact identical, but will not be flagged as such since their labels
>> differs.
>>
>> My goal is to eliminate duplicate sequences (and their label) from
>> this file. So in this case either label_3 and its sequence, or label_5
>> and its sequence would be eliminated. Does anyone have a good
>> suggestion on how to accomplish this?
>>
>> My initial thought is to somehow parse the file and create a hash and
>> then use the sequences as keys and the labels as values, the idea
>> being that duplicate sequences (ie keys) would overwrite each other if
>> they occurred and I'd end up with all unique sequences.
>>
>> Does this seem like a reasonable approach. Is there a better, more
>> elegant and perhaps more efficient solution?
>>
>> Thanks!
>>
>> eb
>
> Sounds like a good idea.
>
> h = Hash[ *DATA.read.split( /^( *>label_\d+\n)/ )[1..-1].reverse ]
>
> puts h.to_a.flatten.reverse
>
> __END__
> >label_1
> AAACCCCCCCTTTTTAAAAA
> AAACCCCCCCTTTTTAAAAA

wow .. that is pretty slick .. glad my idea wasn't that far off,
but it would have taken me some time to come up with such a nice
concise way of expressing it.

A few quick questions:

1. If my labels start with ">" in column 1 (they got indented in my
original post), and are a single line which may contain all sorts
of strings (not necessarily "label") .. would this reg expression
work too? .split(/>.+\n/)

2. I had never seen *DATA before, I can only find DATA in my books
which basically says the data used by the program is appended. What
is the significance of the * in front?

3. When I try reading the data from a file

h = Hash[ IO.read("td").split( /^( *>label_\d+\n)/ )[1..-1].reverse ]

I get: ./eSeqs.rb:7:in `[]': odd number of arguments for Hash (ArgumentError)
from ./eSeqs.rb:7

Clearly I need to read up on hashes more, but is there an easy way
for me to read the file in, instead of appending it to my program?

Thanks again!!

EB

Esmail

3/2/2008 1:16:00 PM

Got it

h = Hash[*IO.read(ARGV[0]).split( /^(>.+\n)/ )[1..-1].reverse]
puts h.to_a.flatten.reverse

A good chance to learn more Ruby :-)

Thanks again!

comp.lang.ruby

eliminate duplicate sequences from file

Esmail

William James

Esmail

Esmail

x Login to ForumsZone