TwirlwT
12/18/2015 6:01:00 PM
El viernes, 18 de diciembre de 2015, 7:13:58 (UTC+1), WJ escribió:
> TwirlwT wrote:
>
> >
> > * (load "mosquito.lisp")
> > longest-common-substr for k = 11505 is (11508 53395)
> > longest-common-substr for k = 11506 is 2639
> > Hence, the greatest k such that f(k) >= k is 11505
> > The substrings are located at positions 27441, 15935 and 39112
> >
> > -----------------------------------------------------------------
> > Source code follows:
> >
> > (defparameter bases (with-open-file (f "mosca.txt"
> > :direction :input
> > :element-type 'base-char)
> > (read-line f)))
> >
> >
> > (defun ordena()
> > (let ((anum (coerce (loop for i below (length bases) collect i) 'vector)))
> > (sort anum (lambda(i j) (string< bases bases :start1 i :start2 j)))))
> >
> > (defparameter anum (ordena))
> >
> > (defun mis(i j)
> > (let ((i1 (aref anum i))
> > (j1 (aref anum j)))
> > (- (mismatch bases bases :start1 i1 :start2 j1) i1)))
> >
> > (defun longest-common-substr(k)
> > "Compute the longest common substring,
> > The parameter k is used to guarantee that each pair of substring are
> > at a distance greater than k. The condition f(k)>=k guarantees that
> > there is not overlapping so that the solution obtained is feasible.
> >
> > Finally, since f(k) is a decreasing function of k, the
> > longest feasible solution is the greatest k such that f(k) >= k. Such
> > k can be found with a loop such as: while f(k)>=k increase k, but is
> > better to look for k using binary search.
> >
> > Note that the ordering used for the indices implies that
> > mismatch(i, i+p) = k0 then mismatch(i,j)>= k0 for each j with i<=j<=k0.
> > "
> >
> > (loop with maxi = 0 with pos = 0 with aux = 0
> > for i upfrom 0 below (- (length anum) 3)
> > for a = (aref anum i)
> > for b = (aref anum (1+ i))
> > for c = (aref anum (+ 2 i))
> > when (and (> (abs (- a b)) k)
> > (> (abs (- a c)) k)
> > (> (abs (- b c)) k))
> > do
> > (progn
> > (setq aux (- (mismatch bases bases :start1 a :start2 c) a))
> > (when (> aux maxi)
> > (setq maxi aux
> > pos i)))
> > finally (return (values maxi pos))))
> >
> >
> > (multiple-value-bind (maxlen location)
> > (longest-common-substr 11505)
> > (format t "longest-common-substr for k = 11505 is ~S~%" (list maxlen location))
> > (format t "longest-common-substr for k = 11506 is ~S~%"
> > (longest-common-substr 11506))
> > (format t "Hence, the greatest k such that f(k) >= k is 11505~%")
> > (format t "The substrings are located at positions ~D, ~D and ~D~%"
> > (aref anum location) (aref anum (1+ location)) (aref anum (+ 2 location))))
> >
>
> MatzLisp (Ruby):
>
> text = IO.binread("Drosophila_melanogaster_chromX.txt")
>
> t = Time.now
> @suffix_array = (0 ... text.size).sort_by{|i| text[i .. -1]}
> printf "Sorting took %.1f seconds.\n", Time.now - t
>
> def suffix_size(i)
> @suffix_array.size - @suffix_array[i]
> end
>
> def overlap?(locations, pos, min_size)
> locations.map{|p| (p - pos).abs}.min < min_size
> end
>
> reps = 3
> min_size = 11506
>
> (text.size - reps + 1).times{|i|
> next if suffix_size(i) < min_size
> pos = @suffix_array[i]
> locations = [pos]
> substr = text[ pos, min_size ]
> (i+1).upto(text.size - 1){|j|
> next if suffix_size(j) < min_size
> pos = @suffix_array[j]
> next if overlap?(locations, pos, min_size)
> break if substr != text[pos,min_size]
> locations << pos }
> if locations.size >= reps
> p locations.sort
> end }
>
>
>
> I used the complete 3 megabyte file, not the 500_000 byte file that
> you mentioned in the Ruby forum.
>
>
> Output:
>
> Sorting took 17.3 seconds.
> [15935, 27441, 39112]
> [15936, 27442, 39113]
> [15937, 27443, 39114]
Hello, thanks for the ruby version, (sorry to the rest of the list if this is off-topic here)
I didn't considered sort_by because, at first sight, creating the table with 3 Millions x 3 Millions bytes appears to be a terrible idea. It happens that Ruby use a copy on write scheme, so that it only create structures with displaced pointers. I have tried to use a similar scheme in Lisp using displaced-arrays but the performance is very bad in this case.
Just for comparison, in one of my old computers, the ruby code and the Lisp code both took approximately the same amount of time, about 19 seconds, when used displaced arrays it takes 85 seconds.
So here I have to acknowledge that the performance of ruby is very good at sorting. Kudos to the ruby developers.
Just nick-picking, in your code you use: locations.map{|p| (p - pos).abs}..min it could be abbreviated to locations.min{|p| (p-pos).abs}, since you don't need to construct the list just to compute the minimum.
As I like to learn many programming languages, I like to read some of your post, the only caveat is that this isn't the right place to do so but it keeps posts flowing, what is good.
-------------------------------------------
ruby wj.rb
Sorting took 18.9 seconds.
[15935, 27441, 39112]
[15936, 27442, 39113]
[15937, 27443, 39114]
(defun ordena-old()
(let ((anum (make-array (length bases)
:element-type '(unsigned-byte 30)
:initial-contents (loop for i below (length bases) collect i))))
(declare (type (simple-array base-char) bases)
(sort anum (lambda(i j) (string< bases bases :start1 i :start2 j))))))
(defun ordena()
(let* ((len1 (length bases))
(anum (make-array len1
:element-type '(unsigned-byte 30)
:initial-contents (loop for i below len1 collect i))))
(declare (type (simple-array base-char) bases))
(sort anum
#'string<
:key (lambda(i)
(make-array (- len1 i) :element-type 'base-char
:displaced-to bases :displaced-index-offset i)))))
(defun ordena-old()
(let ((anum (make-array (length bases)
:element-type '(unsigned-byte 30)
:initial-contents (loop for i below (length bases) collect i))))
(declare (type (simple-array base-char) bases)
(sort anum (lambda(i j) (string< bases bases :start1 i :start2 j))))))
CL-USER> (time (progn (ordena-old) nil))
Evaluation took:
0.193 seconds of real time
0.192000 seconds of total run time (0.180000 user, 0.012000 system)
99.48% CPU
441,885,614 processor cycles
37,741,296 bytes consed
NIL
STYLE-WARNING: redefining COMMON-LISP-USER::ORDENA in DEFUN
CL-USER> (time (progn (ordena) nil))
Evaluation took:
84.704 seconds of real time
84.692000 seconds of total run time (83.864000 user, 0.828000 system)
[ Run times consist of 2.208 seconds GC time, and 82.484 seconds non-GC time. ]
99.99% CPU
194,353,507,546 processor cycles