[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.programming

Parallel Sort library is here...

Ramine

2/8/2015 12:23:00 AM


Hello,


My Parallel Sort library version 3.3 is here:

https://sites.google.com/site/aminer68/parallel-so...


As i have told you , my new parallel Sort algorithm has become
more cache-aware, and since it has become more cache-aware it
have induced a super linear speedup and super linear scalability when
using more cores and more L2 caches, i have done some benchmarks on my
Quadcore that uses two L2 caches and it has given a super linear speedup
of 5X scalability on my Quadcore when sorting strings even though i am
using only 4 cores, that's easy to understand cause when you use only
one thread it will use only one L2 cache, but when you use more threads
on multiple cores and with multiple L2 caches it will use more L2 caches
and it will parallelize the access to those multiple L2 caches , this is
why my new parallel algorithm has given a super linear speedup when
sorting strings.

I have to be frank, we have to be smart when inventing or doing parallel
sort algorithms, my new parallel sort algorithm is smart, cause it is
more cache-aware, but you have to be carefull cause look
at my other parallel quicksort algorithm here:

https://sites.google.com/site/aminer68/parallel...

You have to know that this parallel quicksort that uses the
classical way of sorting is not so good, cause when its partition
function is used recursively, this parallel quicksort algorithm will
dispatch the arrays to be partitioned each time to different threads,
and this will make it less cache-aware than my new parallel sort
algorithm, and since it will make it less cache-aware , so you will
not get much than 3X scalability by sorting strings and you will get
less that 3X scalability by sorting integers or doubles for example, So
i advice you to use my new parallel sort algorithm of my parallel sort
library version 3.3 that is more cache-aware and that gives you super
linear scalability when sorting strings and it gives you good
scalability when sorting integers and doubles etc.

Let's talk more computer science...

Finally i have arrived to an important subject that is "optimization",
and you will notice that computer science learns you one
of the most important thing, is how to amortize and optimize by
optimizing more your time complexity that we express mathematically with
a O(), this is how i have done it with my new parallel sort algorithm
that has a "very" good time complexity, and what learns you also
computer science is how to optimize more in the source code's part or
parts that takes the greater percentage of running time, and what learns
you also computer science is to do the greater percentage of all the
optimizations and to not do the smaller optimizations, this is also a
good way to follow to optimize your code, so if you look at my
new parallel conjugate gradient solver library , i am not using SIMD
SSE2 instructions , but i am just parallelizing the memory transfers
cause my parallel algorithm is NUMA-aware and i am also parallelizing
also the multiplication and addition of the double's type etc. and also
my parallel algorithm is cache-aware so i am also parallelizing more the
L2 cache memory transfers , so this optimization will make it a very
greater percentage of all the optimization that we can do on my parallel
algorithm, so in large scale industrial problems i don't think that my
parallel conjugate gradient algorithm needs to do SIMD SSE2 or AVX
optimization , cause SIMD SSE2 or AVX will give just a small improvement
to my parallel algorithm.

I have finally arrived to an important subject...it's still
on the subject of parallel optimization, but this time i will
talk more about the Cilk-style scheduler.. as you have seen me in my
previous post i have talked about the Dmitry Vyukov's Cilk-style
scheduler that supports system-topology awareness, hyper-threading
awareness, affinity-awareness, batch-spawn capability and manual
task-depth control, here it is:

http://www.1024cores.net/home/parallel-computing/...


and as you have seen me talking about my parallel quicksort, i have said
that it's not so good cause since it dispatches the indexes of the
arrays to be partitionned each time to different threads of my
classical threadpool, since the my threadpool don't look like Dmitry
Vyukov's scheduler , so this will render the parallel quicksort
algorithm less cache-aware, so this will not scale much than 3X for
strings and it will scale less for integers and doubles... but if we are
using the Dmitry Vyukov's scheduler, as we recursively dispatch the
array's indexes to the Scheduler's threads that will call the partition
function, the Dmitry Vyukov's Cilk-style scheduler will dispatch those
array's indexes to threads that will reuse "efficiently" the data of the
caches , so this will reuse efficiently the L2-cache's data and
L3-cache's data efficiently , but even though it will reuse the data of
caches efficiently, this optimizations of the Dmitry Vyukov's scheduler
will not bring more than a 2X improvement , in fact you will get less
and less improvement when using "bigger" and bigger data, so this is not
so important improvement that will bring the Dmitry Vyukov's scheduler,
so what you have to do is to render your parallel sort algorithm into a
NUMA-aware algorithm this way you will render your parallel algorithm
into a scalable algorithm, this is why i have told you in my previous
post that the Dmitry's Sheduler will not bring an important improvement,
and if you take a look at my new parallel sort algorithm of my parallel
sort library version 3.3, its parallel sorting part doesn't contain
parallel recursive calls and its parallel merging part doesn't contain
parallel recursive calls, so the Dmitry Vyukov's scheduler will not
bring improvement to my parallel algorithm, so for all those reasons
this why i have decided to not implement a scheduler that look like the
Dmitry Vyukov's scheduler and this why i have decided to use my
classical threadpool engine instead.

Finally Parallel Sort Library that supports Parallel Quicksort, Parallel
HeapSort and Parallel MergeSort on Multicores systems is here.

Parallel Sort Library uses my Thread Pool Engine and sort many array
parts - of your array - in parallel using Quicksort or HeapSort or
MergeSort and after that it finally merge them - with the merge()
procedure -

In the previous parallelsort version i have parallelized only the sort
part, but in this new parallelsort version i have parallelized also the
merge procedure part and it gives better performance.

My new parallel sort algorithm has become more cache-aware, and i have
done some benchmarks with my new parallel algorithm and it has given up
to 5X scalability on a Quadcore when sorting strings, other than that i
have cleaned more the code and i think my parallel Sort library has
become a more professional and industrial parallel Sort library , you
can be confident cause i have tested it thoroughly and no bugs have
showed , so i hope you will be happy with my new Parallel Sort library.


I have also included a "test.pas" example, just compile first the
"gendata.pas" inside the zip file and run it first, after that compile
the "test.pas" example and run it and do your benchmarks.

I have implemented a Parallel hybrid divide-and-conquer merge algorithm
that performs 0.9-5.8 times better than sequential merge, on a quad-core
processor, with larger arrays outperforming by over 5 times. Parallel
processing combined with a hybrid algorithm approach provides a powerful
high performance result.

The best case complexity of ParallelSort using mergesort is:

((n/p)* log(n/p)) + O(n/p)

p: is the number of cores

the ((n/p)* log(n/p)) is the complexity of the sorting part.

O(n/p) is the best case complexity of the merging part.

so the best case complexity is: ((n/p)* log(n/p))

The worst case complexity of parallel sort library using mergesort is:

((n/p)* log(n/p)) + O(n)

the ((n/p)* log(n/p)) is the complexity of the sorting part.

O(n) is the worst case complexity of the merging part.

so the worst case complexity of parallelsort using mergesort is
approximatly: O(n)

I have done some tests with my ParallelSort library and i have noticed
that it can give up to 5X scalability with strings, and it gives 3x
scalability with integers on a quad cores.

So, why it scales to 5X with strings and only 3x with integers on a quad
cores ?


I explain:

In the SequentialMerge() method and QSort() method inside Parallel Sort
library, i am calling the Scompare() method and also in both of them i
am copying to the memory system.

So when i am using strings the SCompare() method is more expensive, so
the parallel part p in the Amdahl equation 1/ S + P/N (S: the serial
part, P: parallel part and N: the number of cores) is bigger than with
integers so the Amdahl equation will scale better, but when we are using
integers the SCompare() method is less expensive than the SCompare()
with strings, so the parallel part p in the Amdahl equation is less
bigger than with strings. so this is why parallel sorting with strings
scales better than with integers.

I have implemented mergsort and quicksort, but as you know the
complexity of mergesort in the worst case is better than quicksort , and
the mergesort that i have implemented is faster than quicksort, but
mergesort takes more space..

The reccurence equation of the complexity of mergesort is:

T(n) = 2 * T(n/2) + n

cause it take O(n) for the merging part.

It gives:

T(n) = 2 * T(n/2) + n

= 2 (2T(n/4) +n/2) + n

=4T(n/4)+n+n

=4T(n/4)+2*n

=4 (2T(n/8) +n/4) + 2*n

=8T(n/8)+n+2n

=8T(n/8)+3*n

=2k T(n/2^k) + k*n

We want:

n/2^k = 1

n = 2^k

log n = k

so the reccurence equation gives:

= n*T(1) +n*log(n)

= n+ (n * log(n))

So the mergesort complexity in the best case and in the worst case is:

n * log(n)

But the complexity of the quicksort in the worst case is:

T(n)= n + T(n-1)

it gives:

T(n) = n + (n-1) + T(n-2)

T(n) = n + (n-1) + (n-2)+ T(n-3)

T(n) = 1 + 2+ 3+.+N

..T(n) = O(n^2)

And Quicksort in best case performance is:

T(n) = T(n/2) + T(n/2) + O(n)
= 2T(n/2) + (n)

this gives: T(n) = (n lg n)

Parallelizing the Sorts

One way to parallelize the sorts is:

Divide the data among the processors
Sort the data on the individual processors.
Merge the various data
Note that the merge operation is a reduction operation !



You can dowload my Parallel Sort library that is much more faster than
my Parallel Quickort library from:

https://sites.google.com/site/aminer68/parallel-so...




Thank you for your time.


Amine Moulay Ramdane.

2 Answers

Harold Burton

9/9/2011 1:53:00 AM

0

In article <3eri67lpjjjgugq26b1jam7brij8qb934d@4ax.com>,
"Buster Norris (Smacks Shit Out Of Wimpy Libs and Laughs)"
<bustyourface@rocketmail.com> wrote:


> Rosie went on safari!!!!!!!!!!!!


Was wounded, mistaken for a hippo.


snicker

Buster Norris

9/10/2011 2:45:00 AM

0

On Thu, 08 Sep 2011 21:52:56 -0400, Harold Burton
<hal.i.burton@hotmail.com> wrote:

>In article <3eri67lpjjjgugq26b1jam7brij8qb934d@4ax.com>,
> "Buster Norris (Smacks Shit Out Of Wimpy Libs and Laughs)"
> <bustyourface@rocketmail.com> wrote:
>
>
>> Rosie went on safari!!!!!!!!!!!!
>
>
>Was wounded, mistaken for a hippo.
>
>
>snicker

Prove it was a mistake...........

HAAAAAAAAAAAAAAA!!!!!!!!!!!!!!!