[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.programming

About my parallel algorithm and NUMA...

Ramine

3/7/2015 5:49:00 AM

Hello...


We have to be smart, so follow with me please..

As you have noticed i have implemented and invented a parallel Conjugate
gradient linear system solver library...

Here it is:

https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-c...

My parallel algorithm is scalable on NUMA architecture...

But You have to undertand my way of designing my NUMA-aware parallel
algorithms, the first way of implementing a NUMA-aware parallel
algorithm is by implementing a threadpool that schedules a job on a
given thread by specifying for example the NUMA-node explicitly
depending on the wich NUMA node's memory you will do your processing ...
this way will buy you 40% more throughput on NUMA architecture, but
there is another way of doing is to use the classical threadpool without
specifying the NUMA node explicitly , but you will divide for example
your parallel memory processing between the NUMA nodes, this is the way
i have implemented my parallel algorithms that are NUMA-aware, my way of
doing is scalable on NUMA architecture but you will get 40% less
throughput on NUMA architecture, but even if it's 40% less throughput i
think that my parallel algorithms that are NUMA-aware are scalable on
NUMA architecture and they are still good enough, my next parallel sort
library will be also scalable on NUMA-architecture.

From were i have got this 40% less throughput ? please read here:

"Performance impact: the cost of NUMA remote memory access

For instance, this Dell whitepaper has some test results on the Xeon
5500 processors, showing that local memory access can have 40% higher
bandwidth than remote memory access, and the latency of local memory
access is around 70 nanoseconds whereas remote memory access has a
latency of about 100 nanoseconds."


Read more here:

http://sqlblog.com/blogs/linchi_shea/archive/2012/01/30/performance-impact-the-cost-of-numa-remote-memory-a...


As you have noticed on my NUMA-aware i am using my classical threadpool
and i am not scheduling the jobs by specifying an explicit NUMA node,
but i am dividing the parallel memory processing between the NUMA nodes,
and by doing so you will get a scalable algorithm with 40% less
throughput than if you design a more optimized parallel algorithm and
threadpool that schedules the jobs by specifying explicitly a NUMA node
so that to avoid at best remote memory accesses on NUMA nodes, and this
will get you 40% more throughput, bu my parallel algorithm that are
NUMA-aware and that uses a classical threadpool are also good i think,
and it is still good enough , but if you need me to optimize more my
threadpool so that to get 40% more throughput i will do it as a next
project.



Thank you,
Amine Moulay Ramdane.