Asp Forum - We have to be smart please...

Ramine

11/19/2014 2:41:00 AM

Hello,

I have just read the following PhD paper about NUMA Cohort locks...

http://dspace.mit.edu/handle/17...

and i have read also the following Master research paper about NUMA
cohort locks...

https://cs.brown.edu/research/pubs/theses/masters/2...

So we have to be smart please , so follow with me...

I have just read the above papers and i have completly understood there
algorithm, it uses local locks and a global lock and it look
like a distributed algorithm that tries to minimize at best the
inter-socket coherence traffic , so i have not had a problem to
understand it easily, but i have not been satisfied with those papers,
why ? if you read carefully the PhD paper above you will notice that
there benchmarks are saying that the Lock cohort scales to about 6x
compared with a non-Numa Lock... and they have explained this 6x scaling
by the fact that there Lock cohort tries to minimize at best the
inter-socket coherence traffic , but i am not convinced by there
explanation, cause this 6x scaling comes instead from the fact that
there is parallelism inside the function that permit us to enter
the local locks first , this parallelism is around 6 CPU clocks or so
and there is another serial part of the cache-line transfer in there
other function of two integers tranfers from the L2 cache memory to the
CPU that is around two clocks CPU and there is a serial part that spins
for about 4 ms, so from the Amdahl's law this will scale to around 6x ,
so the scaling don't come from the fact that they are minimizing
at best the inter-socket cache coherence traffic as they are saying,
but from the Amdahl's law that says so as i have just explained it to
you, so if you are transfering more data from the L2 cache to the CPU
inside the critical section of the Lock cohort there benchmarks with a
Lock cohort will scale much less than 6x, so if you have undertstood
what i want to say to you , is that the lock cohort doesn't bring you
much scalability if you are transfering more than 4 bytes from the L2
cache to the CPU inside the critical section of the Lock cohort cause
this will scale much less than 6x, so what i want to say is that the
non-NUMA locks are still useful i think , and my scalable MLock can be
used also in realtime systems, the NUMA lock cohort can not.

You can download my scalable MLock from

https://sites.google.com/site/aminer68/scal...

Thank you for your time.

Amine Moulay Ramdane.

3 Answers

Ramine

11/19/2014 3:10:00 AM

Hello,

Please read the PhD paper, it says about the benchmarks that scales
to 6x that:

"The graph shows the average throughput in terms of
number of critical and non-critical section pairs executed per second.
The critical section accesses two distinct cache blocks (increments
4 integer counters on each block), and the non-critical section
is an idle spin loop of up to 4 microseconds."

So as i have just explained to you, the serial part inside the critical
section takes around 1 clock and the parallel part inside the function
that enters the local locks takes around 6 clocks , this is why it
gives 6x scalability from the calculation results of the Amdahl's law.

Hope you have understood well what i want to say, that the scalability
of 6x is not the result of the minimization at best of the inter-socket
coherence traffic, but it is the result of the parallel part inside
the function that enters the local locks and the serial part
inside the critical section of the lock cohort, this is what the
PhD paper doesn't explain to you, and also you have to know that
if you are transfering more than 4 bytes from the L2(local or
remote) to the CPU, the Lock cohort will scale less and less
than 6x , this is why my scalable MLock is still useful
and my scalable MLock can be used in realtime critical systems,
the Lock cohort can not cause it is unfair.

Thank you,
Amine Moulay Ramdane,.

Ramine

11/19/2014 3:46:00 AM

Hello,

The PhD paper says about the benchmarks that gives 6x scalability that:

"The graph shows the average throughput in terms of
number of critical and non-critical section pairs executed per second.
The critical section accesses two distinct cache blocks (increments
4 integer counters on each block), and the non-critical section
is an idle spin loop of up to 4 microseconds."

So i think from the benchmark that the idle loop doen't count as
a serial part cause it is not a critical section and it doesn't affect
by much the Amdahl's law, so the critical section that accesses
the two distinct cache blocks and increments 4 integer counters on each
block will use ILP (instruction level parallelism) of the processor
to lower the amount of CPU clocks that it takes in the critical section
that means it will lower the time of serial part of the Amdahl's law,
and the parallel part is when you enter the the local locks and this
will take a number of CPU clocks, and i think this is what is giving the
6x scaling with the Amdahl's law calculation, and what i want to explain
is that it is not the mimization of the inter-socket coherence traffic
that is giving the 6x scalability, but it's the Amdahl's law that is
giving 6x scalability.

Thank you,
Amine Moulay Ramdane.

Ramine

11/19/2014 4:02:00 AM

On 11/18/2014 7:45 PM, Ramine wrote:
>
> Hello,
>
>
> The PhD paper says about the benchmarks that gives 6x scalability that:
>
>
> "The graph shows the average throughput in terms of
> number of critical and non-critical section pairs executed per second.
> The critical section accesses two distinct cache blocks (increments
> 4 integer counters on each block), and the non-critical section
> is an idle spin loop of up to 4 microseconds."
>
>
> So i think from the benchmark that the idle loop doen't count as
> a serial part cause it is not a critical section and it doesn't affect
> by much the Amdahl's law, so the critical section that accesses
> the two distinct cache blocks and increments 4 integer counters on each
> block will use ILP (instruction level parallelism) of the processor
> to lower the amount of CPU clocks that it takes in the critical section
> that means it will lower the time of serial part of the Amdahl's law,
> and the parallel part is when you enter the the local locks and this
> will take a number of CPU clocks, and i think this is what is giving the
> 6x scaling with the Amdahl's law calculation, and what i want to explain
> is that it is not the mimization of the inter-socket coherence traffic

I mean "minimization", not mimization.

> that is giving the 6x scalability, but it's the Amdahl's law that is
> giving 6x scalability.
>
>
>
> Thank you,
> Amine Moulay Ramdane.
>
>
>
>
>
>
>
>

comp.programming

We have to be smart please...

Ramine

Ramine

Ramine

Ramine

x Login to ForumsZone