Wednesday, February 3, 2016

[Links] multicore programming in lock contention phenomenon





Note: The full text of this article posted from drzhouweiming column (http://blog.csdn.net/drzhouweiming), there is an illustration of this article, you can view the original here: http: //blog.csdn.net/drzhouweiming/archive/2007 /04/10/1559718.aspx




Multicore Programming lock contention phenomenon

The first few puzzles an explanation multicore programming and Countermeasures (a problem) mentioned in the article will lock competition with the auditor CPU serialization exacerbated by the proliferation of the phenomenon, the article came to multicore programming lock contention in-depth analysis.
For simplicity, we look at a simple case, assume that there are four pairs and other tasks at the same time start to run, assuming that the beginning of each mission there is a need to protect the operation of the lock, time-consuming to 1, the rest of each task Processed 25. This operation started after running several tasks as shown below:

Figure 1: Lock competition for tasks such as schematic
In the figure above, we can see that the first direct execution of a task to the end, with no waiting, waiting for the first two tasks one time unit, the first three tasks to wait two time units, the first three tasks waiting 3 time units.
Thus there are a total of three CPU wait six time units, if these tasks are using OpenMP where all tasks are performed at the same point on waiting to perform all the tasks completed when then execute down, then the total running time will The fourth task is the same 29 time units, the acceleration factor: (1 + 4 × 25) / 29 = 3.48
Even with an average time of 27.5 four tasks to be calculated, acceleration factor = 101 / 27.5 = 3.67
Eminem is calculated in accordance with the laws Jorda acceleration factor, then the above applications, the serial time is 1, the total time of parallel processing into the time after a serial of 100 units, if placed on a 4-core CPU running it, acceleration factor = p / (1 + (p-1) * f) = 4 / (1+ (4-1) * 1/101) = 404/104 = 3.88
This creates a strange question, then use the lock, even Eminem Jorda Law acceleration factor calculated acceleration factor not like, let alone with the acceleration factor of the Gustafson's Law.

In fact, the top four lock competition task can be extended to the more general case, assuming that there is a lock to protect the serialization time is 1, can be parallelized to run part time on a single core CPU for t, CPU core number of p , then in the p run simultaneously on a task, the total waiting time caused by lock contention is: 1 + 2 + ... + p = p * (p-1) / 2
One of the most time-consuming task time used as: (p-1) + t / p
Use the most time-consuming task that takes time to run in parallel as time, acceleration factor as
S (p) = t / (p-1 + t / p) = p * t / (p * (p-1) + t) (acceleration factor formula under lock contention)
This formula shows that there is a lock in the competition, if a nuclear fixed number of cases, the larger part of parallelized, then the acceleration factor will be greater. In parallel of time is fixed, if the CPU core a few more, then the acceleration factor will be smaller.
Or calculate several practical examples to illustrate the effect of the above formula:
So that t = 100, p = 4, the acceleration coefficient = 4 × 100 / (4 * (4-1) +100) = 3.57
So that t = 100, p = 16, the acceleration factor = 16 × 100 / (16 * (16-1) +100) = 4.7
So that t = 100, p = 64, the acceleration factor = 64 × 100 / (64 * (64-1) +100) = 1.54
So that t = 100, p = 128, acceleration factor = 128 × 100 / (128 * (128-1) +100) = 0.78
As it can be seen from the above calculation, when the number of multi-core to a certain time, acceleration factor not only increase but decreased, the number increased to 128 nuclear, acceleration coefficient of only 0.78, not as in the single-core CPU running speed.
The above example, the lock protection resulting serial code is invoked when the task starts, in fact, a serial code lock protection and other missions in other parts of the call is the same.
Competition phenomena lock type tasks in practical situations are very common, such as server software, usually each client processing tasks are equal, if a lock on the inside, then it is likely to cause the above said acceleration factor With an increase in the number of CPU cores and the phenomenon of decline.
Previous server software generally runs on a dual CPU or quad CPU machine, so acceleration factor resulting decline in lock contention is not obvious, after entering the multicore era, with the increase in the number of CPU core, this issue will become very serious, so multicore Times of programming presented new challenges. Programming ideas previously put under multitasking on multicore programming is not necessarily feasible.
So simply that previous multicore programming and multi-tasking programming or parallel computing equivalent words are unrealistic, speaking serialization problems that article put forward some countermeasures settlement, but those have yet to be the industry's continued efforts Countermeasures in order to do it.
Of course, due to the current market sales of multi-core CPU or dual-core and quad-core, until the 16-core CPU more mass to enter the market and possibly for several years, I believe lock industry in the next few years be able to face other tasks on the Competition to find a better solution.

The Author: Wei-Ming Zhou, Freelance, in the software industry more than ten years. Currently focused on the content of software testing, multi-core programming, software design and other basic aspects. Written "multi-task under Data Structures and Algorithms," a book, and is currently writing "Software Testing Practice," a book, plans to write a multi-core programming books in the near future.

Reply:

1) multi-core parallel processing needs to lock data. Is single Nucleation can save some time and do not need to lock code?

2)
So that t = 100, p = 4, the acceleration coefficient = 4 × 100 / (4 * (4-1) +100) = 3.57
So that t = 100, p = 16, the acceleration factor = 16 × 100 / (16 * (16-1) +100) = 4.7
So that t = 100, p = 64, the acceleration factor = 64 × 100 / (64 * (64-1) +100) = 1.54
So that t = 100, p = 128, acceleration factor = 128 × 100 / (128 * (128-1) +100) = 0.78

These data can not be used to illustrate the multi-core itself, the problem lies in the time t = 100 the number is too small, if taken t = 10000
See what happens:
Order t = 100, p = 4, acceleration factor = 3.995
So that t = 100, p = 16, the acceleration coefficient = 15.625
Order t = 100, p = 64, acceleration factor = 41.984
Order t = 100, p = 128, acceleration factor = 48.751

Here 128 nuclear has shown great efficiency! Thus, there can be noted that:

If the time is not long serial processing of data, use of multi-core parallel processing, and will not necessarily significantly reduce processing time, and may even increase the time! If the time series data processing is not very long, there is no need to parallel processing of many nuclear, because the efficiency is not linearly increase efficiency, serial processing of data corresponding to different times, there will be an ideal number of nuclei.

3) In any case, here not because of the poor (the case) the efficiency of the parallel processing of data and a negative efficiency of the entire value of the multi-core, in fact, a lot of multi-core parallel processing also can not directly relevant, no lock to protect data.

4) Maybe I'm on the "lock" different understanding with the author? Here the "lock" refers to multi-threaded programming in the "lock" or refers to multi-core internal processes required for the operation and use of any special coordination "lock" (in this regard I do not know ^ _ ^)?
































Reply:

Sorry, the above data (t = 10000) entered incorrectly, correct as follows:

These data can not be used to illustrate the multi-core itself, the problem lies in the time t = 100 the number is too small, if taken t = 10000
See what happens:
Order t = 10000, p = 4, acceleration factor = 3.995
Order t = 10000, p = 16, acceleration factor = 15.625
Order t = 10000, p = 64, acceleration factor = 41.984
Order t = 10000, p = 128, acceleration factor = 48.751
Reply:
top!!!!
Reply:
It said "spin lock."
Reply:
The landlord does not know how to sell books, I also want a book.

Task management Time = Time + computation time
Acceleration rate = original computation time / (task time × auditor)
Where the original computation time is fixed.

Management time, the smaller the acceleration rate.
When managing time & gt; when computing time, accelerate the rate of less than 1. In this case, each task waiting time to unlock greater than computation time.

The problem of declining acceleration factor of lock contention lead author says, in fact, does not exist. Server-side processing of a document sheet Processed 10,000 units (with lock as a unit consuming) or more, relatively speaking, very little administrative overhead. And it has a large number of documents, as well as a single file on a different piece of the server; a plurality of service end while reading a file a document with the possibility of very small pieces.
Reply:
Add that, for the read-only file, it can not be locked.
Reply:
mark
Reply:
mark
Reply:

Linux multicore debugging environment- Intel + Totalview combination
!Currently, in the software development industry, a variety of excellent performance debugging tools abound. However, most of them only support windows environment. Even if we can support linux platform, the operation is also very convenient. Thus, for long-term program to write on linux developers, how to debug it becomes a matter of headache! Intel software and Totalview Debugger is in this case came into being!
Intel software can produce excellent application performance on Intel architecture, and can take advantage of the advanced features of the latest Intel multi-core processors. Combine TotalView Debugger and Intel software debugging tools will set off a revolution under linux!
TotalView Debugger is a debugging tool for linux platform parallel environment, its IDE environment, multithreading (process) debug capabilities, memory debugging capabilities, cluster debugging capabilities are unmatched in the industry!
XLsoft to join Intel, TotalView company on October 30, 2008 held a "multicore debugging Linux environment" free training seminar in Shanghai. We are very pleased to invite you to participate in, and offers free software trial CD!

I. Registration
Online registration page:
http://www.xlsoft.com.cn/TotalView/TotalView_download.asp

Registration Hotline: 021-62128912 / 010-84492749
Registration Email: Marketing@xlsoft.com.cn

Second, Lectures:
1. Linux platform under debugging tools overview
2. Intel Software Features
3. Totalview Debugger Features

Third, talk time:
2008 年 10 月 30 日 (星期四) 14:00 ~ 17:00

Fourth, Venue:
Shanghai Pine City Hotel, 3rd Floor, Hyatt long hall
(Xujiahui Zhao Jia Bang Road, No. 777 Dongan junction, about 15 minutes from Hengshan Road Station)

Fourth, the activity details:
Contact: Juan
Tel: 021-62128916 Mobile: 15000262606
E-mail: kiko.wang@xlsoft.com.cn

Hotline:
021-62128912 010-84492749
More service information, please contact us Marketing@xlsoft.com.cn or contact information.

Shanghai Information Technology Co., Ltd. World-wide software
Shanghai Tel: 021-62128912 Beijing: 010-84492749

1 comment:

  1. Casino of the Day in Las Vegas | Mapyro
    Welcome 강릉 출장마사지 to 김해 출장샵 the Casino of the Day in 순천 출장샵 Las Vegas, Nevada. The Casino of the Day 안동 출장안마 is an 18-story high-rise building in Paradise, Nevada, 통영 출장안마 U.S.A.. View detailed

    ReplyDelete