Given that the concurrent L1-minimization (L1-min) problem is often required in some real applications, we investigate how to solve it in parallel on GPUs in this paper. First, we propose a novel self-adaptive warp implementation of the matrix-vector multiplication (^{T}

Due to the sparsity of the solution of the L1-min problem, it has been successfully applied in various fields such as signal processing [

where

Given their multiple core structures, graphics processing units (GPUs) have sufficient computation power for scientific computations. Processing big data by GPUs has drown much attention over the recent years [

Due to high compute capacity of GPUs, accelerating algorithms that are used to solve the L1-min problem on the GPU has attracted considerable attention recently [^{T}

However, for the implementations of ^{T}^{T}

In this paper, we further investigate the design of effective algorithms that are used to solve the L1-min problem on GPUs. Different from other publications [

Two novel adaptive optimization GPU-accelerated implementations of the matrix-vector multiplication are proposed.

Two optimization concurrent L1-min solvers on a GPU are presented from the perspective of the streams and thread blocks, respectively.

Utilizing new features of GPU and the technique of merging kernels, an optimization concurrent L1-min solver on multiple GPUs is proposed.

The remainder of this paper is organized as follows. In Section 2, we describe the fast iterative shrinkage-thresholding algorithm. Two adaptive optimization implementations of the matrix-vector multiplication on the GPU and the vector-operation and inner-product decision trees are described in Section 3. Sections 4 and 5 give two concurrent L1-min solvers on a GPU and a concurrent L1-min solver on multiple GPUs, respectively. Experimental results are presented in Section 6. Section 7 contains our conclusions and points to our future research directions.

The L1-min problem is known as the basis pursuit (BP) problem [

The fast iterative shrinkage-thresholding algorithm (FISTA) is a kind of accelerations, and achieves an accelerated non-asymptotic convergence rate of ^{2}

where ^{T}

For FISTA, its main components include ^{T}

Symbol | Remark |
---|---|

Number of threads per block | |

Number of blocks per grid | |

Number of streaming multiprocessors | |

Maximum number of 32-bit registers per multiprocessor | |

Maximum amount of shared memory per multiprocessor | |

Maximum number of blocks per multiprocessor | |

Maximum number of threads per multiprocessor |

Given that the GEMV, ^{i} (the

where

The GEMV kernel is mainly composed of the following three steps:

where

The GEMV-T, ^{T}^{T}^{i} (the

Similar to the GEMV kernel, our proposed GEMV-T kernel is also composed of the

^{T}

For these elements of

3.

When parallelizing FISTA on the GPU, the vector-operation and inner-product kernels are needed. Although CUBLAS has shown good performance for the vector operations and the inner product of vector, the use of CUBLAS does not allow to group several operations into a single kernel. Here in order to optimize these operations, we try to group several operations into a single kernel. Therefore, we adopt the idea of constructing the vector-operation and inner-product decision trees that are suggested by Gao et al. [

Assume that

When

Otherwise, on the basis of the GEMV kernel, we construct a new kernel GEMV Kernel-I to calculate the GEMV on the GPU. In this kernel, each row of the first

Similar to the GEMV kernel, for the GEMV-T kernel, we assume that

When

Otherwise, on the basis of the GEMV-T kernel, we construct a new kernel GEMV-T Kernel-I to calculate the GEMV on the GPU. In this kernel, each row of the first

In this section, based on FISTA, we present two concurrent L1-min solvers on a GPU, which are designed from the perspective of the streams and the thread blocks, respectively.

Utilizing the multi-steam features of GPU, on the basis of FISTA, we design a concurrent L1-min solver, called CFISTASOL-SM, to solve the concurrent L1-min problem. Given that these L1-problems that are included in the concurrent L1-min problem can be independently computed, each one of them is assigned to a stream in the proposed CFISTASOL-SM.

For

For a specific GPU, the maximum thread blocks can be calculated as

When the

For the concurrent L1-min problem, if it includes a great number of L1-min problems, we can easily construct a solver on multiple GPUs to solve it by letting each GPU execute CFISTASOL-SM or CFISTASOL-TB. However, here we design a concurrent L1-min solver on multiple GPUs, called CFISTASOL-MGPU, where each GPU only solves a L1-min problem every time instead of solving multiple L1-min problems by utilizing the streams and the thread blocks. This solver is applied to this case where the number of L1-min problems that are included in the concurrent L1-min problem is much less than the number of the streams or the thread blocks.

In this section, we first investigate the effectiveness of our proposed GEMV and GEMV-T kernels by comparing them with GEMV and GEMV-T implementations in the CUBLAS library [

Hardware | K40c | GTX1070 |
---|---|---|

Cores | 2880 | 1920 |

Clock speed (GHz) | 0.74 | 1.56 |

Memory type | GDDR5 | GDDR5 |

Memory size (GB) | 12 | 8 |

Max-bandwith (GB/s) | 288 | 256 |

Compute capability | 3.5 | 6.1 |

Seq | Matrix | Rows ( |
Columns ( |
---|---|---|---|

1 | Mat01 | 32 | 8,388,608 |

2 | Mat02 | 50 | 5,368,709 |

3 | Mat03 | 64 | 4,194,304 |

4 | Mat04 | 100 | 2,684,350 |

5 | Mat05 | 128 | 2,097,152 |

6 | Mat06 | 200 | 1,342,200 |

7 | Mat07 | 256 | 1,048,576 |

8 | Mat08 | 400 | 671,100 |

9 | Mat09 | 512 | 524,288 |

10 | Mat10 | 800 | 335,850 |

11 | Mat11 | 1024 | 262,144 |

12 | Mat12 | 1600 | 166,900 |

First, we compare GEMV and GEMV-T kernels with the implementations in the CUBLAS library [

Second, we take GTX1070 to investigate whether can the proposed GEMV and GEMV-T kernels alleviate the CUBLAS fluctuations? The test matrix sets are as follows: 1) Set 1:

Based on the above observations, we conclude that our proposed GEMV and GEMV-T kernels enhance those that are suggested by Gao et al., and achieve high performance, and are able to alleviate the performance fluctuations of CUBLAS.

In this section, we test the performance of our proposed CFISTASOL-TB and CFISTASOL-SM. Given that these L1-min problems included in the concurrent L1-min problem can be independently computed, as a comparison, we use the FISTA implementation on the CPU using the BLAS library (denoted by BLAS), the FISTA implementation using the CUBLAS library (denoted by CUBLAS), and the FISTA solver (denoted by GAO) that is proposed in [

Prob | BLAS | CUBLAS | GAO | TB | SM | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|

01 | 2962.98 | 1307.39 | 256.26 | 39.28 | 58.95 | 75.43 | 33.28 | 6.52 | 50.26 | 22.18 | 4.35 |

02 | 2203.57 | 847.17 | 167.35 | 38.42 | 40.05 | 57.35 | 22.05 | 4.36 | 55.02 | 21.15 | 4.18 |

03 | 2371.43 | 641.59 | 190.00 | 32.83 | 32.84 | 72.23 | 19.54 | 5.79 | 72.22 | 19.54 | 5.79 |

04 | 1832.85 | 436.34 | 134.57 | 31.11 | 35.29 | 58.92 | 14.03 | 4.33 | 51.93 | 12.36 | 3.81 |

05 | 2050.07 | 342.21 | 158.41 | 29.46 | 29.40 | 69.58 | 11.61 | 5.38 | 69.72 | 11.64 | 5.39 |

06 | 1670.28 | 246.36 | 120.09 | 29.03 | 29.42 | 57.53 | 8.49 | 4.14 | 56.77 | 8.37 | 4.08 |

07 | 1925.79 | 206.70 | 141.73 | 27.55 | 27.47 | 69.90 | 7.50 | 5.14 | 70.11 | 7.53 | 5.16 |

08 | 1568.12 | 223.16 | 126.04 | 28.12 | 29.09 | 55.77 | 7.94 | 4.48 | 53.90 | 7.67 | 4.33 |

09 | 1851.47 | 188.22 | 147.64 | 26.81 | 26.71 | 69.05 | 7.02 | 5.51 | 69.31 | 7.05 | 5.53 |

10 | 1558.87 | 159.74 | 115.13 | 28.09 | 28.74 | 55.49 | 5.69 | 4.10 | 54.24 | 5.56 | 4.01 |

11 | 1753.49 | 151.16 | 111.93 | 28.71 | 28.63 | 61.07 | 5.27 | 3.90 | 61.24 | 5.28 | 3.91 |

12 | 1487.32 | 121.96 | 114.12 | 29.15 | 29.16 | 51.03 | 4.18 | 3.92 | 51.02 | 4.18 | 3.92 |

Prob | BLAS | CUBLAS | GAO | TB | SM | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|

01 | 2962.98 | 413.46 | 156.09 | 18.84 | 29.76 | 157.29 | 21.95 | 8.29 | 99.57 | 13.89 | 5.25 |

02 | 2203.57 | 345.98 | 123.19 | 19.98 | 21.26 | 110.29 | 17.32 | 6.17 | 103.66 | 16.28 | 5.8 |

03 | 2371.43 | 300.18 | 129.08 | 16.24 | 17.38 | 146.03 | 18.48 | 7.95 | 136.47 | 17.28 | 7.43 |

04 | 1832.85 | 269.14 | 110.25 | 16.11 | 19.84 | 113.74 | 16.7 | 6.84 | 92.36 | 13.56 | 5.56 |

05 | 2050.07 | 259.65 | 116.27 | 14.99 | 16.2 | 136.73 | 17.32 | 7.75 | 126.56 | 16.03 | 7.18 |

06 | 1670.28 | 416.32 | 102.91 | 15.02 | 16.61 | 111.18 | 27.71 | 6.85 | 100.54 | 25.06 | 6.19 |

07 | 1925.79 | 473.56 | 101.74 | 14.04 | 15.24 | 137.19 | 33.74 | 7.25 | 126.37 | 31.07 | 6.68 |

08 | 1568.12 | 461.44 | 99.67 | 14.63 | 16.81 | 107.21 | 31.55 | 6.81 | 93.26 | 27.44 | 5.93 |

09 | 1851.47 | 99.26 | 100.29 | 13.66 | 14.57 | 135.55 | 7.27 | 7.34 | 127.04 | 6.81 | 6.88 |

10 | 1558.87 | 130.62 | 97.37 | 13.61 | 15.39 | 114.55 | 9.6 | 7.16 | 101.26 | 8.48 | 6.33 |

11 | 1753.49 | 95.84 | 96.58 | 13.29 | 14.22 | 131.97 | 7.21 | 7.27 | 123.32 | 6.74 | 6.79 |

12 | 1487.32 | 95.05 | 95.7 | 13.25 | 14.66 | 112.25 | 7.17 | 7.22 | 101.47 | 6.48 | 6.53 |

On two GPUs, both TB and SM outperform BLAS, CUBLAS and GAO for all test cases (

We take two GPUs and four GPUs for example to test the performance of our proposed CFISTASOL-MGPU. The test setting is as same as in Section 6.2.

From

From the experimental results, we can observe that CFISTASOL-TB is slightly better than CFISTASOL-SM, and CFISTASOL-SM is advantageous over CFISTASOL-MGPU. In fact, each one of these solvers has its own advantage.

Num | CFISTASOL-TB | CFISTASOL-SM | CFISTASOL-MGPU |
---|---|---|---|

30 | 6.6251 | 7.3302 | 7.8149 |

15 | 6.4342 | 3.6253 | 3.8074 |

4 | 6.0126 | 3.1398 | 0.9768 |

We investigate how to solve the concurrent L1-min problem in this paper, and present two concurrent L1-min solvers on a GPU and a concurrent L1-min solver on multiple GPUs. Experimental results show that our proposed concurrent L1-min solvers are effective, and have high parallelism.

Next, we will further do research in this field, and apply the proposed algorithms to more practical problems to improve them.

_{1}-norm minimization problem when the solution may be sparse

_{1}-minimization algorithms for robust face recognition