Parallelism#

Throughout adelie, there are multiple places where the user can specify the number of (OpenMP) threads. We have made the package flexible enough that one has high control over the number of threads in certain parts of the computation, e.g. linear algebra routines within the solver or inner-products with a matrix. However, this flexibility also allows for common pitfalls. In this notebook, we cover some tips on how to properly use parallelism.

[1]:

import adelie as ad
import numpy as np

Examples of Parallelism#

In adelie, the two most common places where parallelism occurs is within the solver and the matrix class. There is no strict requirement that a matrix class needs to have parallelized routines, however, most of the provided matrix classes do support parallelism.

As an example, we show below how to specify the number of threads for a dense matrix class.

[2]:

n = 100         # number of observations
p = 1000        # number of features
n_threads = 4   # number of threads
seed = 0        # seed

np.random.seed(seed)
X = np.random.normal(0, 1, (n, p))
X_wrap_1 = ad.matrix.dense(X, method="naive")   # default n_threads=1
X_wrap_4 = ad.matrix.dense(X, method="naive", n_threads=n_threads)

It is implementation-specific detail for how multithreading is used, and is generally not to be concerned by the average user.

The following shows how to specify the number of threads for the solver.

[3]:

# create response vector
y = X[:, -1] * np.random.normal(0, 1) + np.random.normal(0, 1, n)
state = ad.grpnet(
    X=X,
    glm=ad.glm.gaussian(y=y),
    n_threads=n_threads,
)

 46%|█████     | 46/100 [00:00:00<00:00:00, 1339.68it/s] [dev:90.6%]

Common Number of Threads#

One pitfall is to specify a different number of threads (greater than 1) within different sections of the algorithm. OpenMP incurs a lot of cost for switching the number of threads (at least on some machines), so we advise users to only specify two possible numbers at all times: 1 or n_threads where n_threads is some user-specified number. For example, we advise against

[4]:

%%time
X_wrap = ad.matrix.dense(X, method="naive", n_threads=2)
state = ad.grpnet(
    X=X_wrap,
    glm=ad.glm.gaussian(y=y),
    n_threads=4,
)

 46%|█████     | 46/100 [00:00:00<00:00:00, 72.68it/s] [dev:90.6%]

CPU times: user 927 ms, sys: 344 ms, total: 1.27 s
Wall time: 636 ms

Instead, we advise for the following cases:

[5]:

%%time
X_wrap = ad.matrix.dense(X, method="naive", n_threads=1)
state = ad.grpnet(
    X=X_wrap,
    glm=ad.glm.gaussian(y=y),
    n_threads=4,
)

 46%|█████     | 46/100 [00:00:00<00:00:00, 1442.81it/s] [dev:90.6%]

CPU times: user 99.3 ms, sys: 11.3 ms, total: 111 ms
Wall time: 35.8 ms

[6]:

%%time
X_wrap = ad.matrix.dense(X, method="naive", n_threads=4)
state = ad.grpnet(
    X=X_wrap,
    glm=ad.glm.gaussian(y=y),
    n_threads=1,
)

 46%|█████     | 46/100 [00:00:00<00:00:00, 1544.51it/s] [dev:90.6%]

CPU times: user 92.8 ms, sys: 6.47 ms, total: 99.3 ms
Wall time: 32.3 ms

[7]:

%%time
X_wrap = ad.matrix.dense(X, method="naive", n_threads=4)
state = ad.grpnet(
    X=X_wrap,
    glm=ad.glm.gaussian(y=y),
    n_threads=4,
)

 46%|█████     | 46/100 [00:00:00<00:00:00, 1219.56it/s] [dev:90.6%]

CPU times: user 132 ms, sys: 9.6 ms, total: 142 ms
Wall time: 40.1 ms

We clearly see about 20x speedup by keeping the number of threads consistent throughout the algorithm!

Optimal Number of Threads#

There is a question of the optimal number of threads to use. This is completely application and hardware dependent and cannot be answered universally. Our only advice is to experiment with the number of threads (while following the rest of the tips). Our experience shows that most of the time, i.e. with small to moderately sized data, single-threaded runs are actually the fastest because the OpenMP thread management cost dominates. We only recommend increasing the number of threads for large data. Even then, you may experience negligible improvements since multithreading is oftentimes bottlenecked by memory bandwidth. So machines with more NUMA nodes, larger RAM, or faster RAM access (e.g. large machines in a cluster) will reap more benefits from parallelism.

Thread Safety#

We do not assume that class member functions are thread safe!

This means that calling .bmul() for a matrix class or .gradient() for a GLM class, for example, in a parallel fashion is undefined behavior. This is worth mentioning for those who wish to interact with the matrix and GLM classes beyond the simple use-case of passing into the solver. It is guaranteed that the solver will not invoke such member functions in a parallel fashion. This was done mostly out of simplicity, assuming that the primary use-case is to pass to the solver. Many of the provided classes internally contain buffers that are used to accelerate computations of the member functions to minimize on-the-fly allocation costs. It would take considerable amount of extra work to make this thread safe.