Core concepts
Transfer Tuning
This section explains Transfer Tuning, a core concept of Daisytuner to optimize performance.
Loop Scheduling
Modern processors have many cores and a hierarchy of caches to speed up access to comparatively slow memory. Efficiently utilizing the cores and caches is paramount to optimal performance. Otherwise, the processor idles most of the time, waiting for data from the main memory.
Since loops and loop nests are usually the hotspots for parallel computation and data processing, they are the primary target of optimizations. Loop scheduling
describes precisely this problem, optimizing the loop structure for optimal performance using code transformations.
Complexity of Loop Scheduling
Depending on how many loops and arrays are involved in a loop nest, the complexity of optimization grows exponentially. Typical optimizations involve the optimal order of loops, parallelization, vectorization, and whether to store parts of the arrays in temporal variables for repetitive accesses.
Modern auto-schedulers tackle this problem by a local search applying loop transformations to the initial loop nest. Because of the large search space, this is expensive and requires a hard-to-define and accurate performance model. Autotuners additionally benchmark each search hypothesis, which is usually intractable for an extensive search.
The Idea of Transfer Tuning
Transfer tuning is an approach to loop scheduling that asks a different question: Have I seen a similar loop nest before?
. A loop nest's performance mainly depends on how arrays are accessed, and parallelism can be utilized. Hence, the performance is widely independent of the computed function.
Our transfer tuner, therefore, defines a metric to compare loop nests in terms of performance-relevant aspects. Based on this distance metric, it can search for similar loop nests and their best scheduling in an online database, just like the search for similar images of a search engine. This way of searching finds a good loop scheduling efficiently, usually within the three or five most similar loop nests from the database.
Transfer Tuning
To learn more about transfer tuning, check out the paper.
Collections
Collections are at the core of our implementation of transfer tuning. A collection is a namespace of loop nests from a specific domain. In the Getting Started section example, you have already used our default collection
for optimization. The default collection comprises a starting set of loop nests from domains such as linear algebra and statistics.
Using a Specific Collection
Remember our initial example, where we have optimized a matrix-vector multiplication with transfer tuning.
import dace
from daisytuner.benchmarking import CPUBenchmark
from daisytuner.normalization import MapExpandedForm
from daisytuner.pipelines import TransferTuningPipeline
@dace.program
def mxv(A: dace.float32[1024, 1024], x: dace.float32[1024], y: dace.float32[1024]):
for i in dace.map[0:1024]:
for j in dace.map[0:1024]:
with dace.tasklet:
a << A[i, j]
b << x[j]
c >> y(1, lambda x1, x2: x1 + x2)[i]
c = a * b
sdfg = mxv.to_sdfg()
# Benchmark your processor
benchmark = CPUBenchmark.measure()
# SDFG Preprocessing
preprocess = MapExpandedForm()
preprocess.apply_pass(sdfg, {})
# Transfer Tuning
pipeline_results = {}
pipeline = TransferTuningPipeline(
benchmark=benchmark,
device="cpu"
# HERE: Specify the collection
collection="default"
)
pipeline.apply_pass(sdfg, pipeline_results)
The Transfer Tuning pipeline has an optional argument to specify the collection. By default, this argument is set to the default collection.
Defining a Custom Collection
When working on a large code base, defining a custom collection for your domain of loop nests is essential. This focuses the search on your domain and enables custom optimizations. Contact us via our contact form to create a custom collection for your application. We have several search methods available to seed your collection with good tuning strategies.