Introduction

Getting started

Learn how to get Daisytuner set up in your project.


Installation

The following sections guide you through the installation of Daisytuner and its dependencies.

Dependencies

Daisytuner needs to benchmark your processor to recommend the best tuning strategies. To derive architecture details and run measurements, we use the performance tool suite Likwid.

Likwid can be installed via multiple methods, where the simplest is to download and build the sources directly.

wget http://ftp.fau.de/pub/likwid/likwid-5.2.2.tar.gz
tar -xaf likwid-5.2.2.tar.gz
cd likwid-5.2.2
make ACCESSMODE=perf_event
sudo make ACCESSMODE=perf_event install

Likwid provides different modes for performing measurements. The perf_event accessmode uses built-in Linux tools to obtain performance counters from the processor.

Linux Perf API

Reading performance counters via the Linux API requires elevated privileges for the user. This can be achieved by setting the perf_event_paranoid variable to 0.

Daisytuner

The Daisytuner SDK is the core component handling the interaction with our Co-Pilot API. The SDK is a Python package easily installed via pip.

pip install daisytuner-sdk[profiling]

We are almost ready to use the Co-Pilot and tune our first program.


Register an Account

To use the API of our Co-Pilot, you must first register an account. The account is free and includes a basic quota for testing. You can sign up via the CLI, coming with the SDK.

daisytuner signup

If you have signed up via the website, you must also log in to your account on your local machine.

daisytuner login

Great, we are now ready to tune our first program!


Tune Your First Program

Our Co-Pilot speaks the language of compilers, and this language is called the intermediate representation (IR) of a program. Based on the IR, compilers reason about a program and apply different code transformations, producing fast instructions for your processor. We use the stateful dataflow multigraph (SDFG) implemented in the DaCe framework to optimize programs.

When using Daisytuner with a compiler such as clang, we must translate the SDFG from the compiler's IR or the source code first. For more information, check out the section Compiler Plugins.

DaCe Framework

To learn more about SDFGs and the DaCe framework, check out the documentation.

Matrix-Vector Multiplication

Our first program will be a classical scientific algorithm, the matrix-vector multiplication (MxV). To obtain the SDFG and run the Co-Pilot, we will implement the MxV with the DaCe Python frontend.

import dace

@dace.program
def mxv(A: dace.float32[1024, 1024], x: dace.float32[1024], y: dace.float32[1024]):
    for i in dace.map[0:1024]:
        for j in dace.map[0:1024]:
            with dace.tasklet:
                a << A[i, j]
                b << x[j]
                c >> y(1, lambda x1, x2: x1 + x2)[i]

                c = a * b

This code specifies two parallel loops (map) multiplying elements from A and x and reducing them to y. The frontend can directly translate Python code into an SDFG.

sdfg = mxv.to_sdfg()

Loop Scheduling

On modern processors, the MxV, as written above, will not be very fast because it does not utilize the cache hierarchy optimally. However, we can improve the performance by changing how the loops iterate over the arrays. Such techniques are commonly known as loop scheduling.

Optimizing the loop schedule is straightforward with Daisytuner: The SDK provides the TransferTuningPipeline, which we can use to ask the Co-Pilot for the optimal loop schedule. Furthermore, we must benchmark our current processor and provide this information to the tuner. Afterward, we can call the tuner and get the new schedule.

from daisytuner.benchmarking import CPUBenchmark
from daisytuner.normalization import MapExpandedForm
from daisytuner.pipelines import TransferTuningPipeline

# Benchmark your processor
benchmark = CPUBenchmark.measure()

# SDFG Preprocessing
preprocess = MapExpandedForm()
preprocess.apply_pass(sdfg, {})

# Transfer Tuning
pipeline_results = {}
pipeline = TransferTuningPipeline(
    benchmark=benchmark,
    device="cpu"
)
pipeline.apply_pass(sdfg, pipeline_results)

Based on the SDFG, DaCe creates a C++ library that can be called from Python. We can inspect the result of the Co-Pilot by compiling the SDFG.

sdfg.compile()
#pragma omp parallel for
for (auto tile_i = 0; tile_i < 1024; tile_i += 128) { 
    for (auto tile_j = 0; tile_j < 1024; tile_j += 128) {
        for (auto i = tile_i; i < (tile_i + 128); i += 1) {
            for (auto j = tile_j; j < (tile_j + 128); j += 1) {
                float a = A[((1024 * i) + j)];
                float b = x[j];
                float c;

                ///////////////////
                // Tasklet code (mxv_5_4_7)
                c = (a * b);
                ///////////////////

                dace::wcr_fixed<dace::ReductionType::Sum, float>::reduce(y + i, c);
            }
        }
    }
}