Build fast software, independent of hardware

Daisytuner provides the optimization platform and developer tools to run software across CPUs, GPUs, and accelerators without rewriting a single line.

Quick start

$pip install docc-compiler

Browse full docs

Build with Frameworks You Love

Optimization Platform

Compile and optimize software
across hardware

Run on CPUs, GPUs, and accelerators

Your application, fast and portable

Compatible with major frameworks. Fast optimizations across processors.
Zero runtime dependencies.

Image Classification on AMD ROCm

AMD Instinct MI210X

10% faster than vendor-provided PyTorch. ResNet-18 from torchvision compiled and optimized for ROCm.

PyTorchtorchvision

Object Detection on Q.ANT Gen 2 NPU

Q.ANT NPU Gen 2

Computing with light. Faster R-CNN with ResNet-50 backbone compiled directly from PyTorch to photonic hardware.

PyTorchtorchvision

OpenFOAM on Tenstorrent Blackhole

Tenstorrent Blackhole

Computational fluid dynamics on RISC-V. C++ source compiled directly with FP32 and BF16 support.

C++OpenMP

LLM Inference Engine

Multi-target

Serve large language models with optimized attention and KV-cache on any accelerator.

PyTorchTransformers

Sensor fusion, video analytics, and more

Daisytuner supports applications built on PyTorch, NumPy, or C/C++. Browse the docs to see all supported workloads and hardware targets.

Why Daisytuner

The Optimization Platform

Write your code once. Daisytuner compiles it into optimized native code, retargets it across processors, and packages it for deployment.

01Deployment

One native artifact, no runtime stack

The compiler packages your entire application into a single native library with generated bindings for Python, C++, and other languages. No interpreter, no framework, no system dependencies.

Runtime stack

Python

Runtime

Libraries

System

Native artifact

Bindings

Native Library

02Performance

Faster than hand-tuned frameworks

The compiler analyzes your full application end-to-end and generates optimized native code. Not just at the kernel level, but across the entire data flow including pre- and post-processing.

End-to-end runtime

OpenMP Supportmore in docs

03Portability

Switch processors, keep your code

Retarget the same application to GPUs, RISC-V accelerators, or photonic processors. One source, compiled for each target. No vendor-specific rewrites.

GPU

Cheaper GPUs

Same code, different vendor. Cut hardware costs without rewriting code.

RISC-V Accelerator

Modern RISC-V accelerators

Run on next-generation silicon designed for AI and HPC workloads.

Photonic

Compute at the speed of light

Target photonic processors for fundamentally new performance frontiers.

How It Works

From your code to optimized deployment

You write the code. We handle compilation, optimization, benchmarking, and deployment.

Start with your code

You write your application using the frameworks you already know: PyTorch, NumPy, C/C++. Nothing changes about how you build software.

Application components

Camera

YOLOv8

Resize

Object Tracker

Daisyflow captures the graph

Your multi-framework application is captured as a single dataflow graph. Daisyflow sees across framework boundaries, so the full pipeline can be optimized as one unit.

Daisyflow

Dataflow graph

Camera

YOLOv8

Resize

Object Tracker

connected across frameworks

docc compiles and optimizes

Each node in the graph is compiled into an optimized native kernel using Transfer Tuning. The compilers query a cloud-connected optimization database to find the best configuration for your target hardware.

docc compilers

AI-optimized compilation

Optimization space

AI-guided search

Track performance on every change

Each pull request runs on our benchmarking infrastructure. You get performance numbers, regression alerts, and bottleneck analysis right in GitHub.

Performance dashboard

PR #142+12% throughput✓

PR #143-8% latency regression✗

BottleneckMemory-bound in Resize kernel

Deploy with zero overhead

The output is a single native artifact with no runtime dependencies. Ship it to your own machines or run jobs on our cloud with AMD, Tenstorrent, and more.

Native artifact

app.so

No runtime. Small deployment.

Self-hosted

Your hardware

Runs

Our cloud

Beyond GPU

Production workloads on next-generation processors

First from PyTorch

Object detection on Q.ANT NPU Gen 2

Daisytuner compiled and deployed an object detection model directly from PyTorch onto Q.ANT's photonic processor. The first time a standard ML framework has targeted photonic hardware.

PyTorch to Photonic: Standard model format compiled directly to Q.ANT hardware.
Full Pipeline: Pre-processing, inference, and post-processing run end-to-end.
No Custom Code: Our platform handled all hardware-specific translation automatically.

World firstSC '25

OpenFOAM on Tenstorrent Blackhole

In collaboration with Tenstorrent, we cross-compiled the industry-standard CFD toolkit OpenFOAM to Tenstorrent's RISC-V based Blackhole accelerator.

Zero Code Changes: Original C++ source code compiled directly with docc.
Automatic Optimization: docc identifies offloadable sections and moves them to the accelerator.
Full Portability: Seamlessly switch between Wormhole, Blackhole, or other vendors.

Watch Demo GitHub

Start building on any processor

Install the tools, point at your code, and get optimized native artifacts in minutes.

$pip install docc-compiler

Browse Docs Book a Demo