GPU Implementation

This section applies the XXZ workflow on GPU.

The key point is that the user-facing API stays almost identical to CPU runs. You still call polfed, but your matrix/vector inputs are moved to GPU arrays. Internally, heavy mapping work is then executed under the hood on GPUs.

Workflow overview:

Simple Run on GPU
Automatic Optimization on GPU
Custom Mapping Construction on GPU
Pass Custom Clenshaw Kernels Through MappingConfig

Simple Run on GPU

Start from the usual XXZ construction, then transfer matrix and vectors to GPU and run POLFED in the same way as on CPU.

using Polfed
using Polfed.Models: xxz_hamiltonian
using CUDA
using LinearAlgebra

L = 18
Lup = L ÷ 2
Delta = 0.55
howmany = 120
target = 0.0

mat = xxz_hamiltonian(L, Lup, 1.0, Delta, 0.0; boundary=:periodic, field=0.0, use_sparse=true)
x0 = rand(size(mat, 1)); x0 ./= norm(x0)

# Transfer to GPU
mat_gpu = CUDA.CUSPARSE.CuSparseMatrixCSR(mat)
x0_gpu = CuVector(x0)

vals_gpu, vecs_gpu, report_gpu = polfed(mat_gpu, x0_gpu, howmany, target; produce_report=true)
display_report(report_gpu)

Automatic Optimization on GPU

The interface is the same as on CPU: set optimize_mapping=true. On GPU, NoParallel() is typically the correct strategy because parallelism is handled by the GPU backend itself. It is worth noting that this is the default choice on GPUs, so the explicit setting is not strictly necessary, but it makes the intended execution path clear.

mapping_gpu_auto = MappingConfig(
    parallel_strategy=NoParallel(),
    optimize_mapping=true,
)

vals_gpu_auto, vecs_gpu_auto, report_gpu_auto = polfed(
    mat_gpu,
    x0_gpu,
    howmany,
    target;
    produce_report=true,
    mapping=mapping_gpu_auto,
)
display_report(report_gpu_auto)

Custom Mapping Construction on GPU

You can also pass a custom rescaled mapping on GPU, exactly as in CPU workflows. A practical option is to precompute a rescaled matrix and use GPU mul! for that mapping.

# Estimate bounds on CPU (same workflow as in previous sections)
f_mul_cpu! = (Y, X) -> mul!(Y, mat, X)
Emin = first(collect(Polfed.Lanczos.lanczos(f_mul_cpu!, x0, 1; which=:SR, maxdim=1000)[1]))
Emax = last(collect(Polfed.Lanczos.lanczos(f_mul_cpu!, x0, 1; which=:LR, maxdim=1000)[1]))
spread = (Emax - Emin) / 2
center = (Emax + Emin) / 2

# Build rescaled matrix, then move it to GPU
mat_rescaled = copy(mat)
mat_rescaled .*= 1 / spread
mat_rescaled += spdiagm(0 => fill(-center / spread, size(mat, 1)))
mat_gpu_rescaled = CUDA.CUSPARSE.CuSparseMatrixCSR(mat_rescaled)

f_rescaled_gpu! = (Y, X) -> mul!(Y, mat_gpu_rescaled, X)

mapping_gpu_rescaled = MappingConfig(
    parallel_strategy=NoParallel(),
    f!_rescaled=f_rescaled_gpu!,
    Emin=Emin,
    Emax=Emax,
)

vals_gpu_rescaled, vecs_gpu_rescaled, report_gpu_rescaled = polfed(
    mat_gpu,
    x0_gpu,
    howmany,
    target;
    produce_report=true,
    mapping=mapping_gpu_rescaled,
)
display_report(report_gpu_rescaled)

Pass Custom Clenshaw Kernels Through `MappingConfig`

For model-specific tuning, you can also provide custom Clenshaw recurrence/final-sum callbacks. A practical template keeps data on GPU and calls GPU mul! inside these callbacks.

function make_gpu_clenshaw_kernels(mat_gpu)
    crr = (b1, b2, b3, c, X) -> begin
        tmp = similar(X)
        mul!(tmp, mat_gpu, b2)
        @. b1 = c * X + 2 * tmp - b3
        return nothing
    end

    cfs = (b1, b2, c, Y, X) -> begin
        tmp = similar(X)
        mul!(tmp, mat_gpu, b1)
        @. Y = c * X + tmp - b2
        return nothing
    end

    return crr, cfs
end

crr_gpu!, cfs_gpu! = make_gpu_clenshaw_kernels(mat_gpu)

mapping_gpu_custom = MappingConfig(
    parallel_strategy=NoParallel(),
    optimize_mapping=true,
    clenshaw_recurrence=crr_gpu!,
    clenshaw_finalsum=cfs_gpu!,
)

vals_gpu_custom, vecs_gpu_custom, report_gpu_custom = polfed(
    mat_gpu,
    x0_gpu,
    howmany,
    target;
    produce_report=true,
    mapping=mapping_gpu_custom,
)
display_report(report_gpu_custom)

In production, you would replace these callbacks with fused model-specific GPU kernels. The integration point stays the same: MappingConfig.

This page was generated using Literate.jl.

GPU Implementation

Simple Run on GPU

Automatic Optimization on GPU

Custom Mapping Construction on GPU

Pass Custom Clenshaw Kernels Through MappingConfig

Pass Custom Clenshaw Kernels Through `MappingConfig`