GPU Implementation
This section applies the XXZ workflow on GPU.
The key point is that the user-facing API stays almost identical to CPU runs. You still call polfed, but your matrix/vector inputs are moved to GPU arrays. Internally, heavy mapping work is then executed under the hood on GPUs.
Workflow overview:
- Simple Run on GPU
- Automatic Optimization on GPU
- Custom Mapping Construction on GPU
- Pass Custom Clenshaw Kernels Through
MappingConfig
Simple Run on GPU
Start from the usual XXZ construction, then transfer matrix and vectors to GPU and run POLFED in the same way as on CPU.
using Polfed
using Polfed.Models: xxz_hamiltonian
using CUDA
using LinearAlgebra
L = 18
Lup = L ÷ 2
Delta = 0.55
howmany = 120
target = 0.0
mat = xxz_hamiltonian(L, Lup, 1.0, Delta, 0.0; boundary=:periodic, field=0.0, use_sparse=true)
x0 = rand(size(mat, 1)); x0 ./= norm(x0)
# Transfer to GPU
mat_gpu = CUDA.CUSPARSE.CuSparseMatrixCSR(mat)
x0_gpu = CuVector(x0)
vals_gpu, vecs_gpu, report_gpu = polfed(mat_gpu, x0_gpu, howmany, target; produce_report=true)
display_report(report_gpu)Automatic Optimization on GPU
The interface is the same as on CPU: set optimize_mapping=true. On GPU, NoParallel() is typically the correct strategy because parallelism is handled by the GPU backend itself. It is worth noting that this is the default choice on GPUs, so the explicit setting is not strictly necessary, but it makes the intended execution path clear.
mapping_gpu_auto = MappingConfig(
parallel_strategy=NoParallel(),
optimize_mapping=true,
)
vals_gpu_auto, vecs_gpu_auto, report_gpu_auto = polfed(
mat_gpu,
x0_gpu,
howmany,
target;
produce_report=true,
mapping=mapping_gpu_auto,
)
display_report(report_gpu_auto)Custom Mapping Construction on GPU
You can also pass a custom rescaled mapping on GPU, exactly as in CPU workflows. A practical option is to precompute a rescaled matrix and use GPU mul! for that mapping.
# Estimate bounds on CPU (same workflow as in previous sections)
f_mul_cpu! = (Y, X) -> mul!(Y, mat, X)
Emin = first(collect(Polfed.Lanczos.lanczos(f_mul_cpu!, x0, 1; which=:SR, maxdim=1000)[1]))
Emax = last(collect(Polfed.Lanczos.lanczos(f_mul_cpu!, x0, 1; which=:LR, maxdim=1000)[1]))
spread = (Emax - Emin) / 2
center = (Emax + Emin) / 2
# Build rescaled matrix, then move it to GPU
mat_rescaled = copy(mat)
mat_rescaled .*= 1 / spread
mat_rescaled += spdiagm(0 => fill(-center / spread, size(mat, 1)))
mat_gpu_rescaled = CUDA.CUSPARSE.CuSparseMatrixCSR(mat_rescaled)
f_rescaled_gpu! = (Y, X) -> mul!(Y, mat_gpu_rescaled, X)
mapping_gpu_rescaled = MappingConfig(
parallel_strategy=NoParallel(),
f!_rescaled=f_rescaled_gpu!,
Emin=Emin,
Emax=Emax,
)
vals_gpu_rescaled, vecs_gpu_rescaled, report_gpu_rescaled = polfed(
mat_gpu,
x0_gpu,
howmany,
target;
produce_report=true,
mapping=mapping_gpu_rescaled,
)
display_report(report_gpu_rescaled)Pass Custom Clenshaw Kernels Through MappingConfig
For model-specific tuning, you can also provide custom Clenshaw recurrence/final-sum callbacks. A practical template keeps data on GPU and calls GPU mul! inside these callbacks.
function make_gpu_clenshaw_kernels(mat_gpu)
crr = (b1, b2, b3, c, X) -> begin
tmp = similar(X)
mul!(tmp, mat_gpu, b2)
@. b1 = c * X + 2 * tmp - b3
return nothing
end
cfs = (b1, b2, c, Y, X) -> begin
tmp = similar(X)
mul!(tmp, mat_gpu, b1)
@. Y = c * X + tmp - b2
return nothing
end
return crr, cfs
end
crr_gpu!, cfs_gpu! = make_gpu_clenshaw_kernels(mat_gpu)
mapping_gpu_custom = MappingConfig(
parallel_strategy=NoParallel(),
optimize_mapping=true,
clenshaw_recurrence=crr_gpu!,
clenshaw_finalsum=cfs_gpu!,
)
vals_gpu_custom, vecs_gpu_custom, report_gpu_custom = polfed(
mat_gpu,
x0_gpu,
howmany,
target;
produce_report=true,
mapping=mapping_gpu_custom,
)
display_report(report_gpu_custom)In production, you would replace these callbacks with fused model-specific GPU kernels. The integration point stays the same: MappingConfig.
This page was generated using Literate.jl.