- Integrated occupancy hints from vendor APIs (CUDA, HIP) to set a
dynamic threads-per-block limit per kernel instead of using static
values.
- Added `find_tuning_function()` to identify the relevant kernel.
- Autotuner now runs in three stages: threads -> loops -> accel. The
first two stages now stop increasing when the tested kernel runtime
gets too close to the target runtime (96ms for `-w 3`), leaving
headroom for the next stage to adjust in a finer sense.
- Accel tuning now uses a capped floating-point multiplier instead of
powers of two.
- Removed workarounds for missing thread autotuning in plugins.
- Removed the hardcoded 4GiB host memory limit for accel. Added a
cross-platform `get_free_memory()` to check actual free RAM during GPU
initialization, preventing underutilization of high-end GPUs like the
4090. If needed, users can still cap memory usage with `-T` or `-n`.
- Updated enums for ROCm 6.4.x and CUDA 12.9.
- Added code to detect kernel register spilling. That's relevant so we
can keep free enough global memory on the runtime for the runtime to
handle spills efficiently.