1574671 : Cannot find GPU using CUDA

Created: 2026-02-06T22:54:49Z - current status: new

Anonymized Summary:

A user is encountering an error while attempting to run the PyNX module in a JupyterHub notebook on the Maxwell cluster (specifically on the allgpu partition). The error indicates that the GPU (NVIDIA A100-PCIE-40GB) cannot be initialized, despite the notebook running on a GPU-enabled node. The issue appears to stem from OpenCL/CUDA compatibility or misconfiguration in the custom kernel (pynx-py312-env).


Core Issue:

The PyNX module fails to detect or initialize the GPU, returning:

Failed initialising GPU. Please check GPU name [NVIDIA A100-PCIE-40GB] or CUDA/OpenCL installation.

Even though the job is running on the allgpu partition, the environment may not be properly configured for CUDA/OpenCL access.


Possible Solutions/Next Steps:

1. Verify CUDA/OpenCL Environment

  • Ensure the custom kernel (pynx-py312-env) has the correct CUDA toolkit and OpenCL drivers installed.
  • Load the appropriate CUDA module (e.g., module load maxwell cuda/11.8) before launching the notebook.
  • Check if the kernel includes PyCUDA or PyOpenCL dependencies.

2. Check GPU Visibility

  • Run a simple test in the notebook to confirm GPU visibility: python import torch # or tensorflow/cupy print(torch.cuda.is_available()) # Should return True print(torch.cuda.get_device_name(0)) # Should match the GPU name (e.g., A100)
  • If this fails, the environment may lack GPU support.

3. Kernel Configuration

  • If the kernel was created manually, ensure it includes:
    • CUDA-compatible Python packages (e.g., cupy, pycuda).
    • Correct paths to CUDA libraries (e.g., LD_LIBRARY_PATH).
  • Example for a conda environment: bash mamba create -n pynx-env python=3.12 pynx cuda-toolkit=11.8 -c conda-forge

4. JupyterHub-Specific Checks

  • Confirm the notebook is running on a GPU node (check !nvidia-smi in a cell).
  • If using a batch job, ensure the script includes: bash #SBATCH --partition=allgpu #SBATCH --gres=gpu:1 # Request 1 GPU module load maxwell cuda/11.8

5. PyNX-Specific Debugging

  • PyNX may require explicit CUDA/OpenCL flags. Try: python from pynx import * # Force CUDA backend (if supported) os.environ["PYNX_CUDA"] = "1"

6. Fallback: Use Pre-Installed Environments

  • If the custom kernel is problematic, test PyNX in a pre-installed environment (e.g., tensorflow-2.11 or rapids-22.04): bash module load maxwell conda/3.9 cuda/11.8 . mamba-init mamba activate tensorflow-2.11

References: