You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are building a tensor interoperability layer in cuda.core (StridedMemoryView) that creates lightweight read-only views over GPU arrays from CuPy, PyTorch, Numba, etc. Today, all paths to extract CuPy ndarray metadata go through Python:
__cuda_array_interface__ allocates a fresh dict, tuples for shape/strides, and Python ints on every access (~5 µs)
__dlpack__ allocates a PyCapsule + mallocs a DLManagedTensor struct (~3–4 µs)
Direct attribute access (arr.data.ptr, arr.shape) involves multiple Python property dispatches and tuple allocations
For PyTorch, we adopted the AOTI stable C ABI — an opaque handle type plus C getter functions. This reduced metadata extraction from ~5 µs to ~14 ns at the C level (~350x faster), because we bypass all Python object allocation. We'd like the same for CuPy.
Scope
This RFC proposes read-only introspection only. All functions return metadata about an existing ndarray. There is no mechanism to create, modify, resize, or free ndarrays through this API — and we have no plan to expand it in that direction.
Proposal
An opaque handle type and a small set of C functions, exported from the CuPy shared library:
/* Opaque handle — consumers must not inspect the struct. */structCuPyArrayOpaque;
typedefstructCuPyArrayOpaque*CuPyArrayHandle;
typedefintCuPyError; /* 0 = success *//* ---- handle acquisition ---- *//* Extract a handle from a PyObject* pointing to a cupy.ndarray. * The handle borrows the ndarray's lifetime — it is valid as long * as the PyObject is alive. This is the function that turns a * Python object into something C code can work with. */CuPyErrorcupy_ndarray_get_handle(void*pyobj, CuPyArrayHandle*out);
/* ---- read-only metadata getters ---- */CuPyErrorcupy_ndarray_get_data_ptr(CuPyArrayHandlearr, void**out);
CuPyErrorcupy_ndarray_get_ndim(CuPyArrayHandlearr, int64_t*out);
CuPyErrorcupy_ndarray_get_shape(CuPyArrayHandlearr, int64_t**out);
CuPyErrorcupy_ndarray_get_strides(CuPyArrayHandlearr, int64_t**out);
CuPyErrorcupy_ndarray_get_device_id(CuPyArrayHandlearr, int*out);
/* dtype — returns itemsize and a DLPack-compatible type code */CuPyErrorcupy_ndarray_get_dtype(CuPyArrayHandlearr,
int*out_typecode, int*out_itemsize);
All functions are read-only — this API does not provide any mechanism to create, modify, or free ndarrays.
Design rationale
Opaque handle: consumers don't depend on CuPy's internal struct layout. CuPy is free to change its internals without breaking the ABI.
cupy_ndarray_get_handle from PyObject*: this is the critical piece. For our PyTorch tensor bridge (NVIDIA/cuda-python#1894) we had to resort to pointer arithmetic on PyTorch's internal THPVariable struct because no such PyObject* → handle function exists in the AOTI stable C ABI — and we've filed an RFC with PyTorch requesting one (Allow getting AtenTensorHandle from the tensor object via AOTI stable C API pytorch/pytorch#180107). Having it as a first-class API makes the whole pattern safe and stable.
Borrowed pointers for shape/strides: get_shape/get_strides return pointers into the ndarray's own storage — zero-copy, no malloc. Valid as long as the ndarray is alive and not reshaped.
Individual getters: a consumer that only needs data_ptr and device_id doesn't pay for shape/strides extraction. Each getter is a trivial struct field read (~2 ns).
No PyCapsule, no version struct: the functions are exported directly from the shared library. Versioning is handled by adding new functions (fully backward-compatible).
Comparison with __dlpack_c_exchange_api__
The DLPack C exchange API is a complementary standard that provides dltensor_from_py_object_no_sync to fill a DLTensor from C. We'd love to see CuPy adopt it too (separate request). The key differences:
DLPack C exchange API
This proposal
Scope
Framework-agnostic standard
CuPy-specific
Allocation
Copies shape/strides into DLTensor (malloc)
Borrows pointers (zero-copy)
Granularity
One call fills entire struct
Individual getters, pay-for-what-you-use
PyObject* → handle
Via PyCapsule function pointer
Direct C function export
Perf (PyTorch ref.)
~75 ns
~14 ns (7 getters)
Both serve different use cases. The DLPack C API is best for cross-framework exchange; this proposal is best for CuPy-aware consumers that need maximum throughput.
Benchmarks (PyTorch, for reference)
Path
Time
Notes
Python __cuda_array_interface__
~5000 ns
dict + tuples + ints
Python __dlpack__(stream=-1)
~5000 ns
PyCapsule + DLManagedTensor
DLPack C exchange API
~75 ns
C function pointers, fills DLTensor
AOTI individual C getters
~14 ns
7 direct struct field reads
Measured on PyTorch 2.11, CPU tensors, 1M iterations (timeit).
What this enables
cuda.core.StridedMemoryView fast path for CuPy (matching our existing PyTorch tensor bridge)
Any C, C++, or Cython extension that needs to inspect CuPy array metadata in a hot loop without Python overhead
Compatibility
Fully backward-compatible — old consumers don't see the new functions. Adding future getters (e.g. cupy_ndarray_get_strides_in_bytes) is also backward-compatible since it's just a new exported symbol.
Motivation
We are building a tensor interoperability layer in
cuda.core(StridedMemoryView) that creates lightweight read-only views over GPU arrays from CuPy, PyTorch, Numba, etc. Today, all paths to extract CuPy ndarray metadata go through Python:__cuda_array_interface__allocates a freshdict,tuples for shape/strides, and Pythonints on every access (~5 µs)__dlpack__allocates aPyCapsule+mallocs aDLManagedTensorstruct (~3–4 µs)arr.data.ptr,arr.shape) involves multiple Python property dispatches and tuple allocationsFor PyTorch, we adopted the AOTI stable C ABI — an opaque handle type plus C getter functions. This reduced metadata extraction from ~5 µs to ~14 ns at the C level (~350x faster), because we bypass all Python object allocation. We'd like the same for CuPy.
Scope
This RFC proposes read-only introspection only. All functions return metadata about an existing ndarray. There is no mechanism to create, modify, resize, or free ndarrays through this API — and we have no plan to expand it in that direction.
Proposal
An opaque handle type and a small set of C functions, exported from the CuPy shared library:
All functions are read-only — this API does not provide any mechanism to create, modify, or free ndarrays.
Design rationale
cupy_ndarray_get_handlefromPyObject*: this is the critical piece. For our PyTorch tensor bridge (NVIDIA/cuda-python#1894) we had to resort to pointer arithmetic on PyTorch's internalTHPVariablestruct because no suchPyObject* → handlefunction exists in the AOTI stable C ABI — and we've filed an RFC with PyTorch requesting one (Allow gettingAtenTensorHandlefrom the tensor object via AOTI stable C API pytorch/pytorch#180107). Having it as a first-class API makes the whole pattern safe and stable.get_shape/get_stridesreturn pointers into the ndarray's own storage — zero-copy, no malloc. Valid as long as the ndarray is alive and not reshaped.data_ptranddevice_iddoesn't pay for shape/strides extraction. Each getter is a trivial struct field read (~2 ns).Comparison with
__dlpack_c_exchange_api__The DLPack C exchange API is a complementary standard that provides
dltensor_from_py_object_no_syncto fill aDLTensorfrom C. We'd love to see CuPy adopt it too (separate request). The key differences:DLTensor(malloc)PyObject*→ handleBoth serve different use cases. The DLPack C API is best for cross-framework exchange; this proposal is best for CuPy-aware consumers that need maximum throughput.
Benchmarks (PyTorch, for reference)
__cuda_array_interface____dlpack__(stream=-1)Measured on PyTorch 2.11, CPU tensors, 1M iterations (
timeit).What this enables
cuda.core.StridedMemoryViewfast path for CuPy (matching our existing PyTorch tensor bridge)Compatibility
Fully backward-compatible — old consumers don't see the new functions. Adding future getters (e.g.
cupy_ndarray_get_strides_in_bytes) is also backward-compatible since it's just a new exported symbol.-- Leo's bot