Skip to content

RFC: stable C API for read-only access to ndarray metadata #9881

@leofang

Description

@leofang

Motivation

We are building a tensor interoperability layer in cuda.core (StridedMemoryView) that creates lightweight read-only views over GPU arrays from CuPy, PyTorch, Numba, etc. Today, all paths to extract CuPy ndarray metadata go through Python:

  • __cuda_array_interface__ allocates a fresh dict, tuples for shape/strides, and Python ints on every access (~5 µs)
  • __dlpack__ allocates a PyCapsule + mallocs a DLManagedTensor struct (~3–4 µs)
  • Direct attribute access (arr.data.ptr, arr.shape) involves multiple Python property dispatches and tuple allocations

For PyTorch, we adopted the AOTI stable C ABI — an opaque handle type plus C getter functions. This reduced metadata extraction from ~5 µs to ~14 ns at the C level (~350x faster), because we bypass all Python object allocation. We'd like the same for CuPy.

Scope

This RFC proposes read-only introspection only. All functions return metadata about an existing ndarray. There is no mechanism to create, modify, resize, or free ndarrays through this API — and we have no plan to expand it in that direction.

Proposal

An opaque handle type and a small set of C functions, exported from the CuPy shared library:

/* Opaque handle — consumers must not inspect the struct. */
struct CuPyArrayOpaque;
typedef struct CuPyArrayOpaque *CuPyArrayHandle;

typedef int CuPyError;   /* 0 = success */

/* ---- handle acquisition ---- */

/* Extract a handle from a PyObject* pointing to a cupy.ndarray.
 * The handle borrows the ndarray's lifetime — it is valid as long
 * as the PyObject is alive.  This is the function that turns a
 * Python object into something C code can work with. */
CuPyError cupy_ndarray_get_handle(void *pyobj, CuPyArrayHandle *out);

/* ---- read-only metadata getters ---- */

CuPyError cupy_ndarray_get_data_ptr(CuPyArrayHandle arr, void **out);
CuPyError cupy_ndarray_get_ndim(CuPyArrayHandle arr, int64_t *out);
CuPyError cupy_ndarray_get_shape(CuPyArrayHandle arr, int64_t **out);
CuPyError cupy_ndarray_get_strides(CuPyArrayHandle arr, int64_t **out);
CuPyError cupy_ndarray_get_device_id(CuPyArrayHandle arr, int *out);

/* dtype — returns itemsize and a DLPack-compatible type code */
CuPyError cupy_ndarray_get_dtype(CuPyArrayHandle arr,
                                  int *out_typecode, int *out_itemsize);

All functions are read-only — this API does not provide any mechanism to create, modify, or free ndarrays.

Design rationale

  • Opaque handle: consumers don't depend on CuPy's internal struct layout. CuPy is free to change its internals without breaking the ABI.
  • cupy_ndarray_get_handle from PyObject*: this is the critical piece. For our PyTorch tensor bridge (NVIDIA/cuda-python#1894) we had to resort to pointer arithmetic on PyTorch's internal THPVariable struct because no such PyObject* → handle function exists in the AOTI stable C ABI — and we've filed an RFC with PyTorch requesting one (Allow getting AtenTensorHandle from the tensor object via AOTI stable C API pytorch/pytorch#180107). Having it as a first-class API makes the whole pattern safe and stable.
  • Borrowed pointers for shape/strides: get_shape/get_strides return pointers into the ndarray's own storage — zero-copy, no malloc. Valid as long as the ndarray is alive and not reshaped.
  • Individual getters: a consumer that only needs data_ptr and device_id doesn't pay for shape/strides extraction. Each getter is a trivial struct field read (~2 ns).
  • No PyCapsule, no version struct: the functions are exported directly from the shared library. Versioning is handled by adding new functions (fully backward-compatible).

Comparison with __dlpack_c_exchange_api__

The DLPack C exchange API is a complementary standard that provides dltensor_from_py_object_no_sync to fill a DLTensor from C. We'd love to see CuPy adopt it too (separate request). The key differences:

DLPack C exchange API This proposal
Scope Framework-agnostic standard CuPy-specific
Allocation Copies shape/strides into DLTensor (malloc) Borrows pointers (zero-copy)
Granularity One call fills entire struct Individual getters, pay-for-what-you-use
PyObject* → handle Via PyCapsule function pointer Direct C function export
Perf (PyTorch ref.) ~75 ns ~14 ns (7 getters)

Both serve different use cases. The DLPack C API is best for cross-framework exchange; this proposal is best for CuPy-aware consumers that need maximum throughput.

Benchmarks (PyTorch, for reference)

Path Time Notes
Python __cuda_array_interface__ ~5000 ns dict + tuples + ints
Python __dlpack__(stream=-1) ~5000 ns PyCapsule + DLManagedTensor
DLPack C exchange API ~75 ns C function pointers, fills DLTensor
AOTI individual C getters ~14 ns 7 direct struct field reads

Measured on PyTorch 2.11, CPU tensors, 1M iterations (timeit).

What this enables

  • cuda.core.StridedMemoryView fast path for CuPy (matching our existing PyTorch tensor bridge)
  • Any C, C++, or Cython extension that needs to inspect CuPy array metadata in a hot loop without Python overhead

Compatibility

Fully backward-compatible — old consumers don't see the new functions. Adding future getters (e.g. cupy_ndarray_get_strides_in_bytes) is also backward-compatible since it's just a new exported symbol.

-- Leo's bot

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions