RFC: stable C API for read-only access to ndarray metadata

## Motivation

We are building a tensor interoperability layer in [`cuda.core`](https://github.com/NVIDIA/cuda-python) (`StridedMemoryView`) that creates lightweight read-only views over GPU arrays from CuPy, PyTorch, Numba, etc. Today, all paths to extract CuPy ndarray metadata go through Python:

- `__cuda_array_interface__` allocates a fresh `dict`, `tuple`s for shape/strides, and Python `int`s on every access (~5 µs)
- `__dlpack__` allocates a `PyCapsule` + `malloc`s a `DLManagedTensor` struct (~3–4 µs)
- Direct attribute access (`arr.data.ptr`, `arr.shape`) involves multiple Python property dispatches and tuple allocations

For PyTorch, we adopted the AOTI stable C ABI — an opaque handle type plus C getter functions. This reduced metadata extraction from ~5 µs to ~14 ns at the C level (~350x faster), because we bypass all Python object allocation. We'd like the same for CuPy.

## Scope

This RFC proposes **read-only introspection only**. All functions return metadata about an existing ndarray. There is no mechanism to create, modify, resize, or free ndarrays through this API — and we have no plan to expand it in that direction.

## Proposal

An opaque handle type and a small set of C functions, exported from the CuPy shared library:

```c
/* Opaque handle — consumers must not inspect the struct. */
struct CuPyArrayOpaque;
typedef struct CuPyArrayOpaque *CuPyArrayHandle;

typedef int CuPyError;   /* 0 = success */

/* ---- handle acquisition ---- */

/* Extract a handle from a PyObject* pointing to a cupy.ndarray.
 * The handle borrows the ndarray's lifetime — it is valid as long
 * as the PyObject is alive.  This is the function that turns a
 * Python object into something C code can work with. */
CuPyError cupy_ndarray_get_handle(void *pyobj, CuPyArrayHandle *out);

/* ---- read-only metadata getters ---- */

CuPyError cupy_ndarray_get_data_ptr(CuPyArrayHandle arr, void **out);
CuPyError cupy_ndarray_get_ndim(CuPyArrayHandle arr, int64_t *out);
CuPyError cupy_ndarray_get_shape(CuPyArrayHandle arr, int64_t **out);
CuPyError cupy_ndarray_get_strides(CuPyArrayHandle arr, int64_t **out);
CuPyError cupy_ndarray_get_device_id(CuPyArrayHandle arr, int *out);

/* dtype — returns itemsize and a DLPack-compatible type code */
CuPyError cupy_ndarray_get_dtype(CuPyArrayHandle arr,
                                  int *out_typecode, int *out_itemsize);
```

All functions are read-only — this API does not provide any mechanism to create, modify, or free ndarrays.

## Design rationale

- **Opaque handle**: consumers don't depend on CuPy's internal struct layout. CuPy is free to change its internals without breaking the ABI.
- **`cupy_ndarray_get_handle` from `PyObject*`**: this is the critical piece. For our PyTorch tensor bridge ([NVIDIA/cuda-python#1894](https://github.com/NVIDIA/cuda-python/pull/1894)) we had to resort to pointer arithmetic on PyTorch's internal `THPVariable` struct because no such `PyObject* → handle` function exists in the AOTI stable C ABI — and we've filed an RFC with PyTorch requesting one (https://github.com/pytorch/pytorch/issues/180107). Having it as a first-class API makes the whole pattern safe and stable.
- **Borrowed pointers for shape/strides**: `get_shape`/`get_strides` return pointers into the ndarray's own storage — zero-copy, no malloc. Valid as long as the ndarray is alive and not reshaped.
- **Individual getters**: a consumer that only needs `data_ptr` and `device_id` doesn't pay for shape/strides extraction. Each getter is a trivial struct field read (~2 ns).
- **No PyCapsule, no version struct**: the functions are exported directly from the shared library. Versioning is handled by adding new functions (fully backward-compatible).

## Comparison with `__dlpack_c_exchange_api__`

The DLPack C exchange API is a complementary standard that provides `dltensor_from_py_object_no_sync` to fill a `DLTensor` from C. We'd love to see CuPy adopt it too (separate request). The key differences:

|   | DLPack C exchange API | This proposal |
|---|---|---|
| Scope | Framework-agnostic standard | CuPy-specific |
| Allocation | Copies shape/strides into `DLTensor` (malloc) | Borrows pointers (zero-copy) |
| Granularity | One call fills entire struct | Individual getters, pay-for-what-you-use |
| `PyObject*` → handle | Via PyCapsule function pointer | Direct C function export |
| Perf (PyTorch ref.) | ~75 ns | ~14 ns (7 getters) |

Both serve different use cases. The DLPack C API is best for cross-framework exchange; this proposal is best for CuPy-aware consumers that need maximum throughput.

## Benchmarks (PyTorch, for reference)

| Path | Time | Notes |
|---|---|---|
| Python `__cuda_array_interface__` | ~5000 ns | dict + tuples + ints |
| Python `__dlpack__(stream=-1)` | ~5000 ns | PyCapsule + DLManagedTensor |
| DLPack C exchange API | ~75 ns | C function pointers, fills DLTensor |
| AOTI individual C getters | ~14 ns | 7 direct struct field reads |

Measured on PyTorch 2.11, CPU tensors, 1M iterations (`timeit`).

## What this enables

- `cuda.core.StridedMemoryView` fast path for CuPy (matching our existing PyTorch tensor bridge)
- Any C, C++, or Cython extension that needs to inspect CuPy array metadata in a hot loop without Python overhead

## Compatibility

Fully backward-compatible — old consumers don't see the new functions. Adding future getters (e.g. `cupy_ndarray_get_strides_in_bytes`) is also backward-compatible since it's just a new exported symbol.

-- Leo's bot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: stable C API for read-only access to ndarray metadata #9881

Motivation

Scope

Proposal

Design rationale

Comparison with `__dlpack_c_exchange_api__`

Benchmarks (PyTorch, for reference)

What this enables

Compatibility

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	DLPack C exchange API	This proposal
Scope	Framework-agnostic standard	CuPy-specific
Allocation	Copies shape/strides into `DLTensor` (malloc)	Borrows pointers (zero-copy)
Granularity	One call fills entire struct	Individual getters, pay-for-what-you-use
`PyObject*` → handle	Via PyCapsule function pointer	Direct C function export
Perf (PyTorch ref.)	~75 ns	~14 ns (7 getters)

Path	Time	Notes
Python `__cuda_array_interface__`	~5000 ns	dict + tuples + ints
Python `__dlpack__(stream=-1)`	~5000 ns	PyCapsule + DLManagedTensor
DLPack C exchange API	~75 ns	C function pointers, fills DLTensor
AOTI individual C getters	~14 ns	7 direct struct field reads

Uh oh!

RFC: stable C API for read-only access to ndarray metadata #9881

Description

Motivation

Scope

Proposal

Design rationale

Comparison with __dlpack_c_exchange_api__

Benchmarks (PyTorch, for reference)

What this enables

Compatibility

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Comparison with `__dlpack_c_exchange_api__`