Building Tensors from Scratch in Rust (Part 1.3): Data Operations

Community Article Published July 31, 2025

In the previous part, we built view operations for our tensor library. Now we're ready to implement data operations, the operations that actually transform values in memory and form the computational backbone of any tensor library.

Data operations like map, reduce, and broadcast are where the real work happens. They're the operations that GPUs accelerate, and they're by far the most important part of any tensor library.

Understanding Tensor Dimensionality

Before we implement our data operations, we need to establish a clear understanding of zero-dimensional tensors, as they'll play a crucial role in our operations.

Think of tensor dimensionality as a hierarchy:

Dimension	Name	Description	Structure
0D	Scalar	Number	Single value
1D	Vector	List of numbers	List of scalars
2D	Matrix	List of lists of numbers	List of vectors
3D	Tensor	List of lists of lists of numbers	List of matrices

Following this pattern, a zero-dimensional tensor is simply a scalar, a single number. This understanding will be essential as we build our operations, since many of our functions will work with or produce scalars.

Map

The first, most straightforward operation is map.

Map applies a function to each element of a tensor independently. In machine learning frameworks, think of it as applying element-wise functions like ReLU, tanh, or sigmoid to a tensor.

A nice way to represent map is using einop notation:

b h w c -func> b h w c

The function that is applied to every element must take in a scalar (zero-dimensional tensor) and output a scalar.

func(Scalar) -> Scalar

This is easy to implement in our tensor library. We just take in a function and apply it to every element in our data.

We add the map function to TensorStorage:

impl<T> TensorStorage<T> {
    fn map<F, U>(&self, f: F) -> TensorStorage<U>
    where
        F: Fn(&T) -> U,
    {
        TensorStorage {
            data: self.data.iter().map(f).collect(),
        }
    }
}

This has some new type constraints. First, notice that the map function doesn't have to return the same type as the input vector; the type of the output tensor depends on the output type of the function we pass in.

And of course, add to our Tensor impl:

impl<T> Tensor<T> {
    fn map<F, U>(&self, f: F) -> Tensor<U>
    where
        F: Fn(&T) -> U,
    {
        Tensor {
            shape: self.shape.clone(),
            storage: self.storage.map(f),
        }
    }
}

Notice we're pushing most computation into TensorStorage. That’ll pay off when we explore running ops on devices other than the CPU. But that is for a different part. We are only focusing on the CPU right now.

In Production: Candle

Candle does not have a map operation that can take in an arbitrary function, like we did above, because it gets challenging to translate arbitrary functions to other devices. So instead, it has a struct called UnaryOp which describes all the different functions that can be applied to a tensor in this fashion.

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum UnaryOp {
    Exp,
    Log,
    Sin,
    Cos,
    Abs,
    Neg,
    Recip,
    Sqr,
    Sqrt,
    Gelu,
    GeluErf,
    Erf,
    Relu,
    Silu,
    Tanh,
    Floor,
    Ceil,
    Round,
    Sign,
}

And here is the unary_map function that actually performs the computation:

pub fn unary_map<T: Copy, U: Copy, F: FnMut(T) -> U>(
    vs: &[T],
    layout: &Layout,
    mut f: F,
) -> Vec<U> {
    match layout.strided_blocks() {
        crate::StridedBlocks::SingleBlock { start_offset, len } => vs
            [start_offset..start_offset + len]
            .iter()
            .map(|&v| f(v))
            .collect(),
        crate::StridedBlocks::MultipleBlocks {
            block_start_index,
            block_len,
        } => {
            let mut result = Vec::with_capacity(layout.shape().elem_count());
            // Specialize the case where block_len is one to avoid the second loop.
            if block_len == 1 {
                for index in block_start_index {
                    let v = unsafe { vs.get_unchecked(index) };
                    result.push(f(*v))
                }
            } else {
                for index in block_start_index {
                    for offset in 0..block_len {
                        let v = unsafe { vs.get_unchecked(index + offset) };
                        result.push(f(*v))
                    }
                }
            }
            result
        }
    }
}

Reduce

The reduce operation takes a tensor and combines all the subtensors over specified axes using a (usually associative) function. An example of this in machine learning frameworks is sum or product over an axis.

A nice way to represent reduce is using einop notation:

b h w c -func> b w c

In this case, we have a four-dimensional tensor of shape [b, h, w, c], and we would like to extract every subtensor of shape [b, w, c] and combine each together elementwise.

Each entry in our output tensor is the output of a function:

func(Tensor) -> Scalar

In this case, for each entry in the output tensor, the function is applied to the corresponding subtensor with shape [h].

Like sum would sum up every element in a tensor.

You could also use any associative function that follows:

func(Scalar, Scalar) -> Scalar

Usually it's best for the function of this type to be binary-associative so we can perform it more efficiently in parallel, but this isn't a huge concern right now.

Associativity is the property that means it doesn't matter what order you do the operation in.

For example, addition:

a + (b + c) = (a + b) + c

Having an associative function lets us easily parallelize our computation.

For example, if we want to sum this list of numbers [a, b, c, d, e, f, g, h], we can do it in parallel in 3 steps:

Step 1: a + b = ab, c + d = cd, e + f = ef, g + h = gh

Step 2: ab + cd = abcd, ef + gh = efgh

Step 3: abcd + efgh = abcdefgh = a + b + c + d + e + f + g + h

We are going to implement this for our TensorStorage. We are only implementing the function that takes in two scalars and returns a scalar, all of the same type.

impl<T: Clone> TensorStorage<T> {
    fn reduce<F>(&self, shape: &TensorShape, dim: usize, f: F) -> TensorStorage<T>
    where
        F: Fn(&T, &T) -> T,
        T: Zero,
    {
        // Calculate result shape by removing the reduction dimension
        let mut result_shape_vec = shape.shape.clone();
        result_shape_vec.remove(dim);

        if result_shape_vec.is_empty() {
            // Reducing to scalar
            let v = self
                .data
                .iter()
                .cloned()
                .reduce(|a, b| f(&a, &b))
                .expect("Cannot reduce an empty tensor storage");
            return TensorStorage { data: vec![v] };
        }

        let result_shape = TensorShape::new(result_shape_vec);
        let mut result_storage = TensorStorage::zeros(result_shape.size());

        // Iterate through all positions in the result tensor
        for result_flat_idx in 0..result_shape.size() {
            let result_multi_idx = result_shape.unravel_index(result_flat_idx);

            // For each result position, reduce along the specified dimension
            let mut accumulated_value: Option<T> = None;

            for dim_idx in 0..shape.shape[dim] {
                // Reconstruct the full multi-index by inserting the dimension index
                let mut full_multi_idx = Vec::with_capacity(shape.shape.len());
                let mut result_idx_pos = 0;

                for d in 0..shape.shape.len() {
                    if d == dim {
                        full_multi_idx.push(dim_idx);
                    } else {
                        full_multi_idx.push(result_multi_idx[result_idx_pos]);
                        result_idx_pos += 1;
                    }
                }

                let source_flat_idx = shape.ravel_index(&full_multi_idx);
                let value = &self[source_flat_idx];

                accumulated_value = match accumulated_value {
                    None => Some(value.clone()),
                    Some(acc) => Some(f(&acc, value)),
                };
            }

            result_storage[result_flat_idx] = accumulated_value.unwrap();
        }

        result_storage
    }
}

And then add it to our Tensor implementation as well:

impl<T: Zero + Clone> Tensor<T> {

    // Rest of implementation

    fn reduce<F>(&self, dim: usize, f: F) -> Tensor<T>
    where
        F: Fn(&T, &T) -> T,
    {
        if dim >= self.shape.shape.len() {
            panic!(
                "dim {} out of bounds for tensor with {} dimensions",
                dim,
                self.shape.shape.len()
            );
        }

        let result_storage = self.storage.reduce(&self.shape, dim, f);

        // Calculate result shape by removing the reduction dimension
        let mut result_shape_vec = self.shape.shape.clone();
        result_shape_vec.remove(dim);
        let result_tensor_shape = TensorShape::new(result_shape_vec);

        Tensor {
            shape: result_tensor_shape,
            storage: result_storage,
        }
    }
}

So now we can reduce any tensor along any dimension.

In Production: Candle

Candle's CPU implementation of reduce only allows for a finite number of different reductions based on the ReduceOp enum, similar to map for the same reasons, where functions are hard to translate between devices.

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum ReduceOp {
    Sum,
    Min,
    Max,
    ArgMin,
    ArgMax,
}

fn reduce_op(&self, op: ReduceOp, layout: &Layout, reduce_dims: &[usize]) -> Result<Self> {
    match op {
        ReduceOp::Sum => {
            let src_dims = layout.dims();
            let mut dst_dims = src_dims.to_vec();
            for &dim in reduce_dims.iter() {
                dst_dims[dim] = 1;
            }
            let dst_shape = Shape::from(dst_dims);
            let mut reduce_dims = reduce_dims.to_vec();
            // Sort the reduce_dims as they have to be processed from left to right when converting the
            // indexes.
            reduce_dims.sort();
            let reduce_dims_and_stride: Vec<_> = reduce_dims
                .iter()
                .map(|&d| (src_dims[d], src_dims[d + 1..].iter().product::<usize>()))
                .collect();
            ReduceSum {
                dst_shape: &dst_shape,
                reduce_dims: &reduce_dims,
                reduce_dims_and_stride,
            }
            .map(self, layout)
        }
        ReduceOp::Min | ReduceOp::ArgMin | ReduceOp::Max | ReduceOp::ArgMax => {
            let reduce_dim_index = match reduce_dims {
                [reduce_dim_index] => *reduce_dim_index,
                _ => {
                    let op = match op {
                        ReduceOp::Min => "min",
                        ReduceOp::ArgMin => "argmin",
                        ReduceOp::Max => "max",
                        ReduceOp::ArgMax => "argmax",
                        _ => unreachable!(),
                    };
                    let dims = reduce_dims.to_vec();
                    Err(Error::OnlySingleDimension { op, dims })?
                }
            };
            let (use_min, return_index) = match op {
                ReduceOp::Min => (true, false),
                ReduceOp::ArgMin => (true, true),
                ReduceOp::Max => (false, false),
                ReduceOp::ArgMax => (false, true),
                _ => unreachable!(),
            };
            ReduceIndex {
                reduce_dim_index,
                use_min,
                return_index,
            }
            .map(self, layout)
        }
    }
}

Broadcast

The broadcast (or Amplect) operation is the first operation that operates on multiple tensors. It applies a binary scalar function elementwise over a given set of corresponding dimensions. This is multiplication or addition in machine learning frameworks.

A nice way to represent broadcast is using einop notation:

b h w c, c -func> b h w c

In this case, we have a four-dimensional tensor of shape [b, h, w, c], and we would like to combine it with c using func.

Each entry in our output tensor is the output of a function similar to reduce:

func(Scalar, Scalar) -> Scalar

It's helpful to think of each shape dimension in the einop notation as a formula for calculating the index.

So if T1 has shape [b, h, w, c], and T2 has shape [c] (where T1's c dimension and T2's c dimension are corresponding dimensions), which we explicitly state, then:

Output[i, j, k, l] = func(T1[i, j, k, l], T2[l])

For the implementation, first we need to implement a way to find the output shape for the two tensors. We implement the function broadcast_shape on TensorShape.

Instead of manually labeling each dimension like we did above with einop notation, we will simply provide the corresponding dimension, and all dimensions will collapse into the leftmost corresponding dimension.

If you are familiar with broadcasting in most frameworks (including candle), you might notice we are taking a different strategy than how broadcasting rules are normally done. I discuss this in my article about broadcasting. One of my main points is that broadcasting rules don't allow for the expressive and useful outer product. Here is an excerpt of the main point:

Broadcasting compares tensors dimension by dimension, starting from the rightmost:

If both dimensions have the same size -> Continue

If either dimension has size 1 -> Stretch it to match the other -> Continue

If a tensor runs out of dimensions -> Treat missing dimensions as size 1

Otherwise -> Broadcasting fails

For example: T1 [a b c] and T2 [b]

Compare c vs b -> Different sizes, neither is 1 -> Broadcasting fails

...

Why should [a b c] and [c] be compatible, but [a b c] and [b] incompatible? Both operations make mathematical sense.

In fact, any two tensors can be combined through their outer product.

Take tensors A with shape [a b c] and B with shape [d e f].

Broadcasting calls these "unbroadcastable," but if we add enough singleton dimensions to A so that it has shape [a b c 1 1 1], they can suddenly broadcast to shape [a b c d e f], which is the outer product.

impl TensorShape {

    // Rest of implementation

    fn broadcast_shape(
        &self,
        other: &TensorShape,
        corresponding_dimensions: &[(usize, usize)],
    ) -> TensorShape {
        // Create mapping for corresponding dimensions
        let mut dim_correspondence = std::collections::HashMap::new();
        let mut other_dim_used = vec![false; other.shape.len()];

        for &(self_dim, other_dim) in corresponding_dimensions {
            if self_dim >= self.shape.len() || other_dim >= other.shape.len() {
                panic!("Dimension index out of bounds");
            }
            dim_correspondence.insert(self_dim, other_dim);
            other_dim_used[other_dim] = true;
        }

        // Build output shape: LHS dimensions (with broadcasting) + remaining RHS dimensions
        let mut output_shape = Vec::new();

        // Process LHS dimensions in order
        for (self_dim, &self_size) in self.shape.iter().enumerate() {
            if let Some(&other_dim) = dim_correspondence.get(&self_dim) {
                let other_size = other.shape[other_dim];

                if self_size == other_size {
                    output_shape.push(self_size);
                } else if self_size == 1 {
                    output_shape.push(other_size);
                } else if other_size == 1 {
                    output_shape.push(self_size);
                } else {
                    panic!(
                        "Cannot broadcast dimensions: {} and {}",
                        self_size, other_size
                    );
                }
            } else {
                output_shape.push(self_size);
            }
        }

        // Add remaining RHS dimensions that weren't used in correspondence
        for (other_dim, &other_size) in other.shape.iter().enumerate() {
            if !other_dim_used[other_dim] {
                output_shape.push(other_size);
            }
        }

        TensorShape::new(output_shape)
    }
}

Then implement the broadcast operation on TensorStorage:

impl<T: Clone> TensorStorage<T> {

    // Rest of the implementation

    fn broadcast_op<F, U, T2>(
        &self,
        self_shape: &TensorShape,
        other: &TensorStorage<T2>,
        other_shape: &TensorShape,
        corresponding_dimensions: &[(usize, usize)],
        f: F,
    ) -> TensorStorage<U>
    where
        F: Fn(&T, &T2) -> U,
        U: Zero + Clone,
    {
        let result_shape = self_shape.broadcast_shape(other_shape, corresponding_dimensions);
        let mut result = TensorStorage::<U>::zeros(result_shape.size());

        // Create mapping for corresponding dimensions
        let mut dim_correspondence = std::collections::HashMap::new();
        let mut other_dim_used = vec![false; other_shape.shape.len()];

        for &(self_dim, other_dim) in corresponding_dimensions {
            dim_correspondence.insert(self_dim, other_dim);
            other_dim_used[other_dim] = true;
        }

        // Perform the element-wise operation
        for flat_idx in 0..result_shape.size() {
            let output_multi_idx = result_shape.unravel_index(flat_idx);

            // Map output indices to input tensor indices
            let mut self_idx = vec![0; self_shape.shape.len()];
            let mut other_idx = vec![0; other_shape.shape.len()];

            // Map LHS dimensions
            for (self_dim, &self_size) in self_shape.shape.iter().enumerate() {
                let output_val = output_multi_idx[self_dim];

                if let Some(&other_dim) = dim_correspondence.get(&self_dim) {
                    if self_size == 1 {
                        self_idx[self_dim] = 0;
                    } else {
                        self_idx[self_dim] = output_val;
                    }

                    if other_shape.shape[other_dim] == 1 {
                        other_idx[other_dim] = 0;
                    } else {
                        other_idx[other_dim] = output_val;
                    }
                } else {
                    self_idx[self_dim] = output_val;
                }
            }

            // Map remaining RHS dimensions
            let mut rhs_output_offset = self_shape.shape.len();
            for (other_dim, _) in other_shape.shape.iter().enumerate() {
                if !other_dim_used[other_dim] {
                    other_idx[other_dim] = output_multi_idx[rhs_output_offset];
                    rhs_output_offset += 1;
                }
            }

            // Get values from input tensors using proper indexing
            let self_flat = self_shape.ravel_index(&self_idx);
            let other_flat = other_shape.ravel_index(&other_idx);

            let self_val = &self[self_flat];
            let other_val = &other[other_flat];

            // Apply operation and store result
            result[flat_idx] = f(self_val, other_val);
        }

        result
    }
}

And finally add it to our Tensor implementation:

impl<T: Zero + Clone> Tensor<T> {

    // Rest of the implementation

    fn broadcast_op<F, U, T2>(
        &self,
        other: &Tensor<T2>,
        corresponding_dimensions: &[(usize, usize)],
        f: F,
    ) -> Tensor<U>
    where
        F: Fn(&T, &T2) -> U,
        U: Zero + Clone,
    {
        let result_shape = self
            .shape
            .broadcast_shape(&other.shape, corresponding_dimensions);

        let result_storage = self.storage.broadcast_op(
            &self.shape,
            &other.storage,
            &other.shape,
            corresponding_dimensions,
            f,
        );

        Tensor {
            shape: result_shape,
            storage: result_storage,
        }
    }
}

In Production: Candle

For broadcast, Candle runs this macro for a given inner function: first Candle finds a corresponding shape to broadcast each tensor to, then runs the inner function.

macro_rules! broadcast_binary_op {
    ($fn_name:ident, $inner_fn_name:ident) => {
        pub fn $fn_name(&self, rhs: &Self) -> Result<Self> {
            let lhs = self;
            let shape = lhs
                .shape()
                .broadcast_shape_binary_op(rhs.shape(), stringify!($fn_name))?;
            let l_broadcast = shape != *lhs.shape();
            let r_broadcast = shape != *rhs.shape();
            match (l_broadcast, r_broadcast) {
                (true, true) => lhs
                    .broadcast_as(&shape)?
                    .$inner_fn_name(&rhs.broadcast_as(&shape)?),
                (false, true) => lhs.$inner_fn_name(&rhs.broadcast_as(&shape)?),
                (true, false) => lhs.broadcast_as(&shape)?.$inner_fn_name(rhs),
                (false, false) => lhs.$inner_fn_name(rhs),
            }
        }
    };
}

Like map and reduce, this means only a few pre-chosen functions can be applied in this way. For similar reasons to map and reduce, it is difficult to run arbitrary functions on arbitrary devices.

Matrix Multiplication

Now that we have the 3 core data operations, we can build some compound ops.

We can create matrix multiplication using broadcast and reduce ops.

For a matrix A with shape [a b] and a matrix B with shape [b c], we have:

a b, b c -multiply> a b c -sum> a c

The first arrow represents a broadcast multiplication; the second arrow represents a reduction op which is performing a sum over the b dimension (the middle dimension of the intermediate [a b c] result).

In our tensor framework:

let intermediate = matrix_a.broadcast_op(&matrix_b, &[(1, 0)], |a, b| a * b);
let result = intermediate.reduce(1, |a, b| a + b);

Batched Matrix Multiplication

Batched matrix multiplication does the exact same 2 operations.

For a matrix A with shape [n a b] and a matrix B with shape [n b c], we have:

n a b, n b c -multiply> n a b c -sum> n a c

In our tensor format:

let batch_intermediate = batch_a.broadcast_op(&batch_b, &[(0, 0), (2, 1)], |a, b| a * b);
let batch_result = batch_intermediate.reduce(2, |a, b| a + b);

In Production: Candle

Candle has a separate matmul op, because matmul is super ubiquitous and optimized. It can handle an arbitrary number of batch dimensions.

This in tensor:

pub fn matmul(&self, rhs: &Self) -> Result<Self> {
    let a_dims = self.shape().dims();
    let b_dims = rhs.shape().dims();

    let dim = a_dims.len();

    if dim < 2 || b_dims.len() != dim {
        Err(Error::ShapeMismatchBinaryOp {
            lhs: self.shape().clone(),
            rhs: rhs.shape().clone(),
            op: "matmul",
        }
        .bt())?
    }

    let m = a_dims[dim - 2];
    let k = a_dims[dim - 1];
    let k2 = b_dims[dim - 2];
    let n = b_dims[dim - 1];

    let c_shape = Shape::from(&a_dims[..dim - 2]).extend(&[m, n]);
    if c_shape.elem_count() == 0 || k == 0 {
        return Tensor::zeros(c_shape, self.dtype(), self.device());
    }
    let batching: usize = a_dims[..dim - 2].iter().product();
    let batching_b: usize = b_dims[..dim - 2].iter().product();
    if k != k2 || batching != batching_b {
        Err(Error::ShapeMismatchBinaryOp {
            lhs: self.shape().clone(),
            rhs: rhs.shape().clone(),
            op: "matmul",
        }
        .bt())?
    }

    let storage = self.storage().matmul(
        &rhs.storage(),
        (batching, m, n, k),
        self.layout(),
        rhs.layout(),
    )?;
    let op = BackpropOp::new2(self, rhs, Op::Matmul);
    Ok(from_storage(storage, c_shape, op, false))
}

which calls the cpu storage matmul:

fn matmul(
    &self,
    rhs: &Self,
    bmnk: (usize, usize, usize, usize),
    lhs_l: &Layout,
    rhs_l: &Layout,
) -> Result<Self> {
    MatMul(bmnk).map(self, lhs_l, rhs, rhs_l)
}

Softmax

We can create softmax over the last dimension by:

map exp

A: a b c -exp> a b c

reduce sum

B: a b c -sum> a b

broadcast divide

C: a b c, a b -div> a b c

So for a tensor T1 with shape [a, b, c], we have:

softmax(T1) = C(A(T1), B(A(T1)))

let exp_3d = tensor_3d.map(|x| x.exp());
let sum_3d = exp_3d.reduce(2, |a, b| a + b);
let softmax_3d = exp_3d.broadcast_op(&sum_3d, &[(0, 0), (1, 1)], |a, b| a / b);

In Production: Candle

Candle does something similar in candle-nn's softmax function:

pub fn softmax<D: candle::shape::Dim>(xs: &Tensor, dim: D) -> Result<Tensor> {
    let dim = dim.to_index(xs.shape(), "softmax")?;
    let max = xs.max_keepdim(dim)?;
    let diff = xs.broadcast_sub(&max)?;
    let num = diff.exp()?;
    let den = num.sum_keepdim(dim)?;
    num.broadcast_div(&den)
}

Code

To run the code from this blog post:

First, clone the repository:

git clone [email protected]:greenrazer/easytensor.git

Then check out the tagged commit for this part:

git checkout part-3

Then run the tests:

cargo test

Look at the code in src/.

Next Steps

We've now implemented the three core data operations that form the foundation of all tensor libraries: map, reduce, and broadcast. These operations are the computational workhorses that make libraries like Candle and PyTorch possible.

This concludes section 1 of our series. In section 2, we'll shift our focus to performance optimization, making our operations fast enough for real-world use.

In the meantime, I encourage you to experiment with these operations. Try implementing other compound operations like attention mechanisms, or explore how different combinations of our core operations can create complex behaviors. The beauty of this compositional approach is that once you understand these three building blocks, you can construct almost any tensor operation you need.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote