Module Ocannl.Nn_blocks

Neural Network Building Blocks

This file contains basic building blocks for neural networks, with limited functionality. Feel free to copy-paste and modify as needed.

Design principles, OCANNL fundamentals, and common patterns:

val kaiming_impl : ?scale_sq:Base.float -> Ocannl_tensor.Tensor.grad_spec -> (unit -> Ocannl_tensor.Tensor.t) -> unit -> Ocannl_tensor.Tensor.op_fun
val xavier_impl : ?scale_sq:Base.float -> Ocannl_tensor.Tensor.grad_spec -> (unit -> Ocannl_tensor.Tensor.t) -> unit -> Ocannl_tensor.Tensor.op_fun
module DSL_modules : sig ... end
val class_ids_of_int_list : ?label:Base.string -> int list -> Ocannl_tensor.Tensor.t

Convert a list of integers to a compact tensor of class IDs (no num_classes allocation).

  • parameter lst

    List of integer class indices (0-based)

  • returns

    A tensor of shape len (a len-sized batch axis) holding the IDs as floats.

val one_hot_of_ids : num_classes:Base__Int.t -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Build a logical one-hot tensor from a tensor of class IDs, using only existing operations (range + equality) so the compiler keeps the proof that the result is one-hot (enabling the gh-343 embedding gather optimization). No dense len * num_classes data is materialized on the host. With ids shaped as a len batch (output rank 0), the result is len; num_classes: one_hot[i, k] = (k == ids[i]).

  • parameter num_classes

    The number of classes (size of the one-hot dimension).

val one_hot_of_int_list : num_classes:Base__Int.t -> int list -> Ocannl_tensor.Tensor.t

Convert a list of integers to a logical one-hot encoded tensor of shape len; num_classes. This composes class_ids_of_int_list and one_hot_of_ids: it stores only len compact IDs on the host and expresses the one-hot logically, rather than allocating a dense len * num_classes Bigarray. See dense_one_hot_of_int_list if a materialized host one-hot is genuinely required.

  • parameter num_classes

    The number of classes (size of the one-hot dimension)

  • parameter lst

    List of integer class indices (0-based)

val dense_one_hot_of_int_list : num_classes:Base__Int.t -> int Base.List.t -> Ocannl_tensor.Tensor.t

Convert a list of integers to a dense, host-materialized one-hot Bigarray-backed tensor of shape len; num_classes. Prefer one_hot_of_int_list (logical) unless a dense host fixture is needed; a materialized Bigarray carries no proof that it is one-hot, so it cannot be optimized into an embedding gather.

val mlp_layer : label:Base.string Base.list -> hid_dim:Base.int -> unit -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t
val dropout : rate:Base.Float.t -> unit -> train_step:Ir.Indexing.static_symbol option -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Masks and scales by 1/keep_prob to maintain expected value. When train_step = None, the dropout rate is ignored and the tensor is returned unmodified.

val mlp : label:Base.string Base.list -> hid_dims:Base.int Base.List.t -> unit -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Multi-layer perceptron of depth List.length hid_dims + 1, with a linear output layer.

val reduce_specified_axes : Base.String.t -> Base.String.t
val softmax : spec:Base.String.t -> ?temperature:Base.Float.t -> unit -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Softmax across specified axes. Does not support non-default row variables.

Position Embedding Strategies

type position_embedding =
  1. | Learned_additive
    (*

    Current default: learned parameter added to input embeddings.

    *)
  2. | Sinusoidal_additive of {
    1. enc_encoding : DSL_modules.Tensor.t;
    2. dec_encoding : DSL_modules.Tensor.t;
    }
    (*

    Fixed sinusoidal encoding added to input embeddings. Use separate tensors for encoder and decoder when d_enc <> d_dec. For equal widths, the same tensor can be passed for both. Build with sinusoidal_position_encoding.

    *)
  3. | RoPE of {
    1. freqs : DSL_modules.Tensor.t;
    2. positions : DSL_modules.Tensor.t;
    }
    (*

    Rotary embeddings applied to Q/K inside self-attention. No additive component.

    *)
  4. | No_pos_embed
    (*

    No position information.

    *)

Strategy for positional encoding in attention / transformer blocks.

val rope_frequencies : half_d:Base.int -> ?base:Base.Float.t -> unit -> Ocannl_tensor.Tensor.t

RoPE inverse frequencies: theta_k = base^(-2k/d) for k = 0..half_d-1.

  • parameter half_d

    Per-head key dimension divided by 2 (i.e. d_k / 2, NOT d_model / 2).

  • parameter base

    Default 10000.0; some models use 500000 for long contexts.

val position_indices : seq_len:Base.int -> unit -> Ocannl_tensor.Tensor.t

Position indices 0, 1, ..., seq_len-1 as a non-learned batch-dim tensor.

val sinusoidal_position_encoding : d_model:Base.int -> max_len:Base.int -> unit -> Ocannl_tensor.Tensor.t

Sinusoidal positional encoding (Vaswani et al. 2017). Non-learned, shape: batch_dims=max_len, output_dims=d_model. Matches model width at the transformer input level, NOT per-head width.

Apply RoPE rotation to tensor x whose last output axis has even size d. Rotates within the last output axis (per-head width d) without crossing head boundaries. freqs has output=d/2, positions has batch=seq_len.

val multi_head_attention : label:Base.string Base.list -> num_heads:Base.int -> d_k:Base.Int.t -> d_v:Base.int -> ?temperature:Base.Float.t -> ?dropout_rate:Base.Float.t -> ?pos_embed:position_embedding -> unit -> train_step:Ir.Indexing.static_symbol option -> ?mask:Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t
val multi_head_att_workshop : num_heads:Base.int -> d_k:Base.int -> d_v:Base.int -> unit -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t
val layer_norm : label:Base.string Base.list -> ?epsilon:Base.float -> unit -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t
val transformer_encoder_block : label:Base.string list -> num_heads:Base.int -> d_k:Base.Int.t -> d_v:Base.int -> d_ff:Base.int -> ?epsilon:Base.float -> ?pos_embed:position_embedding -> unit -> train_step:Ir.Indexing.static_symbol option -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t
val decoder_only_block : label:Base.string list -> num_heads:Base.int -> d_k:Base.Int.t -> d_v:Base.int -> d_ff:Base.int -> ?epsilon:Base.float -> ?dropout_rate:Base.Float.t -> ?pos_embed:position_embedding -> unit -> train_step:Ir.Indexing.static_symbol option -> Ocannl_tensor.Tensor.t -> mask:Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Decoder-only transformer block: masked self-attention + FFN with post-norm LayerNorm. Like transformer_encoder_block but accepts a ~mask parameter for causal masking. No cross-attention — suitable for autoregressive language models.

val decoder_only : label:Base.string list -> num_layers:int -> num_heads:Base.int -> d_k:Base.Int.t -> d_v:Base.int -> d_ff:Base.int -> ?epsilon:Base.float -> ?dropout_rate:Base.Float.t -> ?pos_embed:position_embedding -> unit -> train_step:Ir.Indexing.static_symbol option -> Ocannl_tensor.Tensor.t -> mask:Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Stack of decoder_only_block layers.

val cross_attention : label:Base.string Base.list -> num_heads:Base.int -> d_k:Base.int -> d_v:Base.int -> ?temperature:Base.Float.t -> ?dropout_rate:Base.Float.t -> unit -> train_step:Ir.Indexing.static_symbol option -> Ocannl_tensor.Tensor.t -> enc_output:Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t
val transformer_decoder_block : label:Base.string list -> num_heads:Base.int -> d_k:Base.Int.t -> d_v:Base.int -> d_ff:Base.int -> ?epsilon:Base.float -> ?pos_embed:position_embedding -> unit -> train_step:Ir.Indexing.static_symbol option -> Ocannl_tensor.Tensor.t -> enc_output:Ocannl_tensor.Tensor.t -> mask:Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t
val transformer_encoder : label:Base.string list -> num_layers:int -> num_heads:Base.int -> d_k:Base.Int.t -> d_v:Base.int -> d_ff:Base.int -> ?epsilon:Base.float -> ?pos_embed:position_embedding -> unit -> train_step:Ir.Indexing.static_symbol option -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t
val transformer_decoder : label:Base.string list -> num_layers:int -> num_heads:Base.int -> d_k:Base.Int.t -> d_v:Base.int -> d_ff:Base.int -> ?epsilon:Base.float -> ?pos_embed:position_embedding -> unit -> train_step:Ir.Indexing.static_symbol option -> Ocannl_tensor.Tensor.t -> enc_output:Ocannl_tensor.Tensor.t -> mask:Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t
val transformer : label:Base.string Base.list -> num_encoder_layers:int -> num_decoder_layers:int -> num_heads:Base__Int.t -> d_enc:Base.int -> d_dec:Base.int -> d_ff:Base.int -> ?epsilon:Base.float -> ?pos_embed:position_embedding -> unit -> train_step:Ir.Indexing.static_symbol option -> src:Ocannl_tensor.Tensor.t -> tgt:Ocannl_tensor.Tensor.t -> mask:Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t
val transformer_with_loss : label:'a -> model: (train_step:'b -> src:'c -> tgt:'d -> mask:'e -> Ocannl_tensor.Tensor.t) -> unit -> train_step:'b -> src:'c -> tgt_input:'d -> tgt_target:Ocannl_tensor.Tensor.t -> mask:'e -> Ocannl_tensor.Tensor.t * Ocannl_tensor.Tensor.t

Transformer with teacher forcing for autoregressive training.

TODO: Simplify once tensor shifting/slicing is better supported in shape inference. Currently requires pre-shifted tgt_input (all but last token) and tgt_target (all but first token). During training, the model learns to predict tgt_target given tgt_input.

Convolutional Neural Network Building Blocks

val conv2d : label:Base.string Base.list -> ?kernel_size:Base.int -> ?stride:Base.Int.t -> ?use_padding:bool -> ?out_channels:Base.int -> unit -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

2D convolution layer with flexible padding and stride options.

When use_padding=false and stride > 1, the input spatial dimensions must satisfy: (input_size - kernel_size) mod stride = 0, otherwise shape inference will fail with "incompatible stride" error. The output size is (input_size - kernel_size) / stride + 1.

When use_padding=true, there is no such restriction and output size is input_size / stride.

  • parameter out_channels

    Optional number of output channels. If not provided, must be inferred from context (e.g., from a downstream operation that constrains the output shape).

val depthwise_separable_conv2d : label:Base.string Base.list -> ?kernel_size:Base.int -> ?stride:Base.Int.t -> ?use_padding:bool -> unit -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Depthwise separable convolution - more efficient for mobile/edge devices. Consists of depthwise conv (spatial filtering per channel) followed by pointwise conv (1x1 conv for channel mixing).

See conv2d for dimension constraints when use_padding=false.

val max_pool2d : ?stride:Base.Int.t -> ?window_size:Base.int -> unit -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Max pooling for 2D spatial data - reduces spatial dimensions by taking maximum values.

The input spatial dimensions must satisfy: (input_size - window_size) mod stride = 0, otherwise shape inference will fail. The output size is (input_size - window_size) / stride + 1.

Note: The < in the einsum spec indicates no-padding mode (indices stay within bounds).

val avg_pool2d : ?stride:Base.Int.t -> ?window_size:Base.int -> unit -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Average pooling for 2D spatial data - reduces spatial dimensions by averaging values.

See max_pool2d for dimension constraints.

val global_avg_pool2d : DSL_modules.Tensor.t -> Ocannl_tensor.Tensor.t

Global average pooling - reduces each feature map to a single value by averaging. Commonly used before final classification layer.

val batch_norm2d : label:Base.string Base.list -> ?epsilon:Base.float -> ?momentum:float -> unit -> train_step:'a option -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Batch normalization for CNN layers - normalizes across the batch dimension for each channel. Typically applied after convolutions and before activations.

val batch_norm1d : label:Base.string Base.list -> ?epsilon:Base.float -> ?momentum:float -> unit -> train_step:'a option -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Batch normalization for MLP layers - normalizes across the batch axis only. Unlike batch_norm2d there are no spatial axes to reduce over; channel axes are carried through unchanged via the ..c.. row variable.

See the FIXME on batch_norm2d: running statistics are not implemented, so momentum is ignored and inference falls back to the learned gamma/beta parameters rather than population statistics. Acceptable for tutorial examples; do not rely on inference correctness for distribution-shifted inputs.

val conv_bn_relu : label:Base.string list -> ?kernel_size:Base.int -> ?stride:Base.Int.t -> unit -> train_step:'a option -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Conv block with conv -> batch norm -> activation pattern

val resnet_block : label:Base.string list -> ?stride:Base.Int.t -> unit -> train_step:'a option -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Residual block for ResNet-style architectures. Features skip connections that help with gradient flow in deep networks.

val lenet : ?label:Base.string list -> ?out_channels1:Base.int -> ?out_channels2:Base.int -> unit -> train_step:'a -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

LeNet-style architecture for simple image classification (e.g., MNIST). Classic architecture: conv -> pool -> conv -> pool -> fc layers. Output shape is inferred from training data.

val vgg_block : label:Base.string list -> num_convs:int -> ?kernel_size:Base.int -> unit -> train_step:'a option -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

VGG-style block - multiple convolutions with same filter count followed by pooling

val sokoban_cnn : label:Base.string Base.list -> ?num_actions:Base.int -> unit -> train_step:'a option -> grid_state:Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t * Ocannl_tensor.Tensor.t

Simple CNN for Sokoban-like grid environments. Processes grid states with multiple conv layers and outputs action logits.

val mobile_cnn : label:Base.string Base.list -> ?num_classes:Base.int -> ?width_mult:float -> unit -> train_step:'a option -> Ocannl_tensor.Tensor.t -> Ocannl_tensor.Tensor.t

Modern CNN with depthwise separable convolutions for efficiency. Suitable for mobile/edge deployment.