bio-attention.positional Reference

Here be modules for adding positional encodings / biases to attention modules

class bio_attention.positional.Sinusoidal(dim: int, divide: float = 1.0, learned_div: bool = False)

Sinusoidal positional embedding as in Vaswani et al. (2017) Supports specifying positions, masking, and division of positional range.

Parameters:
  • dim (int) – Hidden size of the embeddings

  • divide (float, optional) – divide positions by this factor, useful for large (or small) numerical ranges, by default 1.0

  • learned_div (bool, optional) – Whether to learn the division factor, if True, div value initialization is as provided by divide argument, by default False

mod_x(x: Tensor, pos: Tensor | None = None, mask: Tensor | None = None, **kwargs) Tensor

Modify X

Parameters:
  • x (torch.Tensor) – (B,*,L,H)

  • pos (Optional[torch.Tensor], optional) – (B,*,L) or (B,*,L-x), by default None for computing positions from 0 to L-1 If sequence length is smaller than x, will pad tokens on the lefthand side to not have any positional encodings added.

  • mask (Optional[torch.Tensor], optional) – (B,*,L) or (B,*,L-x), by default None A boolean mask can be used to explicitly indicate which positions should not have positional encodings added. If sequence length is smaller than x, will pad tokens on the lefthand side to have a positional encoding added.

Returns:

(B,*,L,H)

Return type:

torch.Tensor

class bio_attention.positional.LearnedVocab(dim: int, max_seq_len: int)

Learned vocab as in Devlin et al. (2018) Supports specifying positions. Only works for discrete positional indices.

Parameters:
  • dim (int) – Hidden size of the embeddings

  • max_seq_len (int) – Maximum sequence length or vocab size of the learned embeddings

mod_x(x: Tensor, pos: Tensor | None = None, mask: Tensor | None = None, **kwargs) Tensor

Modify X

Parameters:
  • x (torch.Tensor) – (B,*,L,H)

  • pos (Optional[torch.Tensor], optional) – (B,*,L) or (B,*,L-x), by default None for computing positions from 0 to L-1 If sequence length is smaller than x, will pad tokens on the lefthand side to not have any positional encodings added.

  • mask (Optional[torch.Tensor], optional) – (B,*,L) or (B,*,L-x), by default None A boolean mask can be used to explicitly indicate which positions should not have positional encodings added. If sequence length is smaller than x, will pad tokens on the lefthand side to have a positional encoding added.

Returns:

(B,*,L,H)

Return type:

torch.Tensor

class bio_attention.positional.LearnedContinuous(dim: int, depth: int = 1, norm: bool = False, divide: float = 1.0, learned_div: bool = False)

Learned embeddings with a continuity between absolute positional indices, as learned by a series of linear layers. Supports specifying positions, and division of positional range.

Parameters:
  • dim (int) – Hidden size of the embeddings

  • depth (int, optional) – Number of hidden layers in the positional embedding network. Will follow this structure: Linear -> (Norm if norm) { -> Swish -> Linear -> (Norm if norm) } * (depth-1) By default 1, for a linear embedding.

  • norm (bool, optional) – Whether to use LayerNorms in the embedding net, by default False

  • divide (float, optional) – divide positions by this factor, useful for large (or small) numerical ranges, by default 1.0

  • learned_div (bool, optional) – Whether to learn the division factor, if True, div value initialization is as provided by divide argument, by default False

mod_x(x: Tensor, pos: Tensor | None = None, mask: Tensor | None = None, **kwargs) Tensor

Modify X

Parameters:
  • x (torch.Tensor) – (B,*,L,H)

  • pos (Optional[torch.Tensor], optional) – (B,*,L) or (B,*,L-x), by default None for computing positions from 0 to L-1 If sequence length is smaller than x, will pad tokens on the lefthand side to not have any positional encodings added.

  • mask (Optional[torch.Tensor], optional) – (B,*,L) or (B,*,L-x), by default None A boolean mask can be used to explicitly indicate which positions should not have positional encodings added. If sequence length is smaller than x, will pad tokens on the lefthand side to have a positional encoding added.

Returns:

(B,*,L,H)

Return type:

torch.Tensor

class bio_attention.positional.Rotary(head_dim: int, n_dims: int | None = None, divide: float = 1.0, learned_div: bool = False)

Rotary embedding as in RoFormer / RoPE Supports specifying positions and division of positional range.

Parameters:
  • head_dim (int) – Hidden dimensions per head

  • n_dims (Optional[int], optional) –

    number of dimensions (per head) to apply rotations on.

    Can be used to control how strong the positional bias should be. By default None to use all dims.

  • divide (float, optional) – divide positions by this factor, useful for large (or small) numerical ranges, by default 1.0

  • learned_div (bool, optional) – Whether to learn the division factor, if True, div value initialization is as provided by divide argument, by default False

mod_qkv(q: Tensor, k: Tensor, v: Tensor, pos_q_k: Tensor | None = None, pos_q: Tensor | None = None, pos_k: Tensor | None = None, self_attn_mode: bool = True, **kwargs) Tuple[Tensor, Tensor, Tensor]

Modify Q, K and V

Parameters:
  • q (torch.Tensor) – (B, *, L1, NH, H)

  • k (torch.Tensor) – (B, *, L2, NH, H)

  • v (torch.Tensor) – (B, *, L2, NH, H)

  • pos_q_k (Optional[torch.Tensor], optional) – (B, *, L). Positions of q and k in self attention mode. Requires L1 = L2 By default None for computing positions from 0 to L-1

  • pos_q (Optional[torch.Tensor], optional) – (B, *, L1). Positions of q in cross attention mode By default None for computing positions from 0 to L1-1

  • pos_k (Optional[torch.Tensor], optional) – (B, *, L2). Positions of k in cross attention mode By default None for computing positions from 0 to L2-1

  • self_attn_mode (bool, optional) – Whether to use the same positions for q and k (pos_q_k) or use different positions for each (pos_q) and (pos_k) by default True

Returns:

q, k, v [(B, *, L1, NH, H), (B, *, L2, NH, H), (B, *, L2, NH, H)]

Return type:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

class bio_attention.positional.ALiBi(n_heads: int, use_n_heads: bool | None = None, asymmetric: bool = False, divide: float = 1.0, learned_div: bool = False)

Attention with linear biases as in Press et al. 2022 Supports specifying positions, division of positional range, using only a fraction of the heads, and asymmetric bias for bidirectional cases.

Parameters:
  • n_heads (int) – Number of heads

  • use_n_heads (bool, optional) – Number of heads to use. Can be used to control how strong the positional bias should be, by default None to use all heads

  • asymmetric (bool, optional) – Whether to use assymetric positional biases to differentiate between negative or positive relative positions. Implemented according to solution #3 proposed here https://github.com/ofirpress/attention_with_linear_biases/issues/5. By default False

  • divide (float, optional) – divide positions by this factor, useful for large (or small) numerical ranges, by default 1.0

  • learned_div (bool, optional) – Whether to learn the division factor, if True, div value initialization is as provided by divide argument, by default False

mod_mask(mask: Tensor | None, q: Tensor, k: Tensor, v: Tensor, pos_q_k: Tensor | None = None, pos_q: Tensor | None = None, pos_k: Tensor | None = None, **kwargs) Tensor

Modify mask

Parameters:
  • mask (Optional[torch.Tensor]) – (B, *, NH, L1, L2), can pass None for no pre-existing mask.

  • q (torch.Tensor) – (B, *, L1, NH, H)

  • k (torch.Tensor) – (B, *, L2, NH, H)

  • v (torch.Tensor) – (B, *, L2, NH, H)

  • pos_q_k (Optional[torch.Tensor], optional) – (B, *, L). Positions of q and k in self attention mode. Requires L1 = L2 By default None for using pos_q and pos_k instead

  • pos_q (Optional[torch.Tensor], optional) – (B, *, L1). Positions of q in cross attention mode Ignored if pos_q_k is specified By default None for computing positions from 0 to L1-1

  • pos_k (Optional[torch.Tensor], optional) – (B, *, L2). Positions of k in cross attention mode Ignored if pos_q_k is specified By default None for computing positions from 0 to L2-1

Returns:

(B, *, NH, L1, L2)

Return type:

torch.Tensor

class bio_attention.positional.DPB(dim: int, n_heads: int, depth: int = 1, norm: bool = False, divide: float = 1.0, learned_div: bool = False)

Dynamic positional bias. Computes a bias to every element of the attention matrix based on their relative position via a parameterized MLP Described in https://arxiv.org/abs/2108.00154, https://arxiv.org/abs/2111.09883, https://github.com/lucidrains/x-transformers#dynamic-positional-bias Supports specifying positions and division of positional range

Parameters:
  • dim (int) – number of hidden dimensions of the MLP

  • n_heads (int) – Number of heads in the model

  • depth (int, optional) – Number of hidden layers in the positional embedding network. Will follow this structure: Linear -> (Norm if norm) { -> Swish -> Linear -> (Norm if norm) } * (depth-1) By default 1, for a linear embedding.

  • norm (bool, optional) – Whether to use LayerNorms in the embedding net, by default False

  • divide (float, optional) – divide positions by this factor, useful for large (or small) numerical ranges, by default 1.0

  • learned_div (bool, optional) – Whether to learn the division factor, if True, div value initialization is as provided by divide argument, by default False

mod_mask(mask: Tensor | None, q: Tensor, k: Tensor, v: Tensor, pos_q_k: Tensor | None = None, pos_q: Tensor | None = None, pos_k: Tensor | None = None, **kwargs) Tensor

Modify mask

Parameters:
  • mask (Optional[torch.Tensor]) – (B, *, NH, L1, L2), can pass None for no pre-existing mask.

  • q (torch.Tensor) – (B, *, L1, NH, H)

  • k (torch.Tensor) – (B, *, L2, NH, H)

  • v (torch.Tensor) – (B, *, L2, NH, H)

  • pos_q_k (Optional[torch.Tensor], optional) – (B, *, L). Positions of q and k in self attention mode. Requires L1 = L2 By default None for using pos_q and pos_k instead

  • pos_q (Optional[torch.Tensor], optional) – (B, *, L1). Positions of q in cross attention mode Ignored if pos_q_k is specified By default None for computing positions from 0 to L1-1

  • pos_k (Optional[torch.Tensor], optional) – (B, *, L2). Positions of k in cross attention mode Ignored if pos_q_k is specified By default None for computing positions from 0 to L2-1

Returns:

(B, *, NH, L1, L2)

Return type:

torch.Tensor

class bio_attention.positional.XL(dim: int, n_heads: int, divide: float = 1.0, learned_div: bool = False)

Relative positional biases as in Transformer-XL (Dai et al 2019) Supports specifying positions and division of positional range

Parameters:
  • dim (int) – number of hidden of x (total, not per head)

  • n_heads (int) – Number of heads in the model

  • divide (float, optional) – divide positions by this factor, useful for large (or small) numerical ranges, by default 1.0

  • learned_div (bool, optional) – Whether to learn the division factor, if True, div value initialization is as provided by divide argument, by default False

mod_qkv(q: Tensor, k: Tensor, v: Tensor, **kwargs) Tuple[Tensor, Tensor, Tensor]

Modify Q, K and V

Parameters:
  • q (torch.Tensor) – (B, *, L1, NH, H)

  • k (torch.Tensor) – (B, *, L2, NH, H)

  • v (torch.Tensor) – (B, *, L2, NH, H)

Returns:

q, k, v [(B, *, L1, NH, H), (B, *, L2, NH, H), (B, *, L2, NH, H)]

Return type:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

mod_mask(mask: Tensor | None, q: Tensor, k: Tensor, v: Tensor, pos_q_k: Tensor | None = None, pos_q: Tensor | None = None, pos_k: Tensor | None = None, **kwargs) Tensor

Modify mask

Parameters:
  • mask (Optional[torch.Tensor]) – (B, *, NH, L1, L2), can pass None for no pre-existing mask.

  • q (torch.Tensor) – (B, *, L1, NH, H)

  • k (torch.Tensor) – (B, *, L2, NH, H)

  • v (torch.Tensor) – (B, *, L2, NH, H)

  • pos_q_k (Optional[torch.Tensor], optional) – (B, *, L). Positions of q and k in self attention mode. Requires L1 = L2 By default None for using pos_q and pos_k instead

  • pos_q (Optional[torch.Tensor], optional) – (B, *, L1). Positions of q in cross attention mode Ignored if pos_q_k is specified By default None for computing positions from 0 to L1-1

  • pos_k (Optional[torch.Tensor], optional) – (B, *, L2). Positions of k in cross attention mode Ignored if pos_q_k is specified By default None for computing positions from 0 to L2-1

Returns:

(B, *, NH, L1, L2)

Return type:

torch.Tensor