bio-attention.positional Reference
Here be modules for adding positional encodings / biases to attention modules
- class bio_attention.positional.Sinusoidal(dim: int, divide: float = 1.0, learned_div: bool = False)
Sinusoidal positional embedding as in Vaswani et al. (2017) Supports specifying positions, masking, and division of positional range.
- Parameters:
dim (int) – Hidden size of the embeddings
divide (float, optional) – divide positions by this factor, useful for large (or small) numerical ranges, by default 1.0
learned_div (bool, optional) – Whether to learn the division factor, if True, div value initialization is as provided by divide argument, by default False
- mod_x(x: Tensor, pos: Tensor | None = None, mask: Tensor | None = None, **kwargs) Tensor
Modify X
- Parameters:
x (torch.Tensor) – (B,*,L,H)
pos (Optional[torch.Tensor], optional) – (B,*,L) or (B,*,L-x), by default None for computing positions from 0 to L-1 If sequence length is smaller than x, will pad tokens on the lefthand side to not have any positional encodings added.
mask (Optional[torch.Tensor], optional) – (B,*,L) or (B,*,L-x), by default None A boolean mask can be used to explicitly indicate which positions should not have positional encodings added. If sequence length is smaller than x, will pad tokens on the lefthand side to have a positional encoding added.
- Returns:
(B,*,L,H)
- Return type:
torch.Tensor
- class bio_attention.positional.LearnedVocab(dim: int, max_seq_len: int)
Learned vocab as in Devlin et al. (2018) Supports specifying positions. Only works for discrete positional indices.
- Parameters:
dim (int) – Hidden size of the embeddings
max_seq_len (int) – Maximum sequence length or vocab size of the learned embeddings
- mod_x(x: Tensor, pos: Tensor | None = None, mask: Tensor | None = None, **kwargs) Tensor
Modify X
- Parameters:
x (torch.Tensor) – (B,*,L,H)
pos (Optional[torch.Tensor], optional) – (B,*,L) or (B,*,L-x), by default None for computing positions from 0 to L-1 If sequence length is smaller than x, will pad tokens on the lefthand side to not have any positional encodings added.
mask (Optional[torch.Tensor], optional) – (B,*,L) or (B,*,L-x), by default None A boolean mask can be used to explicitly indicate which positions should not have positional encodings added. If sequence length is smaller than x, will pad tokens on the lefthand side to have a positional encoding added.
- Returns:
(B,*,L,H)
- Return type:
torch.Tensor
- class bio_attention.positional.LearnedContinuous(dim: int, depth: int = 1, norm: bool = False, divide: float = 1.0, learned_div: bool = False)
Learned embeddings with a continuity between absolute positional indices, as learned by a series of linear layers. Supports specifying positions, and division of positional range.
- Parameters:
dim (int) – Hidden size of the embeddings
depth (int, optional) – Number of hidden layers in the positional embedding network. Will follow this structure: Linear -> (Norm if norm) { -> Swish -> Linear -> (Norm if norm) } * (depth-1) By default 1, for a linear embedding.
norm (bool, optional) – Whether to use LayerNorms in the embedding net, by default False
divide (float, optional) – divide positions by this factor, useful for large (or small) numerical ranges, by default 1.0
learned_div (bool, optional) – Whether to learn the division factor, if True, div value initialization is as provided by divide argument, by default False
- mod_x(x: Tensor, pos: Tensor | None = None, mask: Tensor | None = None, **kwargs) Tensor
Modify X
- Parameters:
x (torch.Tensor) – (B,*,L,H)
pos (Optional[torch.Tensor], optional) – (B,*,L) or (B,*,L-x), by default None for computing positions from 0 to L-1 If sequence length is smaller than x, will pad tokens on the lefthand side to not have any positional encodings added.
mask (Optional[torch.Tensor], optional) – (B,*,L) or (B,*,L-x), by default None A boolean mask can be used to explicitly indicate which positions should not have positional encodings added. If sequence length is smaller than x, will pad tokens on the lefthand side to have a positional encoding added.
- Returns:
(B,*,L,H)
- Return type:
torch.Tensor
- class bio_attention.positional.Rotary(head_dim: int, n_dims: int | None = None, divide: float = 1.0, learned_div: bool = False)
Rotary embedding as in RoFormer / RoPE Supports specifying positions and division of positional range.
- Parameters:
head_dim (int) – Hidden dimensions per head
n_dims (Optional[int], optional) –
- number of dimensions (per head) to apply rotations on.
Can be used to control how strong the positional bias should be. By default None to use all dims.
divide (float, optional) – divide positions by this factor, useful for large (or small) numerical ranges, by default 1.0
learned_div (bool, optional) – Whether to learn the division factor, if True, div value initialization is as provided by divide argument, by default False
- mod_qkv(q: Tensor, k: Tensor, v: Tensor, pos_q_k: Tensor | None = None, pos_q: Tensor | None = None, pos_k: Tensor | None = None, self_attn_mode: bool = True, **kwargs) Tuple[Tensor, Tensor, Tensor]
Modify Q, K and V
- Parameters:
q (torch.Tensor) – (B, *, L1, NH, H)
k (torch.Tensor) – (B, *, L2, NH, H)
v (torch.Tensor) – (B, *, L2, NH, H)
pos_q_k (Optional[torch.Tensor], optional) – (B, *, L). Positions of q and k in self attention mode. Requires L1 = L2 By default None for computing positions from 0 to L-1
pos_q (Optional[torch.Tensor], optional) – (B, *, L1). Positions of q in cross attention mode By default None for computing positions from 0 to L1-1
pos_k (Optional[torch.Tensor], optional) – (B, *, L2). Positions of k in cross attention mode By default None for computing positions from 0 to L2-1
self_attn_mode (bool, optional) – Whether to use the same positions for q and k (pos_q_k) or use different positions for each (pos_q) and (pos_k) by default True
- Returns:
q, k, v [(B, *, L1, NH, H), (B, *, L2, NH, H), (B, *, L2, NH, H)]
- Return type:
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
- class bio_attention.positional.ALiBi(n_heads: int, use_n_heads: bool | None = None, asymmetric: bool = False, divide: float = 1.0, learned_div: bool = False)
Attention with linear biases as in Press et al. 2022 Supports specifying positions, division of positional range, using only a fraction of the heads, and asymmetric bias for bidirectional cases.
- Parameters:
n_heads (int) – Number of heads
use_n_heads (bool, optional) – Number of heads to use. Can be used to control how strong the positional bias should be, by default None to use all heads
asymmetric (bool, optional) – Whether to use assymetric positional biases to differentiate between negative or positive relative positions. Implemented according to solution #3 proposed here https://github.com/ofirpress/attention_with_linear_biases/issues/5. By default False
divide (float, optional) – divide positions by this factor, useful for large (or small) numerical ranges, by default 1.0
learned_div (bool, optional) – Whether to learn the division factor, if True, div value initialization is as provided by divide argument, by default False
- mod_mask(mask: Tensor | None, q: Tensor, k: Tensor, v: Tensor, pos_q_k: Tensor | None = None, pos_q: Tensor | None = None, pos_k: Tensor | None = None, **kwargs) Tensor
Modify mask
- Parameters:
mask (Optional[torch.Tensor]) – (B, *, NH, L1, L2), can pass None for no pre-existing mask.
q (torch.Tensor) – (B, *, L1, NH, H)
k (torch.Tensor) – (B, *, L2, NH, H)
v (torch.Tensor) – (B, *, L2, NH, H)
pos_q_k (Optional[torch.Tensor], optional) – (B, *, L). Positions of q and k in self attention mode. Requires L1 = L2 By default None for using pos_q and pos_k instead
pos_q (Optional[torch.Tensor], optional) – (B, *, L1). Positions of q in cross attention mode Ignored if pos_q_k is specified By default None for computing positions from 0 to L1-1
pos_k (Optional[torch.Tensor], optional) – (B, *, L2). Positions of k in cross attention mode Ignored if pos_q_k is specified By default None for computing positions from 0 to L2-1
- Returns:
(B, *, NH, L1, L2)
- Return type:
torch.Tensor
- class bio_attention.positional.DPB(dim: int, n_heads: int, depth: int = 1, norm: bool = False, divide: float = 1.0, learned_div: bool = False)
Dynamic positional bias. Computes a bias to every element of the attention matrix based on their relative position via a parameterized MLP Described in https://arxiv.org/abs/2108.00154, https://arxiv.org/abs/2111.09883, https://github.com/lucidrains/x-transformers#dynamic-positional-bias Supports specifying positions and division of positional range
- Parameters:
dim (int) – number of hidden dimensions of the MLP
n_heads (int) – Number of heads in the model
depth (int, optional) – Number of hidden layers in the positional embedding network. Will follow this structure: Linear -> (Norm if norm) { -> Swish -> Linear -> (Norm if norm) } * (depth-1) By default 1, for a linear embedding.
norm (bool, optional) – Whether to use LayerNorms in the embedding net, by default False
divide (float, optional) – divide positions by this factor, useful for large (or small) numerical ranges, by default 1.0
learned_div (bool, optional) – Whether to learn the division factor, if True, div value initialization is as provided by divide argument, by default False
- mod_mask(mask: Tensor | None, q: Tensor, k: Tensor, v: Tensor, pos_q_k: Tensor | None = None, pos_q: Tensor | None = None, pos_k: Tensor | None = None, **kwargs) Tensor
Modify mask
- Parameters:
mask (Optional[torch.Tensor]) – (B, *, NH, L1, L2), can pass None for no pre-existing mask.
q (torch.Tensor) – (B, *, L1, NH, H)
k (torch.Tensor) – (B, *, L2, NH, H)
v (torch.Tensor) – (B, *, L2, NH, H)
pos_q_k (Optional[torch.Tensor], optional) – (B, *, L). Positions of q and k in self attention mode. Requires L1 = L2 By default None for using pos_q and pos_k instead
pos_q (Optional[torch.Tensor], optional) – (B, *, L1). Positions of q in cross attention mode Ignored if pos_q_k is specified By default None for computing positions from 0 to L1-1
pos_k (Optional[torch.Tensor], optional) – (B, *, L2). Positions of k in cross attention mode Ignored if pos_q_k is specified By default None for computing positions from 0 to L2-1
- Returns:
(B, *, NH, L1, L2)
- Return type:
torch.Tensor
- class bio_attention.positional.XL(dim: int, n_heads: int, divide: float = 1.0, learned_div: bool = False)
Relative positional biases as in Transformer-XL (Dai et al 2019) Supports specifying positions and division of positional range
- Parameters:
dim (int) – number of hidden of x (total, not per head)
n_heads (int) – Number of heads in the model
divide (float, optional) – divide positions by this factor, useful for large (or small) numerical ranges, by default 1.0
learned_div (bool, optional) – Whether to learn the division factor, if True, div value initialization is as provided by divide argument, by default False
- mod_qkv(q: Tensor, k: Tensor, v: Tensor, **kwargs) Tuple[Tensor, Tensor, Tensor]
Modify Q, K and V
- mod_mask(mask: Tensor | None, q: Tensor, k: Tensor, v: Tensor, pos_q_k: Tensor | None = None, pos_q: Tensor | None = None, pos_k: Tensor | None = None, **kwargs) Tensor
Modify mask
- Parameters:
mask (Optional[torch.Tensor]) – (B, *, NH, L1, L2), can pass None for no pre-existing mask.
q (torch.Tensor) – (B, *, L1, NH, H)
k (torch.Tensor) – (B, *, L2, NH, H)
v (torch.Tensor) – (B, *, L2, NH, H)
pos_q_k (Optional[torch.Tensor], optional) – (B, *, L). Positions of q and k in self attention mode. Requires L1 = L2 By default None for using pos_q and pos_k instead
pos_q (Optional[torch.Tensor], optional) – (B, *, L1). Positions of q in cross attention mode Ignored if pos_q_k is specified By default None for computing positions from 0 to L1-1
pos_k (Optional[torch.Tensor], optional) – (B, *, L2). Positions of k in cross attention mode Ignored if pos_q_k is specified By default None for computing positions from 0 to L2-1
- Returns:
(B, *, NH, L1, L2)
- Return type:
torch.Tensor