o h<@sddlZddlZddlmmZzddlZddlmZm Z m Z Wne y2dZdZ dZdZ Ynwddddfddddfddd dfd Z  dd dZ dS)N)_flash_attn_forwardflash_attn_funcflash_attn_varlen_funccC|SNxrr$/data/code/test/modules/attention.pyr cCrrrrrrr r r cC |ddSN transposerrrr r  cCr rrrrrr r !rcCr rrrrrr r $rcCr rrrrrr r %r)flashtorchvanillarFcCst|\}}||}||}||}|dkr2|dur&|jtjkr&||j}tj||||||d} n|dkrQtdus>Jd|dusFJdt|||||dd} n|dkrd t | d } |j \} } } }| d }tj | | | ||j|j d }|r|dusJd tj| | | | tj|j d jdd}||td||j}|dur|jtjkr||tdn||7}||dd | }||7}|jd d}tj||dd}||} ntd||| } | j \} } } }| | | d }|S)u 执行QKV自注意力计算 Args: q (torch.Tensor): 查询张量,形状 [batch_size, seq_len, num_heads, head_dim] k (torch.Tensor): 键张量,形状 [batch_size, seq_len_kv, num_heads, head_dim] v (torch.Tensor): 值张量,形状 [batch_size, seq_len_kv, num_heads, head_dim] mode (str): 注意力模式,可选 'flash', 'torch', 'vanilla' drop_rate (float): 注意力矩阵的dropout概率 attn_mask (torch.Tensor): 注意力掩码,形状根据模式不同而变化 causal (bool): 是否使用因果注意力(仅关注前面位置) Returns: torch.Tensor: 注意力输出,形状 [batch_size, seq_len, num_heads * head_dim] rN) attn_mask dropout_p is_causalruflash_attn_func未定义u不支持的注意力掩码)rcausal softmax_scalerrr)dtypedeviceu0因果掩码和注意力掩码不能同时使用r)diagonalz-inf)dimT)ptrainu不支持的注意力模式: ) MEMORY_LAYOUTrrbooltoFscaled_dot_product_attentionrmathsqrtsizeshapezerosronestril masked_fill_ logical_notfloatrsoftmaxdropoutNotImplementedErrorreshape)qkvmode drop_raterrpre_attn_layoutpost_attn_layoutr scale_factorbas_s1 attn_bias temp_maskattndoutrrr attention*sT         rI)rrNF)r)rZtorch.nn.functionalnn functionalr' flash_attnflash_attn.flash_attn_interfacerrr ImportErrorr$rIrrrr s6