ۀg}ddlmZddlmZddlmZddlZddlmZddlZddl Z ddl Z ddl Z ddl Z ddl mZmZmZmZmZddlmZddlmZdZd Zd Zd Zd Zej8d ZdZdZdZ dZ!y))absolute_import)division)print_functionN)power_compresspower_uncompressstftistft compute_fbank) bandwidth_sub)mel_spectrogramg@c<|jdk(rt||||S|jdk(rt||||S|jdk(rt||||S|jdk(rt ||||S|jdk(rt ||||St dy)aXDecodes audio using the specified model based on the provided network type. This function selects the appropriate decoding function based on the specified network in the arguments and processes the input audio data accordingly. Args: model (nn.Module): The trained model used for decoding. device (torch.device): The device (CPU or GPU) to perform computations on. inputs (torch.Tensor): Input audio tensor. args (Namespace): Contains arguments for network configuration. Returns: list: A list of decoded audio outputs for each speaker. FRCRN_SE_16KMossFormer2_SE_48KMossFormerGAN_SE_16KMossFormer2_SS_16KMossFormer2_SR_48KzNo network found!N)networkdecode_one_audio_frcrn_se_16k#decode_one_audio_mossformer2_se_48k%decode_one_audio_mossformergan_se_16k#decode_one_audio_mossformer2_ss_16k#decode_one_audio_mossformer2_sr_48kprint)modeldeviceinputsargss Y/Users/zhaoshengkui/Downloads/github/ClearerVoice-Studio/clearvoice_super/utils/decode.pydecode_one_audiors  ||~%,UFFDII - -25&&$OO / /4UFFDQQ - -25&&$OO - -25&&$OO !"cg}d}|j|jz}t|dz}|j\}} |dzj dz} | |j|j zkDrd}| |kr?t j|t j|jd|| z fgd}n| ||zkrD||z| z } t j|t j|jd| fgd}nT| |z |zdk7rI| | |z |z|zz } t j|t j|jd| fgd}tjt j|j|}|j\}} |rt j|j| f} ||z dz} d}||z| kr|d d |||zf}||}t|jD]p}||dd d fjj!j#||<|dk(r||d | | ||||z| z f<X||| | | ||| z||z| z f<r||z }||z| krt|jD]}|j%| |d d fng||}t|jD]G}|j%||dd d fjj!j#It|jD])}||dzj dz}|||z | z||<+|S) aDecodes audio using the MossFormer2 model for speech separation at 16kHz. This function handles the audio decoding process by processing the input tensor in segments, if necessary, and applies the model to obtain separated audio outputs. Args: model (nn.Module): The trained MossFormer2 model for decoding. device (torch.device): The device (CPU or GPU) to perform computations on. inputs (torch.Tensor): Input audio tensor of shape (B, T), where B is the batch size and T is the number of time steps. args (Namespace): Contains arguments for decoding configuration. Returns: list: A list of decoded audio outputs for each speaker. F?g?TraxisN) sampling_rate decode_windowintshapemeanone_time_decode_lengthnp concatenatezerostorch from_numpyfloat32tonum_spksrangedetachcpunumpyappend)rrrroutdecode_do_segmentwindowstridebt rms_inputpaddingoutputsgive_up_length current_idx tmp_input tmp_out_listspkout_listrms_outs rrr4s C   $"4"4 4F $ F <RUc>c!ccd Y 6 !KF"a'' (C JJwsAv ' (=' CC JJx}QT*113779??A B CT]]#2s8q=&&(C/s8g% 1C2 Jr cd}|j|jz}t|dz}|j\}}||j|jzkDrd}||kr?t j |t j|jd||z fgd}n|||zkrD||z|z } t j |t j|jd| fgd}nT||z |zdk7rI|||z |z|zz } t j |t j|jd| fgd}tjt j|j|}|j\}}|rt j|} ||z dz} d} | |z|kr~|dd| | |zf} |j| jjj}| dk(r|d| | | | |z| z n|| | | | | z| |z| z | |z } | |z|kr~| S|j|jjj} | S) a Decodes audio using the FRCRN model for speech enhancement at 16kHz. This function processes the input audio tensor either in segments or as a whole, depending on the length of the input. The model's inference method is applied to obtain the enhanced audio output. Args: model (nn.Module): The trained FRCRN model used for decoding. device (torch.device): The device (CPU or GPU) to perform computations on. inputs (torch.Tensor): Input audio tensor of shape (B, T), where B is the batch size and T is the number of time steps. args (Namespace): Contains arguments for decoding configuration. Returns: numpy.ndarray: The decoded audio output, which has been enhanced by the model. Fr"Trr$r%r#N)r'r(r)r*r,r-r.r/r0r1r2r3 inferencer6r7r8)rrrrr;r<r=r>r?rArBrCrDrE tmp_outputs rrr~s"   $"4"4 4F $ F <$IJ_iiwzHyH_I n4[65IN5Z[ 6 !KF"a' N//&)002668>>@ Nr cd}|j|jz}t|dz}|j\}}||j|jzkDrd}t j tj|j|}t j|jdt j|dzdz } |j\}}|r{tj|} ||z dz} d} | |z|krR|d d | | |zf} t||| | |}| dk(r|d | | | | |z| z n|| | | | | z| |z| z | |z } | |z|krR| St|||| |S) aJDecodes audio using the MossFormerGAN model for speech enhancement at 16kHz. This function processes the input audio tensor either in segments or as a whole, depending on the length of the input. The `_decode_one_audio_mossformergan_se_16k` function is called to perform the model inference and return the enhanced audio output. Args: model (nn.Module): The trained MossFormerGAN model used for decoding. device (torch.device): The device (CPU or GPU) for computation. inputs (torch.Tensor): Input audio tensor of shape (B, T), where B is the batch size and T is the number of time steps. args (Namespace): Contains arguments for decoding configuration. Returns: numpy.ndarray: The decoded audio output, which has been enhanced by the model. Fr"Tg@dimr#rN)r'r(r)r*r,r0r1r-r2r3sqrtsizesumr/&_decode_one_audio_mossformergan_se_16k)rrrrr;r<r=r>r? norm_factorrBrCrDrErLs rrrs"   $"4"4 4F $ F <$IJ_iiwzHyH_I n4[65IN5Z[ 6 !KF"a'6eVV[Z^__r c(|jd}ttj||jz }||jz}||z }t j ||ddd|fgd}t j|dd}t j||zdd}t||ddd} | jt j} t| jdddd } || } | djdddd | djdddd } } t| | jd} t| |ddd}|jd|z }|d|j!j#j%S) aProcesses audio inputs through the MossFormerGAN model for speech enhancement. This function performs the following steps: 1. Pads the input audio tensor to fit the model requirements. 2. Computes a normalization factor for the input tensor. 3. Applies Short-Time Fourier Transform (STFT) to convert the audio into the frequency domain. 4. Processes the STFT representation through the model to predict the real and imaginary components. 5. Uncompresses the predicted spectrogram and applies Inverse STFT (iSTFT) to convert back to time domain audio. 6. Normalizes the output audio. Args: model (nn.Module): The trained MossFormerGAN model used for decoding. device (torch.device): The device (CPU or GPU) for computation. inputs (torch.Tensor): Input audio tensor of shape (B, T), where B is the batch size and T is the number of time steps. norm_factor (torch.Tensor): A norm tensor to regularize input amplitude args (Namespace): Contains arguments for STFT parameters and normalization. Returns: numpy.ndarray: The decoded audio output, which has been enhanced by the model. rNNrOrr$T)centerperiodiconesidedr#)rRr)r-ceilwin_incr0cat transposerr3r2rpermutersqueezer r6r7r8)rrrrUr input_lennframe padded_len padding_len inputs_specrH pred_real pred_imagpred_spec_uncompressrBs rrTrTs, BI T\\12 3F$,,&Jy(KYYq,;, 78b AF__VQ *F __Vk11a 8FvtD4$OK../K!-55aAqAK[!H#A;..q!Q:HQK'>q'A4H!??61a8(33BB8L $.$9$9$H$H$U!$ook1a@ $)OO4Eq!$L!FK9J#KQRS))!,//7!=$RL  t4%--aA6 &llny/?/?/A/E/E/GG &1!Q'&:R+aQRTUgBV=V&V#"'':D#mBT!U!#Q_`pbpapQqGK f(<~(MN%3VGH%=NbpqBPAPcQGK.8v9MP^9^_v% W&!+^  (--e.?.?@uq148??61a0 ++::8D &11@@Mook1a8 !OO,=q!DFK1BCK!!!$''/=RL t$%%aA. !1!1!3!7!7!99 )!Q'2R+aAg:N5NN+T3u:> ==?] **r c t||j|j|j|j|j |j |jS)zH Calls mel_spectrogram() and returns the mel-spectrogram output )r n_fftnum_melsr'hop_sizewin_sizefminfmax)xrs rget_melrsJ 1djj$--9K9KT]]\`\i\ikoktktvzvv AAr c8|dddf}|jd}||j|jzkDr8d}|rt|j|jz}t|dz}|jd}||kr/t j |t j||z gd}n|||zkr4||z|z } t j |t j| gd}nD||z |zdk7r9|||z |z|zz } t j |t j| gd}tj|jtj} | jd}tjt j|} ||z dz} d} d}||z|krD|| kr | d||z}n | || z ||z}t|jd|}|d|j|}|d|}|j}t!|t!|z }|dk(r|d| |z| |||z| z n|| d}|| | |z| || z||z| z ||z }||z|krntj|jtj} t| jd|}|d|j|}|d|}|j}  j#j%} t'|| } | S)a This function decodes a single audio input using a two-stage speech super-resolution model. Supports both offline decoding (for short audio) and online decoding (for long audio) with a sliding window approach. Parameters: ----------- model : list A list of two-stage models: - model[0]: The transformer-based Mossformer model for feature enhancement. - model[1]: The vocoder for generating high-resolution waveforms. device : str or torch.device The computation device ('cpu' or 'cuda') where the models will run. inputs : torch.Tensor A tensor of shape (batch_size, num_samples) containing low-resolution audio signals. Only the first audio (inputs[0, :]) is processed. args : Namespace An object containing the following attributes: - sampling_rate: Sampling rate of the input audio (e.g., 48,000 Hz). - one_time_decode_length: Maximum duration (in seconds) for offline decoding. - decode_window: Window size (in seconds) for sliding window processing. - Other optional attributes used for Mel spectrogram extraction. Returns: -------- numpy.ndarray The high-resolution audio waveform as a NumPy array, refined and upsampled. rNTr"r#r$)r*r'r,r)r(r-r.r/r0r1rkrlrrmr3r`rqr7r8r )rrrrrarrr<r=r?rArsrBrCrtrDru mel_segmentmossformer_output_segmentgenerator_output_segmentoffset mel_inputmossformer_outputgenerator_outputs rrrso:AqD\F QI4%%(C(CCC ++d.@.@@AF$'F QA6z&1*1E(FJVf_$ 6/A-'1B(CQGJ&(A-1v:&"86"AAG^^VRXXg5F,GKF$$V,11%2C2CDE AA&&rxx{3G$vo!3N"# K&!+!44$)!K&,@$AM$)+8K*KKZ`L`$aM&m&=&=a&@$G ,4E!H[^^F5K,L)+3584M+N(+C+K+K+M(]+c2J.KK!#QikBlzkz|BlBRCGK f(<~(MN/G/Q,bz|JLZKZ[aKacbGK.8v9MP^9^_v% )&!+0  (--e.?.?@EOOA.5 $E!HY\\&%9:#58$56"**,kkm!!#GFG,G Nr c|\}}tjtj|}|dkDr||z}|j\}}d}||j|j zkDrd}t jtj|j|j}t jtj|j|j}t|jt|j|rtdtj|} |j|jz} d|jz} t| dz} | | z dz} d}|| z|kr|d d ||| zf}t||jz dz}|d d ||| zd d d d f}|||jj!j#j%}|dk(r|d | | ||| z| z n|| | | || z|| z| z || z }|| z|kr|d d | d f}|d d | d d d d d f}|||jj!j#j%}|| d | | | zd | S|||jj!j#j%} | S) a"Processes video inputs through the AV mossformer2 model with Target speaker extraction (TSE) for decoding at 16kHz. This function decodes audio input using the following steps: 1. Checks if the input audio length requires segmentation or can be processed in one go. 2. If the input audio is long enough, processes it in overlapping segments using a sliding window approach. 3. Applies the model to each segment or the entire input, and collects the output. Args: model (nn.Module): The trained SpEx model for speech enhancement. inputs (numpy.ndarray): Input audio and visual data args (Namespace): Contains arguments for sampling rate, window size, and other parameters. Returns: numpy.ndarray: The decoded audio output as a NumPy array. r$FTz********g333333?r#rN)r-maxabsr*r'r,r0r1r2r3rrr/r(r)r6r`r7r8)rrrrsvisualmax_valr>r?decode_do_segementrBr<window_vr=rCrD tmp_audio current_idx_v tmp_videorLs r'decode_one_audio_AV_MossFormer2_TSE_16Krs"ME6ffRVVE]#G{  ;;DAq4   ; ; ;;!   RZZ. / 2 24;; ?E   bjj0 1 4 4T[[ AF %++ &,, j((1+##d&8&88***Vc\" 6/a/ F"Q&a[6-A!AABI D,>,> >r ABMq- 0H"H!QNOIy)4;;=EEGKKMSSUJaMWXhZhYhMi K&$8>$IJ_iiwzHyH_I n4[65IN5Z[ 6 !KF"Q&$!fWX+& 1xij!Q./ 9i0779AACGGIOOQ -7-H.()* Nv&--/779==?EEG Nr )" __future__rrrr0torch.nnnnr8r-ossyslibrosarn utils.miscrrrr r utils.bandwidth_subr dataloader.meldatasetr rjrrrrno_gradrTrrrrr rrs '%  SS-1