U fgJ@sNddlZddlZddlZddlZddlZddlZddlmZddl m m Z ddl m ZddlZddl mZddl m m Z ddlZddlZejejeddlmZddlZddlZdZdZddZdd ZGd d d eZ Gd d d eZ!GdddeZ"GdddeZ#ddZ$ddZ%ddZ&Gddde j'Z(ddZ)dS)N)Dataset)read_and_config_filegư>g@cCsPt|\}}t|}||kr.tj|||d}t|jdkrL|dddf}|S)a Reads an audio file from the specified path, normalizes the audio, resamples it to the desired sampling rate (if necessary), and ensures it is single-channel. Parameters: path (str): The file path of the audio file to be read. sampling_rate (int): The target sampling rate for the audio. Returns: numpy.ndarray: The processed audio data, normalized, resampled (if necessary), and converted to mono (if the input audio has multiple channels). )orig_sr target_srNr)sfread audio_normlibrosaresamplelenshape)path sampling_ratedatafsra/mnt/nas/mit_sg/shengkui.zhao/speech_codec/clear_speech_local/clearvoice/dataloader/dataloader.py audioreadsrcCs`|dd}d|t}||}|d}|}|||kd}d|t}||}|S)ai Normalizes the input audio signal to a target Root Mean Square (RMS) level, applying two stages of scaling. This ensures the audio signal is neither too quiet nor too loud, keeping its amplitude consistent. Parameters: x (numpy.ndarray): Input audio signal to be normalized. Returns: numpy.ndarray: Normalized audio signal. g?g&-`ʬ?)meanEPS)xrmsscalarpow_xZ avg_pow_xZrmsxZscalarxrrrr 4s  r c@s0eZdZdZddZddZddZdd Zd S) DataReadera A class for reading audio data from a list of files, normalizing it, and extracting features for further processing. It supports extracting features from each file, reshaping the data, and returning metadata like utterance ID and data length. Parameters: args: Arguments containing the input path and target sampling rate. Attributes: file_list (list): A list of audio file paths to process. sampling_rate (int): The target sampling rate for audio files. cCs$t||jdd|_|j|_||_dS)NT)decode)r input_path file_listrargs)selfr rrr__init__kszDataReader.__init__cCs t|jS)z Returns the number of audio files in the file list. Returns: int: Number of files to process. )r rr!rrr__len__vszDataReader.__len__cCs4|jjdkr$|jjjdkr$|j|S||j|S)a Retrieves the features of the audio file at the given index. Parameters: index (int): Index of the file in the file list. Returns: tuple: Features (inputs, utterance ID, data length) for the selected audio file. target_speaker_extractionlip)r taskZnetwork_referenceZcuerextract_feature)r!indexrrr __getitem__s  zDataReader.__getitem__cCsH|dd}t||jtj}t|d|jdg}|||jdfS)a} Extracts features from the given audio file path. Parameters: path (str): The file path of the audio file. Returns: inputs (numpy.ndarray): Reshaped audio data for further processing. utt_id (str): The unique identifier of the audio file, usually the filename. length (int): The length of the original audio data. /rr)splitrrastypenpfloat32reshaper )r!rZutt_idrinputsrrrr(s zDataReader.extract_featureN)__name__ __module__ __qualname____doc__r"r$r*r(rrrrr\s   rc@seZdZdZddZdS)Wave_Processorau A class for processing audio data, specifically for reading input and label audio files, segmenting them into fixed-length segments, and applying padding or trimming as necessary. Methods: process(path, segment_length, sampling_rate): Processes audio data by reading, padding, or segmenting it to match the specified segment length. Parameters: path (dict): A dictionary containing file paths for 'inputs' and 'labels' audio files. segment_length (int): The desired length of audio segments to extract. sampling_rate (int): The target sampling rate for reading the audio files. c Cst|d|}t|d|}|jd}|jd|krztj|tjd}tj|tjd}||d|jd<||d|jd<n0td||} || | |}|| | |}||fS)a Reads input and label audio files, and ensures the audio is segmented into the desired length, padding if necessary or extracting random segments if the audio is longer than the target segment length. Parameters: path (dict): Dictionary containing the paths to 'inputs' and 'labels' audio files. segment_length (int): Desired length of the audio segment in samples. sampling_rate (int): Target sample rate for the audio. Returns: tuple: A pair of numpy arrays representing the processed input and label audio, either padded to the segment length or trimmed. r2labelsrdtypeN)rr r/zerosr0randomrandint) r!rsegment_lengthrZ wave_inputsZ wave_labelsZlen_wavZ padded_inputsZ padded_labelsZst_idxrrrprocesss zWave_Processor.processNr3r4r5r6r?rrrrr7sr7c@seZdZdZddZdS)Fbank_ProcessoraH A class for processing input audio data into mel-filterbank (Fbank) features, including the computation of delta and delta-delta features. Methods: process(inputs, args): Processes the raw audio input and returns the mel-filterbank features along with delta and delta-delta features. c Cst|j|jd}t|j|jd}d|||j|j|jd}t|t}t j j j | df|}t|dd}t j|}t j|} t|dd}t| dd} tj||| gdd} | S)Ni?)dither frame_length frame_shift num_mel_binssample_frequency window_typerr)dim)intwin_lenrwin_incnum_melswin_typetorch FloatTensor MAX_WAV_VALUE torchaudio compliancekaldifbank unsqueeze transpose functionalcompute_deltascatnumpy) r!r2r rDrEZ fbank_configrUfbank_tr fbank_deltafbank_delta_deltafbanksrrrr?s$   zFbank_Processor.processNr@rrrrrAs rAc@s(eZdZdZddZddZddZdS) AudioDataseta A dataset class for loading and processing audio data from different data types (train, validation, test). Supports audio processing and feature extraction (e.g., waveform processing, Fbank feature extraction). Parameters: args: Arguments containing dataset configuration (paths, sampling rate, etc.). data_type (str): The type of data to load (train, val, test). cCs||_|j|_|dkr$t|j|_n<|dkr:t|j|_n&|dkrPt|j|_ntd|dt|_ t |_ |j|jj |_ td|dt|jdS)Ntrainvaltestz Data type: z is unknown!zNo. z files: )r rrZtr_listwav_listZcv_listZtt_listprintr7 wav_processorrAfbank_processor max_lengthr>r )r!r data_typerrrr"szAudioDataset.__init__cCs t|jSN)r rdr#rrrr$,szAudioDataset.__len__cCsf|j|}|j|d|dd|j|j\}}|jjdk r^|j||j}|t|t|fS||fS)Nr2r8)r2r8) rdrfr?r>rr Z load_fbankrgrQ)r!r) data_infor2r8r_rrrr*0s  zAudioDataset.__getitem__N)r3r4r5r6r"r$r*rrrrr` s r`cCstdd|D}d}t|djdkr6t||f}n*t|djdkr`t|||djdf}tj|tjd}t|D]X\}}t|jdkr|||d|jdf<qxt|jdkrx|||d|jdddf<qx|S)a9 Concatenates a list of input arrays, applying zero-padding as needed to ensure they all match the length of the longest input. Parameters: inputs (list of numpy arrays): List of input arrays to be concatenated. Returns: numpy.ndarray: A zero-padded array with concatenated inputs. css|]}|jdVqdS)rN)r ).0inprrr Osz"zero_pad_concat..Nrrrr9)maxr r r/r;r0 enumerate)r!r2Zmax_tr Z input_matermrrrzero_pad_concatBs rrcCs(t|\}}t|}t|}||fS)z A custom collate function for combining batches of waveform input and label pairs. Parameters: data (list): List of tuples (inputs, labels). Returns: tuple: Batched inputs and labels as torch.FloatTensors. ziprOrP)rr2r8ryrrrcollate_fn_2x_wavsds   rvcCs6t|\}}}t|}t|}t|}|||fS)a A custom collate function for combining batches of waveform inputs, labels, and Fbank features. Parameters: data (list): List of tuples (inputs, labels, fbanks). Returns: tuple: Batched inputs, labels, and Fbank features as torch.FloatTensors. rs)rr2r8r_rruzrrrcollate_fn_2x_wavs_fbankss    rxc@s2eZdZdZd ddZddZd d Zd d ZdS)DistributedSamplera Sampler for distributed training. Divides the dataset among multiple replicas (processes), ensuring that each process gets a unique subset of the data. It also supports shuffling and managing epochs. Parameters: dataset (Dataset): The dataset to sample from. num_replicas (int): Number of processes participating in the training. rank (int): Rank of the current process. shuffle (bool): Whether to shuffle the data or not. seed (int): Random seed for reproducibility. NTrcCs|dkr tstdt}|dkr@ts8tdt}||_||_||_d|_t t t |jd|j|_ |j |j|_||_||_dS)Nz,Requires distributed package to be availablerrB)dist is_available RuntimeErrorget_world_sizeget_rankdataset num_replicasrankepochrJmathceilr num_samples total_sizeshuffleseed)r!rrrrrrrrr"s  zDistributedSampler.__init__cCs|jrjt}||j|jtjtt|j |j |d|j }g}t |j D]}||| }qRnt t t|j }||d|jt|7}t||jkst||j|j|jd|j}t||jkstt|S)N) generatorr)rrO Generator manual_seedrrrandpermrJr rrrangetolistlistrAssertionErrorrriter)r!gindindicesirrr__iter__s$ zDistributedSampler.__iter__cCs|jSrj)rr#rrrr$szDistributedSampler.__len__cCs ||_dSrj)r)r!rrrr set_epochszDistributedSampler.set_epoch)NNTr)r3r4r5r6r"rr$rrrrrrys  rycCst||d}|jr$t||j|jdnd}|jdks<|jdkrBt}n|jdkrRt}n tddSt j ||j |dk||j |d}||fS) aL Creates and returns a data loader and sampler for the specified dataset type (train, validation, or test). Parameters: args (Namespace): Configuration arguments containing details such as batch size, sampling rate, network type, and whether distributed training is used. data_type (str): The type of dataset to load ('train', 'val', 'test'). Returns: sampler (DistributedSampler or None): The sampler for distributed training, or None if not used. generator (DataLoader): The PyTorch DataLoader for the specified dataset. )r ri)rrN FRCRN_SE_16KMossFormerGAN_SE_16KMossFormer2_SE_48KzHin dataloader, please specify a correct network type using args.network!) batch_sizer collate_fn num_workerssampler) r` distributedry world_size local_ranknetworkrvrxrer DataLoaderrr)r ridatasetsrrrrrrget_dataloaders0   r)*r[r/roscsvrRrOtorch.nnnntorch.utils.datautilsrtorch.distributedrrz soundfilerrsysrappenddirname__file__Zdataloader.miscrr r<rrQrr objectrr7rAr`rrrvrxSamplerryrrrrrs8    (J9,7"=