o Hre! @sddlZddlZedddlZddlZddlZddlm Z ddl Z ddl Z ddl mZddlmZddlZddlZddlmZmZddlmZddlZddlmZedZddlmZddlZddlZddl Z e!dd  Z"e"#Z$Wdn1s~wYe%e$Z&d e'd e'fd d Z(de)de*d ej+fddZ,e&fde'de*de-d ej.fddZ/Gdddej0Z1dZ2eGdddZ3e3e4e&ddddde2dZ5e1e5j6e5j7e5j8e5j9e5j:d Z;e;d&d'Z?dS)(N stopwords)Counter)r) DataLoader TensorDataset) dataclassrussianzmodel/vocab_to_int.jsonrtextreturncCsJ|}tdd|}ddd|D}dd|D}d|}|S)zpreprocessing string: lowercase, removing html-tags, punctuation and stopwords Args: text (str): input string for preprocessing Returns: str: preprocessed string z<.*?>cSsg|] }|tjvr|qS)string punctuation).0cr r f/Users/Anastasia/ds_bootcamp/Проекты ds-phase-2 /03_nlp_lstm_project_streamlit_app/model/rnn.py 2sz&data_preprocessing..cSsg|]}|tvr|qSr )russian_stopwords)rwordr r rr3s )lowerresubjoinsplit)r splitted_textr r rdata_preprocessing&s   r review_intseq_lencCsvtjd|ftd}t|D]+\}}t||kr'tt|t|}||}n|d|}t|||ddf<q |S)a$Make left-sided padding for input list of tokens Args: review_int (list): input list of tokens seq_len (int): max length of sequence, it len(review_int[i]) > seq_len it will be trimmed, else it will be padded by zeros Returns: np.array: padded sequences i)dtypeN)npzerosint enumeratelenlistarray)rrfeaturesireviewr!newr r rpadding7s    r+ input_string vocab_to_intc Csxt|}g}|D]$}z |||Wq ty.}z t|dWYd}~q d}~wwt|g|d}t|S)aFunction for all preprocessing steps on a single string Args: input_string (str): input single string for preprocessing seq_len (int): max length of sequence, it len(review_int[i]) > seq_len it will be trimmed, else it will be padded by zeros vocab_to_int (dict, optional): word corpus {'word' : int index}. Defaults to vocab_to_int. Returns: list: preprocessed string z: not in dictionary!Nr)rrappendKeyErrorprintr+torchTensor)r,rr-preprocessed_string result_listre result_paddedr r rpreprocess_single_stringLs  r7c sTeZdZdZ ddedededededd f fd d Zd ejdejfd dZZ S)RNNNetuv vocab_size: int, размер словаря (аргумент embedding-слоя) emb_size: int, размер вектора для описания каждого элемента последовательности hidden_dim: int, размер вектора скрытого состояния, default 0 batch_size: int, размер batch'а  vocab_sizeemb_size hidden_dimrn_layersr Ncst||_||_||_||_||_t|j|j|_ tj |j|jd|d|_ t t |j|jdtt dd|_dS)NT) input_size hidden_size batch_first num_layers)super__init__rr;r<r=r:nn Embedding embeddingRNNrnn_cell SequentialLinearTanhlinear)selfr:r;r<rr= __class__r rrEps$   zRNNNet.__init__xcCsJ||tj}||\}}||dd}|| d}|S)Nr) rHtornn_confdevicerJ contiguousviewsizerNsqueeze)rOrRoutput_outr r rforwards zRNNNet.forward)r9) __name__ __module__ __qualname____doc__r"rEr1r2r^ __classcell__r r rPrr8gs"r8dc@s>eZdZUeed<eed<eed<eed<eed<eed<dS) ConfigRNNr:rVr= embedding_dimr?rN)r_r`rar"__annotations__strr r r rres  rer9cpurB)r:rVr=rfr?r)r:r;r<rr=zmodel/weights.ptuНейтральныйuПоложительныйuОтрицательный)r9rcCs`t}tt|tddtj }t}||}dt t | d|ddS)N)rrzRNN: ***u-***, время предсказания: ***z.4fu сек***.)time rnn_modelr7SEQ_LEN unsqueezelongrTrUrVsigmoidresultr1argmaxitem)r start_time probabilityend_timeinference_timer r rpreds &"ry)@r1nltkdownloadosnumpyr pandaspdmatplotlib.pyplotpyplotpltrr collectionsr nltk.corpusr streamlitsttorch.utils.datarrtorch.nnrF torchutilstu dataclassesrwordsrjsonrljoblibopen json_fileread json_dataloadsr-rhrr%r"r&r+dictr2r7Moduler8rnrer$rUr:rfr?rr=rmload_state_dictloadrrryr r r rsx            1