a ,d @sLddlZddlZddlmZddlmZddlmZm Z ddl Z ddZ dS)N)TfidfVectorizer)cosine_similarity) AutoTokenizer AutoModelcCs*dd}dd}|||}||||}|S)NcsltdtdfddddfddtD}fddtD}t||}|S) Nz"distilbert-base-multilingual-casedcsp|d|d|d|d|d}j|ddddd }fi|}|j}ttj|dd }|S) NrTpti)add_special_tokensreturn_tensors max_length truncation)dim)Z encode_pluslast_hidden_statenpsqueezetorchmeandetachnumpy)sentencecontexttokensoutputs hidden_statesvector)model tokenizer.compute_similarity..sentence_to_vectorcSs|dkrd}d}n2|dkr,||d}d}n||d}||d}|t|dkr^d}d}n:|t|dkr||d}d}n||d}||d}||||fS)Nrrr)len) sentencesindexZ prev_sentenceZpprev_sentenceZ next_sentenceZnnext_sentencerrr get_contexts"      zBclassify_by_topic..compute_similarity..get_contextcs g|]\}}||qSrr.0ir)articlesr&r!rr 4szAclassify_by_topic..compute_similarity..cs g|]\}}||qSrrr')central_topicsr&r!rr r+6s)rfrom_pretrainedr enumerater)r*r,Z doc_vectorsZ topic_vectorsZcos_sim_matrixr)r*r,r&rr!rr compute_similarity s   z-classify_by_topic..compute_similarityc SsLg}|}t||D]0\}}t|}||}||||fq|S)N)copyzipmaxtolistr%append) r*r,similarity_matrixgroupZoriginal_articlesarticle similarityZmax_similarity max_indexrrr group_by_topic>sz)classify_by_topic..group_by_topicr)r*r,r/r:r5groupsrrr classify_by_topic s 2  r<) gensimrrZsklearn.feature_extraction.textrZsklearn.metrics.pairwiser transformersrrrr<rrrr s