o kb:8@sddlZddlZddlZddlmZgdZgdZGdddZgdZgdZ gdZ d d eDZ d d e DZ e ee e e Zgd Zd ZddgZdZdZdZdZdZdZd ZdZdZdZdZdZdZdZ dZ!dZ"dS)N) zbert-base-arabertv01bert-base-arabertbert-base-arabertv02bert-base-arabertv2zbert-large-arabertv02bert-large-arabertv2zaraelectra-basezaraelectra-base-discriminatorzaraelectra-base-generatorz aragpt2-basezaragpt2-mediumz aragpt2-largez aragpt2-mega)rrrc@sJeZdZdZ       dddZddZddd Zd d Zd d ZdS)ArabertPreprocessorai A Preprocessor class that cleans and preprocesses text for all models in the AraBERT repo. It also can unprocess the text ouput of the generated text Args: model_name (:obj:`str`): model name from the HuggingFace Models page without the aubmindlab tag. Defaults to "bert-base-arabertv02". Current accepted models are: - :obj:`"bert-base-arabertv01"`: No farasa segmentation. - :obj:`"bert-base-arabert"`: with farasa segmentation. - :obj:`"bert-base-arabertv02"`: No farasas egmentation. - :obj:`"bert-base-arabertv2"`: with farasa segmentation. - :obj:`"bert-large-arabertv02"`: No farasas egmentation. - :obj:`"bert-large-arabertv2"`: with farasa segmentation. - :obj:`"araelectra-base"`: No farasa segmentation. - :obj:`"araelectra-base-discriminator"`: No farasa segmentation. - :obj:`"araelectra-base-generator"`: No farasa segmentation. - :obj:`"aragpt2-base"`: No farasa segmentation. - :obj:`"aragpt2-medium"`: No farasa segmentation. - :obj:`"aragpt2-large"`: No farasa segmentation. - :obj:`"aragpt2-mega"`: No farasa segmentation. keep_emojis(:obj: `bool`): don't remove emojis while preprocessing. Defaults to False remove_html_markup(:obj: `bool`): Whether to remove html artfacts, should be set to False when preprocessing TyDi QA. Defaults to True replace_urls_emails_mentions(:obj: `bool`): Whether to replace email urls and mentions by special tokens. Defaults to True strip_tashkeel(:obj: `bool`): remove diacritics (FATHATAN, DAMMATAN, KASRATAN, FATHA, DAMMA, KASRA, SUKUN, SHADDA) strip_tatweel(:obj: `bool`): remove tatweel '\u0640' insert_white_spaces(:obj: `bool`): insert whitespace before and after all non Arabic digits or English digits or Arabic and English Alphabet or the 2 brackets, then inserts whitespace between words and numbers or numbers and words remove_elongation(:obj: `bool`): replace repetition of more than 2 non-digit character with 2 of this character Returns: ArabertPreprocessor: the preprocessor class Example: from preprocess import ArabertPreprocessor arabert_prep = ArabertPreprocessor("aubmindlab/bert-base-arabertv2") arabert_prep.preprocess("SOME ARABIC TEXT") FTc CsZ|dd}|tvrtdd|_n||_||_||_||_||_||_ ||_ ||_ dS)a model_name (:obj:`str`): model name from the HuggingFace Models page without the aubmindlab tag. Defaults to "bert-base-arabertv02". Current accepted models are: - :obj:`"bert-base-arabertv01"`: No farasa segmentation. - :obj:`"bert-base-arabert"`: with farasa segmentation. - :obj:`"bert-base-arabertv02"`: No farasas egmentation. - :obj:`"bert-base-arabertv2"`: with farasa segmentation. - :obj:`"bert-large-arabertv02"`: No farasas egmentation. - :obj:`"bert-large-arabertv2"`: with farasa segmentation. - :obj:`"araelectra-base"`: No farasa segmentation. - :obj:`"araelectra-base-discriminator"`: No farasa segmentation. - :obj:`"araelectra-base-generator"`: No farasa segmentation. - :obj:`"aragpt2-base"`: No farasa segmentation. - :obj:`"aragpt2-medium"`: No farasa segmentation. - :obj:`"aragpt2-large"`: No farasa segmentation. - :obj:`"aragpt2-mega"`: No farasa segmentation. keep_emojis(:obj: `bool`): don't remove emojis while preprocessing. Defaults to False remove_html_markup(:obj: `bool`): Whether to remove html artfacts, should be set to False when preprocessing TyDi QA. Defaults to True replace_urls_emails_mentions(:obj: `bool`): Whether to replace email urls and mentions by special tokens. Defaults to True strip_tashkeel(:obj: `bool`): remove diacritics (FATHATAN, DAMMATAN, KASRATAN, FATHA, DAMMA, KASRA, SUKUN, SHADDA) strip_tatweel(:obj: `bool`): remove tatweel '\u0640' insert_white_spaces(:obj: `bool`): insert whitespace before and after all non Arabic digits or English digits or Arabic and English Alphabet or the 2 brackets, then inserts whitespace between words and numbers or numbers and words remove_elongation(:obj: `bool`): replace repetition of more than 2 non-digit character with 2 of this character z aubmindlab/z]Model provided is not in the accepted model list. Assuming you don't want Farasa SegmentationrN) replaceACCEPTED_MODELSloggingwarning model_name keep_emojisremove_html_markupreplace_urls_emails_mentionsstrip_tashkeel strip_tatweelinsert_white_spacesremove_elongation) selfr r rrrrrrr\C:\Users\MyPc\Desktop\الرسالة\code\New folder\arabic-text-summarization\preprocess.py__init__Qs + zArabertPreprocessor.__init__cCst|}t|}|jrt|}|jrt|}|jr;tD] }t |d|}qt D] }t |d|}q*t t d|}|j rLt dd|}t dd|}|j rT||}|jrlt dd|}t d d |}t d d |}t td|}d|d d }|S)a3 Preprocess takes an input text line an applies the same preprocessing used in AraBERT pretraining Args: text (:obj:`str`): inout text string Returns: string: A preprocessed string depending on which model was selected u [رابط] u [بريد] u [مستخدم] z
 z ]+>u!([^0-9ء-غف-ي٠-٩a-zA-Z\[\]])z \1 u(\d+)([ء-غف-ي٠-٬]+)z \1 \2 u([ء-غف-ي٠-٬]+)(\d+)u️r)strhtmlunescaperarabyrr url_regexesresub email_regexesuser_mention_regexrr_remove_elongationrrejected_chars_regexjoinrsplit)rtextregrrr preprocesss@    zArabertPreprocessor.preprocesscCsttd|}ttd|}ttd|}ttd|}|dd}d|}tdd |}td d |}ttd |}tt d |}tt d |}|S) aRe-formats the text to a classic format where punctuations, brackets, parenthesis are not seperated by whitespaces. The objective is to make the generated text of any model appear natural and not preprocessed. Args: text (str): input text to be un-preprocessed desegment (bool, optional): [whether or not to remove farasa pre-segmentation before]. Defaults to True. Returns: str: The unpreprocessed (and possibly Farasa-desegmented) text. z"\1"z'\1'z\`\1\`u \—\1\—.z . rz(\d+) \. (\d+)z\1.\2z(\d+) \, (\d+)z\1,\2z\1) rr#white_spaced_double_quotation_regex#white_spaced_single_quotation_regex!white_spaced_back_quotation_regexrr$r%left_and_right_spaced_charsleft_spaced_charsright_spaced_chars)rr&Z desegmentrrr unpreprocesss z ArabertPreprocessor.unpreprocesscCsbtttt|D]$}tt|}|r,|}|d}t|}tj|||tj d}q |S|S)zd :param text: the input text to remove elongation :return: delongated text r)flags) rangelenrfindall regex_tatweelsearchgroupescaper MULTILINE)rr&index_Z elongationZelongation_patternZelongation_replacementrrrr"s   z&ArabertPreprocessor._remove_elongationcCs|}tt|}d}|rk|}tt||jd}ddt|d}d|d| d|||| d|df}d|d| d|| ddf }t t |t |}tt|}|s t dd|}| S)Nr)keyrrz\s+)rr6redundant_punct_patternr7sortedsetindexr$listspanstripabsr3r)rr&Ztext_resultZdifrrrr_remove_redundant_puncts( 0&  z+ArabertPreprocessor._remove_redundant_punctN)FTTTTTT)T) __name__ __module__ __qualname____doc__rr(r0r"rFrrrrrs5 ? C" r)الوفبكلللrKrLrMrNrOrPrQuس)(ههاrOيهماكماناكمهمهنكناانينونوااتتنةrRrSrOrTrUrVrWrXrYrZr[r\r]r^r_r`rarbrcrd)u [رابط]u[مستخدم]u [بريد]cCsg|]}|dqS+r.0xrrr ZrjcCsg|]}d|qSrerrgrrrrj[rk)a(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)z6@(https?|ftp)://(-\.)?([^\s/?\.#-]+\.?)+(/[^\s]*)?$@iSz"http[s]?://[a-zA-Z0-9_\-./~\?=%&]+zwww[a-zA-Z0-9_\-?=%&/.~]+z[a-zA-Z]+\.com(?=http)[^\s]+ (?=www)[^\s]+z://z@[\w\d]+z[\w-]+@([\w-]+\.)+[\w-]+z\S+@\S+ug([!\"#\$%\'\(\)\*\+,\.:;\-<=·>?@\[\\\]\^_ـ`{\|}~—٪’،؟`୍“؛”ۚ【»؛\s+«–…‘]{2,})z (\D)\1{2,}u[^0-9\u0621-\u063A\u0640-\u066C\u0671-\u0674a-zA-Z\[\]!\"#\$%\'\(\)\*\+,\.:;\-<=·>?@\[\\\]\^_ـ`{\|}~—٪’،؟`୍“؛”ۚ»؛\s+«–…‘]rmrnrlu0-9\u0621-\u063A\u0640-\u066C\u0671-\u0674a-zA-Z\[\]!\"#\$%\'\(\)\*\+,\.:;\-<=·>?@\[\\\]\^_ـ`{\|}~—٪’،؟`୍“؛”ۚ»؛\s+«–…‘z\"\s+([^"]+)\s+\"z\'\s+([^']+)\s+\'z\`\s+([^`]+)\s+\`u\—\s+([^—]+)\s+\—u+ ([\]!#\$%\),\.:;\?}٪’،؟”؛…»·])u([\[\(\{“«‘*\~]) u ([\+\-\<\=\>\@\\\^\_\|\–]) )#rr rZpyarabic.arabyrr ZSEGMENTED_MODELSrZ prefix_listZ suffix_listZ other_tokensZprefix_symbolsZsuffix_symblosrAr?Znever_split_tokensrr!r r=r5r#Zregex_url_step1Zregex_url_step2Z regex_urlZ regex_mentionZ regex_emailZ chars_regexr*r+r,Zwhite_spaced_em_dashr.r/r-rrrrsD *