g1rddlZddlZddlZddlZddlmZddlmZGddZ GddZ dS)N) load_dataset)displaycJeZdZddZddZdZdZdZdd Zd Z d Z dZ dS)LMSYSChat1MHandlerFTc ||_||_tdd|j|j|_||_|rt |jd|_d|_d|_|js|rt d|jdj D]x}|d}tj |}tt|dz dz }t d |d|d |d dd |d udSdSdS)Nzlmsys/lmsys-chat-1mmain)revisiontoken streamingzData is cached at: trainfilename)z Filename: *iz Size: z bytes)hf_tokenr r lmsys_datasetverboseprint df_sample df_promptsunwrapped_turns_df cache_filesospathgetsizeintlen)selfrr r file_infor file_sizeis Qc:\Users\david\Documents\git\chatbot-arena-wrapper\./src\lmsys_dataset_handler.py__init__zLMSYSChat1MHandler.__init__ sH  ") !-n       & $$ % % %"&~ ]' ] ( ) ) )!/8D ] ] $Z0GOOH55 X+q011[8BQB<[[(344.[[)[[[\\\\  ] ] ] ] ] ]Nc<|ra|jd}||d|}tdt |dn|jsS|jd|}tdt |dnjg}t|jdD]%\}}|||dz|krn&tj |tj |}||_ |jrft |dkrSt|dtd t|d|S) a Extracts a sample of conversations or specific conversations based on their conversation IDs. Parameters: - n_samples (int): Number of random samples to extract. Ignored if `conversation_ids` is provided. - conversation_ids (list): List of conversation IDs to extract. If provided, this takes precedence over `n_samples`. Returns: - pd.DataFrame: A DataFrame containing the extracted conversations. r conversation_id Retrieved z% conversations based on specified IDsz. random conversations from lmsys/lmsys-chat-1mr...)r to_pandasisinrrr sample enumerateappendrandomshufflepd DataFramerrrheadtail)r n_samplesconversation_idsrstreamed_samplesr!rows r"extract_df_samplez$LMSYSChat1MHandler.extract_df_sample!s  ;*73==??I!),=">"C"CDT"U"UVI Ts9~~TTT U U U U> ; .w7AACCJJ9UU a3y>>aaabbbb$& '(:7(CDDFAs$++C0001u ))*/000L)9:: " < 'C NNQ.. INN1%% & & & %LLL INN1%% & & &r$c d}gd}tj|}td|d||zi}td|d}||}tdt |d|||_|jrft |d krSt| d td t| d |S) NzFhttps://huggingface.co/datasets/lmsys/lmsys-chat-1m/resolve/main/data/)z-train-00000-of-00006-4feeb3f83346a0e9.parquetz-train-00001-of-00006-4030672591c2f478.parquetz-train-00002-of-00006-1779b7cec9462180.parquetz-train-00003-of-00006-2fa862bfed56af1f.parquetz-train-00004-of-00006-18f4bdd50c103e71.parquetz-train-00005-of-00006-fe1acc5d10a9f0e2.parquetzSampling from r parquet) data_filessplitr'z/ random conversations from lmsys/lmsys-chat-1m/r)rr*) r0choicerrr+r-rrrrr4r5)rr6base_urlr= sample_fileparquet_samplers r"parquet_samplingz#LMSYSChat1MHandler.parquet_samplingHs[   mJ//  ,{,,---x+56 %iJgVVV",,..55i@@  g3y>>ggZegghhh" < 'C NNQ.. INN1%% & & & %LLL INN1%% & & &r$cf|jdd|jd<|j}|S)zf Adds 'turn' keys to each conversation in the 'conversation' column of the dataframe. conversationcDt|SN) Conversation add_turns)convs r"z?LMSYSChat1MHandler.add_turns_to_conversations..dsd++5577r$)rapply)r df_with_turnss r"add_turns_to_conversationsz-LMSYSChat1MHandler.add_turns_to_conversations_s<*.)G)M)M 7 7* * ~& r$cg}|jD]\}}|d}|}|dd}d}|dD]_}|ddkr|d}||dd}$|dd kr/|-i||d||dd } || d}`t j|} | dd id | |_| S)aW Creates a dataframe where each row corresponds to a pair of user-assistant messages in a conversation and turn. The 'prompt' column contains the user's message, and the 'response' column contains the assistant's message. Each row includes a 'turn_id' column, which numbers the turns uniquely per conversation. r&rENroleusercontentturn03 assistant)turn_npromptresponseconversation_turnsTcolumnsinplace) riterrowsto_dictpopr/r2r3renamer) r paired_data_r9r&row_datacurrent_promptturn_idmessage paired_rowrs r" unwrap_turnszLMSYSChat1MHandler.unwrap_turnsisI  n--// * *FAs!"34O{{}}H LL ( ( (!NG~. * *6?f,,%,Y%7N!0F'&/FFFGGV_ 338R"""")&/"0$+I$6 """J  &&z222%)N * \+66!!63G*HRV!WWW"4!!r$c|j}|r_||d|fdd}n?|fdd}t j|}|ddid t|}rtd 5} d | Dd d d n #1swxYwY||dfd }td|t|z d||_ |jrft|dkrSt|dtdt|d|S)a Extracts user prompts from the sample dataframe, optionally filtering by language and limiting the character length. Parameters: - filter_language (list of str or None): A list of specific languages to filter prompts by. If None, no language filter is applied. Examples of valid values include ['English'], ['English', 'Portuguese'], or ['Spanish', 'French', 'German']. - min_char_length (int): The minimum character length for user prompts to include. Defaults to 20. - max_char_length (int): The maximum character length for user prompts to include. Defaults to 500. - exclusions (str or None): Path to a text file containing phrases. Prompts containing any of these phrases will be excluded from the results. If None, no exclusions are applied. Returns: - pd.DataFrame: A DataFrame containing extracted prompts with columns 'prompt' and 'language'. languagec2fddDS)Ncg|]B}|ddkt|dcxkrk.nn|dddCSrPrQrRrl)rRrlr.0entrymax_char_lengthmin_char_lengthr9s r" zHLMSYSChat1MHandler.extract_prompts....sV}..?c%PYJZF[F[3n3n3n3n_n3n3n3n3n3n!&i 0c*oNN3n3n3nr$rEr9rtrus`r"rKz4LMSYSChat1MHandler.extract_prompts..;!$^!4r$r()axisc2fddDS)Ncg|]B}|ddkt|dcxkrk.nn|dddCSrorprqs r"rvzHLMSYSChat1MHandler.extract_prompts....rwr$rErxrys`r"rKz4LMSYSChat1MHandler.extract_prompts..rzr$rRrWTrZrc6g|]}|Srx)strip)rrlines r"rvz6LMSYSChat1MHandler.extract_prompts..s EEEtdjjllEEEr$Nc<tfdDS)Nc3 K|]}|vV dSrGrx)rr exclusionxs r" zGLMSYSChat1MHandler.extract_prompts....s)MyMyaji[\nMyMyMyMyMyMyr$any)r exclusionss`r"rKz4LMSYSChat1MHandler.extract_prompts..s&#MyMyMyMynxMyMyMyJyJyr$z Excluded z prompts.r)rr*)rr,rLexplodedropnar2r3tolistr`ropen readlinesrrrrr4r5) rfilter_languagerurtrrextracted_datar orig_lengthfs ``` r"extract_promptsz"LMSYSChat1MHandler.extract_promptss^ N  !&y'<'A'A/'R'RSYY Z gii N'__ - gii \."7"7"9"9:: 9h"7FFF*oo  Hj#&& F!EEq{{}}EEE  F F F F F F F F F F F F F F F#Z%9%?%?@y@y@y@y%z%z$z{J FkC OO;FFF G G G$ < (C NNQ.. JOOA&& ' ' ' %LLL JOOA&& ' ' 's D66D:=D:c|jddjd}|jr%t j|d}t ||S)Nr(rWrxwidth)rr-valuesrtextwrapfillr)r prompt_samplewrapped_messages r"extract_prompt_samplez(LMSYSChat1MHandler.extract_prompt_samplesV..q11(;B1E < #&mMEEEO / " " "r$c|jrtd|jd}||dfd}|jr#t dt|dd|S)a2 Searches the dataset for a given string and returns a DataFrame with matching records. Parameters: - search_term (str): The string to search for in the dataset. Returns: - pd.DataFrame: A DataFrame containing conversations where the search term is found. z*Search is not supported in streaming mode.r rEc:tfd|DS)Nc3tK|]2}|dvV3dS)rRN)lower)rrrf search_terms r"rzLLMSYSChat1MHandler.search_conversations....sFccSZ[..00GI4F4L4L4N4NNccccccr$r)rJrs r"rKz9LMSYSChat1MHandler.search_conversations..s%cccc^bcccccr$zFound z* matching conversations for search term: '')r ValueErrorrr+rLrrr)rrdfmatching_recordss ` r"search_conversationsz'LMSYSChat1MHandler.search_conversationss > KIJJ J   ( 2 2 4 4b066 c c c c    < l j3/00jj\gjjj k k kr$c|d}tdt|dddidS)NrlzLanguage Record Counts:CountindexLanguage)r[) value_countsrto_frame reset_indexr`)rrlanguage_countss r"print_language_countsz(LMSYSChat1MHandler.print_language_countsskZ.5577 '((( o&&w//;;==DDgWaMbDccdddddr$)FT)NN)NrirjN) __name__ __module__ __qualname__r#r:rCrNrhrrrrrxr$r"rrs]]]]0%%%%N.!"!"!"F2222h   *eeeeer$rc"eZdZdZdZddZdS)rHc||_dS)z Initializes the Conversation object with the conversation data. Parameters: - conversation_data (list): A list of dictionaries representing a conversation. Nconversation_data)rrs r"r#zConversation.__init__s"3r$cTd}|jD]}|ddkr|dz }||d<|jS)z Adds a 'turn' key to each dictionary in the conversation, identifying the turn (pair of user and assistant messages). Returns: - list: The updated conversation with 'turn' keys added. rrPrQr(rSr)r turn_counterrfs r"rIzConversation.add_turnssH - + +Gv&((! *GFOO%%r$Pc tj||jD]o}|ddkr|}n|ddkr|}n!dfd|dD}t |d|dpd S) a5 Prints the conversation with specified prefixes and wrapped text. Parameters: - user_prefix (str): Prefix to prepend to user messages. - assistant_prefix (str): Prefix to prepend to assistant messages. - width (int): Maximum characters per line for wrapping. rrPrQrU c3BK|]}|VdSrG)r)rrrwrappers r"rz,Conversation.pretty_print..s@(('+ T""((((((r$rR N)r TextWrapperrjoin splitlinesr)r user_prefixassistant_prefixrrfprefixwrapped_contentrs @r" pretty_printzConversation.pretty_prints&U333- 3 3Gv&(($K//)#ii((((/6y/A/L/L/N/N(((O V11o111 2 2 2 2 3 3r$N)r)rrrr#rIrrxr$r"rHrHsF333 & & &333333r$rH) rpandasr2rr0datasetsrIPython.displayrrrHrxr$r"rs  !!!!!!######WeWeWeWeWeWeWeWet03030303030303030303r$