o %g+@sddlmZddlmZddlZddlZddlZddlZddlm Z ddl m Z ddZ dd Z d d Zd d ZddZddZddZddZe dfddZddZGddde ZdS))Image)BytesION)StoppingCriteria)IMAGE_TOKEN_INDEXcCs|\}}d}d}td}|D]=\}}t||||} t|| t|| } } t| | ||} ||| } | |ksC| |krK| |krK| }| }||f}q|S)a Selects the best resolution from a list of possible resolutions based on the original size. Args: original_size (tuple): The original size of the image in the format (width, height). possible_resolutions (list): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...]. Returns: tuple: The best fit resolution in the format (width, height). Nrinf)floatminint) original_sizepossible_resolutionsoriginal_widthoriginal_heightbest_fitmax_effective_resolutionmin_wasted_resolutionwidthheightscaledownscaled_widthdownscaled_heighteffective_resolutionwasted_resolutionr /home/user/app/llava/mm_utils.pyselect_best_resolution s   rcCs|j\}}|\}}||}||}||kr"|}tt|||} n |} tt|||}||| f} td||fd} ||d} || d} | | | | f| S)a1 Resize and pad an image to a target resolution while maintaining aspect ratio. Args: image (PIL.Image.Image): The input image. target_resolution (tuple): The target resolution (width, height) of the image. Returns: PIL.Image.Image: The resized and padded image. RGB)rrr)sizermathceilresizernewpaste)imageZtarget_resolutionr r Z target_widthZ target_heightscale_wscale_h new_width new_height resized_image new_imageZpaste_xZpaste_yrrrresize_and_pad_image*s   r*c Cs^g}|j\}}td||D]}td||D]}||||||f}||}||qq |S)a Divides an image into patches of a specified size. Args: image (PIL.Image.Image): The input image. patch_size (int): The size of each patch. Returns: list: A list of PIL.Image.Image objects representing the patches. r)rrangecropappend) r# patch_sizepatchesrrijboxpatchrrrdivide_to_patchesMs   r4cCs:t|tur |}nt|}t||\}}||||fS)a Calculate the shape of the image patch grid after the preprocessing for images of any resolution. Args: image_size (tuple): The size of the input image in the format (width, height). grid_pinpoints (str): A string representation of a list of possible resolutions. patch_size (int): The size of each image patch. Returns: tuple: The shape of the image patch grid in the format (width, height). )typelistast literal_evalr) image_sizegrid_pinpointsr.r rrrrrget_anyres_image_grid_shapecs  r;c st|tur |}nt|}t|j|}t||}t|jd}| jdjdf}|g|}fdd|D}t j |ddS)a_ Process an image with variable resolutions. Args: image (PIL.Image.Image): The input image to be processed. processor: The image processor object. grid_pinpoints (str): A string representation of a list of possible resolutions. Returns: torch.Tensor: A tensor containing the processed image patches. r shortest_edgecs"g|] }j|ddddqS)ptreturn_tensors pixel_valuesr) preprocess).0Z image_patch processorrr sz(process_anyres_image..rdim) r5r6r7r8rrr*r4 crop_sizer torchstack) r#rDr:r Zbest_resolutionZ image_paddedr/Zimage_original_resizeZ image_patchesrrCrprocess_anyres_imagews      rKcCsttt|S)N)ropenrbase64 b64decode)r#rrrload_image_from_base64srOcCs~|j\}}||kr |S||kr't|j||f|}||d||df|St|j||f|}||||ddf|S)Nrr)rrr!moder")Zpil_imgbackground_colorrrresultrrr expand2squares rSc st|dd}ggd}|dkr8|D]$}t|tdd|jD}|||d|j|jddd d }|qn|d krN|D]}t|||j}|q>n||dd d Stfd dDrht j d dS)Nimage_aspect_ratio)z.Describe in detail what is shown in the image.zWhat is the text in the image?z9Locate the objects in the image, with their descriptions.padcss|] }t|dVqdS)N)r rBxrrr sz!process_images..r=T)textimagesr? image_mean image_stdpaddingr@ranyresr>c3s |] }|jdjkVqdS)rN)shaperWZ new_imagesrrrYsrF) getattrrStupler\r]r-rKimage_grid_pinpointsallrIrJ)r[image_processorZ model_cfgrTtaskr#rrarprocess_imagess$     rhc sfdd|dD}dd}g}d}t|dkr6t|ddkr6|ddjkr6d}||dd|||g|dD] }|||dq@|durc|dkr\tj|tjd Std ||S) Ncsg|]}|jqSr) input_ids)rBchunk tokenizerrrrEsz)tokenizer_image_token..zcSs&ddt||gt|DddS)NcSsg|] }|D]}|qqSrr)rBsublistelerrrrEszCtokenizer_image_token..insert_separator..)ziplen)Xseprrrinsert_separators&z/tokenizer_image_token..insert_separatorrr=)dtypezUnsupported tensor type: ) splitrq bos_token_idr-extendrItensorlong ValueError) promptrlZimage_token_indexr?Z prompt_chunksrtrioffsetrXrrkrtokenizer_image_tokens.rcCs>|d}|d}|ddr|dd|dS|dS)N/roz checkpoint-_)striprw startswith) model_pathZ model_pathsrrrget_model_name_from_paths  rc@sHeZdZddZdejdejdefddZdejdejdefdd Z d S) KeywordsStoppingCriteriacCs||_g|_d|_|D]/}||j}t|dkr%|d|jkr%|dd}t||jkr1t||_|jt|q ||_ |j d|_ dS)Nrru) keywords keyword_idsmax_keyword_lenrirqrxr-rIrzrlr` start_len)selfrrlrikeywordZcur_keyword_idsrrr__init__s   z!KeywordsStoppingCriteria.__init__ output_idsscoresreturnc  stjd|j|j}fdd|jD|_|jD]}d|jd df}t||r1dSq|jjdd| dfddd}|j D] }||vrQdSqHdS)Nrucsg|]}|jqSr)todevice)rB keyword_idrrrrEsz;KeywordsStoppingCriteria.call_for_batch..rT)skip_special_tokensF) rr`rrrrIequalrl batch_decoder) rrrkwargsr~rZtruncated_output_idsoutputsrrrrcall_for_batchs  & z'KeywordsStoppingCriteria.call_for_batchcKs<g}t|jdD]}||||d|q t|S)Nr)r+r`r-r unsqueezere)rrrrrr0rrr__call__sz!KeywordsStoppingCriteria.__call__N) __name__ __module__ __qualname__rrI LongTensor FloatTensorboolrrrrrrrs r)PILriorrMrIrr7 transformersrllava.constantsrrr*r4r;rKrOrSrhrrrrrrrs&   #;