h;"ddlZddlZddlZddlZddlmZddlmZddlm Z m Z m Z ddl m Z ddlmZmZddlmZmZmZmZddlmZmZmZmZdd lmZej:eZed Gd d Z GddZ!Gdde Z"y)N)Future) dataclass)SEEK_ENDSEEK_SETBytesIO)Path)LockThread)DictListOptionalUnion)DEFAULT_IGNORE_PATTERNS CommitInfoCommitOperationAddHfApi)filter_repo_objectsT)frozenc:eZdZUdZeed<eed<eed<eed<y) _FileToUploadzWTemporary dataclass to store info about files to upload. Not meant to be used directly. local_path path_in_repo size_limit last_modifiedN) __name__ __module__ __qualname____doc__r__annotations__strintfloat4/fsx/qgallouedec/trackio/trackio/commit_scheduler.pyrrsaOr$rceZdZdZddddddddddd dedeeefdeeefd e ed e ed e ed e e d e ede ee eefde ee eefde de dddfdZ ddZ ddZddZddZdefdZde efdZde efdZy)CommitSchedulera Scheduler to upload a local folder to the Hub at regular intervals (e.g. push to hub every 5 minutes). The recommended way to use the scheduler is to use it as a context manager. This ensures that the scheduler is properly stopped and the last commit is triggered when the script ends. The scheduler can also be stopped manually with the `stop` method. Checkout the [upload guide](https://huggingface.co/docs/huggingface_hub/guides/upload#scheduled-uploads) to learn more about how to use it. Args: repo_id (`str`): The id of the repo to commit to. folder_path (`str` or `Path`): Path to the local folder to upload regularly. every (`int` or `float`, *optional*): The number of minutes between each commit. Defaults to 5 minutes. path_in_repo (`str`, *optional*): Relative path of the directory in the repo, for example: `"checkpoints/"`. Defaults to the root folder of the repository. repo_type (`str`, *optional*): The type of the repo to commit to. Defaults to `model`. revision (`str`, *optional*): The revision of the repo to commit to. Defaults to `main`. private (`bool`, *optional*): Whether to make the repo private. If `None` (default), the repo will be public unless the organization's default is private. This value is ignored if the repo already exists. token (`str`, *optional*): The token to use to commit to the repo. Defaults to the token saved on the machine. allow_patterns (`List[str]` or `str`, *optional*): If provided, only files matching at least one pattern are uploaded. ignore_patterns (`List[str]` or `str`, *optional*): If provided, files matching any of the patterns are not uploaded. squash_history (`bool`, *optional*): Whether to squash the history of the repo after each commit. Defaults to `False`. Squashing commits is useful to avoid degraded performances on the repo when it grows too large. hf_api (`HfApi`, *optional*): The [`HfApi`] client to use to commit to the Hub. Can be set with custom settings (user agent, token,...). Example: ```py >>> from pathlib import Path >>> from huggingface_hub import CommitScheduler # Scheduler uploads every 10 minutes >>> csv_path = Path("watched_folder/data.csv") >>> CommitScheduler(repo_id="test_scheduler", repo_type="dataset", folder_path=csv_path.parent, every=10) >>> with csv_path.open("a") as f: ... f.write("first line") # Some time later (...) >>> with csv_path.open("a") as f: ... f.write("second line") ``` Example using a context manager: ```py >>> from pathlib import Path >>> from huggingface_hub import CommitScheduler >>> with CommitScheduler(repo_id="test_scheduler", repo_type="dataset", folder_path="watched_folder", every=10) as scheduler: ... csv_path = Path("watched_folder/data.csv") ... with csv_path.open("a") as f: ... f.write("first line") ... (...) ... with csv_path.open("a") as f: ... f.write("second line") # Scheduler is now stopped and last commit have been triggered ``` NF) everyr repo_typerevisionprivatetokenallow_patternsignore_patternssquash_historyhf_apirepo_id folder_pathr)rr*r+r,r-r.r/r0r1rreturnc | xs t||_t|jj |_|xsd|_| |_| g} nt| tr| g} | tz|_ |j jrtd|j d|j jdd|jj|||d} | j |_||_||_||_i|_|dkDstd |dt+|_||_| |_t2j5d |j d |j d |j.d t7|j8d|_|j:j=t?j@|jBd|_"y)N)r-z0'folder_path' must be a directory, not a file: ''.T)parentsexist_ok)r2r,r*r9rz)'every' must be a positive integer, not 'zScheduled job to push 'z' to 'z' every z minutes.)targetdaemonF)#rapir expanduserresolver3rr. isinstancer rr/is_file ValueErrormkdir create_repor2r*r+r- last_uploadedr lockr)r0loggerinfor _run_scheduler_scheduler_threadstartatexitregister _push_to_hub_CommitScheduler__stopped)selfr2r3r)rr*r+r,r-r.r/r0r1repo_urls r%__init__zCommitScheduler.__init__js /U/ ,779AAC(.B,  " O  -./O.1HH    # # %B4CSCSBTTVW  td;88''W D(  '' "     qyHrRS SF  , %d&6&6%7vdll^8TXT^T^S__h i "(t/B/B4!P $$&))*r$cd|_y)ziStop the scheduler. A stopped scheduler cannot be restarted. Mostly for tests purposes. TN)rNrOs r%stopzCommitScheduler.stops r$c|SNr#rSs r% __enter__zCommitScheduler.__enter__s r$c`|jj|jyrV)triggerresultrT)rOexc_type exc_value tracebacks r%__exit__zCommitScheduler.__exit__s   r$c |j|_tj|jdz|j ryE)z7Dumb thread waiting between each scheduled push to Hub.<N)rY last_futuretimesleepr)rNrSs r%rHzCommitScheduler._run_schedulers6#||~D  JJtzzB '~~ r$cL|jj|jS)zTrigger a `push_to_hub` and return a future. This method is automatically called every `every` minutes. You can also call it manually to trigger a commit immediately, without waiting for the next scheduled commit. )r< run_as_futurerMrSs r%rYzCommitScheduler.triggers xx%%d&7&788r$cz|jrytjd |j}|jrQtjd|j j |j|j|j|S#t$r}tjd|d}~wwxYw)Nz((Background) scheduled commit triggered.z$(Background) squashing repo history.)r2r*branchzError while pushing to Hub: ) rNrFrG push_to_hubr0r<super_squash_historyr2r*r+ Exceptionerror)rOvaluees r%rMzCommitScheduler._push_to_hubs >> >? $$&E"" BC-- LLDNN4==.L  LL.qc2    sA.B B:B55B:c |j5tjdt|jj dDcic]<}|j r*|j|jj|>}}|jr|jjddnd}g}t|j|j|jD]}||}|j}|j j#||j ||j$k7sP|j't)|||z|j*|j$ dddt-dk(rtjd ytjd |Dcgc]8}t/t1|j2|j4 |j :} }tjd |j6j9|j:|j<| d|j>} |D]%} | j@|j | j2<'| Scc}w#1swY xYwcc}w)a Push folder to the Hub and return the commit info. This method is not meant to be called directly. It is run in the background by the scheduler, respecting a queue mechanism to avoid concurrent commits. Making a direct call to the method might lead to concurrency issues. The default behavior of `push_to_hub` is to assume an append-only folder. It lists all files in the folder and uploads only changed files. If no changes are found, the method returns without committing anything. If you want to change this behavior, you can inherit from [`CommitScheduler`] and override this method. This can be useful for example to compress data together in a single file before committing. For more details and examples, check out our [integration guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#scheduled-uploads). z-Listing files to upload for scheduled commit.z**/*/r6)r.r/N)rrrrrz4Dropping schedule commit: no changed file to upload.z9Removing unchanged files since previous scheduled commit.)r)path_or_fileobjrz%Uploading files for scheduled commit.zScheduled Commit)r2r* operationscommit_messager+)!rErFdebugsortedr3globr@ relative_toas_posixrstriprkeysr.r/statrDgetst_mtimeappendrst_sizelenr PartialFileIOrrr< create_commitr2r*r+r) rOpathrelpath_to_abspathprefixfiles_to_uploadrelpathrrzfile_to_uploadadd_operations commit_infofiles r%rhzCommitScheduler.push_to_hubs[&YY LLH I #$$))&1"D<<>   !1!12;;=tC "<@;L;L))//45Q7RTF46O."'')#22 $ 4 4 08 !(&&**:6>))*5F#**%'1)/')9'+||*.-- H  1 $ LLO P  PQ#2  #2  -"--.:S:S!,88  #2     <=hh,,LLnn%-]] - $D262D2DD  t /$}" YT  s+;?$$(??39% ? S%Z ? sm ?C=?3-?$?}?!tCy#~!67?"%S 3"78??!? ?B 99hz2&UXj1Ur$r'ceZdZdZdeeefdeddfdZdfd Z defdZ defd Z d effd Z defd Z efd ededefdZddeedefdZxZS)raA file-like object that reads only the first part of a file. Useful to upload a file to the Hub when the user might still be appending data to it. Only the first part of the file is uploaded (i.e. the part that was available when the filesystem was first scanned). In practice, only used internally by the CommitScheduler to regularly push a folder to the Hub with minimal disturbance for the user. The object is passed to `CommitOperationAdd`. Only supports `read`, `tell` and `seek` methods. Args: file_path (`str` or `Path`): Path to the file to read. size_limit (`int`): The maximum number of bytes to read from the file. If the file is larger than this, only the first part will be read (and uploaded). file_pathrr4Nct||_|jjd|_t |t j |jjj|_ y)Nrb) r _file_pathopen_fileminosfstatfilenor~ _size_limit)rOrrs r%rQzPartialFileIO.__init__IsNy/__))$/ z288DJJ4E4E4G+H+P+PQr$cT|jjt| SrV)rclosesuper__del__)rO __class__s r%rzPartialFileIO.__del__Ns  w  r$c<d|jd|jdS)Nz)rrrSs r%__repr__zPartialFileIO.__repr__Rs&''8 TEUEUDVVW X r$c|jSrV)rrSs r%__len__zPartialFileIO.__len__Wsr$namecj|jds|dvrt| |Std|d)N_)readtellseekz PartialFileIO does not support 'r7) startswithr__getattribute__NotImplementedError)rOrrs r%rzPartialFileIO.__getattribute__ZsB ??3 4, $ 7+D1 1!$DTF""MNNr$c6|jjS)z!Return the current file position.)rrrSs r%rzPartialFileIO.tellcszz  r$_PartialFileIO__offset_PartialFileIO__whencec|tk(rt||z}t}|jj ||}||j kDr%|jj |j S|S)zChange the stream position to the given offset. Behavior is the same as a regular file, except that the position is capped to the size limit. )rrrrrr)rOrrposs r%rzPartialFileIO.seekgs_ x 4y8+HHjjooh1 !! !::??4#3#34 4 r$_PartialFileIO__sizec|jj}||dkr|j|z }nt||j|z }|jj |S)zRead at most `__size` bytes from the file. Behavior is the same as a regular file, except that it is capped to the size limit. r)rrrrr)rOrcurrenttruncated_sizes r%rzPartialFileIO.readvs[ **//# >VaZ!--7N!)9)9G)CDNzz~..r$r))rrrrrr rr!rQrrrrrrrr bytesr __classcell__)rs@r%rr6s$R%T "2RRR ! #   OSO!c!3; S C s  /8C= /% /r$r)#rKloggingrrbconcurrent.futuresr dataclassesriorrrpathlibr threadingr r typingr r r rhuggingface_hub.hf_apirrrrhuggingface_hub.utilsr getLoggerrrFrr'rr#r$r%rs %!**".. 6   8 $ $PPfL/GL/r$