缓存系统参考

缓存系统在 v0.8.0 中进行了更新，成为依赖于 Hub 的库的共享中心缓存系统。有关 HF 缓存的详细介绍，请阅读缓存系统指南。

助手

try_to_load_from_cache

huggingface_hub.try_to_load_from_cache

( repo_id: str filename: str cache_dir: typing.Union[str, pathlib.Path, NoneType] = None revision: typing.Optional[str] = None repo_type: typing.Optional[str] = None ) → Optional[str] 或 _CACHED_NO_EXIST

参数

cache_dir (str 或 os.PathLike) — 缓存文件所在的文件夹。
repo_id (str) — huggingface.co 上仓库的 ID。
filename (str) — 在 repo_id 中要查找的文件名。
revision (str, 可选) — 要使用的特定模型版本。如果未提供且未提供 commit_hash，则默认为 "main"。
repo_type (str, 可选) — 仓库类型。默认为 "model"。

Optional[str] 或 _CACHED_NO_EXIST

如果文件未缓存，则返回 None。否则，

如果文件在缓存中找到，则返回缓存文件的确切路径
如果文件在给定提交哈希下不存在且此事实已被缓存，则返回特殊值 _CACHED_NO_EXIST。

探索缓存以返回给定修订版中找到的最新缓存文件。

如果文件未缓存，此函数将不会引发任何异常。

示例

from huggingface_hub import try_to_load_from_cache, _CACHED_NO_EXIST

filepath = try_to_load_from_cache()
if isinstance(filepath, str):
    # file exists and is cached
    ...
elif filepath is _CACHED_NO_EXIST:
    # non-existence of file is cached
    ...
else:
    # file is not cached
    ...

cached_assets_path

huggingface_hub.cached_assets_path

< 源 >

( library_name: str namespace: str = 'default' subfolder: str = 'default' assets_dir: typing.Union[str, pathlib.Path, NoneType] = None )

参数

library_name (str) — 将管理缓存文件夹的库名称。示例："dataset"。
namespace (str, 可选，默认为“default”) — 数据所属的命名空间。示例："SQuAD"。
subfolder (str, 可选，默认为“default”) — 数据将存储在其中的子文件夹。示例：extracted。
assets_dir (str, Path, 可选) — 资产缓存的文件夹路径。此路径不得与 Hub 文件缓存的文件夹相同。如果未提供，则默认为 HF_HOME / "assets"。也可以通过 HF_ASSETS_CACHE 环境变量设置。

返回一个文件夹路径以缓存任意文件。

huggingface_hub 提供了一个规范的文件夹路径来存储资产。这是在下游库中集成缓存的推荐方式，因为它将受益于内置工具来正确扫描和删除缓存。

Hub 缓存的文件和资产之间存在区别。来自 Hub 的文件以 Git 感知的方式缓存，并完全由 huggingface_hub 管理。请参阅相关文档。下游库缓存的所有其他文件都被视为“资产”（从外部源下载的文件、从 .tar 存档中提取的文件、为训练预处理的文件等）。

一旦生成文件夹路径，就保证它存在并且是一个目录。该路径基于 3 个深度级别：库名称、命名空间和子文件夹。这 3 个级别提供了灵活性，同时允许 huggingface_hub 在扫描/删除资产缓存的某些部分时预期文件夹。在一个库中，所有命名空间都共享相同的子文件夹名称子集是预期的，但这不是强制性规则。下游库随后可以完全控制在其缓存中采用的文件结构。命名空间和子文件夹是可选的（默认为 "default/" 子文件夹），但库名称是强制性的，因为我们希望每个下游库都管理自己的缓存。

预期树

    assets/
    └── datasets/
    │   ├── SQuAD/
    │   │   ├── downloaded/
    │   │   ├── extracted/
    │   │   └── processed/
    │   ├── Helsinki-NLP--tatoeba_mt/
    │       ├── downloaded/
    │       ├── extracted/
    │       └── processed/
    └── transformers/
        ├── default/
        │   ├── something/
        ├── bert-base-cased/
        │   ├── default/
        │   └── training/
    hub/
    └── models--julien-c--EsperBERTo-small/
        ├── blobs/
        │   ├── (...)
        │   ├── (...)
        ├── refs/
        │   └── (...)
        └── [ 128]  snapshots/
            ├── 2439f60ef33a0d46d85da5001d52aeda5b00ce9f/
            │   ├── (...)
            └── bbc77c8132af1cc5cf678da3f1ddf2de43606d48/
                └── (...)

示例

>>> from huggingface_hub import cached_assets_path

>>> cached_assets_path(library_name="datasets", namespace="SQuAD", subfolder="download")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/SQuAD/download')

>>> cached_assets_path(library_name="datasets", namespace="SQuAD", subfolder="extracted")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/SQuAD/extracted')

>>> cached_assets_path(library_name="datasets", namespace="Helsinki-NLP/tatoeba_mt")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/Helsinki-NLP--tatoeba_mt/default')

>>> cached_assets_path(library_name="datasets", assets_dir="/tmp/tmp123456")
PosixPath('/tmp/tmp123456/datasets/default/default')

scan_cache_dir

huggingface_hub.scan_cache_dir

< 源 >

( cache_dir: typing.Union[str, pathlib.Path, NoneType] = None )

参数

cache_dir (str 或 Path, 可选) — 要缓存的缓存目录。默认为默认的 HF 缓存目录。

引发

CacheNotFound 或 ValueError

CacheNotFound — 如果缓存目录不存在。
ValueError — 如果缓存目录是文件而不是目录。

扫描整个 HF 缓存系统并返回一个 ~HFCacheInfo 结构。

使用 scan_cache_dir 以编程方式扫描缓存系统。缓存将逐个仓库扫描。如果仓库损坏，将内部抛出 ~CorruptedCacheException 但被捕获并返回到 ~HFCacheInfo 结构中。只有有效的仓库才能获得正确的报告。

>>> from huggingface_hub import scan_cache_dir

>>> hf_cache_info = scan_cache_dir()
HFCacheInfo(
    size_on_disk=3398085269,
    repos=frozenset({
        CachedRepoInfo(
            repo_id='t5-small',
            repo_type='model',
            repo_path=PosixPath(...),
            size_on_disk=970726914,
            nb_files=11,
            revisions=frozenset({
                CachedRevisionInfo(
                    commit_hash='d78aea13fa7ecd06c29e3e46195d6341255065d5',
                    size_on_disk=970726339,
                    snapshot_path=PosixPath(...),
                    files=frozenset({
                        CachedFileInfo(
                            file_name='config.json',
                            size_on_disk=1197
                            file_path=PosixPath(...),
                            blob_path=PosixPath(...),
                        ),
                        CachedFileInfo(...),
                        ...
                    }),
                ),
                CachedRevisionInfo(...),
                ...
            }),
        ),
        CachedRepoInfo(...),
        ...
    }),
    warnings=[
        CorruptedCacheException("Snapshots dir doesn't exist in cached repo: ..."),
        CorruptedCacheException(...),
        ...
    ],
)

您还可以使用 hf 命令行直接打印详细报告

> hf cache scan
REPO ID                     REPO TYPE SIZE ON DISK NB FILES REFS                LOCAL PATH
--------------------------- --------- ------------ -------- ------------------- -------------------------------------------------------------------------
glue                        dataset         116.3K       15 1.17.0, main, 2.4.0 /Users/lucain/.cache/huggingface/hub/datasets--glue
google/fleurs               dataset          64.9M        6 main, refs/pr/1     /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs
Jean-Baptiste/camembert-ner model           441.0M        7 main                /Users/lucain/.cache/huggingface/hub/models--Jean-Baptiste--camembert-ner
bert-base-cased             model             1.9G       13 main                /Users/lucain/.cache/huggingface/hub/models--bert-base-cased
t5-base                     model            10.1K        3 main                /Users/lucain/.cache/huggingface/hub/models--t5-base
t5-small                    model           970.7M       11 refs/pr/1, main     /Users/lucain/.cache/huggingface/hub/models--t5-small

Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G.
Got 1 warning(s) while scanning. Use -vvv to print details.

返回：一个 ~HFCacheInfo 对象。

数据结构

所有结构均由 scan_cache_dir() 构建和返回，且不可变。

HFCacheInfo

class huggingface_hub.HFCacheInfo

< 源 >

( size_on_disk: int repos: typing.FrozenSet[huggingface_hub.utils._cache_manager.CachedRepoInfo] warnings: typing.List[huggingface_hub.errors.CorruptedCacheException] )

参数

size_on_disk (int) — 缓存系统中所有有效仓库大小的总和。
repos (FrozenSet[CachedRepoInfo]) — 描述在扫描时在缓存系统中找到的所有有效缓存仓库的 ~CachedRepoInfo 集合。
warnings (List[CorruptedCacheException]) — 扫描缓存时发生的 ~CorruptedCacheException 列表。这些异常被捕获，以便扫描可以继续。损坏的仓库将从扫描中跳过。

保存整个缓存系统信息的冻结数据结构。

此数据结构由 scan_cache_dir() 返回，并且是不可变的。

这里 size_on_disk 等于所有仓库大小（仅限 blob）的总和。但是，如果某些缓存的仓库损坏，则其大小不计入其中。

delete_revisions

< 源 >

( *revisions: str )

准备删除本地缓存的一个或多个修订版的策略。

输入的修订版可以是任何修订版哈希。如果在本地缓存中未找到修订版哈希，则会发出警告，但不会引发错误。修订版可以来自不同的缓存仓库，因为哈希在仓库之间是唯一的。

示例

>>> from huggingface_hub import scan_cache_dir
>>> cache_info = scan_cache_dir()
>>> delete_strategy = cache_info.delete_revisions(
...     "81fd1d6e7847c99f5862c9fb81387956d99ec7aa"
... )
>>> print(f"Will free {delete_strategy.expected_freed_size_str}.")
Will free 7.9K.
>>> delete_strategy.execute()
Cache deletion done. Saved 7.9K.

>>> from huggingface_hub import scan_cache_dir
>>> scan_cache_dir().delete_revisions(
...     "81fd1d6e7847c99f5862c9fb81387956d99ec7aa",
...     "e2983b237dccf3ab4937c97fa717319a9ca1a96d",
...     "6c0e6080953db56375760c0471a8c5f2929baf11",
... ).execute()
Cache deletion done. Saved 8.6G.

delete_revisions 返回一个需要执行的 DeleteCacheStrategy 对象。DeleteCacheStrategy 不应被修改，但允许在实际执行删除之前进行试运行。

export_as_table

< 源 >

( verbosity: int = 0 ) → str

参数

verbosity (int, 可选) — 详细程度。默认为 0。

字符串

作为字符串的表格。

从 HFCacheInfo 对象生成表格。

传入 verbosity=0 以获取每行一个仓库的表格，包含“repo_id”、“repo_type”、“size_on_disk”、“nb_files”、“last_accessed”、“last_modified”、“refs”、“local_path”列。

传入 verbosity=1 以获取每行一个仓库和修订版的表格（因此单个仓库可能出现多行），包含“repo_id”、“repo_type”、“revision”、“size_on_disk”、“nb_files”、“last_modified”、“refs”、“local_path”列。

示例

>>> from huggingface_hub.utils import scan_cache_dir

>>> hf_cache_info = scan_cache_dir()
HFCacheInfo(...)

>>> print(hf_cache_info.export_as_table())
REPO ID                                             REPO TYPE SIZE ON DISK NB FILES LAST_ACCESSED LAST_MODIFIED REFS LOCAL PATH
--------------------------------------------------- --------- ------------ -------- ------------- ------------- ---- --------------------------------------------------------------------------------------------------
roberta-base                                        model             2.7M        5 1 day ago     1 week ago    main ~/.cache/huggingface/hub/models--roberta-base
suno/bark                                           model             8.8K        1 1 week ago    1 week ago    main ~/.cache/huggingface/hub/models--suno--bark
t5-base                                             model           893.8M        4 4 days ago    7 months ago  main ~/.cache/huggingface/hub/models--t5-base
t5-large                                            model             3.0G        4 5 weeks ago   5 months ago  main ~/.cache/huggingface/hub/models--t5-large

>>> print(hf_cache_info.export_as_table(verbosity=1))
REPO ID                                             REPO TYPE REVISION                                 SIZE ON DISK NB FILES LAST_MODIFIED REFS LOCAL PATH
--------------------------------------------------- --------- ---------------------------------------- ------------ -------- ------------- ---- -----------------------------------------------------------------------------------------------------------------------------------------------------
roberta-base                                        model     e2da8e2f811d1448a5b465c236feacd80ffbac7b         2.7M        5 1 week ago    main ~/.cache/huggingface/hub/models--roberta-base/snapshots/e2da8e2f811d1448a5b465c236feacd80ffbac7b
suno/bark                                           model     70a8a7d34168586dc5d028fa9666aceade177992         8.8K        1 1 week ago    main ~/.cache/huggingface/hub/models--suno--bark/snapshots/70a8a7d34168586dc5d028fa9666aceade177992
t5-base                                             model     a9723ea7f1b39c1eae772870f3b547bf6ef7e6c1       893.8M        4 7 months ago  main ~/.cache/huggingface/hub/models--t5-base/snapshots/a9723ea7f1b39c1eae772870f3b547bf6ef7e6c1
t5-large                                            model     150ebc2c4b72291e770f58e6057481c8d2ed331a         3.0G        4 5 months ago  main ~/.cache/huggingface/hub/models--t5-large/snapshots/150ebc2c4b72291e770f58e6057481c8d2ed331a

CachedRepoInfo

class huggingface_hub.CachedRepoInfo

< 源 >

( repo_id: str repo_type: typing.Literal['model', 'dataset', 'space'] repo_path: Path size_on_disk: int nb_files: int revisions: typing.FrozenSet[huggingface_hub.utils._cache_manager.CachedRevisionInfo] last_accessed: float last_modified: float )

参数

repo_id (str) — Hub 上仓库的仓库 ID。示例："google/fleurs"。
repo_type (Literal["dataset", "model", "space"]) — 缓存仓库的类型。
repo_path (Path) — 缓存仓库的本地路径。
size_on_disk (int) — 缓存仓库中 blob 文件大小的总和。
nb_files (int) — 缓存仓库中 blob 文件的总数。
revisions (FrozenSet[CachedRevisionInfo]) — 描述仓库中所有缓存修订版的 ~CachedRevisionInfo 集合。
last_accessed (float) — 仓库中 blob 文件最后一次访问的时间戳。
last_modified (float) — 仓库中 blob 文件最后一次修改/创建的时间戳。

保存缓存仓库信息的冻结数据结构。

size_on_disk 不一定是所有修订版大小的总和，因为存在重复文件。此外，只考虑 blob 文件，而不考虑文件夹和符号链接（可忽略不计）的大小。

last_accessed 和 last_modified 的可靠性可能取决于您使用的操作系统。有关更多详细信息，请参阅Python 文档。

size_on_disk_str

< 源 >

( )

(属性) blob 文件大小总和的可读字符串。

示例：“42.2K”。

refs

< 源 >

( )

(属性) refs 和修订数据结构之间的映射。

CachedRevisionInfo

class huggingface_hub.CachedRevisionInfo

< 源 >

( commit_hash: str snapshot_path: Path size_on_disk: int files: typing.FrozenSet[huggingface_hub.utils._cache_manager.CachedFileInfo] refs: typing.FrozenSet[str] last_modified: float )

参数

commit_hash (str) — 修订版的哈希值（唯一）。示例："9338f7b671827df886678df2bdd7cc7b4f36dffd"。
snapshot_path (Path) — snapshots 文件夹中修订版目录的路径。它包含与 Hub 上仓库完全相同的树结构。
files — (FrozenSet[CachedFileInfo]): 包含快照中所有文件的 ~CachedFileInfo 集合。
refs (FrozenSet[str]) — 指向此修订版本的 refs 集合。如果修订版本没有 refs，则认为它是分离的。示例： {"main", "2.4.0"} 或 {"refs/pr/1"}。
size_on_disk (int) — 修订版本通过符号链接指向的 blob 文件大小总和。
last_modified (float) — 修订版本最后创建/修改的时间戳。

冻结数据结构，保存有关修订版本的信息。

修订版本对应于 snapshots 文件夹中的一个文件夹，并以与 Hub 上的仓库完全相同的树形结构填充，但只包含符号链接。一个修订版本可以被一个或多个 refs 引用，也可以是“分离的”（没有 refs）。

由于 blob 文件在修订版本之间共享，因此无法正确确定单个修订版本的 last_accessed。

由于可能存在重复文件，size_on_disk 不一定是所有文件大小的总和。此外，只考虑 blob 的大小，不考虑文件夹和符号链接的（可忽略的）大小。

size_on_disk_str

< source >

( )

(属性) blob 文件大小总和的可读字符串。

示例：“42.2K”。

nb_files

< source >

( )

(属性) 修订版本中的文件总数。

CachedFileInfo

class huggingface_hub.CachedFileInfo

< source >

( file_name: str file_path: Path blob_path: Path size_on_disk: int blob_last_accessed: float blob_last_modified: float )

参数

file_name (str) — 文件名。示例：config.json。
file_path (Path) — 文件在 snapshots 目录中的路径。文件路径是指向 blobs 文件夹中 blob 的符号链接。
blob_path (Path) — blob 文件的路径。这等同于 file_path.resolve()。
size_on_disk (int) — blob 文件的大小（字节）。
blob_last_accessed (float) — blob 文件最后一次被访问的时间戳（来自任何修订版本）。
blob_last_modified (float) — blob 文件最后一次被修改/创建的时间戳。

冻结数据结构，保存有关单个缓存文件信息。

blob_last_accessed 和 blob_last_modified 的可靠性可能取决于您使用的操作系统。有关更多详细信息，请参阅 Python 文档。

size_on_disk_str

< source >

( )

(属性) blob 文件大小的可读字符串。

示例：“42.2K”。

DeleteCacheStrategy

class huggingface_hub.DeleteCacheStrategy

< source >

( expected_freed_size: int blobs: typing.FrozenSet[pathlib.Path] refs: typing.FrozenSet[pathlib.Path] repos: typing.FrozenSet[pathlib.Path] snapshots: typing.FrozenSet[pathlib.Path] )

参数

expected_freed_size (float) — 策略执行后预期的释放大小。
blobs (FrozenSet[Path]) — 要删除的 blob 文件路径集合。
refs (FrozenSet[Path]) — 要删除的引用文件路径集合。
repos (FrozenSet[Path]) — 要删除的整个仓库路径集合。
snapshots (FrozenSet[Path]) — 要删除的快照集合（符号链接目录）。

冻结数据结构，保存删除缓存修订版本的策略。

此对象不应通过编程方式实例化，而应由 delete_revisions() 返回。有关使用示例，请参阅文档。

expected_freed_size_str

< source >

( )

(属性) 预期将被释放的空间大小（人类可读字符串）。

示例：“42.2K”。

异常

CorruptedCacheException

class huggingface_hub.CorruptedCacheException

< source >

( )

Huggingface 缓存系统中任何意外结构引发的异常。

< > 在 GitHub 上更新

Hub Python 库

缓存系统参考

助手

try_to_load_from_cache

huggingface_hub.try_to_load_from_cache

cached_assets_path

huggingface_hub.cached_assets_path

scan_cache_dir

huggingface_hub.scan_cache_dir

数据结构

HFCacheInfo

class huggingface_hub.HFCacheInfo

delete_revisions

export_as_table

CachedRepoInfo

class huggingface_hub.CachedRepoInfo

size_on_disk_str

refs

CachedRevisionInfo

class huggingface_hub.CachedRevisionInfo

size_on_disk_str

nb_files

CachedFileInfo

class huggingface_hub.CachedFileInfo

size_on_disk_str

DeleteCacheStrategy

class huggingface_hub.DeleteCacheStrategy

expected_freed_size_str

异常

CorruptedCacheException

class huggingface_hub.CorruptedCacheException