egrace479 commited on
Commit
040d081
·
verified ·
1 Parent(s): 1e93828

Add metadata file for sample image retrieval and description of it

Browse files
Files changed (3) hide show
  1. README.md +14 -1
  2. app.py +1 -7
  3. components/metadata.parquet +3 -0
README.md CHANGED
@@ -16,6 +16,19 @@ datasets:
16
 
17
  This app is modified from the original [BioCLIP Demo](https://huggingface.co/spaces/imageomics/bioclip-demo) to run inference with [BioCLIP 2](https://huggingface.co/imageomics/bioclip-2) and uses [pybioclip](https://github.com/Imageomics/pybioclip).
18
 
19
- Due to space persistent storage limitations, embeddings are fetched from the [TreeOfLife-200M repo](https://huggingface.co/datasets/imageomics/TreeOfLife-200M) and metadata for the images comes from [demo-data](https://huggingface.co/datasets/imageomics/demo-data) (a private Institute dataset repo). The images will be retrieved from an S3 bucket, as with the original BioCLIP demo.
20
 
21
  Note that if this space is duplicated, the sample image portion **will not work**.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  This app is modified from the original [BioCLIP Demo](https://huggingface.co/spaces/imageomics/bioclip-demo) to run inference with [BioCLIP 2](https://huggingface.co/imageomics/bioclip-2) and uses [pybioclip](https://github.com/Imageomics/pybioclip).
18
 
19
+ Due to space persistent storage limitations, embeddings are fetched from the [TreeOfLife-200M repo](https://huggingface.co/datasets/imageomics/TreeOfLife-200M). The images will be retrieved from an S3 bucket, as with the origin, described below.
20
 
21
  Note that if this space is duplicated, the sample image portion **will not work**.
22
+
23
+ **bioclip-2/metadata.parquet:** metadata file for fetching [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M) sample images (up to 3 available per taxa) from an S3 bucket.
24
+ - `uuid`: unique identifier for the image within the TreeOfLife-200M dataset.
25
+ - `eol_page_id`: identifier of EOL page for the most specific taxa of the image (where available). Note that an image's association to a particular page ID may change with updates to the EOL (or image provider's) hierarchy. However, EOL taxon page IDs are stable. "https://eol.org/pages/" + `eol_page_id` links to the page.
26
+ - `gbif_id`: identifier used by GBIF for the most specific taxa of the image (where available). "https://gbif.org/species/" + `taxonID` links to the page.
27
+ - `kingdom`: kingdom to which the subject of the image belongs (all `Animalia`).
28
+ - `phylum`: phylum to which the subject of the image belongs.
29
+ - `class`: class to which the subject of the image belongs.
30
+ - `order`: order to which the subject of the image belongs.
31
+ - `family`: family to which the subject of the image belongs.
32
+ - `genus`: genus to which the subject of the image belongs.
33
+ - `species`: species to which the subject of the image belongs.
34
+ - `file_path`: image filepath to fetch image from S3 bucket (`<folder>/<uuid>.jpg`, folders are first two characters of the `uuid`).
app.py CHANGED
@@ -14,7 +14,6 @@ from torchvision import transforms
14
 
15
  from components.query import get_sample
16
  from bioclip import CustomLabelsClassifier
17
- from huggingface_hub import hf_hub_download
18
 
19
  log_format = "[%(asctime)s] [%(levelname)s] [%(name)s] %(message)s"
20
  logging.basicConfig(level=logging.INFO, format=log_format)
@@ -23,12 +22,7 @@ logger = logging.getLogger()
23
  hf_token = os.getenv("HF_TOKEN")
24
 
25
  # For sample images
26
- hf_hub_download(repo_id="imageomics/demo-data",
27
- filename="bioclip-2/metadata.parquet",
28
- repo_type="dataset",
29
- local_dir = "components",
30
- token = hf_token)
31
- METADATA_PATH = "components/bioclip-2/metadata.parquet"
32
  # Read page IDs as int
33
  metadata_df = pl.read_parquet(METADATA_PATH, low_memory = False)
34
  metadata_df = metadata_df.with_columns(pl.col(["eol_page_id", "gbif_id"]).cast(pl.Int64))
 
14
 
15
  from components.query import get_sample
16
  from bioclip import CustomLabelsClassifier
 
17
 
18
  log_format = "[%(asctime)s] [%(levelname)s] [%(name)s] %(message)s"
19
  logging.basicConfig(level=logging.INFO, format=log_format)
 
22
  hf_token = os.getenv("HF_TOKEN")
23
 
24
  # For sample images
25
+ METADATA_PATH = "components/metadata.parquet"
 
 
 
 
 
26
  # Read page IDs as int
27
  metadata_df = pl.read_parquet(METADATA_PATH, low_memory = False)
28
  metadata_df = metadata_df.with_columns(pl.col(["eol_page_id", "gbif_id"]).cast(pl.Int64))
components/metadata.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6af05f1f8f08b0d447b9a4c18680c7de39551a05318f026d30c224a9bbe5283e
3
+ size 121162891