Upload folder using huggingface_hub

Browse files

Files changed (17) hide show

README.md +260 -0
config.json +34 -0
merges.txt +0 -0
onnx/model.onnx +3 -0
onnx/model_bnb4.onnx +3 -0
onnx/model_fp16.onnx +3 -0
onnx/model_int8.onnx +3 -0
onnx/model_q4.onnx +3 -0
onnx/model_q4f16.onnx +3 -0
onnx/model_quantized.onnx +3 -0
onnx/model_uint8.onnx +3 -0
preprocessor_config.json +28 -0
quantize_config.json +18 -0
special_tokens_map.json +30 -0
tokenizer.json +0 -0
tokenizer_config.json +32 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,260 @@

+---
+license: cc-by-nc-4.0
+language:
+- en
+pipeline_tag: zero-shot-image-classification
+widget:
+- src: https://huggingface.co/geolocal/StreetCLIP/resolve/main/nagasaki.jpg
+  candidate_labels: China, South Korea, Japan, Phillipines, Taiwan, Vietnam, Cambodia
+  example_title: Countries
+- src: https://huggingface.co/geolocal/StreetCLIP/resolve/main/sanfrancisco.jpeg
+  candidate_labels: San Jose, San Diego, Los Angeles, Las Vegas, San Francisco, Seattle
+  example_title: Cities
+library_name: transformers.js
+tags:
+- geolocalization
+- geolocation
+- geographic
+- street
+- climate
+- clip
+- urban
+- rural
+- multi-modal
+- geoguessr
+base_model:
+- geolocal/StreetCLIP
+---
+# StreetCLIP (ONNX)
+This is an ONNX version of [geolocal/StreetCLIP](https://huggingface.co/geolocal/StreetCLIP). It was automatically converted and uploaded using [this Hugging Face Space](https://huggingface.co/spaces/onnx-community/convert-to-onnx).
+## Usage with Transformers.js
+See the pipeline documentation for `zero-shot-image-classification`: https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ZeroShotImageClassificationPipeline
+---
+# Model Card for StreetCLIP
+StreetCLIP is a robust foundation model for open-domain image geolocalization and other
+geographic and climate-related tasks.
+Trained on an original dataset of 1.1 million street-level urban and rural geo-tagged images, it achieves
+state-of-the-art performance on multiple open-domain image geolocalization benchmarks in zero-shot,
+outperforming supervised models trained on millions of images.
+# Model Description
+StreetCLIP is a model pretrained by deriving image captions synthetically from image class labels using
+a domain-specific caption template. This allows StreetCLIP to transfer its generalized zero-shot learning
+capabilities to a specific domain (i.e. the domain of image geolocalization).
+StreetCLIP builds on the OpenAI's pretrained large version of CLIP ViT, using 14x14 pixel
+patches and images with a 336 pixel side length.
+## Model Details
+- **Model type:** [CLIP](https://openai.com/blog/clip/)
+- **Language:** English
+- **License:** Create Commons Attribution Non Commercial 4.0
+- **Trained from model:** [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336)
+## Model Sources
+- **Paper:** [Preprint](https://arxiv.org/abs/2302.00275)
+- **Cite preprint as:**
+```bibtex
+  @misc{haas2023learning,
+      title={Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization},
+      author={Lukas Haas and Silas Alberti and Michal Skreta},
+      year={2023},
+      eprint={2302.00275},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+  }
+```
+# Uses
+StreetCLIP has a deep understanding of the visual features found in street-level urban and rural scenes
+and knows how to relate these concepts to specific countries, regions, and cities. Given its training setup,
+the following use cases are recommended for StreetCLIP.
+## Direct Use
+StreetCLIP can be used out-of-the box using zero-shot learning to infer the geolocation of images on a country, region,
+or city level. Given that StreetCLIP was pretrained on a dataset of street-level urban and rural images,
+the best performance can be expected on images from a similar distribution.
+Broader direct use cases are any zero-shot image classification tasks that rely on urban and rural street-level
+understanding or geographical information relating visual clues to their region of origin.
+## Downstream Use
+StreetCLIP can be finetuned for any downstream applications that require geographic or street-level urban or rural
+scene understanding. Examples of use cases are the following:
+**Understanding the Built Environment**
+- Analyzing building quality
+- Building type classifcation
+- Building energy efficiency Classification
+**Analyzing Infrastructure**
+- Analyzing road quality
+- Utility pole maintenance
+- Identifying damage from natural disasters or armed conflicts
+**Understanding the Natural Environment**
+- Mapping vegetation
+- Vegetation classification
+- Soil type classifcation
+- Tracking deforestation
+**General Use Cases**
+- Street-level image segmentation
+- Urban and rural scene classification
+- Object detection in urban or rural environments
+- Improving navigation and self-driving car technology
+## Out-of-Scope Use
+Any use cases attempting to geolocate users' private images are out-of-scope and discouraged.
+# Bias, Risks, and Limitations
+StreetCLIP was not trained on social media images or images of identifable people for a reason. As such, any use case
+attempting to geolocalize users' private images
+## Recommendations
+We encourage the community to apply StreetCLIP to applications with significant social impact of which there are many.
+The first three categories of potential use cases under Downstream Use list potential use cases with social impact
+to explore.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+from PIL import Image
+import requests
+from transformers import CLIPProcessor, CLIPModel
+model = CLIPModel.from_pretrained("geolocal/StreetCLIP")
+processor = CLIPProcessor.from_pretrained("geolocal/StreetCLIP")
+url = "https://huggingface.co/geolocal/StreetCLIP/resolve/main/sanfrancisco.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)
+choices = ["San Jose", "San Diego", "Los Angeles", "Las Vegas", "San Francisco"]
+inputs = processor(text=choices, images=image, return_tensors="pt", padding=True)
+outputs = model(**inputs)
+logits_per_image = outputs.logits_per_image # this is the image-text similarity score
+probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
+```
+# Training Details
+## Training Data
+StreetCLIP was trained on an original, unreleased street-level dataset of 1.1 million real-world,
+urban and rural images. The data used to train the model comes from 101 countries, biased towards
+western countries and not including India and China.
+## Preprocessing
+Same preprocessing as [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336).
+## Training Procedure
+StreetCLIP is initialized with OpenAI's pretrained large version of CLIP ViT and then pretrained using the synthetic
+caption domain-specific pretraining method described in the paper corresponding to this work. StreetCLIP was trained
+for 3 epochs using an AdamW optimizer with a learning rate of 1e-6 on 3 NVIDIA A100 80GB GPUs, a batch size of 32,
+and gradient accumulation of 12 steps.
+StreetCLIP was trained with the goal of matching images in the batch
+with the caption correponding to the correct city, region, and country of the images' origins.
+# Evaluation
+StreetCLIP was evaluated in zero-shot on two open-domain image geolocalization benchmarks using a
+technique called hierarchical linear probing. Hierarchical linear probing sequentially attempts to
+identify the correct country and then city of geographical image origin.
+## Testing Data and Metrics
+### Testing Data
+StreetCLIP was evaluated on the following two open-domain image geolocalization benchmarks.
+* [IM2GPS](http://graphics.cs.cmu.edu/projects/im2gps/).
+* [IM2GPS3K](https://github.com/lugiavn/revisiting-im2gps)
+### Metrics
+The objective of the listed benchmark datasets is to predict the images' coordinates of origin with as
+little deviation as possible. A common metric set forth in prior literature is called Percentage at Kilometer (% @ KM).
+The Percentage at Kilometer metric first calculates the distance in kilometers between the predicted coordinates
+to the ground truth coordinates and then looks at what percentage of error distances are below a certain kilometer threshold.
+## Results
+**IM2GPS**
+| Model  |  25km | 200km  | 750km | 2,500km |
+|----------|:-------------:|:------:|:------:|:------:|
+| PlaNet (2016) |  24.5 | 37.6 | 53.6 | 71.3 |
+| ISNs (2018) |  43.0 | 51.9 | 66.7 | 80.2 |
+| TransLocator (2022) |  **48.1** | **64.6** | **75.6** | 86.7 |
+| **Zero-Shot CLIP (ours)** | 27.0 | 42.2 | 71.7 | 86.9 |
+| **Zero-Shot StreetCLIP (ours)** |  28.3 | 45.1 | 74.7 | **88.2** |
+Metric: Percentage at Kilometer (% @ KM)
+**IM2GPS3K**
+| Model  |  25km | 200km  | 750km | 2,500km |
+|----------|:-------------:|:------:|:------:|:------:|
+| PlaNet (2016) |  24.8 | 34.3 | 48.4 | 64.6 |
+| ISNs (2018) |  28.0 | 36.6 | 49.7 | 66.0 |
+| TransLocator (2022) |  **31.1** | **46.7** | 58.9 | 80.1 |
+| **Zero-Shot CLIP (ours)** | 19.5 | 34.0 | 60.0 | 78.1 |
+| **Zero-Shot StreetCLIP (ours)** |  22.4 | 37.4 | **61.3** | **80.4** |
+Metric: Percentage at Kilometer (% @ KM)
+### Summary
+Our experiments demonstrate that our synthetic caption pretraining method is capable of significantly
+improving CLIP's generalized zero-shot capabilities applied to open-domain image geolocalization while
+achieving state-of-the-art performance on a selection of benchmark metrics.
+# Environmental Impact
+- **Hardware Type:** 4 NVIDIA A100 GPUs
+- **Hours used:** 12
+# Citation
+Cite preprint as:
+```bibtex
+  @misc{haas2023learning,
+      title={Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization},
+      author={Lukas Haas and Silas Alberti and Michal Skreta},
+      year={2023},
+      eprint={2302.00275},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+  }
+```

config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "_attn_implementation_autoset": true,
+  "_name_or_path": "geolocal/StreetCLIP",
+  "architectures": [
+    "CLIPModel"
+  ],
+  "initializer_factor": 1.0,
+  "logit_scale_init_value": 2.6592,
+  "model_type": "clip",
+  "projection_dim": 768,
+  "text_config": {
+    "dropout": 0.0,
+    "hidden_size": 768,
+    "intermediate_size": 3072,
+    "model_type": "clip_text_model",
+    "num_attention_heads": 12,
+    "projection_dim": 768,
+    "torch_dtype": "float32"
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.49.0",
+  "vision_config": {
+    "dropout": 0.0,
+    "hidden_size": 1024,
+    "image_size": 336,
+    "intermediate_size": 4096,
+    "model_type": "clip_vision_model",
+    "num_attention_heads": 16,
+    "num_hidden_layers": 24,
+    "patch_size": 14,
+    "projection_dim": 768,
+    "torch_dtype": "float32"
+  }
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

onnx/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cb60bb52e9f4f54111d9c4db66c934970746ab770b2237ba3f5c99c961763c22
+size 1712529037

onnx/model_bnb4.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2d11444aaf34f3f070cf4a948ab4416cf677916112567876da43ce2e15ad02f1
+size 377779159

onnx/model_fp16.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:60d47e9d9c61b55e5f8b66d24e27d1be13428eca33a78368db970e3ade36aada
+size 856641119

onnx/model_int8.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:90f3c46c95caaec38e4e3f2ac2779643778468861bd24ccf2548e77a5103ea1f
+size 433623188

onnx/model_q4.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3e55198ff4108ac95c6781f5d6c61e546bdf4c04374ac1a34913a89865df34fb
+size 402046241

onnx/model_q4f16.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3c00f511530563d49494655fa6c0356047b383a5cffdb2c88adda9a6b23845e8
+size 298491223

onnx/model_quantized.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e3dda43fc568cdc7157d18df1e6e0913abbd7aecca2fe8c02ee842f2c459384e
+size 433623188

onnx/model_uint8.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e3dda43fc568cdc7157d18df1e6e0913abbd7aecca2fe8c02ee842f2c459384e
+size 433623188

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "crop_size": {
+    "height": 336,
+    "width": 336
+  },
+  "do_center_crop": true,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_processor_type": "CLIPFeatureExtractor",
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "processor_class": "CLIPProcessor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "shortest_edge": 336
+  }
+}

quantize_config.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+    "modes": [
+        "fp16",
+        "q8",
+        "int8",
+        "uint8",
+        "q4",
+        "q4f16",
+        "bnb4"
+    ],
+    "per_channel": true,
+    "reduce_range": true,
+    "block_size": null,
+    "is_symmetric": true,
+    "accuracy_level": null,
+    "quant_type": 1,
+    "op_block_list": null
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token": {
+    "content": "<|startoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "49406": {
+      "content": "<|startoftext|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49407": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<|startoftext|>",
+  "clean_up_tokenization_spaces": false,
+  "do_lower_case": true,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 77,
+  "pad_token": "<|endoftext|>",
+  "processor_class": "CLIPProcessor",
+  "tokenizer_class": "CLIPTokenizer",
+  "unk_token": "<|endoftext|>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff