GeoGPT-Research-Project/GeoGPT-QA
Viewer • Updated • 41.4k • 119 • 28
How to use yasserrmd/geo-gemma-300m-emb with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("yasserrmd/geo-gemma-300m-emb")
sentences = [
"How does precipitation influence the water use efficiency and carbon isotopes of Picea meyeri, and what are the implications for climate change studies?",
"In the study of starry flounders (Platichthys stellatus), cortisol levels increased with increasing water temperature and then gradually decreased. This suggests that cortisol, a stress hormone, is elevated as a response to higher water temperatures, indicating that the fish experience stress under these conditions. The increase in cortisol levels is part of the fish's physiological response to environmental stressors, such as temperature changes, which can affect their survival and overall health.",
"The FY-4A/AGRI LST products effectively capture surface temperatures in Hunan Province, with a correlation coefficient (R) of 0.893. However, they exhibit a relatively high error level, with a bias of ?6.295 °C and a root mean square error (RMSE) of 8.58 °C, particularly in capturing high LST values. The performance of this product is superior in the eastern flat terrain area of Hunan Province compared to the western mountainous region. Environmental conditions in the mountainous areas cause systematic errors that contribute to instability in detection deviation. Surface heat resources are more abundant in eastern Hunan Province than in the mountainous areas located to the west and south, and their detailed distribution at finer scales is mainly influenced by terrain and climate conditions. There is no obvious seasonal difference in the distribution of heat resources except in winter, and rapid urbanization within the Chang–Zhu–Tan urban agglomeration over two years has significantly altered the spatial distribution pattern of surface heat resources across Hunan Province.",
"The water use efficiency (WUE) of Picea meyeri is significantly influenced by precipitation, along with temperature. The study found that there is a significant positive correlation between the WUE sequence and temperature. However, due to the combined effects of precipitation and temperature, Picea meyeri is subject to drought stress to some extent. This indicates that while temperature is the main climatic factor affecting the δ13C and WUE of Picea meyeri, precipitation also plays a crucial role in the plant's response to climate change. These findings are important for understanding the impacts of climate change on tree species and their ability to adapt to changing environmental conditions."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]
This is a sentence-transformers model finetuned from google/embeddinggemma-300m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(4): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("yasserrmd/geo-gemma-300m-emb")
# Run inference
queries = [
"Based on the Brine Shrimp Lethality Test (BSLT), what are the toxicity levels of liquid smoke from cocoa pod skin at various pyrolysis temperatures and water contents?",
]
documents = [
'The Brine Shrimp Lethality Test (BSLT) was used to determine the toxicity levels of liquid smoke from cocoa pod skin at various pyrolysis temperatures and water contents. The results showed that the LC50 values (the concentration required to kill 50% of the test organisms) were as follows: at 200°C and 10% water content, 11,858.58 ppm; at 200°C and 15% water content, 13,094.23 ppm; at 200°C and 20% water content, 13,373.94 ppm; at 200°C and 25% water content, 15,703.52 ppm. At 300°C and 10% water content, 11,604.26 ppm; at 300°C and 15% water content, 11,673.05 ppm; at 300°C and 20% water content, 13,373.94 ppm; at 300°C and 25% water content, 13,373.94 ppm. At 400°C and 10% water content, 9,213.73 ppm; at 400°C and 15% water content, 13,094.237 ppm; at 400°C and 20% water content, 13,373.94 ppm; at 400°C and 25% water content, 12,493.63 ppm. All the results indicate that the liquid smoke from cocoa pod skin at different pyrolysis temperatures and water contents is classified as non-toxic.',
'The estimated annual flood damage for agriculture and built-up areas in the Tajan watershed, northern Iran, is projected to surge from USD 162 million to USD 376 million and USD 91 million to USD 220 million, respectively, by 2040, considering the land use change scenarios from 2021 to 2040.',
'The distribution of PM2.5 in Santa Ana, CA, tends to be higher in socioeconomically disadvantaged communities compared to other areas, highlighting environmental health inequities that persist in urban areas. This can inform policy decisions related to health equity and community access to resources.',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 768] [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.5805, 0.0253, 0.0709]])
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
How does plastic debris from land-based sources impact the ocean, particularly in the context of First Long Beach, China? |
Plastic debris from land-based sources can significantly impact the ocean, as seen in the study conducted at First Long Beach (FLB), China. The study found that plastic debris amounts ranged from 2 to 82 particles per square meter on this marine sand beach. The most common size of plastics was 0.5–2.5 cm (44.4%), and the most common color was white (60.9%). The most abundant shape of plastic debris was fragments (76.2%). The amount of plastic debris varied significantly between different transects along the land-based source input zone due to the impacts of wind, ocean currents, and waves. Land-based wastewater discharge was identified as a major source of plastic debris on FLB, influenced by coastal water tide variations. Reduction strategies should focus on tracing and managing these land-based sources to mitigate the impact of plastic debris on the ocean. |
How does the concentration of SO2 in urban areas of Nanjing correlate with the normalized difference vegetation index (NDVI), and what does this imply for public health? |
The concentration of SO2 in urban areas of Nanjing exhibits a strong correlation (coefficient of determination, R2 > 0.5) with the normalized difference vegetation index (NDVI) within a radial distance of 2 km from the air pollutant monitoring sites. This indicates that NDVI can be an effective indicator for assessing the distribution and concentrations of air pollutants such as SO2. Negative correlations between NDVI and socio-economic indicators are observed under relatively consistent natural conditions, including climate and terrain. Therefore, the spatiotemporal distribution patterns of NDVI can provide valuable insights not only into socio-economic growth but also into the levels and locations of air pollution concentrations, which is crucial for public health interventions and policies. |
How has the rise of user-generated geodata impacted the role of traditional map producers? |
The rise of user-generated geodata has transformed ordinary citizens into neogeographers, blurring the boundaries between traditional map producers, such as national mapping agencies and local authorities, and citizens as consumers of this information. Citizens now actively participate in mapping different types of features on the Earth’s surface as volunteers, either by providing observations on the ground or tracing data from other sources, such as aerial photographs or satellite imagery. This has resulted in a significant increase in the availability of rich spatial datasets, which are often openly accessible through platforms like OpenStreetMap (OSM) and Ushahidi. |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim",
"gather_across_devices": false
}
num_train_epochs: 1multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 8per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 1max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robinrouter_mapping: {}learning_rate_mapping: {}| Epoch | Step | Training Loss |
|---|---|---|
| 0.0965 | 500 | 0.012 |
| 0.1931 | 1000 | 0.006 |
| 0.2896 | 1500 | 0.0057 |
| 0.3862 | 2000 | 0.0045 |
| 0.4827 | 2500 | 0.0024 |
| 0.5793 | 3000 | 0.0013 |
| 0.6758 | 3500 | 0.0025 |
| 0.7723 | 4000 | 0.0029 |
| 0.8689 | 4500 | 0.0012 |
| 0.9654 | 5000 | 0.0004 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
google/embeddinggemma-300m