SentenceTransformer based on sentence-transformers/all-distilroberta-v1

This is a sentence-transformers model finetuned from sentence-transformers/all-distilroberta-v1. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/all-distilroberta-v1
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Vishnu7796/my-finetuned-model")
# Run inference
sentences = [
    'Marketing effectiveness measurement, content performance analysis, A/B testing for social media',
    'Skills:5+ years of marketing or business analytics experience with synthesizing large-scale data sets to generate insights and recommendations.5+ years of working experience using SQL, Excel, Tableau, and/or Power B. R & Python knowledge are preferred.Understanding of the data science models used for measuring marketing incrementality, e.g. multi-touch attribution, marketing mix models, causal inference, time-series regression, match market test, etc....Understanding of the full-funnel cross-platform marketing and media landscape and experience evolving analytics and measurement capabilities.Flexibility in priority shifts and fast iterations/agile working environment.Strong problem-solving skills, and ability to structure problems into an analytics plan.\nPride Global offers eligible employee’s comprehensive healthcare coverage (medical, dental, and vision plans), supplemental coverage (accident insurance, critical illness insurance and hospital indemnity), 401(k)-retirement savings, life & disability insurance, an employee assistance program, legal support, auto, home insurance, pet insurance and employee discounts with preferred vendors.',
    'Hi All,\nThis is Nithya from TOPSYSIT, We have a job requirement for Data Scientist with GenAI. If anyone interested please send me your updated resume along with contact details to [email protected]\nAny Visa is Fine on W2 except H1B ,OPT and CPT.If GC holders who can share PPN along with proper documentation are eligible\nJob Title Data Scientist with GenAILocation: Plano, TX-OnsiteEXP: 10 Years Description:Competencies: SQL, Natural Language Processing (NLP), Python, PySpark/ApacheSpark, Databricks.Python libraries: Numpy, Pandas, SK-Learn, Matplotlib, Tensorflow, PyTorch.Deep Learning: ANN, RNN, LSTM, CNN, Computer vision.NLP: NLTK, Word Embedding, BOW, TF-IDF, World2Vec, BERT.Framework: Flask or similar.\nThanks & Regards,Nithya Kandee:[email protected]:678-899-6898',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Datasets: ai-job-validation and ai-job-test
Evaluated with TripletEvaluator

Metric	ai-job-validation	ai-job-test
cosine_accuracy	0.9875	0.9756

Training Details

Training Dataset

Unnamed Dataset

Size: 647 training samples
Columns: query, job_description_pos, and job_description_neg

Approximate statistics based on the first 647 samples:

	query	job_description_pos	job_description_neg
type	string	string	string
details	min: 8 tokens mean: 15.05 tokens max: 40 tokens	min: 7 tokens mean: 350.34 tokens max: 512 tokens	min: 7 tokens mean: 352.82 tokens max: 512 tokens

Samples:

query	job_description_pos	job_description_neg
`healthcare data analytics, pregnancy identification algorithms, causal modeling techniques`	experience in using, manipulating, and extracting insights from healthcare data with a particular focus on using machine learning with claims data. The applicant will be driven by curiosity, collaborating with a cross-functional team of Product Managers, Software Engineers, and Data Analysts. Responsibilities Apply data science, machine learning, and healthcare domain expertise to advance and oversee Lucina’s pregnancy identification and risk-scoring algorithms.Analyze healthcare data to study patterns of care and patient conditions which correlate to specific outcomes.Collaborate on clinical committee research and development work.Complete ad hoc analyses and reports from internal or external customers prioritized by management throughout the year. Qualifications Degree or practical experience in Applied Math, Statistics, Engineering, Information Management with 3 or more years of data analytics experience, Masters degree a plus.Experience manipulating and analyzing healthcare dat...	Experience of Delta Lake, DWH, Data Integration, Cloud, Design and Data Modelling. Proficient in developing programs in Python and SQLExperience with Data warehouse Dimensional data modeling. Working with event based/streaming technologies to ingest and process data. Working with structured, semi structured and unstructured data. Optimize Databricks jobs for performance and scalability to handle big data workloads. Monitor and troubleshoot Databricks jobs, identify and resolve issues or bottlenecks. Implement best practices for data management, security, and governance within the Databricks environment. Experience designing and developing Enterprise Data Warehouse solutions. Proficient writing SQL queries and programming including stored procedures and reverse engineering existing process. Perform code reviews to ensure fit to requirements, optimal execution patterns and adherence to established standards. Requirements: You are: Minimum 9+ years of experience is required. 5+ years...
`Data Engineer Python Azure API integration`	experience preferred but not required. Must-Have Skills:10+ years of total IT experience required.of 4 years of proven and relevant experience in a similar Data Engineer role and/or Python Dev role.Strong proficiency in Python programming is essential for data manipulation, pipeline development, and integration tasks.In-depth knowledge of SQL for database querying, data manipulation, and performance optimization.Experience working with RESTful APIs and integrating data from external sources using API calls.Azure: Proficiency in working with Microsoft Azure cloud platform, including services like Azure Data Factory, Azure Databricks, and Azure Storage.	requirements;Research & implement new data products or capabilitiesAutomate data visualization and reporting capabilities that empower users (both internal and external) to access data on their own thereby improving quality, accuracy and speedSynthesize raw data into actionable insights to drive business results, identify key trends and opportunities for business teams and report the findings in a simple, compelling wayEvaluate and approve additional data partners or data assets to be utilized for identity resolution, targeting or measurementEnhance PulsePoint's data reporting and insights generation capability by publishing internal reports about Health dataAct as the “Subject Matter Expert” to help internal teams understand the capabilities of our platforms, how to implement & troubleshoot RequirementsWhat are the ‘must haves’ we’re looking for?Minimum 3-5 years of relevant experience in:Creating SQL queries from scratch using real business data;Highly proficient knowledge of Excel (...
`Data Engineer big data technologies, cloud data warehousing, real-time data streaming`	experience in machine learning, distributed microservices, and full stack systems Utilize programming languages like Java, Scala, Python and Open Source RDBMS and NoSQL databases and Cloud based data warehousing services such as Redshift and Snowflake Share your passion for staying on top of tech trends, experimenting with and learning new technologies, participating in internal & external technology communities, and mentoring other members of the engineering community Collaborate with digital product managers, and deliver robust cloud-based solutions that drive powerful experiences to help millions of Americans achieve financial empowerment Perform unit tests and conduct reviews with other team members to make sure your code is rigorously designed, elegantly coded, and effectively tuned for performance Basic Qualifications: Bachelor’s Degree At least 2 years of experience in application development (Internship experience does not apply) At least 1 year of experience in big d...	requirements of analyses and reports.Transform requirements into actionable, high-quality deliverables.Perform periodic and ad-hoc operations data analysis to measure performance and conduct root cause analysis for Claims, FRU, G&A, Provider and UM data.Compile, analyze and provide reporting that identifies and defines actionable information or recommends possible solutions for corrective actions.Partner with other Operations areas as needed to provide technical and other support in the development, delivery, maintenance, and enhancement of analytical reports and analyses.Collaborate with Operations Tower Leaders in identifying and recommending operational performance metrics; map metrics against targets and the company’s operational plans and tactical/strategic goals to ensure alignment and focus.Serve as a liaison with peers in other departments to ensure data integrity.Code and schedule reports using customer business requirements from Claims, FRU, G&A, Provider and UM data. Princi...

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Dataset

Unnamed Dataset

Size: 80 evaluation samples
Columns: query, job_description_pos, and job_description_neg

Approximate statistics based on the first 80 samples:

	query	job_description_pos	job_description_neg
type	string	string	string
details	min: 8 tokens mean: 14.9 tokens max: 25 tokens	min: 14 tokens mean: 354.31 tokens max: 512 tokens	min: 31 tokens mean: 334.05 tokens max: 512 tokens

Samples:

query	job_description_pos	job_description_neg
`Data analysis, operations reporting, SQL expertise`	requirements, determine technical issues, and design reports to meet data analysis needsDeveloping and maintaining web-based dashboards for real-time reporting of key performance indicators for Operations. Dashboards must be simple to use, easy to understand, and accurate.Maintenance of current managerial reports and development of new reportsDevelop and maintain reporting playbook and change logOther duties in the PUA department as assigned What YOU Will Bring To C&F Solid analytical and problem solving skillsIntuitive, data-oriented with a creative, solutions-based approachAbility to manage time, multi-task and prioritizes multiple assignments effectivelyAbility to work independently and as part of a teamAble to recognize and analyze business and data issues with minimal supervision, ability to escalate when necessaryAble to identify cause and effect relationships in data and work process flows Requirements 3 years in an Analyst role is requiredA Bachelor’s degree in associated f...	experience in data engineering, software engineering, data analytics, or machine learning.Strong expertise working with one or more cloud data platforms (Snowflake, Sagemaker, Databricks, etc.)Experience managing Snowflake infrastructure with terraform.Experience building batch, near real-time, and real-time data integrations with multiple sources including event streams, APIs, relational databases, noSQL databases, graph databases, document stores, and cloud object stores.Strong ability to debug, write, and optimize SQL queries in dbt. Experience with dbt is a must.Strong programming experience in one or more modern programming languages (Python, Clojure, Scala, Java, etc.)Experience working with both structured and semi-structured data.Experience with the full software development lifecycle including requirements gathering, design, implementation, testing, deployment, and iteration.Strong understanding of CI/CD principles.Strong ability to document, diagram, and deliver detailed pres...
`AWS Sagemaker, ML Model Deployment, Feedback Loop Automation`	Qualifications AWS tools and solutions including Sagemaker, Redshift, AthenaExperience with Machine learning libraries such as PyTorchHands-on experience with designing, developing and deploying workflows with ML models with feedback loops; Uses Bitbucket workflows and has experience with CI/CDDeep experience in at least two of the following languages: PySpark/Spark, Python, CWorking knowledge of AI/ML algorithms. Large language models (LLMs), Retrieval-augmented generation (RAN), Clustering algorithms (such as K-Means), Binary classifiers (such as XGBoost)High level of self-starter, learning, and initiative behaviors Preferred:Background as a software engineer and experience as a data scientistFeatures Stores Why Teaching Strategies At Teaching Strategies, our solutions and services are only as strong as the teams that create them. By bringing passion, dedication, and creativity to your job every day, there's no telling what you can do and where you can go! We provide a competitive...	requirements and metrics. Provide training and support to end-users on data quality best practices and tools. Develop and maintain documentation related to data quality processes. Education Qualification: Bachelor's degree in a related field such as Data Science, Computer Science, or Information Systems. Required Skills: Experience working as a BA/Data Analyst in a Data warehouse/Data governance platform. Strong analytical and problem-solving skills. Proficiency in SQL, data analysis, and data visualization tools. Critical thinking. Ability to understand and examine complex datasets. Ability to interpret Data quality results and metrics. Desired Skills: Knowledge of Data quality standards and processes. Proven experience in a Data Quality Analyst or similar role. Experience with data quality tools such as Informatica, PowerCurve, or Collibra DQ is preferred. Certifications in data management or quality assurance (e.g. Certified Data Management Professional, Certified Quality ...
`Financial analysis, process re-engineering, client relationship management`	skills: BA/BS degree in finance-related field and/or 2+ years working in finance or related field Strong working knowledge of Microsoft Office (especially Excel) Ability to work in a fast-paced environment and attention to detail. This role includes reviews and reconciliation of financial information. General Position Summary The Business Analyst performs professional duties related to the review, assessment and development of business systems and processes as well as new client requirements. This includes reviewing existing processes to develop strong QA procedures as well as maximizing review efficiencies and internal controls through process re-engineering. The Business Analyst will assist with the development of seamless solutions for unique requirements of new clients, delivered and implemented on time and within scope. This role will ensure that all activity, reconciliation, reporting, and analysis is carried out in an effective, timely and accurate manner and will look for cont...	`Skills / Experience:Required: Proficiency with Python, pyTorch, Linux, Docker, Kubernetes, Jupyter. Expertise in Deep Learning, Transformers, Natural Language Processing, Large Language Models Preferred: Experience with genomics data, molecular genetics. Distributed computing tools like Ray, Dask, Spark. Thanks & RegardsBharat Priyadarshan GuntiHead of Recruitment & OperationsStellite Works LLC4841 W Stonegate Circle Lake Orion MI - 48359Contact: 313 221 [email protected]`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
learning_rate: 2e-05
num_train_epochs: 1
warmup_ratio: 0.1
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	ai-job-validation_cosine_accuracy	ai-job-test_cosine_accuracy
0	0	0.85	-
1.0	41	0.9875	0.9756

Framework Versions

Python: 3.11.12
Sentence Transformers: 3.3.1
Transformers: 4.48.0
PyTorch: 2.6.0+cu124
Accelerate: 1.5.2
Datasets: 2.14.4
Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: -

Safetensors

Model size

82.1M params

Tensor type

F32

Model tree for Vishnu7796/my-finetuned-model

Base model

sentence-transformers/all-distilroberta-v1

Finetuned

(40)

this model

Evaluation results

Cosine Accuracy on ai job validation
self-reported

0.988
Cosine Accuracy on ai job test
self-reported

0.976

View on Papers With Code