---
license: apache-2.0
tags:
- setfit
- sentence-transformers
- text-classification
pipeline_tag: text-classification
datasets:
- mserras/alpaca-es-hackaton
- somosnlp/somos-clean-alpaca-es
language:
- es
---

# mserras/setfit-alpaca-es-unprocessable-sample-detection

This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for filtering the Alpaca ES instruction dataset. 

The base model is the multilingual model of [Paraphrase mpnet base v2](sentence-transformers/paraphrase-multilingual-mpnet-base-v2) from Sentence Transformers

 This model has been developed during the 2023 Hackaton organized by [SomosNLP](https://somosnlp.org/)/[HF Card](https://huggingface.co/somosnlp) and with the GPUs provided by [Q Blocks](https://www.qblocks.cloud)
 
This model has been trained over "unprocessable" samples of the translated [Clean Alpaca Es](https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es) dataset from 
the HF [Argilla](https://argilla.io) space https://huggingface.co/spaces/mserras/somos-alpaca-es.

To this end, a custom tag is proposed: "unprocessable" which corresponds to instruction/input/output triplets that require processing image, fetching information from the 
open web and similar tasks where the LLM has no capability action, thus, ending in hallucinations or strange outcomes.


## Usage

To use this model for inference, first install the SetFit library:

```bash
python -m pip install setfit
```

You can then run inference as follows:

```python
from setfit import SetFitModel
import argilla as rg


# Download from Hub and run inference
model = SetFitModel.from_pretrained("mserras/setfit-alpaca-es-unprocessable-sample-detection")

def instruct_fields_to_text(field_instruction: str, field_input: str, field_output: str):
    """Given the instruction, input and output fields, return a text to be used by setfit"""
    return f"INSTRUCTION:\n{field_instruction}\nINPUT:\n{field_input}\nOUTPUT:\n{field_output}\n"

def sample_to_text(sample: rg.TextClassificationRecord) -> str:
    """Converts and Argilla TextClassificationRecord to a text to be used by setfit"""
    return instruct_fields_to_text(sample.inputs["1-instruction"], sample.inputs["2-input"], sample.inputs["3-output"])

# For a given Argilla record:

unprocessable_score = model.predict_proba([sample_to_text(argilla_record)])[0].tolist()[1]

```

## Evaluation

*Disclaimer*: There was no formal evaluation done, just a bunch of guys looking at the data & the outcomes.

## Changelog

- [09/04/2023] SQL code generation, date conversion, percentual discounts and renewable energies no longer detected as unprocessable.
- [06/04/2023] It no longer detects password generation as unprocessable.


## BibTeX entry and citation info

```bibtex
@article{https://doi.org/10.48550/arxiv.2209.11055,
doi = {10.48550/ARXIV.2209.11055},
url = {https://arxiv.org/abs/2209.11055},
author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Efficient Few-Shot Learning Without Prompts},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
```