--- license: apache-2.0 tags: - setfit - sentence-transformers - text-classification pipeline_tag: text-classification datasets: - mserras/alpaca-es-hackaton - somosnlp/somos-clean-alpaca-es language: - es --- # mserras/setfit-alpaca-es-unprocessable-sample-detection This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for filtering the Alpaca ES instruction dataset. The base model is the multilingual model of [Paraphrase mpnet base v2](sentence-transformers/paraphrase-multilingual-mpnet-base-v2) from Sentence Transformers This model has been developed during the 2023 Hackaton organized by [SomosNLP](https://somosnlp.org/)/[HF Card](https://huggingface.co/somosnlp) and with the GPUs provided by [Q Blocks](https://www.qblocks.cloud) This model has been trained over "unprocessable" samples of the translated [Clean Alpaca Es](https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es) dataset from the HF [Argilla](https://argilla.io) space https://huggingface.co/spaces/mserras/somos-alpaca-es. To this end, a custom tag is proposed: "unprocessable" which corresponds to instruction/input/output triplets that require processing image, fetching information from the open web and similar tasks where the LLM has no capability action, thus, ending in hallucinations or strange outcomes. ## Usage To use this model for inference, first install the SetFit library: ```bash python -m pip install setfit ``` You can then run inference as follows: ```python from setfit import SetFitModel import argilla as rg # Download from Hub and run inference model = SetFitModel.from_pretrained("mserras/setfit-alpaca-es-unprocessable-sample-detection") def instruct_fields_to_text(field_instruction: str, field_input: str, field_output: str): """Given the instruction, input and output fields, return a text to be used by setfit""" return f"INSTRUCTION:\n{field_instruction}\nINPUT:\n{field_input}\nOUTPUT:\n{field_output}\n" def sample_to_text(sample: rg.TextClassificationRecord) -> str: """Converts and Argilla TextClassificationRecord to a text to be used by setfit""" return instruct_fields_to_text(sample.inputs["1-instruction"], sample.inputs["2-input"], sample.inputs["3-output"]) # For a given Argilla record: unprocessable_score = model.predict_proba([sample_to_text(argilla_record)])[0].tolist()[1] ``` ## Evaluation *Disclaimer*: There was no formal evaluation done, just a bunch of guys looking at the data & the outcomes. ## Changelog - [09/04/2023] SQL code generation, date conversion, percentual discounts and renewable energies no longer detected as unprocessable. - [06/04/2023] It no longer detects password generation as unprocessable. ## BibTeX entry and citation info ```bibtex @article{https://doi.org/10.48550/arxiv.2209.11055, doi = {10.48550/ARXIV.2209.11055}, url = {https://arxiv.org/abs/2209.11055}, author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren}, keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Efficient Few-Shot Learning Without Prompts}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution 4.0 International} } ```