<a href="https://colab.research.google.com/github/huggingface/data-is-better-together/blob/main/prompt_translation/02_upload_prompt_translation_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2. Uploading prompts to be translated to an Argilla Space 

This notebook focuses on the steps involved in uploading prompts to be translated to an Argilla Space. It assumes you have already created an Argilla Space and have the Space ID and API key. If you haven't created an Argilla Space yet, please refer to the previous notebooks and the overall README for instructions on how to do so.

## Steps

This notebook picks up from the previous notebook in which you setup an Argilla Space with Oauth authentication and requested an upgrade to persistent storage. In this notebook we'll finish the setup instructions by covering the following steps. 

1. Loading the DIBT data into the Argilla Space
2. (Optional) machine translating the prompts to the target language as a starting point

Install the required libraries by running the cell below.

In [None]:
%pip install huggingface_hub argilla datasets openai -qqq

## Load the DIBT data into the Argilla Space


First we need to set up the Argilla SDK client with the URL and owner credentials for our space

<div class="alert alert-warning">
  <strong>Warning!</strong> Make sure you have persistent storage enabled before you proceed to the next steps, if you haven't done this there is a strong danger of losing data. Please reach out on Discord to make sure this step has been done!
</div>

In [22]:
import json

In [16]:
SPACE_ID =  "DIBT-for-Esperanto/prompt-translation-for-Esperanto"
HOMEPAGE_URL = "https://dibt-for-esperanto-prompt-translation-for-esperanto.hf.space"
LANGUAGE = None, # i.e. "French"

In [27]:
assert SPACE_ID and HOMEPAGE_URL and LANGUAGE, "Please set SPACE_ID and HOMEPAGE_URL to your space ID and homepage URL"

In [18]:
from huggingface_hub import space_info


assert space_info(SPACE_ID).runtime.storage.get("current") == "small", "Please ensure you have setup persistent storage for your space. Please see steps above"

In [None]:
import argilla as rg

OWNER_API_KEY = "owner.apikey" # if you haven't setup the secret this is the default owner api key
assert OWNER_API_KEY is not None, "Please set OWNER_API_KEY to the API token you just set in the Space settings"

rg.init(api_url=HOMEPAGE_URL, api_key=OWNER_API_KEY)

Finally, we're ready to create our dataset in the `admin` workspace. To test that everything is working let's upload the original dataset (without translation), you can later delete this dataset from the UI or via the SDK.

In [None]:
from datasets import load_dataset

# load the dataset from the Hub
ds = load_dataset('data-is-better-together/prompts_ranked_multilingual_benchmark')

In [21]:
# create the dataset with a pre-built template
argilla_ds = rg.FeedbackDataset.for_translation(
    use_markdown=True,
    guidelines=None,
    metadata_properties=None,
    vectors_settings=None,
)
argilla_ds

FeedbackDataset(
   fields=[TextField(name='source', title='Source', required=True, type='text', use_markdown=True)]
   questions=[TextQuestion(name='target', title='Target', description='Translate the text.', required=True, type='text', use_markdown=True)]
   guidelines=This is a translation dataset that contains texts. Please translate the text in the text field.)
   metadata_properties=[])
   vectors_settings=[])
)

In [23]:
# create records
records = []
for row in ds["train"]:
    record = rg.FeedbackRecord(
        fields={"source": row["prompt"]},
        metadata=json.loads(row["metadata"]),
        external_id=row["row_idx"],
    )
    records.append(record)

In [24]:
# add records to the dataset
argilla_ds.add_records(records)

In [None]:
# push the dataset to Argilla
argilla_ds.push_to_argilla(f"DIBT Translation for {LANGUAGE}", workspace="admin")

At this point, the dataset is available in the UI. To be able to delete you need to log in with the user `owner` and the password you have setup in the secrets or the default one which is `12345678` if you haven't added the secret.

## Translate the source dataset and push it to Argilla

The only remaining step is to translate the dataset and create the final dataset your contributors will be annotating.

There are different options to translate the dataset, such as:

- Using Open Source models, like: nllb-200, Google-T5, OPUS-MT
- Using Closed LLM API providers like OpenAI with gpt-4-turbo or Mistral with mistral-large

### Translation models

#### Open Source models

We will first start with an example of a translation pipeline with open source models. Even though these models are able to run on CPU it is highly recommended to use a GPU in order to speed up inference.

We will use the [No Language Left Behind (NLLB) intiative from Meta](https://ai.meta.com/blog/nllb-200-high-quality-machine-translation/). A distilled version of this [model is available on Hugging Face](https://huggingface.co/facebook/nllb-200-distilled-600M). This model workd accross 200 different language and their language codes can be found in [this readme](https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/README.md).

First, we will initialize the model with a correct `src_lang`.

In [31]:
# !pip install 'transformers[torch]'

Successfully installed accelerate-0.28.0 mpmath-1.3.0 networkx-3.2.1 safetensors-0.4.2 sympy-1.12 tokenizers-0.15.2 torch-2.2.1 transformers-4.39.0


In [32]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_path = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_path, src_lang="eng_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

# Check if a GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"

# Move the model and tokenizer to the GPU if available
model = model.to(device)

Next, we will define a translation function that takes a `text: Union[str, List[str]]` and a correct `trg_lang`.

In [None]:
def open_translate(texts, trg_lang):
    if isinstance(texts, str):  # If a single text is provided, convert it to a list
        texts = [texts]

    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    translated_tokens = model.generate(
        **inputs.to(device), forced_bos_token_id=tokenizer.lang_code_to_id[trg_lang]
    )
    translations = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)

    if len(translations) == 1:
        return translations[0]
    else:
        return translations


example = "We will first start with an example of a translation pipeline with open source models. Even though these models are able to run on CPU it is highly recommended to use a GPU in order to speed up inference."
open_translate(example, "spa_Latn")

#### Other models and closed source LLM API providers

Depending on the language you are working in the NLLB model used above might not work so well. One alternative is to use a specific translation model for your language. You can find many of these on the Hugging Face Hub with examples showing how to use them. Alternatively, you might use a closed LLM provider. We provide a separate example of using one of these notebooks [here](https://github.com/huggingface/data-is-better-together/blob/main/prompt_translation/Translation_with_distilabel_gpt_4_turbo.ipynb). If you want to use this alternative approach you may want to jump straight to that notebook and skip the rest of this notebook!

### Add translations as suggestions

Now, we will use the defined translation functions to add some pre-filled translation suggestions to the Argilla dataset.

In [None]:
argilla_ds = rg.FeedbackDataset.from_argilla(f"DIBT Translation for {LANGUAGE}", workspace="admin")
argilla_ds

Next, we will loop through the records and add a translation.

In [None]:
from tqdm.auto import tqdm

In [None]:
altered_records = []
for rec in tqdm(argilla_ds.records):
    rec.suggestions = [
        {
            "question_name": "target",
            "value": open_translate(rec.fields["source"], "spa_Latn")
        }
    ]
    altered_records.append(rec)

Lastly, we will update these records within Argilla.

In [None]:
# Example of doing this with a closed model
# altered_records = []
# for rec in tqdm(argilla_ds.records):
#     rec.suggestions = [
#         {
#             "question_name": "target",
#             "value": closed_translate(
#                 rec.fields["source"],
#                 "spa_Latn",
#                 max_tokens=len(rec.fields["source"]) + 10,
#             ),
#         }
#     ]
#     altered_records.append(rec)

In [None]:
argilla_ds.update_records(altered_records)