# Multilingual vector search with E5 embedding models

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/supporting-blog-content/multilingual-e5/multilingual-e5.ipynb)

In this example we'll use a multilingual embedding model
[multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) to perform search on a toy dataset of mixed
language documents. The examples in this notebook follow the blog post of the same title: Multilingual vector search with E5 embedding models.

# üß∞ Requirements

For this example, you will need:

- An Elastic Cloud deployment with an ML node (min. 8 GB memory)
   - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook))


## Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.

- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page
   - Select **Create deployment**
   - Use the default node types for Elasticsearch and Kibana
   - Add an ML node with **8 GB memory** (the multilingual E5 base model is larger than most)

# Setup Elasticsearch environment

To get started, we'll need to connect to our Elastic deployment using the Python client.
Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.

First we need to `pip` install the packages we need for this example.

In [None]:
!pip install elasticsearch eland[pytorch]

Next we need to import the `elasticsearch` module and the `getpass` module.
`getpass` is part of the Python standard library and is used to securely prompt for credentials.

In [3]:
import getpass

from elasticsearch import Elasticsearch

Now we can instantiate the Python Elasticsearch client.
First we prompt the user for their password and Cloud ID.

üîê NOTE: `getpass` enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.

Then we create a `client` object that instantiates an instance of the `Elasticsearch` class.

In [None]:
# Found in the "Manage Deployment" page
CLOUD_ID = getpass.getpass("Enter Elastic Cloud ID: ")

# Password for the "elastic" user generated by Elasticsearch
ELASTIC_PASSWORD = getpass.getpass("Enter Elastic password: ")

# Create the client instance
client = Elasticsearch(cloud_id=CLOUD_ID, basic_auth=("elastic", ELASTIC_PASSWORD))

client.info()

# Setup emebdding model

Next we upload the E5 multilingual embedding model into Elasticsearch and create an ingest pipeline to automatically create embeddings when ingesting documents. For more details on this process, please see the blog post: [How to deploy NLP: Text Embeddings and Vector Search](https://www.elastic.co/blog/how-to-deploy-nlp-text-embeddings-and-vector-search)

In [None]:
MODEL_ID = "multilingual-e5-base"

!eland_import_hub_model \
    --cloud-id $CLOUD_ID \
    --es-username elastic \
    --es-password $ELASTIC_PASSWORD \
    --hub-model-id intfloat/$MODEL_ID \
    --es-model-id $MODEL_ID \
    --task-type text_embedding \
    --start

In [None]:
client.ingest.put_pipeline(
    id="pipeline",
    processors=[
        {
            "inference": {
                "model_id": MODEL_ID,
                "field_map": {"passage": "text_field"},  # field to embed: passage
                "target_field": "passage_embedding",  # embedded field: passage_embedding
            }
        }
    ],
)

# Index documents

We need to add a field to support dense vector storage and search.
Note the `passage_embedding.predicted_value` field below, which is used to store the dense vector representation of the `passage` field, and will be automatically populated by the inference processor in the pipeline created above. The `passage_embedding` field will also store metadata from the inference process.

In [None]:
# Define the mapping and settings
mapping = {
    "properties": {
        "id": {"type": "keyword"},
        "language": {"type": "keyword"},
        "passage": {"type": "text"},
        "passage_embedding.predicted_value": {
            "type": "dense_vector",
            "dims": 768,
            "index": "true",
            "similarity": "cosine",
        },
    },
    "_source": {"excludes": ["passage_embedding.predicted_value"]},
}

settings = {
    "index": {
        "number_of_replicas": "1",
        "number_of_shards": "1",
        "default_pipeline": "pipeline",
    }
}

# Create the index (deleting any existing index)
client.indices.delete(index="passages", ignore_unavailable=True)
client.indices.create(index="passages", mappings=mapping, settings=settings)

Now that we have the pipeline and mappings ready, we can index our documents. This is of course just a demo so we only index the few toy examples from the blog post.

In [None]:
passages = [
    {
        "id": "doc1",
        "language": "en",
        "passage": """I sat on the bank of the river today.""",
    },
    {
        "id": "doc2",
        "language": "de",
        "passage": """Ich bin heute zum Flussufer gegangen.""",
    },
    {
        "id": "doc3",
        "language": "en",
        "passage": """I walked to the bank today to deposit money.""",
    },
    {
        "id": "doc4",
        "language": "de",
        "passage": """Ich sa√ü heute bei der Bank und wartete auf mein Geld.""",
    },
]

# Index passages, adding first the "passage: " instruction for E5
for doc in passages:
    doc["passage"] = f"""passage: {doc["passage"]}"""
    client.index(index="passages", document=doc)

# Multilingual semantic search

In [None]:
def query(q):
    """Query with embeddings, adding first the "query: " instruction for E5."""

    return client.search(
        index="passages",
        knn={
            "field": "passage_embedding.predicted_value",
            "query_vector_builder": {
                "text_embedding": {
                    "model_id": MODEL_ID,
                    "model_text": f"query: {q}",
                }
            },
            "k": 2,  # for the demo, we're always just searching for pairs of passages
            "num_candidates": 5,
        },
    )


def pretty_response(response):
    """Pretty print search responses."""

    for hit in response["hits"]["hits"]:
        score = hit["_score"]
        id = hit["_source"]["id"]
        language = hit["_source"]["language"]
        passage = hit["_source"]["passage"]
        print()
        print(f"ID: {id}")
        print(f"Language: {language}")
        print(f"Passage: {passage}")
        print(f"Score: {score}")

In [None]:
# Example 1
pretty_response(query("riverside"))


ID: doc1
Language: en
Passage: passage: I sat on the bank of the river today.
Score: 0.88001645

ID: doc2
Language: de
Passage: passage: Ich bin heute zum Flussufer gegangen.
Score: 0.87662137


In [None]:
# Example 2
pretty_response(query("Geldautomat"))


ID: doc4
Language: de
Passage: passage: Ich sa√ü heute bei der Bank und wartete auf mein Geld.
Score: 0.8967148

ID: doc3
Language: en
Passage: passage: I walked to the bank today to deposit money.
Score: 0.8863925


In [None]:
# Example 3a
pretty_response(query("movement"))


ID: doc3
Language: en
Passage: passage: I walked to the bank today to deposit money.
Score: 0.87475425

ID: doc2
Language: de
Passage: passage: Ich bin heute zum Flussufer gegangen.
Score: 0.8741033


In [None]:
# Example 3b
pretty_response(query("stillness"))


ID: doc4
Language: de
Passage: passage: Ich sa√ü heute bei der Bank und wartete auf mein Geld.
Score: 0.85991657

ID: doc1
Language: en
Passage: passage: I sat on the bank of the river today.
Score: 0.8561436
