+ "cells": [
+ "source": [
+ "# How to implement Image search using Elasticsearch"
+ "source": [
+ "The workbook shows how to implement an Image search using Elasticsearch. You will index documents with image embeddings (generated or pre-generated) and then using NLP model be able to search using natural language description of the image.\n",
+ "\n",
+ "## Prerequisities\n",
+ "Before we begin, create an elastic cloud deployment and [autoscale](https://www.elastic.co/guide/en/cloud/current/ec-autoscaling.html) to have least one machine learning (ML) node with enough (4GB) memory. Also ensure that the Elasticsearch cluster is running. \n",
+ "\n",
+ "If you don't already have an Elastic deployment, you can sign up for a free [Elastic Cloud trial](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook)."
+ "source": [
+ "### Install Python requirements\n",
+ "Before you start you need to install all required Python dependencies."
+ "outputs": [],
+ "source": [
+ "!pip install sentence-transformers eland elasticsearch transformers torch tqdm Pillow streamlit"
+ "from elasticsearch import Elasticsearch\n",
+ "from elasticsearch.helpers import parallel_bulk\n",
+ "import requests\n",
+ "import os\n",
+ "import sys\n",
+ "\n",
+ "import zipfile\n",
+ "from tqdm.auto import tqdm\n",
+ "import pandas as pd\n",
+ "from PIL import Image\n",
+ "from sentence_transformers import SentenceTransformer\n",
+ "import urllib.request\n",
+ "\n",
+ "# import urllib.error\n",
+ "import json\n",
+ "from getpass import getpass"
+ "source": [
+ "### Upload NLP model for querying\n",
+ "\n",
+ "Using the [`eland_import_hub_model`](https://www.elastic.co/guide/en/elasticsearch/client/eland/current/machine-learning.html#ml-nlp-pytorch) script, download and install the [clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1) model, will transfer your search query into vector which will be used for the search over the set of images stored in Elasticsearch.\n",
+ "\n",
+ "To get your cloud id, go to [Elastic cloud](https://cloud.elastic.co) and `On the deployment overview page, copy down the Cloud ID.`\n",
+ "\n",
+ "To authenticate your request, You could use [API key](https://www.elastic.co/guide/en/kibana/current/api-keys.html#create-api-key). Alternatively, you can use your cloud deployment username and password."
+ "source": [
+ "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
+ "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
+ "\n",
+ "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
+ "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")"
+ "!eland_import_hub_model --cloud-id $ELASTIC_CLOUD_ID --hub-model-id sentence-transformers/clip-ViT-B-32-multilingual-v1 --task-type text_embedding --es-api-key $ELASTIC_API_KEY --start --clear-previous"
+ "source": [
+ "### Connect to Elasticsearch cluster\n",
+ "Use your own cluster details `ELASTIC_CLOUD_ID`, `API_KEY`."
+ "source": [
+ "es = Elasticsearch(\n",
+ " cloud_id=ELASTIC_CLOUD_ID,\n",
+ " api_key=ELASTIC_API_KEY,\n",
+ " request_timeout=600,\n",
+ ")\n",
+ "\n",
+ "es.info() # should return cluster info"
+ "source": [
+ "### Create Index and mappings for Images\n",
+ "Befor you can index documents into Elasticsearch, you need to create an Index with correct mappings."
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "id": "xAkc1OVcOxy3"
+ {
+ "data": {
+ "text/plain": [
+ "ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'images'})"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ "source": [
+ "# Destination Index name\n",
+ "INDEX_NAME = \"images\"\n",
+ "\n",
+ "# flag to check if index has to be deleted before creating\n",
+ "\n",
+ "INDEX_MAPPING = {\n",
+ " \"properties\": {\n",
+ " \"image_embedding\": {\n",
+ " \"type\": \"dense_vector\",\n",
+ " \"dims\": 512,\n",
+ " \"index\": True,\n",
+ " \"similarity\": \"cosine\",\n",
+ " },\n",
+ " \"photo_id\": {\"type\": \"keyword\"},\n",
+ " \"photo_image_url\": {\"type\": \"keyword\"},\n",
+ " \"ai_description\": {\"type\": \"text\"},\n",
+ " \"photo_description\": {\"type\": \"text\"},\n",
+ " \"photo_url\": {\"type\": \"keyword\"},\n",
+ " \"photographer_first_name\": {\"type\": \"keyword\"},\n",
+ " \"photographer_last_name\": {\"type\": \"keyword\"},\n",
+ " \"photographer_username\": {\"type\": \"keyword\"},\n",
+ " \"exif_camera_make\": {\"type\": \"keyword\"},\n",
+ " \"exif_camera_model\": {\"type\": \"keyword\"},\n",
+ " \"exif_iso\": {\"type\": \"integer\"},\n",
+ " }\n",
+ "}\n",
+ "\n",
+ "# Index settings\n",
+ " \"index\": {\n",
+ " \"number_of_replicas\": \"1\",\n",
+ " \"number_of_shards\": \"1\",\n",
+ " \"refresh_interval\": \"5s\",\n",
+ " }\n",
+ "}\n",
+ "\n",
+ "# check if we want to delete index before creating the index\n",
+ " if es.indices.exists(index=INDEX_NAME):\n",
+ " print(\"Deleting existing %s\" % INDEX_NAME)\n",
+ " es.indices.delete(index=INDEX_NAME, ignore=[400, 404])\n",
+ "\n",
+ "print(\"Creating index %s\" % INDEX_NAME)\n",
+ "es.indices.create(\n",
+ " index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS, ignore=[400, 404]\n",
+ ")"
+ "source": [
+ "### Get image dataset and embeddings\n",
+ "Download:\n",
+ "- The example image dataset is from [Unsplash](https://github.com/unsplash/datasets)\n",
+ "- The [Image embeddings](https://github.com/radoondas/flask-elastic-nlp/blob/main/embeddings/blogs/blogs-no-embeddings.json.zip) are pre-generated using CLIP model\n",
+ "\n",
+ "Then unzip both files."
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "zFGaPDRR5mqT",
+ "outputId": "0114cdd6-a714-41ab-9b46-3013bd36698a"
+ "source": [
+ "!curl -L https://unsplash.com/data/lite/1.2.0 -o unsplash-research-dataset-lite-1.2.0.zip\n",
+ "!curl -L https://raw.githubusercontent.com/radoondas/flask-elastic-nlp/main/embeddings/images/image-embeddings.json.zip -o image-embeddings.json.zip"
+ "source": [
+ "# Unzip downloaded files\n",
+ "UNSPLASH_ZIP_FILE = \"unsplash-research-dataset-lite-1.2.0.zip\"\n",
+ "EMBEDDINGS_ZIP_FILE = \"image-embeddings.json.zip\"\n",
+ "\n",
+ "with zipfile.ZipFile(UNSPLASH_ZIP_FILE, \"r\") as zip_ref:\n",
+ " print(\"Extracting file \", UNSPLASH_ZIP_FILE, \".\")\n",
+ " zip_ref.extractall(\"data/unsplash/\")\n",
+ "\n",
+ "with zipfile.ZipFile(EMBEDDINGS_ZIP_FILE, \"r\") as zip_ref:\n",
+ " print(\"Extracting file \", EMBEDDINGS_ZIP_FILE, \".\")\n",
+ " zip_ref.extractall(\"data/embeddings/\")"
+ "source": [
+ "# Import all pregenerated image embeddings\n",
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "id": "32xrbSUXTODQ"
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 375
+ },
+ "id": "wdicpvRlzmXG",
+ "outputId": "00550041-0aed-4f51-ccd3-18eb705ff7ed"
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iUAbRqr8II-x"
+ },
+ "source": [
+ "### Install tunnel library"
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "9Wb7GOWMXFnF",
+ "outputId": "6db23ef3-b25e-4f80-a3cb-6d08c1c78c16"
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "851CeYi8jvuF",
+ "outputId": "46a64023-e990-4900-f482-5558237f08cc"
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "inF7ceBmjyE3",
+ "outputId": "559ce180-3f0f-4475-c9a9-46dc91389276"
