🏆 AI Search Providers Benchmark

📚 Introduction

The AI Search Wars Have Begun - ChatGPT Search, Andi Search, Gemini Grounding, and more. In this rapidly evolving digital landscape, AI-powered search engines are at the forefront of innovation, offering users enhanced search capabilities and personalized experiences. This benchmark aims to analyze, fetch, and prepare data from leading AI search providers currently available in the market. Our focus includes:

Datura (Meta - Bittensor Network, Subnet 22) - Visit Datura
You.com - Visit You.com
OpenAI ChatGPT - Learn about ChatGPT Search
Perplexity - Visit Perplexity
Google Gemini - Visit Google Gemini
Andi Search - Visit Andi Search
X Grok - X Grok

Through this benchmark, we aim to provide insights into the capabilities and performance of these AI search engines, helping users and developers make informed decisions about which platforms best meet their needs.

📊 Provider Performance Comparison

Below is a table showcasing the results of each provider in various aspects of our scoring mechanism:

Provider	Summary Text Relevance	Link Content Relevance	Performance (s)	Embedding Similarity	Expected Answer Relevance
Datura Horizon 1.0	92.57%	80.46%	38.35s	72.60%	69.69%
Datura Orbit 1.0	88.03%	77.32%	19.40s	72.53%	69.86%
Datura Nova 1.0	83.80%	71.22%	8.84s	72.20%	69.56%
Andi Search	23.30%	62.06%	6.68s	20.70%	18.88%
OpenAI ChatGPT	90.70%	61.36%	2.23s	73.27%	71.38%
You.com	37.64%	60.11%	1.73s	54.77%	50.40%
Perplexity	91.82%	59.63%	5.68s	73.30%	70.29%
Google Gemini	71.58%	53.42%	1.81s	60.92%	56.90%

🏆 Top Model Analysis

📊 Best in Search Summary	🌐 Best in Web Link Content Relevance

🐦 Twitter Relevance Results

In this section, we present the results focusing on Summary and Twitter Content Relevance. Only Datura and X Grok are included in this analysis, as they have access to Twitter data. Other providers are excluded because they do not have the capability to provide Twitter-related insights.

Below is a table showcasing the results of each provider in terms of Summary and Twitter Content Relevance:

Provider	Summary Text Relevance	Link Content Relevance	Performance (s)	Embedding Similarity	Expected Answer Relevance
Datura Horizon 1.0	79.00%	44.92%	31.45s	71.72%	65.47%
Datura Orbit 1.0	66.19%	40.63%	24.59s	68.18%	62.37%
Grok 2	92.68%	29.95%	30.20s	76.44%	75.37%
Datura Nova 1.0	52.68%	23.43%	8.59s	66.00%	59.20%

📊 Top Models Chart

🥇 Best in Twitter Summary	🔍 Best in Twitter Link Content Relevance

📈 Scoring Mechanism

To evaluate the performance and relevance of AI search providers, we have developed a comprehensive scoring mechanism. This mechanism assesses the quality of responses based on several key factors:

Summary Text Relevance:
- We utilize a language model (LLM) to evaluate the relevance of the summary text provided by each AI search provider in relation to the question asked.
- In our initial tests, we employed GPT-4o to assess the relevance between the summary text and the questions.
Link Title and Description Relevance:
- Each link's title and description returned by the provider are checked for contextual relevance to the question.
- This involves analyzing whether the content of the links is pertinent to the query, ensuring that the search results are not only accurate but also useful.
Performance:
- This metric evaluates the response time of each provider, measured in milliseconds.
- Faster response times contribute positively to the overall score, as they enhance the user experience by providing timely information.
Embedding Similarity:
- We calculate the similarity between the embeddings of the question and the returned results.
- This involves using vector representations to measure how closely related the question is to the content of the results, providing an additional layer of relevance assessment.

By incorporating these factors, our scoring mechanism provides a holistic view of each provider's capabilities, helping users and developers make informed decisions based on both qualitative and quantitative data.

⚡ Fastest and Most Affordable Models

In this section, we evaluate the AI search providers based on their speed and cost-effectiveness. The following chart illustrates the latency of each model, providing insights into which models offer the fastest response times at the most affordable rates.

🥇 Web Performance	🥇 Twitter Performance

🧰 Scraper Scripts

All scraper scripts are located in the ./scraper directory. These scripts are designed to collect data from each provider's repository. We execute these processes on our local machines, and the collected data is stored in the results directory.

🛠️ Methodology

The Talc AI team creates datasets by:

Scraping various trustworthy sources for interesting facts and lessons.
Creating sets of Q&A to represent those facts.
Adjusting the tone, style, and distribution of queries to best match production users.

They categorize their dataset into four major categories:

Simple: Basic questions requiring minimal analysis.
Complex: Questions needing synthesis across multiple sources.
Hallucination Inducing: Questions with false premises to test AI's factual accuracy.
News: Questions with answers that change due to recent developments.

This approach ensures that the dataset is both comprehensive and adaptable to real-world scenarios, making it an excellent resource for benchmarking AI search providers.

🔍 Data Fetch Process

In our project, we aim to gather data from various AI search providers. However, many of these providers do not offer APIs for direct data retrieval. To address this, we have developed simple scraper scripts to extract data from each provider for our dataset.

📊 Dataset

In our first test, we decided to utilize a dataset from the Talc AI SearchBench repository. We extend our gratitude to the Talc AI team for their valuable contribution.

The SearchBench repository addresses common issues with traditional benchmarks by focusing on practical, everyday use cases rather than theoretical limits. It emphasizes realistic user queries and the incorporation of new knowledge over time, ensuring that AI products remain relevant and useful.

📂 Dataset Source

Our dataset is sourced from the Talc AI Search Bench repository and is stored in the dataset/data.jsonl file.

To better illustrate the structure of responses in your README.md, you can provide a more detailed example of the JSON structure. Here's how you can update the README.md to include a more comprehensive example:

Filling in more realistic values in the example can indeed make it clearer and more informative for users. Here's an updated version of the example with more detailed and realistic values:

📈 Areas	📉 Categories

By using this approach, we ensure that we can gather the necessary data for our analysis, even in the absence of direct API access from the providers.

📐 Structure of Responses

The structure of the responses is as follows:

Question: This is the query we send to the AI search provider.
Result: This field contains the summary text returned by the provider.
Search Results: This includes web links, titles, and descriptions returned by the providers.
Response Time: The time taken to receive the response from the provider, measured in milliseconds.

📝 Example Response

{
    "id": "c0683ac6-baee-4e2a-9290-8b734b777301",
    "question": "What did safety reviews conclude about the danger of experiments at the Large Hadron Collider?",
    "result": "Safety reviews have consistently concluded that the experiments at the Large Hadron Collider pose no significant risk to the public or the environment.",
    "search_results": [
        {
            "title": "CERN's Safety Assessment",
            "url": "https://home.cern/science/experiments/safety",
            "description": "An overview of the safety measures and assessments conducted by CERN regarding the LHC experiments."
        },
        {
            "title": "LHC Safety: Public Concerns Addressed",
            "url": "https://www.scientificamerican.com/article/lhc-safety-public-concerns/",
            "description": "This article addresses public concerns about the safety of the LHC and explains why these fears are unfounded."
        }
    ],
    "response_time": 10
}

🏃‍♂️ Running benchmarks

🛠️ Preparing results

Before running benchmarks locally, you need to prepare the results from different providers. You can do this by running the scraper scripts located in the ./scrapper directory. These scripts will collect data from each provider's repository and store the results in the results directory.

Scraping Datura

🚀 Running the benchmark

After preparing the results, you can score results locally by following scoring guide.

🚀 Future Directions

We are committed to regularly updating this benchmark to reflect the latest advancements in AI search technologies. Your feedback is invaluable to us, as we strive to make this benchmark as practical and user-focused as possible.

If you encounter any inaccuracies or areas for improvement, please share your thoughts with us. We are eager to enhance the benchmark based on your insights.

For inquiries, suggestions, or contributions, feel free to contact

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
dataset		dataset
docs		docs
results		results
scoring		scoring
scrapper		scrapper
ui		ui
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏆 AI Search Providers Benchmark

📚 Introduction

📊 Provider Performance Comparison

🏆 Top Model Analysis

🐦 Twitter Relevance Results

📊 Top Models Chart

📈 Scoring Mechanism

⚡ Fastest and Most Affordable Models

🧰 Scraper Scripts

🛠️ Methodology

🔍 Data Fetch Process

📊 Dataset

📂 Dataset Source

📐 Structure of Responses

📝 Example Response

🏃‍♂️ Running benchmarks

🛠️ Preparing results

🚀 Running the benchmark

🚀 Future Directions

About

Releases

Packages

Contributors 6

Languages

License

Chkhikvadze/ai-search-benchmark

Folders and files

Latest commit

History

Repository files navigation

🏆 AI Search Providers Benchmark

📚 Introduction

📊 Provider Performance Comparison

🏆 Top Model Analysis

🐦 Twitter Relevance Results

📊 Top Models Chart

📈 Scoring Mechanism

⚡ Fastest and Most Affordable Models

🧰 Scraper Scripts

🛠️ Methodology

🔍 Data Fetch Process

📊 Dataset

📂 Dataset Source

📐 Structure of Responses

📝 Example Response

🏃‍♂️ Running benchmarks

🛠️ Preparing results

🚀 Running the benchmark

🚀 Future Directions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages