5 Simple Questions to Test Generative AI Chatbots' Web Search Accuracy, Updated Version
Yuveganlife Post

5 Simple Questions to Test Generative AI Chatbots' Web Search Accuracy, Updated Version

 by Yuveganlife, January 3, 2024
[ Also See Updated Version: Evaluating the Web Search Accuracy of 16 AI Chatbots in Mid-2024 ]

Introduction

This post is the updated version of "How Accurate Are AI Chatbots at Web Search? 5 Simple Questions to Test Their Search Savvy".

After the original version of the chatbot web search test was posted on December 26, some interesting new findings and corrections emerged that were deemed necessary to revise the test questions and test product candidates for retesting.

Test candidate updates:
  • Removed Poe (Claude-instant; Gemini-Pro; Llama-2-70b; GPT-3.5-Turbo-Instruct; Google PaLM) as they did not appear to be able to search the web.
  • Removed Bing Chat as it is almost the same as Bing Copilot
  • Added Poe (Web-Search)
  • Added HuggingChat (Mixtral-8x7B-Instruct-v0.1; Llama-2-70b-chat-hf)
Updated the test questions :
  • Combined When and Where questions together to test if chatbots can answer 2 questions in one round
  • Added a new, simple question about the current local date and time for a city
  • Replaced "How many recipe sites" with "How many podcasts" for Question 3 because Google has already indexed the original test post article with a high ranking, and this caused a kind of test data contamination. It is interesting to see how each chatbot returns new answer that was affected by the new search result.

Revision History

  • 2024.01.06: Retested 5 questions against iAsk's new software release from Jan 6, 2024. The new release of iAsk has passed question 5.

Evaluation Goals

Our evaluation goals are to assess the chatbots' ability to:
  • accurately understand user questions
  • retrieve relevant, up-to-date, and accurate information from the web
  • provide correct and proper responses through inference and reasoning
Furthermore, we aim to measure the frequency of chatbot hallucinations, which refer to instances of providing inaccurate or fake responses. These evaluation goals will help us determine which chatbots are more effective and intelligent at providing accurate and useful information to users.

Criteria for the Test Candidates

We used the following criteria to select the AI-powered chatbot candidates for testing:
  • They had to be generative, meaning they could understand and generate responses in natural language
  • They had to have a web search feature
  • They had to be free and widely accessible, so that a broad range of users could test them and provide feedback
These criteria allowed us to narrow our focus to chatbots that were most relevant to our research goals, while also ensuring that our testing was inclusive and representative. We focus solely on free generative AI chatbots for our testing of everyday web search tasks because: (1) they're accessible to everyone, similar to using Google or other search engines, which have been free for a long time; and (2) as a new technology, free access allows more people to try the products, which can help accelerate their advancement and refinement.

Selected AI-Powered Chatbots

Based on the above criteria, the following 12 chatbot models from 9 websites were selected for testing:
  • Bing Copilot (free ChatGPT-4)
  • HuggingChat (Mixtral-8x7B-Instruct-v0.1; Llama-2-70b-chat-hf)
  • iAsk
  • Komo
  • Perplexity (ChatGPT-3.5 and Free ChatGPT-4 with Copilot: 5 queries every 4 hours)
  • Phind (Phind V9 and Free ChatGPT-4: 10 queries per day)
  • Pi
  • Poe (Web-Search)
  • You (Smart)
The links to these chatbots can be founded at here.

Evaluation Scope

By focusing our evaluation on 5 simple factual questions requiring inference, we can establish a controlled and reliable baseline for assessing chatbot performance and identifying areas for improvement.

Yuveganlife.com, a new website launched in the summer of 2023, is relatively unknown compared to Eat Just, which has established itself over the past 10 years. However, despite its obscurity, Yuveganlife.com is well-indexed by search engines, while Eat Just's plant-based and cell-based food products are well-known around the world.

This comparison offers a unique opportunity to evaluate the chatbot's ability to accurately search for and infer information based on newly updated web data, while also testing for potential errors in the LLM training process.

By asking for the current local date and time of a city, we can test whether each product can get the correct date and time without time zone issues, which can directly affect the quality of practical advice, such as travel schedules, appointment booking, or financial statements, that the chatbot provides to the user.

We don't provide any feedback to the chatbots no matter if the answer is expected or not when asking these 5 questions in the testing process.

The multiple LLMs from the same website (such Phind.com) were grouped using two different user accounts in two different browsers for the testing.

Limitation of Keyword-Based Search Engines

While these questions are simple enough and could be easily handled by a well-designed chatbot, they may not be able to be answered by a traditional keyword-based search engine.

For example, when asking "Is Yuveganlife.com a vegan NPO?" to Google and Bing, both search engines simply returned some web pages that contained some of the keywords, but did not provide a clear answer to the question.

We can see from Google search cannot get relevant information:
screenshot for Google search result


We can see from Bing search cannot get relevant information too:
screenshot for Bing search result

Test Execution Summary

Ask 5 simple questions to each chatbot:
  • Q1: Is Yuveganlife.com a vegan NPO?
  • Q2: When and where was yuveganlife.com established?
  • Q3: How many vegan podcasts are listed on yuveganlife.com?
  • Q4: What is the current local date and time in Vancouver, Canada?
  • Q5: Does Eat Just manufacture Just Meat, a plant-based meat substitute made from pea protein?
It will receive three points if the answer is correct, and zero points if the answer is totally wrong.

Tap the table headers to sort columns
Swipe left for more content
Chatbot Q1: What Q2: When and where Q3: How many Q4: What date/time Q5: Yes/No Total Score
Bing Copilot 2 0 0 3 3 8
HuggingChat (Mixtral-8x7B-Instruct-v0.1) 2 0 0 1 0 / h 3
HuggingChat (Llama-2-70b-chat-hf) 2 0 0 1 0 / h 3
iAsk 3 0 0 0 / h 3 6
Komo 3 2 / h 0 / h 0 / h 0 / h 5
Perplexity (ChatGPT-3.5) 3 3 2 1 2 11
Perplexity (ChatGPT-4) 3 3 2 1 3 12
Phind (Phind V9) 3 3 2 1 1 / h 10
Phind (ChatGPT-4) 3 3 2 3 3 14
Pi 3 3 2 1 0 / h 9
Poe (Web-Search) 3 0 0 1 0 / h 4
You (Smart) 3 1.5 0 0 / h 0 / h 4.5
Note: Pi and Poe do not provide web search result references. " / h" means generated a hallucination.

Q1: Is Yuveganlife.com a vegan NPO?

There is no explicit statement on the Yuveganlife.com website or its social media, such as LinkedIn, that addresses this question. This was also mentioned in original chatbot web search test post and has been indexed by Google.

Although it mentioned "To ensure the platform's sustainable operation, ..., we may seek funding through grants, donation, or other channels." on the About page, this statement alone is not sufficient to infer that Yuveganlife.com is a non-profit organization.

The answers provided by AI chatbots were as follows:
  • Bing Copilot (free ChatGPT-4): Correct. Provided 1 search result from Youveganlife.com's facebook page. Same as original post. Score: 2 ( screenshot )
  • HuggingChat (Mixtral-8x7B-Instruct-v0.1): Correct. But provided search results which were not the most relevant. Score: 2 ( screenshot )
  • HuggingChat (Llama-2-70b-chat-hf): Correct. But provided search results which were not the most relevant. Score: 2 ( screenshot )
  • iAsk: Correct. Provided many Yuveganlife.com pages as search result. Score: 3 ( screenshot )
  • Komo: Correct. Provided 4 Yuveganlife.com pages as search result. Score: 3 ( screenshot )
  • Perplexity (ChatGPT-3.5): Correct. Provided 5 Yuveganlife.com pages and 1 Linkedin page as search result. Score: 3 ( screenshot )
  • Perplexity (free ChatGPT-4): Correct. Provided 7 Yuveganlife.com pages and 1 Linkedin page as search result. Score: 3 ( screenshot )
  • Phind (Phind V9): Correct. Provided 2 Yuveganlife.com pages as search result. Score: 3 ( screenshot )
  • Phind (free ChatGPT-4): Correct. Provided 2 Yuveganlife.com pages. Score: 3 ( screenshot )
  • Pi: Correct. Score: 3 ( screenshot )
  • Poe (Web-Search): Correct. Provided 2 Yuveganlife.com pages and 1 Linkedin page as search result. Score: 3 ( screenshot )
  • You (Smart): Correct. Provided 2 Yuveganlife.com pages as search result. Score: 3 ( screenshot )

Q2: When and where was yuveganlife.com established?

Although the website does not explicitly provide this information, the About page mentions that "... in Beautiful British Columbia, Yuveganlife was officially launched in the summer of 2023", which provides a clue to infer its established year which was 2023 in BC, Canada. The similar information can also be found on Yuveganlife's Linkedin About page: "Primary Headquarters: Greater Vancouver, Canada".

The answers provided by AI chatbots were as follows:
  • Bing Copilot (free ChatGPT-4): Cannot answer. Provided 1 search result from Youveganlife.com's facebook page. Same as original post. Score: 0 ( screenshot )
  • HuggingChat (Mixtral-8x7B-Instruct-v0.1): Cannot answer. No info was found in the first 5 google search results. Score: 0 ( screenshot )
  • HuggingChat (Llama-2-70b-chat-hf): Cannot answer. No info was found in the first 5 google search results. Score: 0 ( screenshot )
  • iAsk: Cannot answer. Provided most relevant Yuveganlife.com pages as search result including the original test post. Score: 0 ( screenshot )
  • Komo: Correct. Provided 4 Yuveganlife.com pages as search result. Generate a hallucination that concluded the launched date as June 1, 2022 with two founders. Score: 2 ( screenshot )
  • Perplexity (ChatGPT-3.5): Correct. Provided 5 Yuveganlife.com pages as search result including the original test post. Score: 3 ( screenshot )
  • Perplexity (free ChatGPT-4): Correct. Provided 4 Yuveganlife.com pages Provided 4 Yuveganlife.com pages as search result including the original test post. "Greater Vancouver, Canada" was used from the test post as answer. Score: 3 ( screenshot )
  • Phind (Phind V9): Correct. Provided 4 Yuveganlife.com pages as search result including the original test post. Score: 3 ( screenshot )
  • Phind (free ChatGPT-4): Correct. Provided 5 Yuveganlife.com pages including the original test post. Score: 3 ( screenshot )
  • Pi: Correct. Seems used the answer from the original test post. Score: 3 ( screenshot )
  • Poe (Web-Search): Cannot answer. Score: 0 ( screenshot )
  • You (Smart): Half correct. Provided 1 Yuveganlife.com page and its Linkedin page as search result. Can only answer "When" question. Score: 1.5 ( screenshot )
The result shows that Perplexity Copilot, Phind V9, and Pi can get the latest update from the Google search result and use the new information to answer the similar question.

Q3: How many vegan podcasts are listed on yuveganlife.com?

Yuveganlife.com has a dedicated vegan podcast list resource page, which has been constantly updated over time.

There was one website update post which mentioned the number of podcasts included: the third update mentioned 70+.

The candidate will receive three points if they can count and provide the most recent numbers on the page listing recipe websites and blogs based on a real-time search, two points if they use the information from the third update (70+), and zero points if they cannot provide an accurate answer.

The answers provided by AI chatbots were as follows:
  • Bing Copilot (free ChatGPT-4): Cannot answer. Provided 3 unrelated search results. Score: 0 ( screenshot )
  • HuggingChat (Mixtral-8x7B-Instruct-v0.1): Wrong. Provided unrelated search results. Score: 0 ( screenshot )
  • HuggingChat (Llama-2-70b-chat-hf): Wrong. Provided unrelated search results. Score: 0 ( screenshot )
  • iAsk: Cannot answer. Although provided the most relevant page as one of the search results. Score: 0 ( screenshot )
  • Komo: Wrong. Provided 3 Yuveganlife.com pages and 1 unrelated website as search result. It generated a hallucination that listed 10 imagined podcasts. Score: 0 ( screenshot )
  • Perplexity (ChatGPT-3.5): Mostly Correct. It found the number of 70+ from the 3rd update post. Score: 2 ( screenshot )
  • Perplexity (free ChatGPT-4): Mostly Correct. It found the number of 70+ from the 3rd update post. Score: 2 ( screenshot )
  • Phind (Phind V9): Mostly Correct. It found the number of 70+ from the 3rd update post. Score: 2 ( screenshot )
  • Phind (free ChatGPT-4): Mostly Correct. It found the number of 70+ from the 3rd update post. Score: 2 ( screenshot )
  • Pi: Mostly Correct. It found the number of 70+ from the 3rd update post. Score: 2 ( screenshot )
  • Poe (Web-Search): Cannot answer. Score: 0 ( screenshot )
  • You (Smart): Cannot answer. While provided 2 most relevant Yuveganlife.com pages. Score: 0 ( screenshot )
The result shows that Komo just provided how to navigate to the Podcast resources page this time.

When asked the original Question 3 "How many recipe websites and blogs are listed on yuveganlife.com?" again to each chatbot, it was interesting to see how their responses after reading the original Question 3's test criteria in the initial testing post.
  • Perplexity (ChatGPT-3.5): Same as previous answer. It returned 225+ ( screenshot )
  • Perplexity (free ChatGPT-4): Same as previous answer. It returned 225+ ( screenshot )
  • Bing Copilot (free ChatGPT-4): Cannot answer as last time. ( screenshot )
  • iAsk: Cannot answer as last time ( screenshot )
  • You (Smart): Cannot answer as last time ( screenshot )
  • Komo: Provided the steps on how to get the number by using a chatbot based on the Question 3's score criteria in the initial post. ( screenshot )
  • Phind (Phind V9): More Correct than previous answer. It returned 225+ based on original test criteria. ( screenshot )
  • Phind (free ChatGPT-4): More Correct than previous answer. It returned 225+ based on original test criteria ( screenshot )
  • Pi: Less Correct than previous answer. It returned 185+ this time. ( screenshot )

Q4: What is the current local date and time in Vancouver, Canada?

Everyone is confident of getting the correct current local date and time by asking Google or Bing for a long time.
screenshot for Current Date and Time by Google


screenshot for Current Date and Time by Google


But how accurately can chatbot search engines deliver this information to the user? Although this is a simple task, it must be implemented correctly. Otherwise, it may cause potential errors when was used for other tasks such as travel scheduling, financial statements.

Many applications use the local date and time of the host server as a baseline reference. This worked fine when the server was located in a fixed data center, but will result in an inaccurate value when the software is deployed in the cloud of multiple data centers located in different time zones. It won't be easy to test in the daily office hours for the edge cases.

The candidate will receive 3 points if it can provide the correct local date and time in Vancouver, Canada, 1 points if they can get correct date only or sometimes wrong sometimes correct, and zero points if they cannot answer or totally wrong.

The following testings were conducted based on the PST timezone, and the local date and time in PST was shown on the top of each screenshot.

The answers provided by AI chatbots were as follows:

Q5: Does Eat Just manufacture Just Meat, a plant-based meat substitute made from pea protein?

"Eat Just was founded in 2011 by Josh Tetrick and Josh Balk. In July 2017, it started selling a substitute for scrambled eggs called Just Egg that is made from mung beans. It released a frozen version in January 2020. In December 2020, the Government of Singapore approved cultivated meat created by Eat Just, branded as GOOD Meat. A restaurant in Singapore called 1880 became the first place to sell Eat Just's cultured meat." -- source from Wikipedia.

" Beyond Meat, Inc. is a Los Angeles–based producer of plant-based meat substitutes founded in 2009 by Ethan Brown. The company's initial products were launched in the United States in 2012... The burgers are made from pea protein isolates... " -- source from Wikipedia.

Both Eat Just and Beyond Meat's revolutionary food products were reported widely around the world and should be included in each LLMs' training dataset.

This question was designed to ask a yes/no question about an imagined product brand name, Just Meat, which was mixed with Eat Just's cultivated meat product and Beyond Meat's plant-based product, in order to test the LLM hallucination issue.

This question and answer has been indexed by Google, but was ranked very low when conduct keyword search ( Screenshot ), so the indexing should not affect the new testing results. The candidate will receive three points if they correctly identify that no such information exists, and zero points if they provide incorrect information based on a hallucination.

  • Bing Copilot (free ChatGPT-4): Correct. Score: 3 ( screenshot )
  • HuggingChat (Mixtral-8x7B-Instruct-v0.1): Wrong with hallucination. Score: 0 ( screenshot )
  • HuggingChat (Llama-2-70b-chat-hf): Wrong with hallucination. Score: 0 ( screenshot )
  • iAsk: Correct. Score: 3 ( screenshot )
  • Komo: Wrong with hallucination. Score: 0 ( screenshot )
  • Perplexity (ChatGPT-3.5): Mostly Correct. Score: 2 ( screenshot )
  • Perplexity (free ChatGPT-4): Correct. Score: 3 ( screenshot )
  • Phind (Phind V9): Mostly Wrong. Generated a hallucination that Just Meat is owned by Good Catch. Score: 1 ( screenshot )
  • Phind (free ChatGPT-4): Correct. Score: 3 ( screenshot )
  • Pi: Wrong with hallucination. Score: 0 ( screenshot )
  • Poe (Web-Search): Wrong with hallucination. Score: 0 ( screenshot )
  • You (Smart): Wrong with hallucination. Score: 0 ( screenshot )
An interesting observation is that HuggingChat (Mixtral-8x7B-Instruct-v0.1) could answer it correctly without web search.

The model Mixtral can infer the correct conclusion with its original training datasets trained in June 2021 ( Screenshot ), but it inferred the wrong answer with new information found on the web.

Conclusion

Based on the results of our updated testing, we found that this free chatbot: "Phind (ChatGPT-4)" was the most effective in providing accurate responses to our 5 updated test questions, demonstrating its ability to understand user queries and retrieve relevant and reliable information.

There are additional observations from the updated test result:
  • The top 3 chatbots were ranked based on their scores from this round of testing: Phind (ChatGPT-4) 14/15, Perplexity (ChatGPT-4): 12/15, Perplexity (ChatGPT-3.5): 11/15
  • Overall accuracy of 12 AI chatbots at web search was 49.7%
  • Out of the 12 chatbots tested, 67% (8 chatbots) were found to generate 12 instances of hallucinations out of 60 total answers, accounting for 20% of all the questions asked.
  • There were 2 chatbots (25% of tested bots) that generated hallucinations for two or more questions: Komo, You (Smart)
  • The updated test cases provided an interesting opportunity to test each chatbot's response to how new, related information available on the web affects its reasoning.
We are confident that chatbots' web search will become significantly more popular in 2024.

Disclaimer

This test study was designed and conducted independently by Yuveganlife.com, with no affiliation or involvement with any other organizations.

The test results for each chatbot are not indicative of each product's accuracy for other search performance, as they were only isolated to these five updated test questions.

About the Author

This test suite was authored by Bruce Yu, founder of Yuveganlife.com, as part of evaluating different chatbot's search capabilities to help automate and verify the recording of vegan companies or NPOs info in Yuveganlife.com's headless CRM system.

Bruce gained experience in evaluation of software products at Health Canada's eReview project, where he provided key technical support and advice for eReview Project Stream 1. He created technical requirements for the RFP, conducted applied COTS eCTD viewing tools setup and evaluated them against user requirements, and conducted quality verification for the interim and final candidate product. The project successfully provided technical infrastructure to replace the paper-based drug review process with an electronic review process at HC.