Spaces:
Running
Running
A newer version of the Gradio SDK is available:
5.46.0
GAIA Test Evaluation Summary
Session: session_20250614_112312
- Total Questions: 20
- Correct Answers: 18
- Accuracy: 90.0%
- Target: 70.0%
- Target Achieved: β YES
Question-by-Question Results:
8e867cd7-cff9-4e6c-867a-ff5ddc2550be
β Status: CORRECT
- Question: How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use...
- Final Answer: 3
- Expected Answer: 3
- Execution Time: 42.36s
a1e91b78-d3d8-4675-bb8d-62741b4b68a6
β Status: CORRECT
- Question: In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird species...
- Final Answer: 3
- Expected Answer: 3
- Execution Time: 36.32s
2d83110e-a098-4ebb-9987-066c06fa42d0
β Status: CORRECT
- Question: .rewsna eht sa "tfel" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI...
- Final Answer: Right
- Expected Answer: Right
- Execution Time: 464.02s
cca530fc-4052-43b2-b130-b30968d8aa44
β Status: CORRECT
- Question: Review the chess position provided in the image. It is black's turn. Provide the correct next move f...
- Final Answer: Rd5
- Expected Answer: Rd5
- Execution Time: 43.64s
4fc2f1ae-8625-45b5-ab34-ad4433bc21f8
β Status: CORRECT
- Question: Who nominated the only Featured Article on English Wikipedia about a dinosaur that was promoted in N...
- Final Answer: FunkMonk
- Expected Answer: FunkMonk
- Execution Time: 25.24s
6f37996b-2ac7-44b0-8e68-6d28256631b4
β Status: CORRECT
- Question: Given this table defining * on the set S = {a, b, c, d, e}
* | a | b | c | d | e |
---|---|---|---|---|---|
... |
- Final Answer: b, e
- Expected Answer: b, e
- Execution Time: 110.52s
9d191bce-651d-4746-be2d-7ef8ecadb9c2
β Status: CORRECT
- Question: Examine the video at https://www.youtube.com/watch?v=1htKBjuUWec.
What does Teal'c say in response ...
- Final Answer: Extremely
- Expected Answer: Extremely
- Execution Time: 41.71s
cabe07ed-9eca-40ea-8ead-410ef5e83f91
β Status: CORRECT
- Question: What is the surname of the equine veterinarian mentioned in 1.E Exercises from the chemistry materia...
- Final Answer: Louvrier
- Expected Answer: Louvrier
- Execution Time: 28.78s
3cef3a44-215e-4aed-8e3b-b1e3f08063b7
β Status: CORRECT
- Question: I'm making a grocery list for my mom, but she's a professor of botany and she's a real stickler when...
- Final Answer: broccoli, celery, fresh basil, lettuce, sweet potatoes
- Expected Answer: broccoli, celery, fresh basil, lettuce, sweet potatoes
- Execution Time: 100.50s
99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3
β Status: CORRECT
- Question: Hi, I'm making a pie but I could use some help with my shopping list. I have everything I need for t...
- Final Answer: cornstarch, freshly squeezed lemon juice, granulated sugar, pure vanilla extract, ripe strawberries
- Expected Answer: cornstarch, freshly squeezed lemon juice, granulated sugar, pure vanilla extract, ripe strawberries
- Execution Time: 46.00s
305ac316-eef6-4446-960a-92d80d542f82
β Status: CORRECT
- Question: Who did the actor who played Ray in the Polish-language version of Everybody Loves Raymond play in M...
- Final Answer: Wojciech
- Expected Answer: Wojciech
- Execution Time: 25.13s
f918266a-b3e0-4914-865d-4faa564f1aef
β Status: CORRECT
- Question: What is the final numeric output from the attached Python code?...
- Final Answer: 0
- Expected Answer: 0
- Execution Time: 36.57s
3f57289b-8c60-48be-bd80-01f8099ca449
β Status: CORRECT
- Question: How many at bats did the Yankee with the most walks in the 1977 regular season have that same season...
- Final Answer: 519
- Expected Answer: 519
- Execution Time: 136.74s
1f975693-876d-457b-a649-393859e79bf3
β Status: CORRECT
- Question: Hi, I was out sick from my classes on Friday, so I'm trying to figure out what I need to study for m...
- Final Answer: 132, 133, 134, 197, 245
- Expected Answer: 132, 133, 134, 197, 245
- Execution Time: 66.03s
840bfca7-4f7b-481a-8794-c560c340185d
β Status: CORRECT
- Question: On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This articl...
- Final Answer: 80GSFC21M0002
- Expected Answer: 80GSFC21M0002
- Execution Time: 77.41s
bda648d7-d618-4883-88f4-3466eabd860e
β Status: CORRECT
- Question: Where were the Vietnamese specimens described by Kuznetzov in Nedoshivina's 2010 paper eventually de...
- Final Answer: Saint Petersburg
- Expected Answer: Saint Petersburg
- Execution Time: 23.65s
cf106601-ab4f-4af9-b045-5295fe67b37d
β Status: CORRECT
- Question: What country had the least number of athletes at the 1928 Summer Olympics? If there's a tie for a nu...
- Final Answer: CUB
- Expected Answer: CUB
- Execution Time: 83.08s
a0c07678-e491-4bbc-8f0b-07405144218f
β Status: INCORRECT
- Question: Who are the pitchers with the number before and after TaishΕ Tamai's number as of July 2023? Give th...
- Final Answer: Yoshida, Uehara**
- Expected Answer: Yoshida, Uehara
- Execution Time: 48.86s
7bd855d8-463d-4ed5-93ca-5fe35145f733
β Status: INCORRECT
- Question: The attached Excel file contains the sales of menu items for a local fast-food chain. What were the ...
- Final Answer: 109092.00
- Expected Answer: 89706.00
- Execution Time: 208.38s
5a0c1adf-205e-4841-a666-7c3ef95def9d
β Status: CORRECT
- Question: What is the first name of the only Malko Competition recipient from the 20th Century (after 1977) wh...
- Final Answer: Claus
- Expected Answer: Claus
- Execution Time: 65.38s