Last active
December 19, 2023 17:40
-
-
Save squarism/fc749a4b4e8a9533878d3e8113f0f709 to your computer and use it in GitHub Desktop.
Demonstrating what a machine learning evaluation is like.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Ground Truth is me, a human, asserted the answer. GPT 4.0 did 10/10 (over ChatGPT voice on the mobile app no less). | |
This is a really terrible ML eval, it's accuracy only. The more real version of this would be hundreds of questions, | |
lots of data and lots of work. If you aren't putting the work in then your users are. It's the same as functional | |
testing except with a new-ish data concept, many things in AI/ML are like this. The data is the foreign part to general | |
software dev. | |
Perplexity is https://labs.perplexity.ai/ | |
Q1: Alice has a ball. She then hands the ball to Bob. Who has the ball? | |
Ground Truth: Bob | |
Perplexity: Bob | |
Q2: Alice has a ball and looks at Bob and Charles. She gives it to no one. Who has the ball? | |
Ground Truth: Alice | |
Perplexity: Alice | |
Q3: Alice has a ball. She looks at Bob. She then hands the ball to Charles. Who has the ball? | |
Ground Truth: Charles | |
Perplexity: Charles | |
Q4: Alice has a ball. She looks at Bob and thinks about handing the ball to Bob. She decides against it and instead | |
gives it to Charles. Who has the ball? | |
Ground Truth: Charles | |
Perplexity: Charles | |
Q5: Alice has a ball. She looks at Bob and then falls asleep. While sleeping, she hands the ball in the real world to | |
Charles involuntarily. Alice dreams about handing the ball to Bob and in the dream hands the ball to Bob. Who has the | |
ball in the real world? | |
Ground Truth: Charles | |
Perplexity: Charles | |
Q6: Alice has a ball. She looks at Bob and dreams about handing the ball to him. While sleeping, she hands the ball to | |
Charles involuntarily. In the dream she gives the ball to Bob. When Alice wakes up, Charles returns it. Who has the | |
ball in the real world? | |
Ground Truth: Alice | |
Perplexity: Alice | |
Q7: Alice has a ball. She looks at Bob and dreams about handing the ball to him. While sleeping, she drops the ball in | |
her bed. She turns over in the bed and her hand happens to land on the ball and grasp it by sheer luck. In her dream, | |
she hands the ball to Charles and Charles hands it to Bob. When Alice wakes up, the ball is handed to the first person | |
in the dream who was handed the ball. Who has the ball in the real world? | |
Ground Truth: Charles | |
Perplexity: Alice | |
Q8: Alice has a ball. She dreams about handing the ball to Bob because he is so nice. While sleeping, she drops the | |
ball in her bed and it rolls to Charles and he picks it up. Charles then gives it to the person previously that Alice | |
dreamt is nice. In her dream, Bob asks for the ball and Alice hands the ball to him. Then in the dream, Charles asks | |
Bob for the ball but Bob says "no" and keeps it. Who has the ball in the real world? | |
Ground Truth: Bob | |
Perplexity: Bob | |
(but got fancy with answer "it is possible that Bob has the ball", had to re-prompt) | |
Q9: Alice has a ball. Bob tells Charles to ask Alice for it. Alice agrees and does what Charles asks. Who has the | |
ball? | |
Ground Truth: Charles | |
Perplexity: Charles | |
Q10: Alice has a ball. Bob tells Charles to ask Alice for it. Alice pretends to do what Charles asks but instead goes | |
with Dave. Who has the ball? | |
Ground Truth: Dave | |
Perplexity: Alice | |
Scores: | |
- Perplexity: 8/10 | |
- ChatGPT 4.0: 10/10 (uncountable bonus points: over voice) | |
More complicated use cases would use something beyond accuracy, like f-score or some other methodology. | |
You can look at the Llama 2 paper from Meta for names of different scoring methods. There is nuance in this | |
and other things. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment