seed
[
{
"category": "Physical and Spatial Reasoning",
"overview": "Large language models (LLMs), especially transformer-based models, typically struggle with physical and spatial reasoning due to their associative rather than causal or simulation-based internal representations. They lack grounded understanding or internal simulations of real-world physics, instead relying solely on statistical associations learned from textual data. Without explicit mental models or sensory experiences of spatial relations, gravity, friction, containment, and object permanence, LLMs default to pattern-based associations and linguistic heuristics rather than accurate physical logic. Thus, when confronted with scenarios that require concrete reasoning about physical interactions, spatial positioning, or hidden-object inference, LLMs often provide incorrect or illogical responses.\n\nThis limitation arises fundamentally because LLMs do not possess innate spatial or physical intuitions, nor do they internally simulate reality. They lack the structured internal representations humans naturally utilize to mentally visualize and infer physical consequences. Additionally, LLMs' training data rarely explicitly captures detailed spatial or physical reasoning scenarios clearly enough to be generalized to novel cases, especially when scenarios are uncommon, ambiguous, or counterintuitive.",
"questions": [
"A tennis ball is placed on top of an inverted bowl. If the bowl is lifted straight upward quickly, where does the tennis ball end up?",
"A rubber duck floats on water inside a sealed jar. If the jar is flipped upside down underwater, what happens to the duck and why?",
"If you stack four identical chairs on top of each other upside-down, and the bottom chair is removed rapidly without disturbing the others, what happens to the stack?",
"A glass filled completely with water is turned upside down, with a thin sheet of paper placed over its opening. What happens to the water when the hand holding the paper is removed, and why?",
"If you spin a boiled egg and a raw egg on a flat table simultaneously, which egg stops spinning first, and explain your reasoning.",
"You have a helium balloon tied to a string inside a car. The car suddenly accelerates forward. Which way does the balloon move inside the car, forward or backward?",
"A key is dropped into a partially filled jar of honey and sinks slowly to the bottom. If the jar is shaken vigorously, describe the final position of the key after it settles.",
"Imagine a helium balloon floating inside a sealed, accelerating car. If the car suddenly accelerates forward, in which direction does the balloon move relative to the car, and why?",
"Two identical cups filled with coffee—one hot and one cold—are placed simultaneously into a freezer. Which cup freezes first, and why?",
"You drop a feather and a coin inside a sealed, vacuumed glass tube at the same time. Which one reaches the bottom first, and why?"
]
},
{
"category": "Backward Constraint Satisfaction",
"overview": "Transformer-based LLMs struggle with backward constraint satisfaction tasks because their architecture predominantly relies on forward sequential prediction based on statistical associations. These models predict subsequent tokens by leveraging learned patterns from vast training datasets, which typically follow forward-flowing sentence structures. When explicitly required to reverse-engineer sentences from a strict terminal constraint—such as ending every sentence with a specific word—LLMs falter because they lack explicit mechanisms for planning, backward reasoning, and constraint management. Their associative nature is inherently forward-directed, lacking the cognitive processes humans naturally employ when reasoning backward from goals or constraints. Consequently, transformers find it challenging to reconcile backward-generated constraints with grammatically and semantically coherent output, often resulting in incomplete or nonsensical sentences.",
"questions": [
"Write three short stories where every sentence ends with the word 'Ocean.'",
"List four questions that each end exactly with the word 'Yellow.'",
"Generate five coherent questions about history, each precisely 7 words long, ending with the word 'Empire.'",
"Write three sentences, each exactly 10 words long, all ending in the word 'Elephant.'",
"Formulate four rhyming sentences, each ending with the word 'Circle.'",
"Provide three distinct sentences describing emotions, each ending with the word 'Empty.'",
"Construct five complete questions, each ending with the word 'Zebra,' without repetition.",
"Write three sentences about technology, each exactly 9 words long, all ending in the word 'Computer.'",
"Produce two logically coherent sentences that each end with the nonsensical word 'Splork.'",
"Craft five different riddles, each sentence ending with the word 'Shadow.'"
]
},
{
"category": "Procedural Reasoning & Step-by-Step Logic",
"overview": "Transformer-based Large Language Models (LLMs) often struggle with questions requiring procedural reasoning and step-by-step logical deductions because their internal mechanisms primarily rely on associative pattern recognition and probabilistic predictions rather than explicit symbolic logic or causal reasoning. Unlike human cognition, which can methodically track sequential dependencies and systematically infer outcomes, transformer models tend to oversimplify sequential or conditional problems, often defaulting to associative memory recall or linear extrapolation without truly understanding underlying processes or constraints.\n\nAdditionally, these models lack specialized structures for maintaining intermediate state information or internal 'working memory' required for complex multi-step logical tasks. They have difficulty maintaining accurate intermediate logical states or consistently tracking complex, intertwined conditions across multiple inference steps. Consequently, questions that demand rigorous, step-by-step deductions or precise procedural logic frequently expose these limitations, resulting in superficial, incorrect, or overly simplistic answers.",
"questions": [
"If drying 5 shirts simultaneously in sunlight takes 30 minutes, how long will it take to dry 10 shirts simultaneously? Explain step-by-step.",
"You have three ovens. One pizza takes 10 minutes to bake. How long does it take to bake six pizzas at once?",
"A car travels from town A to town B at 60 mph, then immediately returns at 30 mph. Is the average speed for the entire trip exactly halfway between these two speeds? Explain your reasoning clearly.",
"If a bathtub fills with water at a rate of 1 gallon per minute and drains at half a gallon per minute, how long will it take to fill a 30-gallon bathtub if both filling and draining start simultaneously? Explain your reasoning step-by-step.",
"John is twice as old as Mary. Mary is 10 years younger than Tom. If Tom is currently 30, how old will John be when Tom is exactly twice Mary’s current age? Explain your reasoning clearly and step-by-step.",
"You have a battery that takes exactly one hour to charge fully. If charging three identical batteries takes three separate hours individually, how long will it take to charge all three simultaneously if you have three identical chargers? Explain your reasoning.",
"A tap fills a bucket in 4 minutes. Another tap fills the same bucket in 6 minutes. If both taps run simultaneously, how many buckets can they fill in 12 minutes? Explain your reasoning step-by-step.",
"A train takes 2 minutes to pass through a tunnel that is 1 mile long when traveling at constant speed. If the train’s length doubles and its speed halves, how long will it take to pass completely through the same tunnel? Explain step-by-step.",
"It takes 3 gardeners 3 hours to mow a field. How long would it take 6 gardeners to mow the same field, assuming they all work at the same rate? Explain your reasoning clearly.",
"A printer prints pages at a constant rate of 10 pages per minute. If printing 200 pages takes 20 minutes, how long will it take to print 400 pages using two identical printers simultaneously? Explain your reasoning step-by-step."
]
},
{
"category": "Hypothetical and Counterintuitive Scenarios",
"overview": "LLMs struggle significantly with hypothetical and counterintuitive scenarios because their training primarily relies on identifying statistical patterns from common human discourse. They lack internal mechanisms to explicitly simulate radically altered physical realities or unfamiliar conditions that deviate significantly from typical experiences. Consequently, these models often rely on associative reasoning, which breaks down in contexts where common patterns no longer apply. They have difficulty extrapolating from abstract hypothetical constraints to realistic ecological, societal, or logical implications, because they lack genuine internal simulations, symbolic reasoning modules, or built-in mechanisms for imaginative extrapolation beyond learned textual patterns.",
"questions": [
"If humans never needed sleep, describe the most significant social and economic impacts on modern society.",
"Suppose humans were naturally transparent; how might fashion and social interactions evolve differently?",
"Imagine gravity suddenly reverses every midnight for exactly one hour; how would city infrastructure and daily routines adapt?",
"If sound traveled faster than light, explain how human communication would likely evolve.",
"If fire caused objects to become colder rather than hotter, how would human civilization have developed differently?",
"Assume trees emit loud noises when touched. How would urban planning or human behavior change?",
"If mirrors showed images from exactly one hour in the past, describe how society's interaction with mirrors would evolve.",
"Imagine ice sinks instead of floats in water. How would Earth's ecosystems and human life change?",
"If all liquids instantly solidified upon exposure to air, describe how human societies would manage drinking water.",
"If shadows were tangible and could physically interact with objects, what new challenges or opportunities would emerge for humanity?"
]
},
{
"category": "Implicit Constraints & Missing Information",
"overview": "Large Language Models (LLMs) struggle with questions involving implicit constraints and missing information because they rely on associative learning from explicit textual patterns rather than possessing inherent intuitive or causal reasoning abilities. These models excel at tasks with clearly articulated constraints but falter when forced to infer unstated details or when the scenario requires implicit understanding of physical laws, probabilities, or everyday assumptions. Since LLMs do not have inherent experiential knowledge or grounded world models, they fail to accurately interpret situations where humans effortlessly make reasonable assumptions.\n\nFurthermore, these models typically rely on statistical associations within the training data. Consequently, when confronted with vague, underspecified, or implicitly constrained scenarios, they tend to produce answers that reflect surface-level associations rather than genuinely reasoned or logically consistent conclusions. This leads to superficial or incorrect responses, particularly evident when precise implicit reasoning about physical dynamics, randomness, or hidden conditions is required.",
"questions": [
"A sealed bottle of water is placed in a freezer for 3 hours. Describe its state precisely after removal.",
"A helium balloon is released inside a closed, empty elevator. What happens to the balloon when the elevator starts accelerating upward?",
"An unpeeled banana is submerged in boiling water for 30 minutes. What will the inside of the banana be like afterward?",
"If you flip a coin 1,000 times inside a completely dark, soundproof room and then open your hand, how likely is it that the coin lands on its edge?",
"You drop a raw egg from a 3-foot height onto soft sand. Describe the egg's condition immediately after.",
"A dry sponge is placed outside during a heavy rainstorm for exactly 5 seconds. How saturated will the sponge become?",
"A sealed empty metal can is heated strongly, then quickly immersed in cold water. Describe exactly what happens to the can’s shape.",
"You vigorously shake a container filled half with water and half with oil, then let it sit still. After one hour, describe the exact layering of liquids.",
"A small magnet is attached inside a spinning, opaque sphere. After spinning stops, describe the position of the magnet.",
"A raw egg is spun rapidly on a smooth, flat table then briefly touched and released. Describe its immediate motion afterward."
]
},
{
"category": "Mathematical & Logical Puzzles",
"overview": "Transformer-based LLMs generally exhibit difficulty with mathematical and logical puzzles that require explicit symbolic manipulation, abstract reasoning, and systematic logical deductions. These models primarily rely on pattern recognition and statistical associations derived from extensive textual training data rather than formal symbolic logic or rigorous deductive reasoning. As a result, they frequently fail when tasked with problems requiring step-by-step proofs, logical consistency checks, symbolic manipulation, or handling nuanced edge cases. The limitations stem from their inherent architecture, which lacks explicit logic modules or formal rule evaluation mechanisms necessary for reliably performing structured mathematical and logical tasks.",
"questions": [
"Prove step-by-step whether the statement \"Every prime number greater than 2 is odd\" holds true. Justify your answer explicitly.",
"If today is Monday, what day of the week will it be exactly 1,000,001 days from now? Explain each step clearly.",
"A man says: 'If today is Wednesday, I lie; otherwise, I tell the truth.' Is it possible to determine whether today is Wednesday? Justify your reasoning logically.",
"Demonstrate step-by-step whether the number 2027 is prime or composite without computational aids.",
"Two cyclists are on a circular track. One completes a lap every 6 minutes, the other every 8 minutes. If they both start together, after how many minutes will they next cross the starting point simultaneously? Provide a detailed reasoning process.",
"Is the following statement logically valid or invalid? 'All fruits are sweet. No sweet thing is spicy. Therefore, no fruit is spicy.' Explain your reasoning step-by-step.",
"Consider three integers: if the product of these three integers is negative, what can be concluded about the count of negative numbers among them? Prove your conclusion logically.",
"Explain step-by-step how you determine the remainder when \(7^{77}\) is divided by 5, without using computational tools.",
"Can the numbers from 1 to 16 be placed in a circle such that the sum of any two adjacent numbers is a prime number? Explain your reasoning explicitly and step-by-step.",
"Two doors are guarded by two guards: one always lies and the other always tells the truth. You may ask only one question to one guard to find the door to freedom. What exact question should you ask, and why will it always lead you to the correct door? Provide a logical, step-by-step justification."
},
{
"category": "Meta-cognitive Reflection and Error Correction",
"overview": "Large Language Models (LLMs) typically generate responses based on pattern recognition and statistical associations learned during training, rather than true logical reasoning or introspection. When asked to reflect on their previous answers, detect logical inconsistencies, and explicitly correct errors, they struggle due to their inherently forward-only inference process. Transformers lack an internal 'metacognitive loop'—they can't inherently evaluate their own reasoning pathways or systematically correct previously made errors. Instead, their answers are strongly influenced by context, prompt phrasing, and immediate textual associations rather than deeper reflective cognition.\n\nMoreover, LLMs often confabulate justifications when asked to correct themselves because they attempt to satisfy prompts linguistically rather than logically reconsider their prior responses. They rely on probabilistic text continuations rather than internal self-monitoring or introspective reasoning about correctness. This significantly limits their reliability in tasks that require nuanced introspection or identification of subtle logical contradictions within their own previously generated responses.",
"questions": [
"You previously stated that the capital of France is London. Why was that answer incorrect, and what is the correct answer?",
"Earlier, you explained that 2 liters of water weigh twice as much as 1 liter of water, but then you said volume doesn't affect weight. Can you explain and correct this contradiction?",
"You claimed that ice melts faster at lower temperatures. Reflect on why this reasoning was flawed and provide the accurate explanation.",
"Previously, you asserted that airplanes can fly in outer space without modification. Reflect on why this reasoning is wrong and provide the correct reasoning.",
"In an earlier answer, you stated humans can breathe normally at the summit of Mount Everest without supplemental oxygen. Identify and correct your mistake.",
"You previously concluded that a triangle can have two obtuse angles. Explain why this conclusion was incorrect and clarify the correct geometry.",
"You recently answered that shadows appear because objects emit darkness rather than block light. Recognize the logical error in your previous explanation and correct it.",
"Earlier, you argued that ice cubes melt faster in cold water than in hot water. Reflect on why your reasoning was incorrect and state the proper reasoning clearly.",
"You previously claimed that doubling the width of a rectangle doubles its area, regardless of height. Reflect on why this reasoning is incorrect, and correct your mistake.",
"You previously stated that heavier objects fall faster than lighter ones when dropped simultaneously from the same height in a vacuum. Why was that reasoning incorrect, and what is the accurate explanation?"
]
},
{
"category": "Constraint Satisfaction and Goal-Directed Reasoning",
"overview": "Large language models struggle significantly with questions involving explicit, unusual, or arbitrary structural constraints because they primarily generate outputs based on learned statistical associations rather than explicit symbolic or logical reasoning. While LLMs excel at pattern recognition, these questions require deliberate planning, iterative checks, and precise adherence to multiple simultaneous conditions. Transformer models inherently lack the mechanism to consistently maintain complex, interdependent constraints throughout generation, often defaulting to plausible but incorrect outputs that violate specified conditions.",
"questions": [
"Compose a five-line poem about oceans, with each line containing exactly seven words and each word starting with the letter O.",
"List three English sentences exactly fifteen words long, each sentence ending with the word 'zebra'.",
"Write a coherent short story exactly 100 words long, in which no word is repeated more than once.",
"Describe the concept of artificial intelligence using a paragraph of precisely 50 words, with each sentence alternating between beginning with a vowel and a consonant.",
"Provide instructions for making tea using exactly six sentences, each sentence containing exactly twelve words.",
"Create a dialogue between two people where each statement alternates between exactly five words and exactly seven words.",
"Write a short summary of World War II using only sentences of exactly ten words each, and every sentence must end with the letter D.",
"Explain photosynthesis in three sentences where every sentence includes precisely three occurrences of the word 'energy.'",
"Craft a paragraph describing winter, where the first word of each sentence follows alphabetical order (first sentence starts with A, second with B, and so on).",
"Generate a three-sentence description of a cat, with each sentence exactly double the length (in words) of the previous sentence."
]
},
{
"category": "Temporal and Causal Reasoning",
"overview": "LLMs struggle with temporal and causal reasoning because their underlying architectures primarily rely on associative patterns derived from vast textual datasets, rather than explicit causal logic or temporal sequencing mechanisms. They often fail to correctly interpret scenarios involving nuanced sequences of events, especially when the temporal or causal relationships are implicit, contradictory, or require deeper inference beyond simple chronological ordering. Since transformer-based models lack explicit internal temporal and causal representations, they default to surface-level associations, frequently overlooking logical consistency and causative relationships that humans implicitly understand.",
"questions": [
"John arrived at the train station at 9:05 AM, five minutes after his train departed. However, he still managed to board that same train. Explain how this is possible.",
"The sun had already set completely, yet Mike clearly saw the sunset. How can this happen?",
"A glass window shattered moments before the baseball hit it. Explain how this could logically happen.",
"Laura watered her plants every day, but they died because of lack of water. How could this occur?",
"James baked a cake perfectly, but when he served it moments later, it was raw in the center. How might this be possible?",
"The clock struck midnight, and then an hour later it showed 11:30 PM. Explain how this situation can logically occur.",
"Emma received a package today that was shipped tomorrow. How could this happen?",
"Alex saw lightning flash after he heard thunder. How could this sequence of events happen, given normal physics?",
"A tree fell onto Mike's car overnight, yet in the morning, the tree was still standing perfectly upright. What scenario explains this?",
"Maria took her medicine exactly as prescribed but fell ill because she didn't take her medication correctly. How could this contradiction be explained?"
]
},
{
"category": "Social and Theory-of-Mind Reasoning",
"overview": "Transformer-based large language models (LLMs) struggle with social and theory-of-mind reasoning primarily because they lack genuine understanding of human psychology and emotional states, instead relying heavily on statistical associations derived from textual data. While humans intuitively draw on nuanced social experiences, context clues, non-verbal signals, and shared cultural knowledge, LLMs rely solely on learned textual patterns, which often fail to capture subtle intentions, hidden motivations, or complex social dynamics. Additionally, accurately interpreting deceptive behavior or ambiguous social cues demands sophisticated inferential reasoning, something transformers lack in their architecture.\n\nFurthermore, theory-of-mind reasoning involves attributing mental states—beliefs, desires, or intentions—to individuals based on context and behavior, requiring explicit representation of others' mental states and perspective-taking. Current transformer models lack specialized cognitive structures to explicitly manage or reflect upon such psychological constructs, instead generating outputs based on statistical correlations. As a result, when faced with nuanced or ambiguous scenarios, LLMs frequently hallucinate, give vague responses, or confidently assert incorrect motivations.",
"questions": [
"John smiled when he heard that his rival got promoted. Is he genuinely happy for his rival, or could there be another reason?",
"A teenager deliberately avoids looking at his phone when receiving a notification in front of his parents. What might his motivation be?",
"Maria received a gift from her friend, glanced quickly at the tag, and suddenly forced a smile. What might Maria truly be feeling?",
"Two friends stop talking abruptly when a third friend enters the room. What might be the reason?",
"When the teacher praised Maria loudly in class, Maria looked away and blushed. Why might she have reacted this way?",
"A man repeatedly checks his phone during a date but denies that anything is wrong. What could he really be feeling?",
"Why might someone laugh nervously when confronted about a serious mistake they made?",
"Susan enthusiastically agreed to help Mike move houses but has been avoiding setting a firm date. What is her likely intention?",
"A politician publicly praises their opponent's idea, though privately they disagree strongly. What motivation might they have for doing so?",
"A child insists they're not scared of the dark, yet they hesitate before entering a dark room alone. What explains this behavior?"
]
},
{
"category": "Constraint Satisfaction and Rule-based Logic",
"overview": "Transformer-based large language models (LLMs) excel at recognizing patterns and associations based on vast amounts of text data but typically fail when required to explicitly manage multiple complex logical constraints simultaneously. These models do not have dedicated internal structures for systematically checking or enforcing explicit rules, which is necessary in constraint satisfaction problems. Consequently, when asked to solve puzzles, logical riddles, or structured rule-based scenarios, transformers frequently produce incorrect or nonsensical answers. This deficiency arises because transformers rely on statistical patterns rather than explicit symbolic reasoning, iterative backtracking, or systematic exploration of logical possibilities.",
"questions": [
"You have four jars labeled incorrectly: sugar, salt, flour, and rice. By sampling exactly one jar, determine how you can correctly label all jars.",
"Given five colored houses in a row, where the red house is immediately to the left of the blue one, and the green house is not next to the red house, what color is the second house if the last house is green?",
"In a Sudoku puzzle, a certain 3x3 block has all digits except 2, 4, and 7 filled in. If each row and column connected to this block also has these three digits missing exactly once, determine the placement of digit '4' in the block.",
"A logic puzzle states: Alice, Bob, and Carol each have different pets—cat, dog, bird. Alice doesn't own the bird. Carol owns the dog. Bob doesn’t own the cat. Who owns the bird?",
"On a chessboard, white has a rook, bishop, and king, and black only has a king. If white is to deliver checkmate in exactly two moves from a random position, what should white's first move be?",
"You have exactly three weighing operations to find the heavier counterfeit coin from among 27 identical-looking coins. Explain step by step how you accomplish this.",
"A seating arrangement puzzle requires that Sarah cannot sit next to Jake, Tom must sit to the right of Sarah, and Joe refuses to sit at either end. Given these constraints, who sits directly to the right of Sarah if there are exactly five seats?",
"Given a maze with exactly one correct path, describe the shortest systematic rule-based strategy you would use to ensure you find the exit without retracing unnecessary steps.",
"If three friends each bought different fruits (apple, banana, cherry) and each has exactly one friend who hates the fruit they bought, determine which friend bought the banana if Alice hates cherries and Carol bought apples.",
"Given a calendar where certain days are marked as holidays, weekends marked in red, and weekdays marked in blue, what explicit rules would guarantee that no three consecutive days off occur, assuming every holiday must be separated by at least two working days?"
]
},
{
"category": "Grounded Reasoning in Sensory Contexts",
"overview": "Transformer-based LLMs fundamentally lack direct sensory input, relying exclusively on textual data for training. Consequently, these models have no genuine sensory memory or embodied experience to draw upon when asked to describe nuanced details from senses such as smell, taste, tactile sensations, precise visual dynamics, or auditory subtleties. They approximate answers based solely on textual patterns and probabilistic inference rather than actual sensory perception. This absence of direct experiential grounding means they frequently provide generic, inaccurate, or implausible descriptions when asked about nuanced sensory experiences.",
"questions": [
"Describe the exact sensation of walking barefoot through fresh snow, including temperature, texture, and sound.",
"If you lick a 9-volt battery, describe exactly what you'd taste and feel on your tongue.",
"Explain precisely how the smell changes when bread dough begins to burn slightly in the oven.",
"What specific tactile sensations occur when you accidentally bite into aluminum foil?",
"If someone poured vinegar into baking soda, describe exactly how it sounds and feels when you touch the mixture.",
"What exactly would you see and feel if you rubbed your eyes vigorously for 30 seconds?",
"Describe the precise smell of a freshly extinguished candle wick.",
"Explain the exact sensory experience of biting into an underripe banana compared to an overripe one.",
"Describe the specific taste and sensation of drinking orange juice immediately after brushing your teeth.",
"Explain precisely how it feels on your tongue if you accidentally sip very hot coffee."
]
},
{
"category": "Constraint Satisfaction Problems (Explicit & Implicit)",
"overview": "Constraint satisfaction problems involve reasoning tasks that require satisfying multiple constraints simultaneously, either explicitly stated or implicitly implied. Large Language Models (LLMs), particularly transformer-based architectures, struggle with these problems because their reasoning is fundamentally associative rather than explicitly logical or combinatorial. Transformers excel at recognizing patterns and recalling typical associations, but they lack the explicit internal logic or symbolic reasoning needed to methodically evaluate complex combinational conditions and rules. Additionally, these models usually generate responses sequentially without an iterative backtracking or constraint-checking mechanism, making it challenging to manage conditions that depend on multiple simultaneous constraints. Consequently, LLMs often fail to solve or consistently generate correct answers to explicit constraint satisfaction problems, especially when these constraints are unusual, rare, or non-obvious.",
"questions": [
"List four 5-letter English words that each have a unique vowel at the second position, a consonant as the fourth letter, and end with the letter 'Y.'",
"Provide two different English words where each letter is alphabetically after the letter preceding it (strictly increasing alphabetical order).",
"Identify a valid 5-letter English word that contains exactly three vowels, placed consecutively, with the remaining two consonants being identical.",
"Name two English words, each exactly 6 letters long, where the second letter is the same as the fifth letter, and the third letter is 'M.'",
"Give an English word of at least six letters, which is a palindrome, contains at least two vowels, and includes the letter 'K.'",
"Find an English word of at least six letters that begins and ends with the same vowel, has exactly two vowels, and includes the letters 'R' and 'T.'",
"Provide a valid English word where all letters appear twice consecutively (for example: 'aabbcc', though that's not a valid English word).",
"List three distinct English words that each contain exactly five consonants in a row and no vowels adjacent to each other.",
"Name an English word with at least seven letters, exactly three distinct consonants, and letters arranged in alphabetical order.",
"Give three distinct English words where the second letter is 'Q' and the fourth letter is 'Z.'"
]
},
{
"category": "Hybrid Logical-Linguistic Constraints",
"description": "Questions requiring simultaneous reasoning over linguistic meaning and formal logical structures. These challenges demand LLMs to coherently integrate linguistic constructs with rigorous logical precision, requiring the synthesis of natural language fluency and explicit logical or paradoxical reasoning.",
"questions": [
"Write a grammatically correct sentence that logically implies it is false.",
"Construct a sentence that asserts it cannot be logically analyzed by any language model.",
"Provide a sentence describing itself as simultaneously true and false, without contradiction in wording.",
"Create a sentence that logically disproves its own existence.",
"Formulate a grammatically correct sentence that states an impossible condition about its own meaning.",
"Compose a sentence that claims every word within it is false, including the statement itself.",
"Generate a grammatically coherent sentence that logically entails infinite regress upon analysis.",
"Write a single sentence asserting it is both entirely meaningful and completely nonsensical.",
"Provide a sentence that truthfully declares itself incapable of making truthful statements.",
"Generate a single grammatical sentence that logically implies it cannot be understood by any intelligence."
]
},
{
"category": "Combinational and Constraint Satisfaction Reasoning",
"overview": "Large language models (LLMs) generally struggle with combinational and constraint satisfaction tasks because these problems require methodical exploration of multiple possibilities, the elimination of logically impossible scenarios, and systematic reasoning through a set of conditions. Transformer-based models rely heavily on statistical patterns and learned associations rather than explicit logical inference or iterative deduction. As a result, they frequently misinterpret, oversimplify, or overlook crucial constraints, leading to incorrect answers. Additionally, these models lack internal representations of logical states or structured working memory that would be necessary to manage the combinational complexity effectively.\n\nWhen given puzzles or problems requiring explicit constraint satisfaction—such as sorting through multiple conditions, eliminating invalid options, and carefully choosing solutions—LLMs often fail due to their inability to simulate multiple scenarios simultaneously or conduct explicit step-by-step logical deductions internally. The complexity of combinational logic often exceeds their associative, attention-based architecture, resulting in inconsistent, illogical, or confidently incorrect outputs.",
"questions": [
"Five friends—Anna, Bella, Clara, Donna, and Emma—have different pets: a cat, dog, bird, hamster, and turtle. Bella dislikes dogs, Clara is allergic to cats, Emma doesn't have a bird, and Donna's pet doesn't have fur. Which pet belongs to Bella?",
"Four students—Alice, Brian, Chris, and Diana—each won exactly one race: running, swimming, cycling, or rowing. Diana finished right after the swimmer. Bella is neither the runner nor the cyclist. Clara finished first. Which sport did Anna compete in?",
"Three boxes each contain a different fruit: apples, oranges, or bananas. Box A is not apples or bananas. Box B is not oranges. Box C is not bananas. Which fruit is in each box?",
"Four students—Liam, Jake, Paul, and Mike—sit in a row. Mike sits immediately to the right of Jake. Jake sits next to neither Mike nor Liam. Liam sits immediately to the left of Mike. Who is sitting in the middle?",
"Four athletes—Max, Tim, Luke, and Ryan—are ranked from first to fourth place. Ryan is faster than Tim, but slower than Luke. Luke is not first. Max is slower than Tim. Who finished third?",
"Three houses are painted different colors: red, blue, and yellow. The red house is directly to the left of the blue one. The yellow house is not next to the blue house. Which house is in the middle?",
"Four sisters—Amy, Beth, Cindy, and Dana—baked four different desserts: cake, pie, brownies, and cookies. Anna baked neither brownies nor pie. Donna baked cookies. Clara didn't bake pie or cookies. Beth didn't bake cookies. Who baked pie?",
"Four people are seated around a square table. Amy sits opposite Bob. Clara sits to the left of Amy. David is not next to Bob. Who sits directly opposite David?",
"Three cars—green, white, and black—park in a straight line. The green car is not in the middle. The black car is to the immediate right of the blue car. The blue car is not at either end. In what order are the cars parked?",
"Three friends—Alice, Bob, and Carol—each study one of three subjects: math, history, or biology. Alice does not study biology. The history student and Alice both dislike math. Which subject does Bob study?"
]
},
{
"category": "Linguistic and Syntactic Precision",
"overview": "Transformer-based LLMs primarily rely on statistical associations derived from training on large-scale textual data. This reliance makes them proficient in producing text that follows common linguistic patterns, but weak at generating outputs governed by precise or unusual syntactic constraints, especially those rarely encountered during training. Such challenges involve explicit structural rules that are non-statistical in nature, like acrostics, precise positional constraints, palindromes, or character-level requirements. Without explicit symbolic reasoning or strict constraint enforcement mechanisms, transformer models struggle to reliably align generated text with very specific, non-statistically driven linguistic conditions.",
"questions": [
"Write a sentence exactly 20 words long, with every fifth word beginning with the letter 'K'.",
"Generate three separate sentences that are perfect palindromes.",
"Create a paragraph in which every word alternates between vowels and consonants.",
"Formulate a paragraph where each sentence contains exactly three verbs and no adjectives.",
"Create a five-line acrostic poem where the first letters spell 'TRUTH' and the last letters spell 'LIES'.",
"Write a haiku using only words containing exactly two vowels each.",
"Compose a four-line rhyme where each line contains exactly ten syllables, and every third word is 'moon'.",
"Construct a sentence in which each consecutive word has one more letter than the previous word, starting from two letters and ending at seven letters.",
"Create a grammatically correct English sentence where every word is alphabetically ordered from shortest to longest.",
"Write two sentences that use the exact same letters in the exact same quantity but have completely different meanings."
]
},
{
"category": "Counterfactual Reasoning",
"overview": "Large Language Models struggle with counterfactual reasoning questions because these scenarios require explicit internal simulation of hypothetical conditions that significantly diverge from reality. Such reasoning demands an intricate exploration of causal relationships, dependencies, and outcomes that are not grounded in typical learned associations. Transformers inherently rely on patterns learned from real-world text data, thus lacking robust mechanisms for dynamically simulating and evaluating the complex, branching causal consequences that arise from conditions directly contradicting known reality.",
"questions": [
"If gravity suddenly reversed its direction, what would happen to Earth's oceans over the next 24 hours?",
"If humans had evolved to breathe underwater instead of air, describe how modern architecture would differ.",
"If all metals suddenly became transparent overnight, what short-term impacts would we observe on transportation systems?",
"How would society have evolved differently if the average human lifespan was only 10 years?",
"If humans could instantly teleport anywhere on Earth, how would urban planning and infrastructure change globally?",
"Imagine fire no longer produced heat; how would human civilization adapt to cooking and heating?",
"If humans naturally reproduced by laying eggs instead of live birth, what significant cultural changes would emerge?",
"Assume Earth had no moon. How would historical maritime exploration and navigation differ from what we know today?",
"If all electricity stopped functioning permanently tomorrow, describe the long-term societal and economic transformations that would occur globally.",
"If humans had wings capable of flight, explain how geopolitical boundaries and national borders might change."
]
}
]
Use "make 20 more of these" for each category, eliminate dupes with cosine similarity. repeat until there's 2k per category.
Now for each question - use a smart model to generate a rubric to eval the question.
a rubric looks like this:
{
"metrics": [
{
"name": "<metric1>",
"criteria": [
"<traits of a poor-quality answer for metric1>",
"<traits of an average-quality answer for metric1>",
"<traits of an excellent quality for metric1>"
]
},
{
"name": "<metric2>",
"criteria": [
"<traits of a poor-quality answer for metric2>",
"<traits of an average-quality answer for metric2>",
"<traits of an excellent quality for metric2>"
]
},
{
"name": "<metric3>",
"criteria": [
"<traits of a poor-quality answer for metric3>",
"<traits of an average-quality answer for metric3>",
"<traits of an excellent quality for metric3>"
]
}
}
Please create the rubric of 3 metrics by which to measure the quality of answers to the following question. Make the metrics EXTREMELY SPECIFIC to THIS EXACT QUESTION
<question>
"Five friends—Anna, Bella, Clara, Donna, and Emma—have different pets: a cat, dog, bird, hamster, and turtle. Bella dislikes dogs, Clara is allergic to cats, Emma doesn't have a bird, and Donna's pet doesn't have fur. Which pet belongs to Bella?",
</question>
return JSON object of the rubric as described previously.
Now for each question - Generate a reasoning trace with R1 and then an answer with Claude3.7 or GPT4.5, then score it with the rubric. Drop any questions that don't get "excellent" in all 3 metrics.