Skip to content

Instantly share code, notes, and snippets.

@thomasdavis
Created March 28, 2025 05:55
Show Gist options
  • Save thomasdavis/c236d6a9b48a0d8c8e851e3d9f4310b8 to your computer and use it in GitHub Desktop.
Save thomasdavis/c236d6a9b48a0d8c8e851e3d9f4310b8 to your computer and use it in GitHub Desktop.

Got it. I’ll dig into whether others have encountered similar issues with OpenAI models (especially GPT-4 and 4o) returning excessively long or hallucinated UUID-like lists or arrays, and if this issue is more formally known or documented. I’ll also look for workarounds developers have successfully used, especially in tagging/filtering tasks like yours.

I’ll update you as soon as I have a clear picture.

Issues with Long UUID Lists in GPT Models

Observations of Hallucinated or Duplicated IDs

Developers have indeed reported that when asking GPT models to output a long list of IDs (e.g. UUIDs or database keys), the model can produce incorrect or repeated entries. In structured JSON outputs, the model sometimes invents new IDs or duplicates existing ones instead of sticking to the provided list. For example, one OpenAI forum user noted that about 10% of the time GPT-4 would return a randomly generated ingredient ID that did not match any ID in the provided list, even though the ingredient name was correct (Gpt-4o hallucinates inventing IDs from provided list - API - OpenAI Developer Community). In another case, a developer’s chatbot for a pastry shop was supposed to return product ID and Name from a given catalog; it usually worked, but occasionally the model would output a completely made-up ID (a random hexadecimal string) for an item instead of the real ID (Assistant going through a database sometimes hallucinates IDs - API - OpenAI Developer Community) (Assistant going through a database sometimes hallucinates IDs - API - OpenAI Developer Community). Reports on community forums and Reddit echo this pattern – the longer the list of items requested, the higher the chance the AI will start to hallucinate entries or repeat patterns instead of producing a perfectly faithful list (often well before hitting token limits).

These issues manifest as the model:

Notably, when the list of allowed values is very small, one might expect the model to always choose a valid entry – but even then it can err. An OpenAI developer mentioned that with a list of just 3 valid enum strings, GPT-4 sometimes output an entirely made-up value instead of using one of the three options (Structured Outputs not reliable with GPT-4o-mini and GPT-4o - API). All these reports underline that large enumerations and ID lists are not a strength of GPT’s generative approach.

Numeric Strings vs. Richer Context Formats

There is anecdotal evidence that the format of the data matters. Purely numeric or opaque identifiers (like GUIDs/UUIDs, product codes, etc.) are essentially meaningless to the language model except as token sequences. The model doesn’t truly understand them as distinct entities – it only knows what it saw during training. So, when asked to produce a long list of such IDs, it may start to rely on learned patterns (or even training data examples) rather than faithfully creating new unique strings. In the ingredient list example, each item had both an ID and a name; interestingly, the names were almost always correct (showing the model understood which ingredients were valid) but the IDs could be wrong (Gpt-4o hallucinates inventing IDs from provided list - API - OpenAI Developer Community). This suggests that providing descriptive context (like a name/description alongside each ID) anchors the model somewhat – it can keep the name consistent – but it might still guess the associated numeric ID if it fails to recall it exactly.

When only raw numbers or ID strings are present, the model has even less context to latch onto. Developers have observed that if you ask for a list of, say, 100 random UUIDs, the model might output a few valid-looking ones and then inadvertently start repeating patterns or earlier IDs. The lack of semantic meaning in a long sequence of random digits/characters means the model may default to what it “knows” (for example, common UUID patterns or fragments seen in training data) or it may accidentally reuse a token sequence it already produced because it has no true random generator or memory of “used” values beyond its immediate context. In short, IDs presented as pure numbers or strings can be harder for the model to handle than IDs accompanied by names or other data – though as we saw, even additional context doesn't guarantee correctness.

Another factor is that OpenAI’s structured output system (JSON mode with schemas) currently cannot enforce certain constraints that would help in this scenario. For instance, the JSON Schema support doesn’t include minItems/maxItems for arrays or strict enumeration of allowed string values (Diving Deeper with Structured Outputs | by Armin Catovic | TDS Archive | Medium). So, you cannot tell the model “these are the only valid IDs” except by literally listing them in the prompt (which still isn’t foolproof). One experienced user pointed out that even if your schema says a field is a string, the content of that string isn’t validated by the model – e.g. if you expect a product_ids: List[str], the model will output a list of strings in the correct format, but those strings might not correspond to real IDs (Diving Deeper with Structured Outputs | by Armin Catovic | TDS Archive | Medium). In practice, this means numeric lists or codes are treated like any other text: the model will fill the slot with something that looks plausible. Without true reference checking, it may produce IDs that fit the pattern but aren’t from your provided set. Developers are advised to validate the model’s output against their source of truth (e.g. check each returned ID against the database and correct or reject any that don’t match) (Diving Deeper with Structured Outputs | by Armin Catovic | TDS Archive | Medium).

Why This Happens: Model Quirks with Lists and Numbers

Large Language Models (LLMs) like GPT-3.5/4 don’t have an innate concept of unique randomness or factual consistency – they generate the next token based on probabilities. For tasks like “list 100 unique identifiers”, the model isn’t executing a program or doing a database lookup; it’s essentially trying to predict a plausible-looking sequence of ID strings. After producing many items, it can run into issues with repetition or nonsense because it has no algorithmic method to ensure uniqueness. As one AI researcher noted, models have been shown to hallucinate a lot when generating numerals (e.g. dates, quantities, or other numeric strings) because they only have a limited ability to handle precise sequences that weren’t memorized (Study suggests that even the best AI models hallucinate a bunch | TechCrunch) (Diving Deeper with Structured Outputs | by Armin Catovic | TDS Archive | Medium). The longer the sequence (or list), the greater the chance of a slip-up propagating. In fact, if a single incorrect item appears early, it might even throw off subsequent items (a phenomenon where one hallucination can beget more if the model treats its own earlier output as context) – though in a structured list, the impact is mostly that one entry is wrong or out-of-place, not that the whole list derails.

It’s also worth noting there are no hard-coded limits in the model like “50 items is the max reliable list length”. The number 50-100 mentioned is more of a practical observation range. The model has a finite token window and will try to comply with the request, but error rates seem to climb as the list grows. One plugin developer observed that beyond a certain prompt size (~6000-7000 tokens of input), GPT-4’s outputs suddenly became much less reliable (ChatGPT halluciating with very long plugin response - Plugins / Actions builders - OpenAI Developer Community). In our case, a long list of output tokens (especially if the prompt context is already large) increases complexity. Essentially, long enumerations push the model into parts of its distribution it’s less certain about, leading to more filler or guessed content. This is a general behavior across LLMs – not a strict cutoff, but a gradient of increasing difficulty as the task length and requirement for consistency grow.

Workarounds and Best Practices

Developers have experimented with various techniques to mitigate these problems. Here are some effective strategies suggested by experienced users and OpenAI staff:

  • Reduce the temperature: When you need determinism and to avoid creative “guesses”, setting a low temperature helps. A temperature of 0 (or near-zero, like 0.1) makes the model more likely to output the highest-probability completion each time. This can prevent some randomness that leads to hallucinated IDs. An OpenAI staff member recommended lowering temperature if you’re seeing ID hallucinations (Gpt-4o hallucinates inventing IDs from provided list - API - OpenAI Developer Community). (Note: GPT-4 structured output mode sometimes had issues even at temp=0 due to internal handling, but generally, low temp reduces variability.)

  • Provide examples or formatting cues: If possible, show the model a few examples of correct ID usage in your prompt (few-shot learning). You can even show a “bad” example and a “corrected” example. One developer suggested explicitly including examples of incorrect outputs (e.g., an ingredient with a wrong ID) and explaining why it’s wrong, so the model learns what not to do (Assistant going through a database sometimes hallucinates IDs - API - OpenAI Developer Community). This approach isn’t foolproof, but it can reinforce the pattern that IDs must come from the given set.

  • Break the task into chunks: Instead of asking for 100 IDs at once, you might prompt the model for 20 at a time, or some manageable batch, especially if you’re generating synthetic data. Smaller outputs are easier for the model to get right, and you can always combine them. This also keeps each response within a reasonable token count. Some users find that beyond a certain length, lists tend to go off-track, so requesting smaller lists and then merging can improve accuracy (at the cost of multiple API calls).

  • Use intermediary identifiers (or just names) and post-process: A very robust solution is not to have the model output the database IDs at all. Instead, ask it for something easier (like the names of the items, or an index). As one forum expert put it: “Why bother sending IDs to the model?... Simply send the list of ingredients (names) and get the answer in a JSON list, then lookup the ID for every ingredient name you receive.” (Gpt-4o hallucinates inventing IDs from provided list - API - OpenAI Developer Community). Many developers have adopted this strategy: the model selects items by name/description, and then the application code maps those names to the actual IDs from a database. This completely circumvents the issue of the model making up IDs (since you’re only relying on it for the name, which it’s less likely to hallucinate if your prompt is clear). In the pastry shop example, the developer ultimately did exactly this as a fallback – if the model gave a wrong ID, they would correct it by matching the product name to the real ID in their backend (Gpt-4o hallucinates inventing IDs from provided list - API - OpenAI Developer Community).

  • Include both ID and name in the output: Similar to the above, you can explicitly ask the model to return redundant information, like each entry as an object with both the name and id. This way, if it does hallucinate an ID, you still have the name to use as a key. One community member suggested sending both to be safe (Assistant going through a database sometimes hallucinates IDs - API - OpenAI Developer Community). This doesn’t prevent the model from making a mistake, but it gives you a way to detect or fix it (if the name and id don’t match, you know there’s an error and can reconcile it by trusting the name or doing a secondary lookup).

  • Constrain via functions or tools: If you’re using the OpenAI API with function calling, you can offload the ID resolution to a function. For example, have the model output a list of names (or some reference), and call a function get_ids(names_list) which returns the IDs from your database. The model then doesn’t need to produce the IDs at all – it just triggers the function. This way, the final answer to the user can include the IDs filled in by your function result. Anthropic’s Claude and other models similarly allow tool use or plugins which could fetch real IDs. This approach treats the model more as a reasoning engine than a data source, eliminating hallucination for those fields entirely.

  • Data formatting tricks: Some developers found that how you provide the data to the model can affect its fidelity. In OpenAI’s new “assistant with retrieval” (file-based knowledge) system, one user had poor results embedding product info as JSON, but got no hallucinations after switching to a plain text table format (Using Assistant API Retrieval Hallucinates - API - OpenAI Developer Community). They converted the product list into a CSV-like text (using uncommon delimiters to avoid confusion) and included that in the prompt or system message. The structured JSON input might have confused the model’s narrative understanding, whereas a simpler text list was easier for it to reference correctly. The takeaway: if one format isn’t working (e.g. a big JSON blob of IDs), try formatting your list differently – sometimes a bulleted list of “ID – Name” pairs in plain language works better than a raw JSON array. It gives the model a more narrative-friendly context to draw from.

  • Validate and correct programmatically: Ultimately, the safest fix is post-validation. Treat the model’s output as a first draft. You can write a small routine to check each UUID in the response: is it in the allowed list? Does it appear more than once? If something’s off, you can either prompt the model again to fix the mistake or just fix it yourself (since you have the source data). One medium article on structured outputs explicitly recommends this – if you expect a list of product IDs from the model, be prepared to compare them against your actual product catalog and flag or replace any unknown IDs (Diving Deeper with Structured Outputs | by Armin Catovic | TDS Archive | Medium). This adds a bit of work, but in production it may be necessary to guarantee correctness. In practice, many developers combine approaches: they use low temperature, careful prompting, and a validation layer as backup.

  • Adjust model parameters for repetition: When duplication of entries is the problem (model repeats some IDs or list items), you can try increasing the presence_penalty or frequency_penalty on the API. These penalties discourage the model from outputting the same token or phrase multiple times (You can pretty much duplicate any custom GPT by asking it ... - Reddit). For instance, a presence_penalty makes it less likely for the model to reuse an exact ID string once it’s already appeared. This might help ensure more variety in a long list. However, use this carefully – too high a penalty might make the model avoid legitimately reusing a token and could lead to bizarre output. In general, a small positive penalty could reduce verbatim duplicates in the list.

In summary, the best practice is to not solely trust the model to faithfully reproduce long lists of arbitrary IDs. Either structure your prompt to avoid needing that (using names or smaller chunks) or implement checks. As one OpenAI forum commenter quipped, the question should be “why rely on the model for this at all?” when simple deterministic post-processing can do a better job (Gpt-4o hallucinates inventing IDs from provided list - API - OpenAI Developer Community). Use the model for what it’s good at (semantic understanding, following instructions in general) and let your own code handle the exact ID mappings and validations whenever possible.

Model-Specific Differences (GPT-3.5 vs GPT-4 vs Others)

GPT-3.5 vs GPT-4: Generally, GPT-4 is more capable and reliable than GPT-3.5 in following instructions and producing structured output. Developers have found GPT-4 to adhere to JSON schemas and list formats more strictly. GPT-3.5 (especially the 2023 versions) was more prone to format errors and omissions when asked for long structured answers. So, if you’re using an older model and hitting these issues, switching to GPT-4 often helps – but as we’ve seen, it doesn’t solve the issue entirely. GPT-4 can still hallucinate IDs; it just might do so slightly less often or recover better when corrected.

GPT-4 (original) vs GPT-4o: In late 2024, OpenAI introduced GPT-4o (a new variant of GPT-4 geared towards “open” conversation and structured output). Some developers reported that GPT-4o behaved a bit differently. In one report, a strict JSON summarization prompt that “worked perfectly” on gpt-4-turbo and even on gpt-3.5-turbo started producing fake data (hallucinated URLs) consistently with gpt-4o, despite using temperature 0 (GPT-4o - Hallucinating at temp:0 - Unusable in production - Feedback - OpenAI Developer Community). Another user in that thread confirmed GPT-4o would eventually ignore instructions and return malformed JSON, whereas the older GPT-4 did fine (GPT-4o - Hallucinating at temp:0 - Unusable in production - Feedback - OpenAI Developer Community). This suggests that early in GPT-4o’s life, there were some quirks – possibly due to a different fine-tuning or system message – making it more prone to certain hallucinations. OpenAI has likely improved this with subsequent updates, but it’s a reminder that “newer” isn’t always immediately better. If you experience issues on one model, testing the same prompt on a closely related model (e.g. GPT-4 0613 vs GPT-4-2023 or GPT-4o) can be illuminating. Sometimes a temporary rollback or using the more established model is a viable short-term fix (as one user decided to stick with GPT-4 turbo until 4o’s issues were resolved (GPT-4o - Hallucinating at temp:0 - Unusable in production - Feedback - OpenAI Developer Community)).

On the flip side, GPT-4o and newer versions have the advantage of features like longer context windows (some variants allow more tokens) and better support for function calling/structured output. These can help with large lists if used properly. For instance, a longer context means you could include a very large list of allowed IDs in the prompt (though beyond a certain point that’s counterproductive), or you could handle bigger JSON outputs without truncation. But no matter the context size, the fundamental behavior (the model guessing or repeating IDs) still needs the strategies discussed above.

Claude (Anthropic) and others: The phenomenon of messing up long lists is not unique to OpenAI’s models. It’s a common limitation of current LLMs. A TechCrunch report succinctly stated that all generative models will hallucinate details – naming Google’s Gemini, Anthropic’s Claude, and OpenAI’s GPT-4o in the same breath (Study suggests that even the best AI models hallucinate a bunch | TechCrunch). Claude, especially in its earlier versions, could also lose track or start repeating entries in a long enumeration. However, anecdotal feedback suggests Claude 2 (and Claude 3) might hallucinate slightly less frequently than ChatGPT in certain factual scenarios (I'm finding Claude to hallucinate less than ChatGPT, and to be far ...), likely due to differences in training (Claude is tuned heavily for harmlessness and might be more inclined to say “I don’t know” rather than make something up). That said, when it comes to purely synthetic tasks like generating UUIDs or lists of IDs, Claude does not have a magic bullet either – it uses the same type of language modeling. One advantage Claude offers is a 100k token context window in the latest versions, which means it can handle very large prompts or databases. In theory, you could feed all 50-100 valid IDs into Claude’s prompt and strongly instruct it to only use those. Claude would likely comply better given the ample context (it can “see” all those IDs clearly). But even Claude could err if asked to produce a very long list of new random IDs – because just like GPT, it has no internal RNG or understanding of UUID algorithms. In practice, you’d still need to verify Claude’s outputs for correctness, though you might find it refuses to fabricate an answer more often (which can be preferable – a refusal or an apology is easier to handle than a subtly wrong ID).

Other models (Groq etc.): “Groq” is a newer player (not as well-documented as GPT or Claude). It appears to be a fast AI service, possibly hosting its own large model. There’s no explicit documentation of how it handles long ID lists, but given it’s also a large language model at heart, the same caution applies. Unless Groq’s model has some built-in tooling, it likely will have similar limitations when asked to enumerate many unique identifiers. Speed (one of Groq’s selling points) doesn’t necessarily equate to accuracy in this sense. Until we see a model that is architecturally different (e.g. a system that can internally query a database or has a component specifically for combinatorial logic), we should assume the limitation is fundamental: LLMs are unreliable for long, exact, arbitrary lists.

In summary, GPT-4 (and variants) currently offers the best reliability from OpenAI for structured outputs, but it’s not perfect. GPT-3.5 is a bit more error-prone with JSON and long lists. Claude’s newer versions might hallucinate slightly less and can handle more context, but they will still require similar precautions. And every model – whether it’s OpenAI’s, Anthropic’s, or others like Groq – can and will make things up at times, especially when pressured to produce long, pattern-less outputs. As one article put it, even the best models in 2024 could only be guaranteed hallucination-free about 35% of the time in general (Study suggests that even the best AI models hallucinate a bunch | TechCrunch) – which is a stark reminder to never fully trust an LLM’s output without verification when correctness is critical.

Conclusion

Developers have documented that asking GPT models for long lists of UUIDs/IDs can lead to hallucinated, duplicated, or incorrect outputs, especially as the list length grows. This seems to occur because the model isn’t actually retrieving or calculating these IDs – it’s generating them based on learned patterns. The issue is noticeable in both pure numeric ID lists and structured JSON outputs (though providing richer context like names can slightly reduce errors). OpenAI’s documentation and community experts acknowledge that hallucinations are still possible in structured mode, and that the model will obey format constraints while possibly getting the content wrong (Diving Deeper with Structured Outputs | by Armin Catovic | TDS Archive | Medium). There are no hard limits set by OpenAI on list lengths, but absent explicit support for things like uniqueness or enum enforcement, the model may go off-track on long outputs.

Thankfully, the developer community (and OpenAI staff) have proposed several workarounds and best practices: lowering generation randomness (temp), chunking the task, using surrogate outputs (like names) and mapping to IDs in post-processing (Gpt-4o hallucinates inventing IDs from provided list - API - OpenAI Developer Community), formatting input data in a model-friendly way (even if that means avoiding complex JSON in the prompt), and always validating the AI’s output against a trusted source. Some have also used function calling or tools to fetch correct IDs, rather than depending on the model’s memory.

Finally, this behavior is consistent with the nature of LLMs. Newer models like GPT-4 and GPT-4o have improved capabilities and larger contexts (and generally handle instructions better than GPT-3.5), but they are not immune to these quirks. Indeed, early adopters of GPT-4o noticed some regression in strict compliance, which highlights that prompt tuning might be needed when switching models (GPT-4o - Hallucinating at temp:0 - Unusable in production - Feedback - OpenAI Developer Community). Competing models like Claude show similar fundamental limitations – all current large language models will sometimes “make up” list items if not carefully guided (Study suggests that even the best AI models hallucinate a bunch | TechCrunch).

Reliable insight: You should treat an LLM’s long-list output as a helpful draft, not final truth. By combining smart prompting with application-level checks or transformations, many developers have achieved workable solutions. For example, one developer was able to stop hallucinations entirely by changing how data was given to the model (using a plain text table) (Using Assistant API Retrieval Hallucinates - API - OpenAI Developer Community). Others simply removed the need for the model to output IDs, which is arguably the most robust fix (Gpt-4o hallucinates inventing IDs from provided list - API - OpenAI Developer Community). In critical applications, a “human in the loop” or a deterministic post-process to catch errors is highly recommended. The consensus from experienced users is to never rely on raw model output for unique identifiers or database keys without verification – use the model for what it’s good at (language and logic) and let your system handle the exact IDs.

Sources:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment