Skip to content

Instantly share code, notes, and snippets.

@daturkel
Last active May 25, 2025 19:22
Show Gist options
  • Save daturkel/b27afd56a6f0bd800c56d900fbf9dec6 to your computer and use it in GitHub Desktop.
Save daturkel/b27afd56a6f0bd800c56d900fbf9dec6 to your computer and use it in GitHub Desktop.

llm -T get_collections -T get_relevant_documents "how do i find similar documents based on embeddings in llm" --tools-debug

To find similar documents based on embeddings in an LLM, I'll need to help you use the embedding database tools available. Let me show you how to do this.

First, let's see what collections are available in the embeddings database:

Tool call: get_collections({})
  ["documentation"]

I can see there's a collection called "documentation" in the embeddings database. Now I can show you how to find similar documents based on your query.

To find documents similar to a specific query, you can use the get_relevant_documents function. Here's how:

Tool call: get_relevant_documents({'query': 'finding similar documents with embeddings', 'collection_name': 'documentation', 'number': 3})
  [{"id": "docs/embeddings/index.md", "score": 0.7508922610353906, "content": "(embeddings)=\n# Embeddings\n\nEmbedding models allow you to take a piece of text - a word, sentence, paragraph or even a whole article, and convert that into an array of floating point numbers.\n\nThis floating point array is called an \"embedding vector\", and works as a numerical representation of the semantic meaning of the content in a many-multi-dimensional space.\n\nBy calculating the distance between embedding vectors, we can identify which content is semantically \"nearest\" to other content.\n\nThis can be used to build features like related article lookups. It can also be used to build semantic search, where a user can search for a phrase and get back results that are semantically similar to that phrase even if they do not share any exact keywords.\n\nSome embedding models like [CLIP](https://github.com/simonw/llm-clip) can even work against binary files such as images. These can be used to search for images that are similar to other images, or to search for images that are semantically similar to a piece of text.\n\nLLM supports multiple embedding models through {ref}`plugins <plugins>`. Once installed, an embedding model can be used on the command-line or via the Python API to calculate and store embeddings for content, and then to perform similarity searches against those embeddings.\n\nSee [LLM now provides tools for working with embeddings](https://simonwillison.net/2023/Sep/4/llm-embeddings/) for an extended explanation of embeddings, why they are useful and what you can do with them.\n\n```{toctree}\n---\nmaxdepth: 3\n---\ncli\npython-api\nwriting-plugins\nstorage\n```\n", "metadata": null}, {"id": "docs/embeddings/storage.md", "score": 0.5643716384039444, "content": "(embeddings-storage)=\n# Embedding storage format\n\nThe default output format of the `llm embed` command is a JSON array of floating point numbers.\n\nLLM stores embeddings in space-efficient format: a little-endian binary sequences of 32-bit floating point numbers, each represented using 4 bytes.\n\nThese are stored in a `BLOB` column in a SQLite database.\n\nThe following Python functions can be used to convert between this format and an array of floating point numbers:\n\n```python\nimport struct\n\ndef encode(values):\n    return struct.pack(\"<\" + \"f\" * len(values), *values)\n\ndef decode(binary):\n    return struct.unpack(\"<\" + \"f\" * (len(binary) // 4), binary)\n```\n\nThese functions are available as `llm.encode()` and `llm.decode()`.\n\nIf you are using [NumPy](https://numpy.org/) you can decode one of these binary values like this:\n\n```python\nimport numpy as np\n\nnumpy_array = np.frombuffer(value, \"<f4\")\n```\nThe `<f4` format string here ensures NumPy will treat the data as a little-endian sequence of 32-bit floats.", "metadata": null}, {"id": "docs/related-tools.md", "score": 0.5412134963208175, "content": "(related-tools)=\n# Related tools\n\nThe following tools are designed to be used with LLM:\n\n(related-tools-strip-tags)=\n## strip-tags\n\n[strip-tags](https://github.com/simonw/strip-tags) is a command for stripping tags from HTML. This is useful when working with LLMs because HTML tags can use up a lot of your token budget.\n\nHere's how to summarize the front page of the New York Times, by both stripping tags and filtering to just the elements with `class=\"story-wrapper\"`:\n\n```bash\ncurl -s https://www.nytimes.com/ \\\n  | strip-tags .story-wrapper \\\n  | llm -s 'summarize the news'\n```\n\n[llm, ttok and strip-tags\u00e2\u0080\u0094CLI tools for working with ChatGPT and other LLMs](https://simonwillison.net/2023/May/18/cli-tools-for-llms/) describes ways to use `strip-tags` in more detail.\n\n(related-tools-ttok)=\n## ttok\n\n[ttok](https://github.com/simonw/ttok) is a command-line tool for counting OpenAI tokens. You can use it to check if input is likely to fit in the token limit for GPT 3.5 or GPT4:\n\n```bash\ncat my-file.txt | ttok\n```\n```\n125\n```\nIt can also truncate input down to a desired number of tokens:\n```bash\nttok This is too many tokens -t 3\n```\n```\nThis is too\n```\nThis is useful for truncating a large document down to a size where it can be processed by an LLM.\n\n(related-tools-symbex)=\n## Symbex\n\n[Symbex](https://github.com/simonw/symbex) is a tool for searching for symbols in Python codebases. It's useful for extracting just the code for a specific problem and then piping that into LLM for explanation, refactoring or other tasks.\n\nHere's how to use it to find all functions that match `test*csv*` and use those to guess what the software under test does:\n\n```bash\nsymbex 'test*csv*' | \\\n  llm --system 'based on these tests guess what this tool does'\n```\nIt can also be used to export symbols in a format that can be piped to {ref}`llm embed-multi <embeddings-cli-embed-multi>` in order to create embeddings:\n```bash\nsymbex '*' '*:*' --nl | \\\n  llm embed-multi symbols - \\\n  --format nl --database embeddings.db --store\n```\nFor more examples see [Symbex: search Python code for functions and classes, then pipe them into a LLM](https://simonwillison.net/2023/Jun/18/symbex/).\n", "metadata": null}]

Based on the results, here's how you can find similar documents using embeddings in LLM:

Finding Similar Documents with Embeddings in LLM

LLM provides tools to work with embeddings for semantic similarity search. Here's how to use it:

1. Understanding Embeddings

Embeddings convert text (words, sentences, paragraphs) into arrays of floating point numbers that represent the semantic meaning of the content. By calculating distances between these vectors, you can find semantically similar content.

2. How to Find Similar Documents

To find similar documents based on embeddings:

  1. Use the get_relevant_documents function:
    • Provide a query (the text you want to find similar documents for)
    • Specify the collection name to search in
    • Optionally set the number of results you want

For example:

results = get_relevant_documents(
    query="your search text here",
    collection_name="your_collection_name",
    number=3  # Number of similar documents to retrieve
)

3. Additional Features

  • LLM stores embeddings efficiently in a SQLite database
  • You can embed multiple documents using llm embed-multi
  • The system supports various embedding models through plugins

If you want to perform this search with a different query or in a different collection, let me know and I can help you formulate the proper command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment