Fingerprinting Minified JavaScript Libraries / AST Fingerprinting / Source Code Similarity / Etc

Some notes and tools on fingerprinting minified JavaScript libraries, AST fingerprinting, source code similarity, etc.

Original Notes
ChatGPT Explorations
Musings
- On Twitter
  - Embedding Based Code Search Across the Open-Source Ecosystem
  - Is there anything like PyPi-Data but for the NPM / JavaScript ecosystem?
Code Embeddings
Code Search
npm Package Ranking, Bundle Size, etc
- npm Package Registry Data, Ranking, etc
- Package / Bundle Size, Bundle Analyzer / Visualizer, etc
  - Size Visualisation Data Structures
Link Dump 1
- Unsorted/Unreviewed Initial Link Dump RE: 'AST fingerprinting' / Code Similarity
Link Dump 2
- OpenAI Embeddings
- Unsorted/Unreviewed Link Dump RE: 'AST fingerprinting' / Code Similarity (v2)
Link Dump 3
Link Dump 4
Link Dump 5
Software Similarity and Classification (2012; Book; Silvio Cesare, Yang Xiang)
Unsorted
See Also
- My Other Related Deepdive Gist's and Projects

Original Notes

This gist was created as I was finding there was too much content related to this topic to keep tacking it onto my older gist on Deobfuscating / Unminifying Obfuscated Web App / JavaScript Code; but until I move all of the relevant content from there to this gist; here is a link to the main notes I was keeping track of there (largely copies of my comments on various relevant GitHub repo's exploring this topic + related research / tools / etc):

fingerprinting-minified-javascript-libraries.md
- Fingerprinting Minified JavaScript Libraries

ChatGPT Explorations

https://chatgpt.com/c/d2713f5a-19ee-41fe-836d-0db4ba3daeac
- Public Share (created 2025-03-25): https://chatgpt.com/share/67e25fc8-f638-8008-a610-3edaa6614072
- Private ChatGPT conversation about various things related to AST fingerprinting/etc; or as it summarised itself:
  - This chat explored how to create a stable and efficient system for fingerprinting and identifying variables in minified JavaScript code using structural patterns from AST analysis. We examined how tools like eslint-scope can help extract scope and reference data, discussed structural fingerprinting techniques inspired by academic research, and considered which JavaScript elements typically survive minification (like strings, symbols, and function structures). Finally, we developed an enhanced AST traversal script that categorizes these preserved elements by context—scopes, functions, classes, and modules—to make them easier to understand and analyze.
- TODO: Summarise/pull out the relevant parts from this and include them here
https://chatgpt.com/c/67e25d5d-1aa4-8008-ac08-c971ac64090e
- Public Share (created 2025-03-25): https://chatgpt.com/share/67e25f3a-b604-8008-9d83-e12c738eb306
- Private ChatGPT conversation about various things related to identifying NPM imports in a bundled apps module import/export graph; or as it summarised itself:
  - This chat discusses techniques for analyzing a module dependency graph extracted from a bundled and minified JavaScript web app to identify subgraphs likely representing third-party library code. It covers methods such as graph clustering (e.g., Louvain, spectral clustering), centrality analysis, import tree depth, symbol naming heuristics, fingerprint/signature matching, entropy analysis, and dynamic profiling. These approaches help isolate self-contained, library-like clusters that can potentially be "sliced off" from the main application logic, supporting the goal of distinguishing app code from imported npm dependencies.
- TODO: Summarise/pull out the relevant parts from this and include them here

Musings

On Twitter

Embedding Based Code Search Across the Open-Source Ecosystem

https://x.com/_devalias/status/1905905312093053215
- @_devalias (March 29 2025)
  
  I wonder who's going to give me robust embedding based search across the entire open source ecosystem first.. @github code search, or @Sourcegraph?
  
  Ideally not just at a file level, but at a function level.
  
  I don't think either do currently, but I may not have read deep enough
- https://x.com/_devalias/status/1905905692869042316
  - @_devalias (March 29 2025)
    
    Basically, given a random snippet of (potentially minified) code from a JS bundle; I want to be able to create an embedding for it, and then search for that across the whole NPM package ecosystem / open-source JS repos; and be able to identify which dependency it is.
- https://x.com/_devalias/status/1905906049917542735
  - @_devalias (March 29 2025)
    
    There are ways I could do this currently, by extracting various 'stable' / 'salient' parts from the module and then using the regex/etc search features for it.
    
    But it just kind of feels like being able to find the closest matches based on a code embeding would be even nicer.
- https://x.com/_devalias/status/1905906304100778169
  - @_devalias (March 29 2025)
    
    Bonus points would also be if by searching via that embedding, not only did it end up matching the library I wanted; but if it identified the specific version/commit/similar because it was a closer match.
- https://x.com/_devalias/status/1905908702215094752
  - @_devalias (March 29 2025)
    
    My ever-growing deep dive gist of thoughts/resources/research/etc tangentially related to this and similar:
    
    Fingerprinting Minified JavaScript Libraries / AST Fingerprinting / Source Code Similarity / Etc
    
    https://gist.github.com/0xdevalias/31c6574891db3e36f15069b859065267#fingerprinting-minified-javascript-libraries--ast-fingerprinting--source-code-similarity--etc
- https://x.com/_devalias/status/1906298255278739858
  - @_devalias (March 30 2025)
    
    Or as the wildcard entry.. just stumbled across @boyter 's http://searchcode.com + Blog full of a literal goldmine of interesting looking content related to it!
    
    (will be dumping a pile of interesting looking blog links into my aforementioned gist as soon as I can)

Is there anything like PyPi-Data but for the NPM / JavaScript ecosystem?

https://x.com/_devalias/status/1913111558810456574 / https://bsky.app/profile/devalias.net/post/3ln2xcu6wh225
- @_devalias (April 18 2025)
  
  Does anyone know if there is anything like https://github.com/pypi-data / https://py-code.org but for the @npmjs / JavaScript ecosystem?
- https://bsky.app/profile/devalias.net/post/3ln2xms5p7s24
  - @_devalias (April 18 2025)
    
    This was the deep dive rabbithole of things I found when I was last looking into this sort of thing: https://gist.github.com/0xdevalias/31c6574891db3e36f15069b859065267#npm-package-ranking-bundle-size-etc

Code Embeddings

Benchmarks / Leaderboards / etc

MMTEB: Massive Multilingual Text Embedding Benchmark (2025) / MTEB: Massive Text Embedding Benchmark (2022)

https://huggingface.co/spaces/mteb/leaderboard
- MTEB Leaderboard
- Embedding Leaderboard This leaderboard compares 100+ text and image embedding models across 1000+ languages. We refer to the publication of each selectable benchmark for details on metrics, languages, tasks, and task types.
https://github.com/embeddings-benchmark/mteb
- MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2502.13595
- MMTEB: Massive Multilingual Text Embedding Benchmark (2025)
- Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.
https://arxiv.org/abs/2210.07316
- MTEB: Massive Text Embedding Benchmark (2022)
- Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings to date. We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks.

CoIR: A Comprehensive Benchmark for Code Information Retrieval Models (2024-2025)

https://archersama.github.io/coir/
- CoIR: Code Information Retrieval Benchmark
- This leaderboard evaluates various models on different Code Retrieval tasks.
https://arxiv.org/abs/2407.02883
- CoIR: A Comprehensive Benchmark for Code Information Retrieval Models
- Despite the substantial success of Information Retrieval (IR) in various NLP tasks, most IR systems predominantly handle queries and corpora in natural language, neglecting the domain of code retrieval. Code retrieval is critically important yet remains under-explored, with existing methods and benchmarks inadequately representing the diversity of code in various domains and tasks. Addressing this gap, we present COIR (Code Information Retrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. COIR comprises ten meticulously curated code datasets, spanning eight distinctive retrieval tasks across seven diverse domains. We first discuss the construction of COIR and its diverse dataset composition. Further, we evaluate nine widely used retrieval models using COIR, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems. To facilitate easy adoption and integration within existing research workflows, COIR has been developed as a user-friendly Python framework, readily installable via pip. It shares same data schema as other popular benchmarks like MTEB and BEIR, enabling seamless cross-benchmark evaluations. Through COIR, we aim to invigorate research in the code retrieval domain, providing a versatile benchmarking tool that encourages further development and exploration of code retrieval systems.
- https://arxiv.org/html/2407.02883v3
- https://www.alphaxiv.org/overview/2407.02883
https://github.com/CoIR-team/coir
- COIR - Benchmarking Code IR
- (ACL 2025 Main) A Comprehensive Benchmark for Code Information Retrieval.
- CoIR (Code Information Retrieval) benchmark, is designed to evaluate code retrieval capabilities. CoIR includes 10 curated code datasets, covering 8 retrieval tasks across 7 domains. In total, it encompasses two million documents. It also provides a common and easy Python framework, installable via pip, and shares the same data schema as benchmarks like MTEB and BEIR for easy cross-benchmark evaluations.
- Why the Results on the MTEB and CoIR Leaderboards Differ? Look this issue
  - CoIR-team/coir#17
    - Why Do the Results on the MTEB and CoIR Leaderboards Differ?
    - Some users may have noticed discrepancies between evaluation results obtained using the MTEB and CoIR frameworks. The primary reason for this difference lies in the handling of the title field in certain datasets within the CoIR benchmark.
      
      In the MTEB evaluation framework, the title field is utilized by default, whereas CoIR does not incorporate this field in its standard evaluation. As a result, models evaluated under the MTEB framework may exhibit higher performance scores compared to CoIR. In the CoIR paper, the reported results exclude the title field.
    - This issue is specific to the current implementation. In future versions of the coir package, the title field will be included as a default component in the evaluation process.

Embedding Models

Qwen3 Embedding (2025-06)

https://qwenlm.github.io/blog/qwen3-embedding/
- Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models (2025-06-05)
- We release Qwen3 Embedding series, a new proprietary model of the Qwen model family. These models are specifically designed for text embedding, retrieval, and reranking tasks, built on the Qwen3 foundation model. Leveraging Qwen3’s robust multilingual text understanding capabilities, the series achieves state-of-the-art performance across multiple benchmarks for text embedding and reranking tasks. We have open-sourced this series of text embedding and reranking models under the Apache 2.0 license on Hugging Face and ModelScope, and published the technical report and related code on GitHub.
- The Qwen3 Embedding series offers a diverse range of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to various use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios.
- The Qwen3 Embedding series support over 100 languages, including various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities.

CodeXEmbed (SFR-Embedding-Code) (2025-01)

https://www.salesforce.com/blog/sfr-embedding-code/
- SFR-Embedding-Code: A Family of Embedding Models for Code Retrieval
- Developers face unique challenges when retrieving code snippets, such as understanding syntax, control flow, and variable dependencies. Enter SFR-Embedding-Code, a groundbreaking family of code embedding models that aims to address these challenges and revolutionize how we retrieve and generate code.
- Code retrieval is a critical, yet under explored area in the field of artificial intelligence. While text retrieval systems have seen remarkable success in natural language processing (NLP) tasks, these approaches often fall short when applied to code. Developers face unique challenges when retrieving code snippets, such as understanding syntax, control flow, and variable dependencies. Enter SFR-Embedding-Code, a groundbreaking family of code embedding models that aims to address these challenges and revolutionize how we retrieve and generate code. Whether you’re a seasoned programmer or someone curious about the intersection of AI and coding, this blog will walk you through how SFR-Embedding-Code is revolutionizing the landscape.
- SFR-Embedding-Code introduces a family of large-scale, open-source embedding models with parameter sizes ranging from 400 million to 7 billion. These models redefine the state-of-the-art in code retrieval, outperforming the second best model by over 20% on the CoIR benchmark.
- https://www.reddit.com/r/machinelearningnews/comments/1i4rofm/salesforce_ai_research_introduced_codexembed/
  - Salesforce AI Research Introduced CodeXEmbed (SFR-Embedding-Code): A Code Retrieval Model Family Achieving #1 Rank on CoIR Benchmark and Supporting 12 Programming Languages
  - https://www.marktechpost.com/2025/01/18/salesforce-ai-research-introduced-codexembed-sfr-embedding-code-a-code-retrieval-model-family-achieving-1-rank-on-coir-benchmark-and-supporting-12-programming-languages/
https://huggingface.co/Salesforce/SFR-Embedding-Code-400M_R
- Salesforce/SFR-Embedding-Code-400M_R
https://huggingface.co/Salesforce/SFR-Embedding-Code-2B_R
- Salesforce/SFR-Embedding-Code-2B_R
https://arxiv.org/abs/2411.12644v2
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval (2024)
- Despite the success of text retrieval in many NLP tasks, code retrieval remains a largely underexplored area. Most text retrieval systems are tailored for natural language queries, often neglecting the specific challenges of retrieving code. This gap leaves existing models unable to effectively capture the diversity of programming languages and tasks across different domains, highlighting the need for more focused research in code retrieval. To address this, we introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters. Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework, enhancing model generalizability and retrieval performance. Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark. In addition to excelling in code retrieval, our models demonstrate competitive performance on the widely adopted BeIR text retrieval benchmark, offering versatility across domains. Experimental results demonstrate that improving retrieval performance significantly enhances end-to-end Retrieval-Augmented Generation (RAG) performance for code-related tasks.
- https://www.alphaxiv.org/overview/2411.12644

Voyage (2024-2025)

https://blog.voyageai.com/2025/01/07/voyage-3-large/
- voyage-3-large: the new state-of-the-art general-purpose embedding model
- TL;DR – Introducing voyage-3-large, a new state-of-the-art general-purpose and multilingual embedding model that ranks first across eight evaluated domains spanning 100 datasets, including law, finance, and code. It outperforms OpenAI-v3-large and Cohere-v3-English by an average of 9.74% and 20.71%, respectively. Enabled by Matryoshka learning and quantization-aware training, voyage-3-large supports smaller dimensions and int8 and binary quantization that dramatically reduce vectorDB costs with minimal impact on retrieval quality.
https://blog.voyageai.com/2024/12/04/voyage-code-3/
- voyage-code-3: more accurate code retrieval with lower dimensional, quantized embeddings
- TL;DR – Introducing voyage-code-3, our next-generation embedding model optimized for code retrieval. It outperforms OpenAI-v3-large and CodeSage-large by an average of 13.80% and 16.81% on a suite of 32 code retrieval datasets, respectively. By supporting smaller dimensions with Matryoshka learning and quantized formats like int8 and binary, voyage-code-3 can also dramatically reduce storage and search costs with minimal impact on retrieval quality.
https://blog.voyageai.com/2024/11/12/voyage-multimodal-3/
- voyage-multimodal-3: all-in-one embedding model for interleaved text, images, and screenshots
- TL;DR — We are excited to announce voyage-multimodal-3, a new state-of-the-art for multimodal embeddings and a big step forward towards seamless RAG and semantic search for documents rich with both visuals and text. Unlike existing multimodal embedding models, voyage-multimodal-3 is capable of vectorizing interleaved texts + images and capturing key visual features from screenshots of PDFs, slides, tables, figures, and more, thereby eliminating the need for complex document parsing. voyage-multimodal-3 improves retrieval accuracy by an average of 19.63% over the next best-performing multimodal embedding model when evaluated across 3 multimodal retrieval tasks (20 total datasets).
https://blog.voyageai.com/2024/01/23/voyage-code-2-elevate-your-code-retrieval/
- voyage-code-2: Elevate Your Code Retrieval
- TL;DR – We are thrilled to introduce voyage-code-2, our latest embedding model specifically tailored for semantic retrieval of codes and related text data from both natural language and code queries. Our comprehensive evaluation, covering 11 code retrieval tasks (derived from popular coding datasets like HumanEval and MBPP), demonstrated a remarkable 14.52% improvement in recall compared to competitors, including OpenAI and Cohere. Additionally, we noted consistent gains, averaging 3.03%, across diverse general-purpose text datasets.

Vector Embedding Databases

See Also:
- Vector Databases/Search, Similarity Search, Clustering, etc (0xdevalias' gist - subsection)

Faiss

https://github.com/facebookresearch/faiss
- Faiss
- A library for efficient similarity search and clustering of dense vectors.
- Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed primarily at Meta's Fundamental AI Research group.
- https://faiss.ai/

Chroma

https://github.com/chroma-core/chroma
- Chroma - the open-source embedding database.
- The fastest way to build Python or JavaScript LLM apps with memory!

Unsorted

https://github.com/waingram/code-embeddings
- A Comparative Study of Various Code Embeddings in Software Semantic Matching
- The ability to search code repositories for functionally equivalent code would be a tremendous benefit to software engineering. Code reuse is fundamental to software engineering, and open source code repositories have become rich sources of reusable code. In this study, we examine how machine learning techniques used in Natural Language Processing (NLP) for representing words and documents as vectors can be applied to representing code fragments in vector space. To do so, we amass a large corpus of programming tasks implemented in multiple programming languages. We then apply existing document embedding techniques to our corpus of code so that we can map each code fragment to a point in vector space and study to what extent these document embeddings are useful in capturing the semantics of software code. Finally we design and implement a code-matching application for locating functionally equivalent code fragments based on vector embeddings and use this application for evaluating the different embeddings.
- https://github.com/waingram/code-embeddings#sample-search-engine
  - Sample Search Engine
  - Proof of concept code search engine
  - https://github.com/waingram/code-embeddings/tree/master/app
    - Code Similarity Search Engine
    - As a proof of concept, we developed a search engine that makes use of our doc2vec models for searching a corpus of source code. The search engine takes a text block as input. The user may enter a Java or Python code fragment of arbitrary length. When submitted, the application will use the corresponding model (Java or Python) to infer a vector for the given code fragment. Then, the inferred vector is used to find the top-n most-similar vectors known from the training by calculating the cosine-similarity. The results are displayed to the user.
https://platform.openai.com/docs/guides/embeddings
- Vector embeddings
- Learn how to turn text into numbers, unlocking use cases like search.
https://python.langchain.com/docs/integrations/text_embedding/
- Embedding models | Langchain
https://community.openai.com/t/github-embeddings-for-entire-github-code-repository/155253
- [GitHub] Embeddings for Entire GitHub Code Repository
https://github.com/orgs/community/discussions/52651
- [OpenAI] Better Code Search - Embeddings of Entire GitHub Repository

Code Search

GitHub Code Search

Public Code Search

https://github.com/search?type=code

Docs

https://docs.github.com/en/search-github/github-code-search/about-github-code-search
- About GitHub Code Search You can search, navigate and understand code across GitHub with code search.
- https://docs.github.com/en/search-github/github-code-search/about-github-code-search#limitations
  - Limitations
    
    We have indexed many public repositories for code search, and continue to index more. Additionally, the private repositories of GitHub users are indexed and searchable by those that already have access to those private repositories on GitHub. However, very large repositories may not be indexed at this time, and not all code is indexed.
    
    The current limitations on indexed code are:
    - Vendored and generated code is excluded
    - Empty files and files over 350 KiB are excluded
    - Lines over 1,024 characters long are truncated
    - Binary files (PDF, etc.) are excluded
    - Only UTF-8 encoded files are included
    - Very large repositories may not be indexed
    - Exhaustive search is not supported
    - Files with more than one line over 4096 bytes are excluded
    We currently only support searching for code on the default branch of a repository. The query length is limited to 1000 characters.
    
    Results for any search with code search are restricted to 100 results (5 pages). Sorting is not supported for code search results at this time. This limitation only applies to searching code with the new code search and does not apply to other types of searches.
    
    If you use the path: qualifier for a file that's in multiple repositories with similar content, GitHub will only show a few of those files. If this happens, you can choose to expand by clicking Show identical files at the bottom of the page.
    
    Code search supports searching for symbol definitions in code, such as function or class definitions, using the symbol: qualifier. However, note that the symbol: qualifier only searches for definitions and not references, and not all symbol types or languages are fully supported yet.
https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax
- Understanding GitHub Code Search syntax You can build search queries for the results you want with specialized code qualifiers, regular expressions, and boolean operations.
- The search syntax in this article only applies to searching code with GitHub code search. Note that the syntax and qualifiers for searching for non-code content, such as issues, users, and discussions, is not the same as the syntax for code search. For more information on non-code search, see About searching on GitHub and Searching on GitHub.
  - https://docs.github.com/en/search-github/getting-started-with-searching-on-github/understanding-the-search-syntax
  - Understanding the search syntax When searching GitHub, you can construct queries that match specific numbers and words.
  - Note: The syntax below applies to non-code search. For more information on code search syntax, see Understanding GitHub Code Search syntax.
- Search queries consist of search terms, comprising text you want to search for, and qualifiers, which narrow down the search.
- A bare term with no qualifiers will match either the content of a file or the file's path.
- You can enter multiple terms separated by whitespace to search for documents that satisfy both terms.
- Searching for multiple terms separated by whitespace is the equivalent to the search hello AND world. Other boolean operations, such as hello OR world, are also supported.
  - https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#using-boolean-operations
    - Using boolean operations
    - Code search supports boolean expressions. You can use the operators AND, OR, and NOT to combine search terms.
    - By default, adjacent terms separated by whitespace are equivalent to using the AND operator.
    - You can use parentheses to express more complicated boolean expressions.
- Code search also supports searching for an exact string, including whitespace.
  - https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#query-for-an-exact-match
    - Query for an exact match
- You can narrow your code search with specialized qualifiers, such as repo:, language: and path:
  - https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#using-qualifiers
    - Using qualifiers
      
      You can use specialized keywords to qualify your search.
      - Repository qualifier (repo:)
      - Organization (org:) and user (user:) qualifiers
      - Language qualifier (language:)
      - Path qualifier (path:)
      - Symbol qualifier (symbol:)
        
        You can search for symbol definitions in code, such as function or class definitions, using the symbol: qualifier. Symbol search is based on parsing your code using the open source Tree-sitter parser ecosystem, so no extra setup or build tool integration is required.
        
        In some languages, you can search for symbols using a prefix (e.g. a prefix of their class name). For example, for a method deleteRows on a struct Maint, you could search symbol:Maint.deleteRows if you are using Go, or symbol:Maint::deleteRows in Rust.
        
        You can also use regular expressions with the symbol qualifier.
        
        Note that this qualifier only searches for definitions and not references, and not all symbol types or languages are fully supported yet. Symbol extraction is supported for the following languages:
        
        Bash
        
        C
        
        C#
        
        C++
        
        CodeQL
        
        Elixir
        
        Go
        
        JSX
        
        Java
        
        JavaScript
        
        Lua
        
        PHP
        
        Protocol Buffers
        
        Python
        
        R
        
        Ruby
        
        Rust
        
        Scala
        
        Starlark
        
        Swift
        
        Typescript
        
        We are working on adding support for more languages. If you would like to help contribute to this effort, you can add support for your language in the open source Tree-sitter parser ecosystem, upon which symbol search is based.
      - Content qualifier (content:)
      - Is qualifier (is:)
        
        To filter based on repository properties, you can use the is: qualifier. is: supports the following values:
        
        archived: restricts the search to archived repositories.
        
        fork: restricts the search to forked repositories.
        
        vendored: restricts the search to content detected as vendored.
        
        generated: restricts the search to content detected as generated.
- You can also use regular expressions in your searches by surrounding the expression in slashes.
  - https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#using-regular-expressions
    - Using regular expressions
      
      Code search supports regular expressions to search for patterns in your code. You can use regular expressions in bare search terms as well as within many qualifiers, by surrounding the regex in slashes.
    - Most common regular expressions features work in code search. However, "look-around" assertions are not supported.
- https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#searching-for-quotes-and-backslashes
  - Searching for quotes and backslashes
  - To search for code containing a quotation mark, you can escape the quotation mark using a backslash.
- https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#separating-search-terms
  - Separating search terms
    
    All parts of a search, such as search terms, exact strings, regular expressions, qualifiers, parentheses, and the boolean keywords AND, OR, and NOT, must be separated from one another with spaces. The one exception is that items inside parentheses, ( ), don't need to be separated from the parentheses.
    
    If your search contains multiple components that aren't separated by spaces, or other text that does not follow the rules listed above, code search will try to guess what you mean. It often falls back on treating that component of your query as the exact text to search for.
  - If code search guesses wrong, you can always get the search you wanted by using quotes and spaces to make the meaning clear.
- https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#case-sensitivity
  - Case sensitivity
    
    By default, code search is case-insensitive, and results will include both uppercase and lowercase results. You can do case-sensitive searches by using a regular expression with case insensitivity turned off. For example, to search for the string "True", you would use:
    /(?-i)True/
https://docs.github.com/en/search-github/searching-on-github/searching-code
- Searching code (legacy) You only need to use the legacy code search syntax if you are using the code search API.
- https://docs.github.com/en/rest/search/search#search-code
  - Search code Searches for query terms inside of a file. This method returns up to 100 results per page.
  - GET /search/code

Blogs, YouTube, etc

https://www.youtube.com/watch?v=QCs76SC1ZZ0
- YouTube: The technology behind GitHub's new code search - Universe 2022
https://github.blog/engineering/architecture-optimization/the-technology-behind-githubs-new-code-search/
- The technology behind GitHub’s new code search (February 6, 2023) A look at what went into building the world’s largest public code search index.
- TODO: read through this and include more relevant snippets here
- https://news.ycombinator.com/item?id=34681223
  - I've worked alongside the CEO/CTO of Sourcegraph for the past 8 years, everyone else is at our company offsite so I figured I'd chime in :) nobody asked me to write this (nor did I ask) :)
    
    The article is a top-notch technical write-up, the devs on GitHub code search should be proud of what they've achieved so far!
    
    Honestly, we're rooting for GitHub to improve their code search, viewing them as a close peer-not a competitor. We also maintain OSS projects like Zoekt, which IIRC GitLab is maybe looking at using for their own. The more devs that 'get' code search, the better off Sourcegraph is frankly!
    
    GitHub has a nice intuitive/simple UX, we could learn a thing or two there (though, easier to do with less features.)
    
    Still, Sourcegraph search tech is quite a bit more powerful:
    - Searching over commit messages, diffs, filename, etc. are super nice for tracking down regressions / finding 'that PR I swear my coworker made'
    - Expressiveness like "find this regexp in repositories, but only if the repo has had a commit in the last month AND has a file named package.json in its root"
    - Since Steve Yegge joined us, we've started thinking about ranking of search results, a notoriously difficult thing to do well in code search unless you have great factors to rank on (e.g. a semantic understanding of code): https://about.sourcegraph.com/blog/new-search-ranking
    - We stream results back, so you can get a comprehensive set of results - not just a few pages, from our API.
    - Works in GitHub Enterprise, not just GitHub.com. Plus on all your code hosts, think BitBucket, GitLab, Azure DevOps, Gerrit, Phabricator, etc. and even non-Git VCS like Perforce.
    - Respects permissions of all your code hosts (a very difficult problem, as there are no official APIs to query this info from code hosts in general)
    Having code search is one thing, but using it is another:
    - Code Insights (we use search as an API to gather statistics about code, track code quality, keywords, etc. both over time and retroactively and let you build dashboards)
    - Batch changes (find+replace, but over thousands of repositories. Run a Docker container per repo, run your custom linter script etc. and then draft or send PRs to thousands of repos, manage/track campaigns with thousands of PRs like that over time, etc.)
    - Precise code intel / semantic awareness of code, we use SCIP indexers for this (spiritual successor to Microsoft's LSIF format for indexing LSP servers.)
    I am super happy GitHub continues to push their code search effort, and genuinely believe it's a great thing for all developers and us over at Sourcegraph. Also excited to see when they do their public rollout of this :)
    
    Anyway, that's just my take as someone who works there-other Sourcegraphers will chime in later if anything I said above feels off to them I'm sure :)
    - https://sourcegraph.com/blog/new-search-ranking
      - Rethinking search results ranking on Sourcegraph.com
      - Announcing Search Ranking and Relevance
        
        I’m thrilled to announce that Sourcegraph has launched PageRank-driven Code Search result rankings that prioritize relevance and showing reusable code. This launched today for searches on popular OSS repos on https://sourcegraph.com/ , and we are working to bring ranking to private Sourcegraph deployments soon.
      - Sourcegraph’s new search ranking uses a rendition of the Google PageRank algorithm on source code, powered by the code symbol graph from our sophisticated code intelligence platform (CIP).
      - Why is using PageRank for Code Search so revolutionary and effective? Let’s dig in.
      - For web pages, Google’s PageRank tracks which pages are pointed at (referenced) most often by other web pages. PageRank is a measure of how “cool” they are: Who’s pointing at them?
        
        For source code, the pointing hands are code usages: function calls, imports, that sort of thing. If there’s only one arm pointing at a smiley, that’s a code use. But if more than one arm is pointing in… that’s reuse! The big yellow smiley is being reused by more code than any other smiley in the diagram. The PageRank algorithm uncovered this fact.
        
        The implication here is that PageRank is a measure of code reuse. Which makes it an incredibly powerful ranking signal. Because when you’re doing a code search, you are almost always looking for code you can reuse.
      - TODO: read through this and include more relevant snippets here
https://github.blog/engineering/a-brief-history-of-code-search-at-github/
- A brief history of code search at GitHub (December 15, 2021)
  
  This blog post tells the story of why we built a new search engine optimized for code.
- We want to share more about our work on code exploration, navigation, search, and developer productivity. Recently, we substantially improved the precision of our code navigation for Python, and open-sourced the tools we developed for this. The stack graph formalism we developed will form the basis for precise code navigation support for more languages, and will even allow us to empower language communities to build and improve support for their own languages, similarly to how we accept contributions to github/linguist to expand GitHub’s syntax highlighting capabilities.
- TODO: read through this and include more relevant snippets here
https://github.blog/open-source/introducing-stack-graphs/
- Introducing stack graphs (December 9, 2021 | Updated July 23, 2024)
  
  Precise code navigation is powered by stack graphs, a new open source framework that lets you define the name binding rules for a programming language.
- Today, we announced the general availability of precise code navigation for all public and private Python repositories on GitHub.com. Precise code navigation is powered by stack graphs, a new open source framework we’ve created that lets you define the name binding rules for a programming language using a declarative, domain-specific language (DSL). With stack graphs, we can generate code navigation data for a repository without requiring any configuration from the repository owner, and without tapping into a build process or other CI job. In this post, I’ll dig into how stack graphs work, and how they achieve these results.
- TODO: read through this and include more relevant snippets here
- https://dcreager.net/talks/stack-graphs/
  - Incremental, zero-config Code Navigation using stack graphs.
    
    Exploring a large or unfamiliar codebase can be tricky. Code Navigation features like “jump to definition” and “find all references” let you discover how different pieces of code relate to each other. To power these features, we need to extract lists of symbols from the code, and describe the language-specific rules for how those symbols relate to each other.
    
    It’s difficult to add Code Nav to a large hosted service like GitHub, where we must support hundreds of programming languages, hundreds of millions of repositories, and petabytes of history. At this scale, we have a different set of design constraints than a local IDE. We need our data extraction to be incremental, so that we can reuse previous results for files that haven’t changed in a newly pushed commit, saving both compute and storage costs. And to support cross-repo lookups, it should require zero configuration — repo owners should not have to set up anything manually to activate the feature.
    
    In this talk I’ll describe stack graphs, which use a graphical notation to define the name binding rules for a programming language. They work equally well for dynamic languages like Python and JavaScript, and for static languages like Go and Java. Our solution is fast — processing most commits within seconds of us receiving your push. It does not require setting up a CI job, or tapping into a project-specific build process. And it is open-source, building on the tree-sitter project’s existing ecosystem of language tools.
  - Presentation: https://www.youtube.com/watch?v=l2R1PTGcwrE
    - YouTube: "Incremental, zero-config Code Nav using stack graphs" by Douglas Creager
  - Slides: https://media.dcreager.net/dcreager-strange-loop-2021-slides.pdf
- https://arxiv.org/abs/2211.01224
  - Stack graphs: Name resolution at scale (2022)
  - We present stack graphs, an extension of Visser et al.'s scope graphs framework. Stack graphs power Precise Code Navigation at GitHub, allowing users to navigate name binding references both within and across repositories. Like scope graphs, stack graphs encode the name binding information about a program in a graph structure, in which paths represent valid name bindings. Resolving a reference to its definition is then implemented with a simple path-finding search.
    
    GitHub hosts millions of repositories, containing petabytes of total code, implemented in hundreds of different programming languages, and receiving thousands of pushes per minute. To support this scale, we ensure that the graph construction and path-finding judgments are file-incremental: for each source file, we create an isolated subgraph without any knowledge of, or visibility into, any other file in the program. This lets us eliminate the storage and compute costs of reanalyzing file versions that we have already seen. Since most commits change a small fraction of the files in a repository, this greatly amortizes the operational costs of indexing large, frequently changed repositories over time. To handle type-directed name lookups (which require "pausing" the current lookup to resolve another name), our name resolution algorithm maintains a stack of the currently paused (but still pending) lookups. Stack graphs can be constructed via a purely syntactic analysis of the program's source code, using a new declarative graph construction language. This means that we can extract name binding information for every repository without any per-package configuration, and without having to invoke an arbitrary, untrusted, package-specific build process.
https://github.blog/news-insights/product-news/precise-code-navigation-python-code-navigation-pull-requests/
- Precise code navigation for Python, and code navigation in pull requests (December 9, 2021 | Updated July 23, 2024)
  
  Code navigation is now available in PRs, and code navigation results for Python are now more precise.
- Over the coming months, we will add stack graph support for additional languages, allowing us to show precise code navigation results for them as well. Our stack-graphs library is open source and builds on the Tree-sitter ecosystem of parsers. We will also be publishing information on how language communities can self-serve stack graph support for their languages, should they wish to.
- If you would like to learn more about how stack graphs enable precise code navigation with zero configuration, check out our deep dive post and Strange Loop presentation.
- TODO: read through this and include more relevant snippets here

SourceGraph

https://sourcegraph.com/
- Sourcegraph accelerates how software gets built, helping developers search, understand, and write code in complex codebases with AI
- Code Search Find and navigate code, make large-scale changes, and track insights across codebases of any size.
  - https://sourcegraph.com/contexts
    - Search code you care about with search contexts
      - https://sourcegraph.com/docs/code-search/working/search_contexts
        
        Search Contexts
        
        Search Contexts help you search the code you care about on Sourcegraph. A search context represents a set of repositories at specific revisions on a Sourcegraph instance that will be targeted by search queries by default.
        
        Every search on Sourcegraph uses a search context. Search contexts can be defined with the contexts selector shown in the search input, or entered directly in a search query.
  - https://sourcegraph.com/code-search
    - Code Search makes it easy to find code, make large-scale changes, and track insights across codebases of any scale and with any number of code hosts.
    - Efficiently reuse existing code. Find code across thousands of repositories and multiple code hosts in seconds.
    - Understand your code and its dependencies
      
      Onboard to codebases faster with cross-repository code navigation features like “Go to definition” and “Find references”.
      
      Complete code reviews, get up to speed on unfamiliar code, and determine the impact of code changes with the confidence of compiler-accurate code navigation.
      
      Determine root causes quickly with code navigation that tracks dependencies and references across repositories.
- https://sourcegraph.com/pricing
  - Free
    - $0 per month
    - AI editor extension for hobbyists or light usage
  - Enterprise Starter
    - $19 per user/month
    - AI & search experience for growing organizations hosted on our cloud
    - This seems to be the first tier that adds specialised search features (beyond whats available publicly anyway)
      - Integrated search results
      - Code Search Features
        
        Code Search
        
        Symbol Search
  - Enterprise
    - $59 per user/month
    - AI & search with enterprise-level security, scalability, and flexibility
    - Extra search features
      - Everything in Enterprise Starter, plus:
      - Code Search Features
        
        Batch Changes
        
        Code Insights
        
        Code Navigation

Public Code Search

https://sourcegraph.com/search
- Public Code Search

Docs

https://sourcegraph.com/docs
- Documentation Sourcegraph allows developers to rapidly search, write, and understand code by bringing insights from their entire codebase right into the editor.
- https://sourcegraph.com/docs/code-search
  - Code Search
  - Code Search allows you to find, fix, and navigate code with any code host or language across multiple repositories with real-time updates. It deeply understands your code, prioritizing the most relevant results for an enhanced search experience.
  - Sourcegraph's Code Search empowers you to:
    - Utilize regular expressions, boolean operations, and keyboard shortcuts to help you unleash the full potential of your searches
    - With the symbol, commit, and diff search capabilities, it identifies code vulnerabilities in milliseconds and quickly helps you resolve issues and incidents
    - Offers innovative code view with seamless code navigation for a comprehensive coding experience
  - https://sourcegraph.com/docs/code-search/features
    - Code Search Capabilities
      
      Learn and understand more about Sourcegraph's Code Search features and core functionality.
    - https://sourcegraph.com/docs/code-search/features#powerful-flexible-queries
    - https://sourcegraph.com/docs/code-search/features#symbol-search
      - Searching for symbols makes it easier to find specific functions, variables, and more. Use the type:symbol filter to search for symbol results. Symbol results also appear in typeahead suggestions, so you can jump directly to symbols by name. When on an indexed commit, it uses Zoekt. Otherwise it uses the symbols service
        
        https://sourcegraph.com/docs/code-search/types/symbol
        
        We use Ctags to index the symbols of a repository on demand. These symbols are used to implement symbol search, matching declarations instead of plain text.
    - https://sourcegraph.com/docs/code-search/features#saved-searches
    - https://sourcegraph.com/docs/code-search/features#search-contexts
    - https://sourcegraph.com/docs/code-search/features#re2-regular-expressions
      - RE2 Regular Expressions
        
        The Sourcegraph search language supports RE2 syntax. If you're used to tools like Perl which uses PCRE syntax, you may notice that there are some features that are missing from RE2 like backreferences and lookarounds. We choose to use RE2 for a few reasons:
        
        It makes it possible to build worst-case linear evaluation engines, which is very desirable for building a production-ready regex search engine.
        
        It's well-supported in Go, allowing us to take advantage of a rich ecosystem (notably including Zoekt)
        
        Our API and tooling makes it straightforward to use Sourcegraph with other tools that provide facilities not built in to the search language.
      - https://github.com/google/re2
        
        RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.
        
        https://github.com/google/re2/wiki/Syntax
    - https://sourcegraph.com/docs/code-search/features#search-experience
    - etc
  - https://sourcegraph.com/docs/code-search/queries
    - Search Query Syntax This page describes the query syntax for Code Search.
  - https://sourcegraph.com/docs/code-search/code-navigation
    - Code Navigation
      
      Learn how to navigate your code and understand its dependencies with high precision.
      
      Code Navigation helps you quickly understand your code, its dependencies, and symbols within the Sourcegraph file view while making it easier to move through your codebase
    - https://sourcegraph.com/docs/code-search/code-navigation#code-navigation-types
      - Code Navigation types
        
        There are two types of Code Navigation that Sourcegraph supports:
        
        Search-based Code Navigation: Works out of the box with most popular programming languages, powered by Sourcegraph's code search. It uses a mix of text search and syntax-level heuristics (no language-level semantic information) for fast, performant searches across large code bases.
        
        Precise Code Navigation: Uses compile-time information to provide users with accurate cross-repository navigation experience across the entire code base.
        
        https://sourcegraph.com/docs/code-search/code-navigation/precise_code_navigation
        
        Precise Code Navigation
        
        Precise Code Navigation is an opt-in feature that is enabled from your admin settings and requires you to upload indexes for each repository to your Sourcegraph instance. Once setup is complete on Sourcegraph, precise code navigation is available for use across popular development tools
        
        Sourcegraph automatically uses Precise Code Navigation whenever available, and Search-based Code Navigation is used as a fallback when precise navigation is not available.
        
        Precise code navigation relies on the open source SCIP Code Intelligence Protocol, which is a language-agnostic protocol for indexing source code.
        
        https://sourcegraph.com/docs/code-search/code-navigation/auto_indexing
        
        Auto-indexing
        
        With Sourcegraph deployments supporting executors, your repository contents can be automatically analyzed to produce a code graph index file. Once auto-indexing is enabled and auto-indexing policies are configured, repositories will be periodically cloned into an executor sandbox, analyzed, and the resulting index file will be uploaded back to the Sourcegraph instance.
        
        Auto-indexing is currently available for Go, TypeScript, JavaScript, Python, Ruby and JVM repositories. See also dependency navigation for instructions on how to setup cross-dependency navigation depending on what language ecosystem you use.
        
        https://sourcegraph.com/docs/code-search/code-navigation/writing_an_indexer#writing-an-indexer
        
        Indexers
        
        This page describes the process of writing an indexer and details all the recommended indexers that Sourcegraph currently supports.
        
        The following documentation describes the SCIP Code Intelligence Protocol and explains steps to write an indexer to emit SCIP.
        
        https://sourcegraph.com/docs/code-search/code-navigation/writing_an_indexer#sourcegraph-recommended-indexers
        
        Sourcegraph recommended indexers
        
        Language support is an ever-evolving feature of Sourcegraph. Some languages may be better supported than others due to demand or developer bandwidth/expertise. The following clarifies the status of the indexers which the Sourcegraph team can both recommend to customers and provide support for.
        
        https://sourcegraph.com/docs/code-search/code-navigation/writing_an_indexer#cross-repository-emits-monikers-for-cross-repository-support
        
        Cross repository: Emits monikers for cross-repository support
        
        The next milestone provides support for cross-repository definitions and references.
        
        The indexer can emit a valid index including import monikers for each symbol defined non-locally, and export monikers for each symbol importable by another repository. This index should be consumed without error by the latest Sourcegraph instance and Go to Definition and Find References should work on cross-repository symbols given that both repositories are indexed at the exact commit imported.
    - https://sourcegraph.com/docs/code-search/code-navigation/rockskip
      - Rockskip: fast symbol sidebar and search-based code navigation on monorepos
      - Rockskip is an alternative symbol indexing and query engine for the symbol service intended to improve performance of the symbol sidebar and search-based code navigation on big monorepos. It was added in Sourcegraph 3.38.
  - https://sourcegraph.com/docs/code-search/types/fuzzy
    - Fuzzy Finder
      
      Learn and understand about Sourcegraph's Fuzzy Search and core functionality.
      
      Use the fuzzy finder to quickly navigate to a repository, symbol, or file.
  - https://sourcegraph.com/docs/code-search/types/structural
    - Structural Search
    - Changed in version 5.3. Structural search is disabled by default. To enable it, ask your site administrator to set experimentalFeatures.structuralSearch = "enabled" in site configuration. Structural search has performance limitations and is not actively developed. We recommend using regex search or a combination of Search Jobs and custom scripts instead.
    - With structural search, you can match richer syntax patterns specifically in code and structured data formats like JSON. It can be awkward or difficult to match code blocks or nested expressions with regular expressions. To meet this challenge we've introduced a new and easier way to search code that operates more closely on a program's parse tree. We use Comby syntax for structural matching. Below you'll find examples and notes for this language-aware search functionality.
      - https://comby.dev/
        
        Comby is a tool for searching and changing code structure
        
        https://comby.dev/docs/overview
        
        Comby provides a lightweight way of matching syntactic structures of a program’s parse tree, like expressions and function blocks. Comby is language-aware and understands basic syntax of code, strings, and comment syntax in many languages.
        
        https://comby.dev/docs/syntax-reference
        
        Syntax Reference
  - https://sourcegraph.com/docs/code-search/types/search-jobs
    - Search Jobs
    - Use Search Jobs to search code at scale for large-scale organizations.
      
      Search Jobs allows you to run search queries across your organization's codebase (all repositories, branches, and revisions) at scale. It enhances the existing Sourcegraph's search capabilities, enabling you to run searches without query timeouts or incomplete results.
      
      With Search Jobs, you can start a search, let it run in the background, and then download the results from the Search Jobs UI when it's done.
  - https://sourcegraph.com/docs/code-search/working/snippets
    - Search Snippets
    - Every project and team has a different set of repositories they commonly work with and queries they perform regularly. Custom search snippets enable users and organizations to quickly filter existing search results with search fragments matching those use cases.
      
      A search snippet is any valid query. For example, a search snippet that defines all repositories in the "example" organization would be repo:^github\.com/example/. After adding this snippet to your settings, it would appear in the search snippet panel in the search sidebar under a label of your choosing (as of v3.29).
  - https://sourcegraph.com/docs/code-search/working/search_subexpressions
    - Search Subexpressions
    - Search subexpressions combine groups of filters like repo: and operators like or. Compared to basic examples, search subexpressions allow more sophisticated queries.
- https://sourcegraph.com/docs/api/graphql
  - Sourcegraph GraphQL API
    
    The Sourcegraph GraphQL API is a rich API that exposes data related to the code available on a Sourcegraph instance.
    
    The Sourcegraph GraphQL API supports the following types of queries:
    - Full-text and regexp code search
    - Rich git-level metadata, including commits, branches, blame information, and file tree data
    - Repository and user metadata
  - https://sourcegraph.com/docs/api/graphql#documentation
    - Sourcegraph's GraphQL API documentation is available on the API Docs page, as well as directly in the API console itself.
    - https://sourcegraph.com/docs/api/graphql/api-docs
      - Sourcegraph API
  - https://sourcegraph.com/docs/api/graphql#search
    - Search
      
      See additional documentation about search GraphQL API: https://sourcegraph.com/docs/api/graphql/search
  - https://sourcegraph.com/docs/api/graphql#using-the-api-via-the-sourcegraph-cli
    - Using the API via the Sourcegraph CLI
      
      A command line interface to Sourcegraph's API is available. Today, it is roughly the same as using the API via curl (see below), but it offers a few nice things:
      
      Allows you to easily compose queries from scripts, e.g. without worrying about escaping JSON input to curl properly.
      
      Reads your access token and Sourcegraph server endpoint from a config file (or env var).
      
      Pipe multi-line GraphQL queries into it easily.
      
      Get any API query written using the CLI as a curl command using the src api -get-curl flag.
      
      To learn more, see sourcegraph/src-cli
  - https://sourcegraph.com/docs/api/graphql#using-the-api-via-curl
    - Using the API via curl
      
      The entire API can be used via curl (or any HTTP library), just the same as any other GraphQL API.
- https://sourcegraph.com/docs/api/stream_api
  - Sourcegraph Stream API
    
    With the Stream API you can consume search results and related metadata as a stream of events. The Sourcegraph UI calls the Stream API for all interactive searches. Compared to our GraphQL API, it offers shorter times to first results and supports running exhaustive searches returning a large volume of results without putting pressure on the backend.

SourceGraph GitHub

https://github.com/sourcegraph

Main

https://github.com/sourcegraph/sourcegraph-public-snapshot
- Sourcegraph Code AI platform with Code Search & Cody
- Note
  
  Sourcegraph transitioned to a private monorepo. This repository, sourcegraph/sourcegraph-public-snapshot is a publicly available copy of the sourcegraph/sourcegraph repository as it was just before the migration.
- Tip
  
  If you are interested in working with the code, this commit is the last one made under an Apache License.
  - This commit was made on Jun 14, 2023
- Note: The latest commits seem to be from August 2024
- https://news.ycombinator.com/item?id=36584656
  - Sourcegraph is no longer open source
  - sqs on July 4, 2023
    
    Sourcegraph CEO here. Sourcegraph is now 2 separate products: code search and Cody (our code AI). Cody remains open source (Apache 2) in the client/cody* directories in the repository, and we're extracting that to a separate 100% OSS repository soon.
    
    Our licensing principle remains to charge companies while making tools for individual devs open source. Very few individual devs (or companies) used the limited-feature open-source variant of code search, so we decided to remove it. Usage of Sourcegraph code search was even more skewed toward our official non-OSS build than in other similar situations like Google Chrome vs. Chromium or VS Code vs. VSCodium. Maintaining 2 variants was a burden on our engineering team that had very little benefit for anyone.
    
    You can see more explanation at sourcegraph/sourcegraph-public-snapshot#53528 (comment) . The change was announced in the changelog and in a PR (all of our development occurs in public), and we will have a blog post this week after we separate our big monorepo into 2 repos as planned: the 100% OSS repo for Cody and the non-OSS repo for code search.
    
    You can still use Sourcegraph code search for free on public code at https://sourcegraph.com and on our self-hosted free tier on private code (which means individual devs can still run Sourcegraph code search 100% for free). Customers are not affected at all.
- https://github.com/sourcegraph/src-cli
  - Sourcegraph CLI
  - src is a command line interface to Sourcegraph:
    - Search & get results in your terminal
    - Search & get JSON for programmatic consumption
    - Make GraphQL API requests with auth easily & get JSON back fast
    - Execute batch changes
    - Manage & administrate repositories, users, and more
    - Easily convert src-CLI commands to equivalent curl commands, just add --get-curl!

Zoekt - Fast Code Search

https://github.com/sourcegraph/zoekt
- Zoekt: fast code search
- Fast trigram based code search
- Zoekt is a text search engine intended for use with source code. (Pronunciation: roughly as you would pronounce "zooked" in English)
- Note: This has been the maintained source for Zoekt since 2017, when it was forked from the original repository github.com/google/zoekt.
- Zoekt supports fast substring and regexp matching on source code, with a rich query language that includes boolean operators (and, or, not). It can search individual repositories, and search across many repositories in a large codebase. Zoekt ranks search results using a combination of code-related signals like whether the match is on a symbol. Because of its general design based on trigram indexing and syntactic parsing, it works well for a variety of programming languages.
  
  The two main ways to use the project are
  - Through individual commands, to index repositories and perform searches through Zoekt's query language
  - Or, through the indexserver and webserver, which support syncing repositories from a code host and searching them through a web UI or API
  For more details on Zoekt's design, see the docs directory.
- Note: It is also recommended to install Universal ctags, as symbol information is a key signal in ranking search results. See ctags.md for more information.
  - https://github.com/sourcegraph/zoekt/blob/main/doc/ctags.md
    - CTAGS
      
      Ctags generates indices of symbol definitions in source files. It started its life as part of the BSD Unix, but there are several more modern flavors. Zoekt supports universal-ctags.
- https://github.com/sourcegraph/zoekt/blob/main/doc/query_syntax.md
  - Zoekt Query Language Guide This guide explains the Zoekt query language, used for searching text within Git repositories. Zoekt queries allow combining multiple filters and expressions using logical operators, negations, and grouping. Here's how to craft queries effectively.
- https://github.com/sourcegraph/zoekt-archived
  - Note: This is a Sourcegraph fork of github.com/google/zoekt. It contains some changes that do not make sense to upstream and or have not yet been upstreamed.

SCIP - SCIP Code Intelligence Protocol

https://github.com/sourcegraph/scip.dev
- Future home of scip.dev
https://github.com/sourcegraph/scip
- SCIP Code Intelligence Protocol
- SCIP (pronunciation: "skip") is a language-agnostic protocol for indexing source code, which can be used to power code navigation functionality such as Go to definition, Find references, and Find implementations.
  
  This repository includes:
  - A Protobuf schema for SCIP.
  - Rich Go and Rust bindings for SCIP: These include many utility functions to help build tooling on top of SCIP.
  - Auto-generated bindings for TypeScript and Haskell.
  - The scip CLI, which makes SCIP indexes a breeze to work with.
  If you're interested in better understanding the motivation behind SCIP, check out the announcement blog post and the design doc.
  
  If you're interested in writing a new indexer that emits SCIP, check out our documentation on how to write an indexer. Also, check out the Debugging section in the Development docs.
  
  If you're interested in consuming SCIP data, you can either use one of the provided language bindings, or generate code for the SCIP Protobuf schema using the Protobuf toolchain for your language ecosystem. Also, check out the Debugging section in the Development docs.
https://github.com/sourcegraph/scip-typescript
- SCIP indexer for TypeScript and JavaScript
https://github.com/sourcegraph/scip-semantic
- scip-semantic
- various semantic and syntax based tools related to SCIP

LSIF (Legacy)

https://lsif.dev/
- A community-driven source of knowledge for Language Server Index Format implementations
- What is LSIF?
  
  The Language Server Index Format (LSIF, pronounced “else if”) is a standard format for language servers or other programming tools to emit their knowledge about a code workspace. This persisted information can later be used to answer LSP requests for the same workspace without running a language server.
- https://code.visualstudio.com/blogs/2019/02/19/lsif
  - The Language Server Index Format (LSIF) (February 19, 2019)
https://github.com/sourcegraph/lsif-protocol
- LSIF protocol utilities for Go This repository contains LSIF protocol struct definitions.
- This project has been merged into github.com/sourcegraph/sourcegraph-public-snapshot
https://github.com/sourcegraph/lsif-semanticdb
- Language Server Index Format (LSIF) converter
- This project is now part of lsif-java
  
  Visit https://sourcegraph.github.io/lsif-java/docs/getting-started.html to install the lsif-java command-line tool. Run the following command to generate LSIF from SemanticDB.
https://github.com/sourcegraph/lsif-node
- Language Server Indexing Format (LSIF) generator for JavaScript and TypeScript
- Deprecated: TypeScript LSIF indexer This project is no longer maintained. Please use scip-typescript instead.
- https://github.com/sourcegraph/lsif-node-action
  - Sourcegraph TypeScript LSIF Indexer GitHub Action
  - This action generate LSIF data from TypeScript source code. See the LSIF TypeScript indexer for more details.
https://github.com/sourcegraph/lsif-upload-action
- Sourcegraph LSIF Uploader GitHub Action
- This action uploads generated LSIF data to a Sourcegraph instance.
https://github.com/sourcegraph/coif-to-lsif
- Converts CoIF to LSIF
- CoIF is not actively developed; you probably want to look at SCIP instead.
- CoIF (Code Index Format) is similar to LSIF, but simpler. It's intended to be a format that is easier for indexers to emit than LSIF. The CoIF to LSIF converter only needs to be written once, so it can save the indexer from needing to be aware of all the nuances of LSIF.

LSP - Language Server Protocol (Legacy)

https://github.com/sourcegraph/sourcegraph-typescript
- Language server for TypeScript/JavaScript
- Provides code intelligence for TypeScript
- This repository has been superseded by scip-typescript.
https://github.com/sourcegraph/lsp-client
- @sourcegraph/lsp-client
- Connects Sourcegraph extensions to language servers
https://github.com/sourcegraph/lsp-adapter
- lsp-adapter provides a proxy which adapts Sourcegraph LSP requests to vanilla LSP requests
- Code Intelligence on Sourcegraph is powered by the Language Server Protocol.
  
  Previously, language servers that were used on sourcegraph.com were additionally required to support our custom LSP files extensions. These extensions allowed language servers to operate without sharing a physical file system with the client. While it's preferable for language servers to implement these extensions for performance reasons, implementing this functionality is a large undertaking.
  
  lsp-adapter eliminates the need for this requirement, which allows off-the-shelf language servers to be able to provide basic functionality (hovers, local definitions) to Sourcegraph.
https://github.com/sourcegraph/javascript-typescript-langserver
- JavaScript and TypeScript code intelligence through the Language Server Protocol
- This project is no longer maintained
  
  This language server is an implementation of LSP using TypeScript's APIs. This approach made it difficult to keep up with new features of TypeScript and implied that the server always uses a bundled TypeScript version, instead of the local TypeScript in node_modules like using the official (non-LSP) tsserver allows.
  
  On top of that, over time we simplified our architecture for running language servers in the cloud at Sourcegraph which removed the necessity for this level of tight integration and control. Theia's TypeScript language server is a thinner wrapper around tsserver, which avoids these problems to some extent. Our latest approach of running a TypeScript language server in the cloud uses Theia's language server (and transitively tsserver) under the hood.
  
  However, since then our code intelligence evolved even further and is nowadays powered primarily by LSIF, the Language Server Index Format. LSIF is developed together with LSP and uses the same structures, but in a pre-computed serialization instead of an RPC protocol. This allows us to provide near-instant code intelligence for our tricky on-demand cloud code intelligence scenarios and hence we are focusing all of our efforts on LSIF indexers. All of this work is also open source of course and if you're curious you can read more about how we use LSIF on our blog.
  
  LSP is still the obvious choice for editor scenarios and everyone is welcome to fork this repository and pick up maintenance, although from what we learned we would recommend to build on Theia's approach (wrapping tsserver). We would also love to see and are looking forward to native LSP support for the official tsserver, which would eliminate the need for any wrappers.
https://github.com/sourcegraph/typescript-language-server
- TypeScript & JavaScript Language Server
- Forked from https://github.com/typescript-language-server/typescript-language-server

ctags (Legacy)

https://github.com/sourcegraph/go-ctags
- go-ctags: universal-ctags wrapper for easy access in Go
  
  Note: This library is meant only for Sourcegraph use.
  
  To improve type:symbol results in Sourcegraph, for languages with high quality Tree-sitter grammars, prefer adding support in scip-ctags in the Sourcegraph monorepo over adding support in this repo.
https://github.com/sourcegraph/ctags
- Forked from https://github.com/universal-ctags/ctags
- https://ctags.io/
  - https://github.com/universal-ctags/ctags
    - Universal Ctags (abbreviated as u-ctags) is a maintained implementation of ctags. ctags generates an index (or tag) file of language objects found in source files for programming languages. This index makes it easy for text editors and other tools to locate the indexed items.

`srclib` / `jsg` (Legacy)

https://github.com/sourcegraph/jsg
- jsg: JavaScript grapher
- JavaScript grapher -- part of GraphKit, a collection of source analyzers for popular programming languages
- Moved to srclib-javascript (this repository is no longer a standalone project; submit patches to srclib-javascript)
https://srclib.org/
- srclib is a hackable, multi-language code analysis library for building better software tools.
  
  srclib makes developer tools like code search and static analyzers better. It supports things like jump to definition, find usages, type inference, and documentation generation.
  
  srclib consists of language analysis toolchains (currently for Go, Python, JavaScript, and Ruby) with a common output format, and developer tools that consume this format.
  
  srclib originated inside Sourcegraph, where it powers intelligent code search over hundreds of thousands of projects.
- https://github.com/sourcegraph/srclib
  - srclib is a polyglot code analysis library, built for hackability. It consists of language analysis toolchains (currently for Go and Java, with Python, JavaScript, and Ruby in beta) with a common output format, and a CLI tool for running the analysis.
- https://github.com/sourcegraph/srclib-javascript
  - JavaScript (node.js) toolchain for srclib
  - srclib-javascript is a srclib toolchain that performs JavaScript (Node.js) code analysis: type inference, documentation generation, jump-to-definition, dependency resolution, etc.
    
    It enables this functionality in any client application whose code analysis is powered by srclib, including Sourcegraph.
- https://github.com/sourcegraph/srclib-typescript
  - Sourcegraph support for typescript toolchain
  - srclib-typescript is a srclib toolchain that performs TypeScript code analysis: type inference, documentation generation, jump-to-definition, dependency resolution, etc. It enables this functionality in any client application whose code analysis is powered by srclib, including Sourcegraph.

Treesitter (Forks)

https://github.com/sourcegraph/go-tree-sitter
- forked from smacker/go-tree-sitter
- https://github.com/smacker/go-tree-sitter
  - Golang bindings for tree-sitter https://github.com/tree-sitter/tree-sitter
https://github.com/sourcegraph/tree-sitter-wasms
- forked from Gregoor/tree-sitter-wasms
- https://github.com/Gregoor/tree-sitter-wasms
  - tree-sitter-wasms Prebuilt WASM binaries for tree-sitter's language parsers. Forked from https://github.com/Menci/tree-sitter-wasm-prebuilt because I wanted to use GitHub Actions to automate publishing.
  - Prebuilt WASM binaries for tree-sitter's language parsers.
https://github.com/sourcegraph/tree-sitter-typescript
- forked from tree-sitter/tree-sitter-typescript
- https://github.com/tree-sitter/tree-sitter-typescript
  - TypeScript grammar for tree-sitter

Golang Libs

https://github.com/sourcegraph/go-diff
- go-diff Unified diff parser and printer for Go
- Diff parser and printer for Go.
- It doesn't actually compute a diff. It only reads in (and prints out, given a Go struct representation) unified diff output
https://github.com/sourcegraph/go-dep-parser
- Forked from aquasecurity/go-dep-parser
- https://github.com/aquasecurity/go-dep-parser
  - go-dep-parser Dependency Parser for Multiple Programming Languages
  - Note: Moved to the dependency package in Trivy
    - https://github.com/aquasecurity/trivy/tree/main/pkg/dependency
      - https://github.com/aquasecurity/trivy
        
        Trivy (pronunciation) is a comprehensive and versatile security scanner. Trivy has scanners that look for security issues, and targets where it can find those issues.
        
        Find vulnerabilities, misconfigurations, secrets, SBOM in containers, Kubernetes, code repositories, clouds and more
https://github.com/sourcegraph/tiktoken-go
- forked from pkoukk/tiktoken-go
- https://github.com/pkoukk/tiktoken-go
  - tiktoken-go
    
    OpenAI's tiktoken in Go.
    
    Tiktoken is a fast BPE tokeniser for use with OpenAI's models.
    
    This is a port of the original tiktoken.

Unsorted

https://github.com/sourcegraph/awesome-code-ai
- Awesome-Code-AI
- A list of AI coding tools (assistants, completion, refactoring, etc.).
https://github.com/sourcegraph/codesearch.ai
- codesearch.ai (Archived)
- codesearch.ai is a semantic code search engine. It allows searching GitHub functions and StackOverflow answers using natural language queries. It uses HuggingFace Transformers under the hood, and the training procedure is inspired by a paper called Text and Code Embeddings by Contrastive Pre-Training from OpenAI. The CodeSearchNet project served as a basis for data collection and cleaning.
https://github.com/sourcegraph/whouses
- Who Uses (Archived) Find out what awesome you've started with Sourcegraph.
- Find out what projects are using your npm package
- https://whouses.netlify.app/

Vercel Grep.app

https://grep.app/
- Code search made fast
- Effortlessly search for code, files, and paths across a million GitHub repositories.
- https://grep.app/api/search?q=
https://vercel.com/blog/vercel-acquires-grep
- Vercel acquires Grep to accelerate code search
- Grep allows developers to quickly search code across over 500,000 public git repositories. With the acquisition, founder Dan Fox will also be joining Vercel’s AI team to continue building Grep to enhance code search for developers.

searchcode

https://searchcode.com/
- SearchCode
- Artisanal, small batch, handcrafted code search!
- Simple, comprehensive code search
- Helping you find real world examples of functions, API's and libraries in 378+ languages across 10+ public code sources
- Filter down to one or many sources such as Bitbucket, CodePlex, Fedora Project, GitLab, Github, Gitorious, Google Android, Google Code, Minix3, Seek Quarry, Sourceforge, Tizen, codeberg, repo.or.cz, sr.ht or by 378+ languages.
- https://searchcode.com/about/
  - Team / Contact
    
    searchcode is currently the work of a single developer standing on the shoulders of giants.
    
    Feel free to contact me at [email protected] or via twitter @boyter or follow developments at https://boyter.org/
- https://searchcode.com/api/
  - searchcode API
  - Code Index
    
    Queries the code index and returns at most 100 results. All filters supported by searchcode are available. These include src (sources), lan (languages) and loc (lines of code). These work in the same way that the main page works. See the examples for how to use these.
  - Code Result
    
    Returns the raw data from a code file given the code id which can be found as the id in a code search result.
  - Related Results
    
    Returns an array of results given a searchcode unique code id which are considered to be duplicates. The matching is slightly fuzzy allowing so that small differences between files are ignored.
  - etc
https://searchcodeserver.com/
- searchcode server
- The best code search solution. Guaranteed. The code search solution for companies that build or maintain software who want to improve productivity and shorten development time by getting value from their existing source code.
- How searchcode server works.
  
  By indexing your source code it allows you to search over this code quickly, filtering down by repositories, languages and file owners to find what you were looking for. Own your data, searchcode server is not a SAAS or cloud product, download and install it on your own servers.
- https://searchcodeserver.com/pricing.html
  - Pricing for searchcode server
  - Requirements: A GNU/Linux/Windows/BSD machine running the Java 8 runtime. Everything else is configured out of the box for you.
    
    The community edition is free to use for as many users as you wish but you must leave the searchcode branding visible.
    
    All paid plans include a full downloadable version of searchcode server with the ability to change the icon and modify other look and feel elements. The software comes with a lifetime licence to install use searchcode server internally on as many instances as you like. You can use any paid for version in an manner you see fit include public facing websites. Finally you will get direct emails letting you know when updates are available and links to the update for the length of the support period.
- https://github.com/boyter/searchcode-server/tree/master
  - searchcode server
  - searchcode server is a powerful code search engine with a sleek web user interface.
    
    searchcode server works in tandem with your source control system, indexing thousands of repositories and files allowing you and your developers to quickly find and reuse code across teams.

Ben E. C. Boyter's Blog

https://boyter.org/
- Ben E. C. Boyter's Blog
- https://boyter.org/about/

Shortlist:

https://boyter.org/posts/searchcode-bigger-sqlite-than-you/
- searchcode.com’s SQLite database is probably 6 terabytes bigger than yours (2025-02-16)
https://boyter.org/posts/how-i-built-my-own-index-for-searchcode/
- Building a custom code search index in Go for searchcode.com (2022-11-22)

Additional/Unsorted:

https://boyter.org/posts/searchcode.com-vibe-coding/
- Vibe coding searchcode a new UI and saving myself 40+ hours of work (2025-03-12) (1108 words)
https://boyter.org/posts/bloom-filters-sqlite/
- Bloom Filters and SQLite (2024-11-20) (421 words)
- https://github.com/boyter/bloom-sqlite
  - bloom-sqlite
- https://github.com/boyter/indexer
  - indexer
    
    Code for GopherConSyd 2023
    
    So please clone this, and start interacting!
    
    It's a small portion of the caisson index that powers searchcode.com with no dependencies.
https://boyter.org/posts/one-hundred-million-little-queries/
- One hundred million little queries (2024-04-23) (605 words)
https://boyter.org/posts/brute-force-text-search-optimizations/
- Brute force text search optimizations (2024-03-27) (854 words)
https://boyter.org/posts/codespelunker-details/
- Code Spelunker how it works (2023-06-06) (858 words)
https://boyter.org/posts/code-spelunker-a-code-search-command-line-tool/
- Code Spelunker a Code Search Command Line Tool (2023-06-05) (1107 words)
- https://github.com/boyter/cs
  - codespelunker (cs)
    
    A command line search tool. Allows you to search over code or text files in the current directory either on the console, via a TUI or HTTP server, using some boolean queries or regular expressions.
    
    Consider it a similar approach to using ripgrep, silver searcher or grep coupled with fzf but in a single tool.
https://boyter.org/posts/profiling-ngram-trigram-tokenization-in-go/
- Real World CPU profiling of ngram/trigram tokenization in Go to reduce index time in searchcode.com (2023-04-12) (554 words)
https://boyter.org/posts/search-index-implementations/
- Search index implementations (2022-06-26) (539 words)
- Trie for example https://github.com/typesense/typesense which uses Adaptive Radix Tree https://stackoverflow.com/questions/50127290/data-structure-for-fast-full-text-search
  - https://github.com/typesense/typesense
    - Open Source alternative to Algolia + Pinecone and an Easier-to-Use alternative to ElasticSearch ⚡ 🔍 ✨ Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences
- Bit Signatures
  
  This is something I remember reading about years ago, and found this link to prove I had not lost my mind https://www.stavros.io/posts/bloom-filter-search-engine/ At the time I thought it was neat but not very practical… However then it turns out that Bing has been using this technique over its entire web corpus http://bitfunnel.org/ https://www.youtube.com/watch?v=1-Xoy5w5ydM
https://boyter.org/posts/bloom-filter/
- Bloom Filters - Much, much more than a space efficient hashmap! (2020-12-10) (2447 words)
https://boyter.org/posts/building-an-api-rate-limiter-in-go-for-searchcode/
- Building a API rate limiter in Go for searchcode (2020-05-04) (1327 words)
https://boyter.org/posts/searchcode-rebuilt-with-go/
- searchcode Rebuilt with Go (2020-04-22) (984 words)
- https://github.com/boyter/searchcode-server-highlighter
  - searchcode-server-highlighter
  - A very simple Go HTTP based Syntax highlighter. Run it, then post some code to the default port and it will return CSS + HTML syntax highlighted code.
https://boyter.org/posts/an-informal-survey-of-10-million-github-bitbucket-gitlab-projects/
- Processing 40 TB of code from ~10 million projects with a dedicated server and Go for $100 (2019-09-30) (13129 words)
https://boyter.org/posts/file-read-challange/
- Processing Large Files – Java, Go and 'hitting the wall' (2019-05-08) (2480 words)
https://boyter.org/2018/03/collection-favorite-optimization-posts-articles/
- Collection of my favorite optimization posts and articles (2018-03-08) (597 words)
https://boyter.org/2017/12/searchcode-plexus/
- searchcode plexus (2017-12-05) (1224 words)
https://boyter.org/2017/06/design-searchcode-server/
- Design for searchcode server (2017-06-27) (107 words)
https://boyter.org/2017/03/golang-solution-faster-equivalent-java-solution/
- Why is this GoLang solution faster than the equivalent Java Solution? (2017-03-30) (2146 words)
https://boyter.org/2017/01/repository-overview-searchcode-server/
- Repository overview now in searchcode server (2017-01-30) (363 words)
https://boyter.org/2016/08/searchcode-server-fair-source/
- searchcode server under fair source (2016-08-24) (185 words)
https://boyter.org/2016/08/syncing-stashbitbucket-searchcode-server/
- Syncing Stash/BitBucket with searchcode server (2016-08-04) (295 words)
https://boyter.org/2016/07/searchcode-com-architecture-migration-3-0/
- searchcode.com: The Architecture – migration 3.0 (2016-07-28) (1619 words)
https://boyter.org/2016/03/searchcode-server-released/
- searchcode server released (2016-03-31) (167 words)
https://boyter.org/2015/12/searchcode-server/
- searchcode server (2015-12-29) (235 words)
https://boyter.org/2015/10/searchcode-local/
- searchcode local (2015-10-30) (301 words)
https://boyter.org/2015/09/search/
- Go Forth and Search (2015-09-02) (247 words)
https://boyter.org/2015/07/searchcode-path-profitability/
- searchcode the path to profitability (2015-07-17) (338 words)
https://boyter.org/2015/07/searchcode-com-unit-integration-tested/
- How searchcode.com is Unit and Integration Tested (2015-07-01) (1216 words)
https://boyter.org/2015/03/updates-searchcode-com/
- Updates to searchcode.com (2015-03-18) (286 words)
https://boyter.org/2014/10/searchcode-com-100-free-software/
- Why searchcode.com isn't 100% free software (2014-10-10) (765 words)
https://boyter.org/2014/06/sphinx-searchcode/
- Sphinx and searchcode (2014-06-20) (631 words)
- http://sphinxsearch.com/
- http://sphinxsearch.com/blog/2014/06/19/sphinx-searches-code-at-searchcode-com/
https://boyter.org/2014/06/estimating-sphinx-search-ram-requirements/
- Estimating Sphinx Search RAM Requirements (2014-06-19) (117 words)
https://boyter.org/2014/06/searchcode/
- searchcode next (2014-06-16) (542 words)
https://boyter.org/2014/03/searchcode-screenshot/
- searchcode screenshot (2014-03-26) (117 words)
https://boyter.org/2014/02/searchcode-logo/
- New searchcode Logo (2014-02-10) (157 words)
https://boyter.org/2014/02/storing-tracking-managing-billions-tiny-files-file-system-nightmare/
- Why is storing, tracking and managing billions of tiny files directly on a file system a nightmare? (2014-02-06) (202 words)
https://boyter.org/2013/02/why-code-search-is-difficult/
- Why Code Search is Difficult (2013-02-28) (475 words)
https://boyter.org/2013/01/want-to-write-a-search-engine-have-some-links/
- Want to write a search engine? Have some links (2013-01-30) (635 words)
- https://github.com/gigablast/open-source-search-engine
  - open-source-search-engine An open source web and enterprise search engine and spider/crawler. As can be seen on http://www.gigablast.com/
https://boyter.org/2013/01/code-for-a-search-engine-in-php-part-5/
- Code a Search Engine in PHP Part 5 (2013-01-10) (1344 words)
https://boyter.org/2013/01/code-for-a-search-engine-in-php-part-4/
- Code a Search Engine in PHP Part 4 (2013-01-10) (1348 words)
https://boyter.org/2013/01/code-for-a-search-engine-in-php-part-3/
- Code a Search Engine in PHP Part 3 (2013-01-10) (1891 words)
https://boyter.org/2013/01/code-for-a-search-engine-in-php-part-2/
- Code a Search Engine in PHP Part 2 (2013-01-10) (2074 words)
https://boyter.org/2013/01/code-for-a-search-engine-in-php-part-1/
- Code a Search Engine in PHP Part 1 (2013-01-10) (5454 words)
https://boyter.org/2012/11/building-a-search-engine-the-most-important-feature-you-can-add/
- Building a search engine? The most important feature you can add. (2012-11-15) (434 words)
- https://duckduckgo.com/bangs
  - What are bangs?
    
    Bangs are shortcuts that quickly take you to search results on other sites. For example, when you know you want to search on another site like Wikipedia or Amazon, our bangs get you there fastest. A search for !w filter bubble will take you directly to Wikipedia.
https://boyter.org/2012/07/billions-of-lines-of-code/
- Billions of lines of code (2012-07-16) (267 words)
https://boyter.org/2012/06/codesearch-api/
- Codesearch API (2012-06-26) (309 words)
https://boyter.org/2012/04/growing-index/
- Growing Index (2012-04-13) (216 words)
https://boyter.org/2012/04/performance/
- Performance (2012-04-12) (96 words)
https://boyter.org/2012/02/improving-the-index/
- Improving the Index (2012-02-29) (503 words)
https://boyter.org/2011/12/searchcode-now-supports-regex-code-search/
- searchcode now supports regex code search (2011-12-17) (284 words)
https://boyter.org/2011/10/google-killing-off-code-search/
- Google Killing off Code Search (2011-10-15) (186 words)
https://boyter.org/2011/06/vector-space-search-model-explained/
- Vector Space Search Model Explained (2011-06-28) (700 words)
- http://la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf
  - This link 404's now unfortunately, but these seem similar:
    - https://ondoc.logand.com/d/2697/pdf
      - Basic Vector Space Search Engine Theory (January 2, 2004)
    - https://www.researchgate.net/publication/289611753_A_Vector_Space_Model_Approach_for_Searching_and_Matching_Product_E-Catalogues
      - A Vector Space Model Approach for Searching and Matching Product E-Catalogues
https://boyter.org/2010/08/build-vector-space-search-engine-python/
- Building a Vector Space Indexing Engine in Python (2010-08-23) (1437 words)
https://boyter.org/2008/09/data-mining/
- Data Mining (2008-09-22) (680 words)

Google Code Search

Note: I think this might only be for Google projects/similar(?)

https://developers.google.com/code-search
- Code Search
- You can search for specific files or code snippets by using the search box located at the top of the Code Search UI
- Start using this public code search tool for exploring code without downloading the source.
https://developers.google.com/code-search/user/getting-started
- Getting started with Code Search
- To get started, open the Code Search UI for your project:
  - https://cs.android.com/
    - Android Code Search
  - https://source.chromium.org/
    - Chromium Code Search
  - https://cs.opensource.google/?authuser=1
    - Google Open Source
https://developers.google.com/code-search/reference?authuser=1
- Syntax reference
- This page provides detailed information on the supported filters, operators, syntax options, and keyboard shortcuts for Code Search.

Programmable Search Engine

https://programmablesearchengine.google.com/controlpanel/all
https://programmablesearchengine.googleblog.com/
- Programmable Search Engine Blog The latest news, updates and tips from the Programmable Search Engine team
- The Custom Search Site Restricted JSON API endpoints will cease serving traffic on January 8, 2025.
  
  Beginning on January 8, 2025, all Custom Search Site Restricted JSON API customers must begin their transition to Google Cloud's Vertex AI Search to maintain access to their site search functionality.
  - https://developers.google.com/custom-search/v1/site_restricted_api
    - Custom Search Site Restricted JSON API
    - If your Programmable Search Engine is restricted to only searching specific sites (10 or fewer), you can use the Custom Search Site Restricted JSON API. This API is similar to the Custom Search JSON API except this version has no daily query limit. To use this version, confirm that you see 10 or fewer sites to search in the “Sites to Search” section of your Programmable Search Engine control panel, there are no global top level domain patterns, and that “Search the entire web” is set to OFF.
  - https://cloud.google.com/enterprise-search
    - Vertex AI Search Vertex AI Search helps developers build secure, Google-quality search experiences for websites, intranet and RAG systems for generative AI agents and apps.

Unsorted

https://github.com/livegrep/livegrep
- Livegrep
- Livegrep is a tool, partially inspired by Google Code Search, for interactive regex search of ~gigabyte-scale source repositories. You can see a running instance at http://livegrep.com/.
- To run livegrep, you need to invoke both the codesearch backend index/search process, and the livegrep web interface.
- https://livegrep.com/search/linux
  - This only has a few example repositories indexed
https://gist.github.com/phillipalexander/9244143
- Source Code Search Engines
- NOTE: This list is almost entirely copy/pasted from THIS awesome article. I've made my own personal edits (adding some additional content) which is why I keep it here.
- A lot of the search engines listed here seem to not be a good match for what I want, or no longer exist, etc.
https://openhub.net/
- Discover, Track and Compare Open Source
- https://openhub.net/tools
- https://github.com/blackducksoftware/ohloh_api#open-hub-api-documentation

`npm` Package Ranking, Bundle Size, etc

npm Package Registry Data, Ranking, etc

As of 2025-05-28, the following is the most recent/canonical information about replicating the npm registry database entries:
- https://github.blog/changelog/2025-02-26-changes-and-deprecation-notice-for-npm-replication-apis/
  - Changes and deprecation notice for npm replication APIs
  - We are making changes to npm replication APIs to optimize performance and availability. As part of this update, certain endpoints will be deprecated as of Thursday, May 29, 2025.
    
    To facilitate a seamless transition, the new endpoints will be available starting Tuesday, March 18, 2025, operating in parallel with the existing endpoints. The existing endpoints will be fully deprecated on Thursday, May 29, 2025.
    
    During the transition period, you may access the new endpoints by including the npm-replication-opt-in header with the value true in your requests. This option will be available from Tuesday, March 18, 2025 until the deprecation date, after which only the new endpoints will be available. Effective Thursday, May 29, 2025, the header will be ignored, and all requests will be directed to the new endpoints by default.
  - How to migrate?
    
    To assist with migration, we have detailed documentation in our replication API migration community discussion, outlining alternative approaches for deprecated endpoints when available. This is the go-to place for questions and discussions.
    - https://github.com/orgs/community/discussions/152515
      - npm replication API changes and migration guide
      - Overview
        
        As part of our ongoing improvements, we are making changes to our replication API services. This is to ensure availability and performance of our feeds. Moving forward, only a limited set of API endpoints will be supported. Additionally, requests to skimdb.npmjs.com will now be redirected to replicate.npmjs.com via a 301 Moved Permanently response. All other endpoints will be deprecated and return a 404 Not Found response.
        
        To facilitate a seamless transition, the new endpoints will be available starting Tuesday, March 18, 2025 (12:00 UTC), operating in parallel with the existing endpoints. The existing endpoints will be fully deprecated on Thursday, May 29, 2025 (12:00 UTC). During the transition period, you may access the new endpoints by including the npm-replication-opt-in header with the value true in your requests (wherein they will be ignored thereafter)
      - Supported Endpoints
        
        Going forward, the following API endpoints will be supported (with some limitations):
        
        GET https://replicate.npmjs.com/registry/_changes
        
        GET/POST https://replicate.npmjs.com/registry/_all_docs
        
        HEAD/GET https://replicate.npmjs.com/
      - Changes and Limitations
        
        Each supported endpoint will now have a limited set of supported parameters. Requests using unsupported parameters will not function as expected.
      - https://github.com/orgs/community/discussions/152515#discussioncomment-12647262
        
        When building and maintaining a replica, I would expect to first paginate through the _all_docs endpoint using the startkey and limit parameters to get a bulk downlaod. Then once the last page is reached I would switch to paginating through changes using the _changes endpoint using the since parameter.
        
        This approach worked great with the old endpoints. The old _all_docs endpoint takes the update_seq=true parameter that causes the output to include the seq as of when that page was generated. I would store the seq returned from the very first page of docs and use that as the since parameter to my first call to _changes to start getting updates that happened since I started the bulk pagination. But with the new endpoints I don't know how to determine the correct value to use for since in the first call to _changes.
        
        What is the recommended approach to do a bulk download and then transition to fetching updates? Is there a supported endpoint to check the current sequence value?
        
        I had a word with my engineering team, and they gave me a response on the feasible options. Can you help if this works for you?
        
        The latest sequence number can be found from the root endpoint:
        
        $ curl -H "npm-replication-opt-in: true" "https://replicate.npmjs.com/registry/" {"db_name":"registry","engine":"npm-replicate","doc_count":3501535,"update_seq":61159494}
https://stackoverflow.com/questions/28526255/resource-for-npms-most-downloaded-this-week-month
https://stackoverflow.com/questions/27233104/dist-tarball-urls-from-npm-couchdb-mirror-dont-resolve
- I'm playing with the internals of NPM, and I wanted to see what the raw database looks like. Through a bit of poking, it seems to be documents like this: http://isaacs.iriscouch.com/registry/less/ (isaacs.iriscouch.com seems to be the offical downstream mirror). It lists dist tarballs like this: https://aws-west-3.fullfatdb.internal.npmjs.com/registry/less/less-1.7.0.tgz, only name resolution for aws-west-3.fullfatdb.internal.npmjs.com fails.
  
  Why aren't the URLs for the dist tarballs working, and where can I find working ones?
  - https://blog.npmjs.org/post/75707294465/new-npm-registry-architecture
    - New npm Registry Architecture
      
      This blog post describes some recent changes to the way that The npm Registry works, and can be relevant to you if you’re replicating from the registry CouchDB today.
    - tl;dr
      
      npm, Inc., is now sponsoring the public npm registry. The isaacs.iriscouch.com CouchDB is a downstream mirror.
      
      If you change nothing, everything still works. Your replications might be a few seconds or minutes behind the official database of record.
      
      To shorten this delay, and also benefit from greater data consistency, you can replicate from https://fullfatdb.npmjs.com/registry instead. The AU and EU mirrors are already pulling out of FullfatDB. You probably should just create a new database that replicates from FullfatDB if you already have been pulling from Iris Couch in the past, since it has a lot less garbage.
      
      To replicate the data without the attachments, point your replicator at https://skimdb.npmjs.com/registry. If you do this, then tarballs will be fetched from the public URLs.
      - From a very quick/naive skim through this; these registries seem to no longer exist: isaacs.iriscouch.com, fullfatdb.npmjs.com (and it's AU/EU mirrors); whereas this one still seems to respond: skimdb.npmjs.com
- When I searched again, this seems to be the recommended repository to use if you don't want packages as attachments: https://skimdb.npmjs.com/registry/
https://docs.npmjs.com/policies/crawlers
- Crawler policy npm's full public dataset is available via the public registry. Using CouchDB replication, you can get a full copy of all metadata, and it is acceptable within our terms of use to download copies of tarballs for inspection or experimentation.
  
  npm's website also has package metadata available. We allow this content to be indexed by commercial crawlers such as GoogleBot. At our discretion, we also allow experimental crawlers to access the site, as long as they keep their request velocity to 1 request per second or less. At that velocity, indexing all packages would take 3 days, so if you want a full copy of our metadata it is always going to be faster to access the data via replication, which takes only an hour or two to provide full data and will thereafter automatically stay in sync.
  
  If you do not wish to install CouchDB to manage replication, we provide open source software that makes it easy to sync to the registry's public feed.
  
  If you attempt to access package metadata by high-velocity crawling of the npm website, we reserve the right to rate-limit or ban your IP, user-agent or both.
  - https://github.com/npm/concurrent-couch-follower
    - a couch follower wrapper that you can use to be sure you don't miss any documents even if you process them asynchronously.

https://docs.npmjs.com/cli/v11/using-npm/registry

npm is configured to use the npm public registry at https://registry.npmjs.org by default.

The npm public registry is powered by a CouchDB database, of which there is a public mirror at https://skimdb.npmjs.com/registry

https://skimdb.npmjs.com/registry

{
  db_name: "registry",
  engine: "couch_bt_engine",
  doc_count: 3520565,
  doc_del_count: 1978731,
  update_seq: 41541438,
  purge_seq: 664829,
  compact_running: false,
  sizes: {
    active: 63674972313,
    external: 189235790386,
    file: 64038371590
  },
  disk_size: 64038371590,
  data_size: 63674972313,
  other: {
    data_size: 189235790386
  },
  instance_start_time: "1744921649119335",
  disk_format_version: 7,
  committed_update_seq: 41541438,
  compacted_seq: 41539692,
  uuid: "e03bbc377a13aa48a8fb748146cbe7e3"
}

Summarised by ChatGPT:

📊 Document Stats:

Stat Value Meaning

doc_count 3,520,565 Current number of active (non-deleted) documents.

doc_del_count 1,978,731 Documents that have been deleted, but whose tombstones are still retained.

update_seq 41,541,438 Number of changes made to the DB (adds/updates/deletes).

purge_seq 664,829 Number of permanently purged revisions — purges fully erase history (unlike deletes).

💾 Size Metrics:

Metric Value Description

sizes.active 63.67 GB Actual active data in use.

sizes.external 189.24 GB Raw data size as stored externally (JSON size without overhead, compression, etc).

sizes.file 64.04 GB Size of the underlying DB file on disk.

disk_size 64.04 GB Total physical size on disk (same as sizes.file).

data_size 63.67 GB Compressed data currently in use (matches sizes.active).

🔍 The external size is much larger than active or file — this suggests that the data being stored (e.g., package metadata?) has a lot of bloat or redundancy that's being compacted/compressed in CouchDB's storage engine.

https://registry.npmjs.org/
https://github.com/npm/public-api
- npm's public APIs are haphazard, old-fashioned, and scattered. We can and will do better. An internal API was created to handle the needs of www, but needs some work before it can be publicly released.
- This repository is deprecated. If you are interested in filing an issue about npm's public registry API, please file over at the npm/registry repo. You can also find documentation over there!
  - https://github.com/npm/registry
    - npm registry documentation
    - A collection of archived documentation about registry endpoints/API.
    - https://github.com/npm/registry/tree/main/docs
      - https://github.com/npm/registry/blob/main/docs/REGISTRY-API.md
        
        Public Registry API
      - https://github.com/npm/registry/blob/main/docs/REPLICATE-API.md
        
        Replication API
        
        https://github.com/npm/registry/blob/main/docs/REPLICATE-API.md#the-follower-pattern
        
        The Follower Pattern
        
        The primary pattern of using these services is to build a follower. If you'd rather just jump into building something with the registry data, head on over to this tutorial to get started!
        
        https://github.com/npm/registry/blob/main/docs/follower.md / https://github.com/npm/registry-follower-tutorial
        
        This tutorial will teach you how to write a generic boilerplate NodeJS application that can manipulate, respond to, broadcast, analyze, and otherwise play with package metadata as it changes in the npm registry.
        
        Wait...what? Why?
        
        Here's the deal: do you want to have some fun with the package.json data from every version of every package in the npm registry? Some neat ideas:
        
        Find all the package READMEs that mention dogs
        
        Discover how many package authors are named "Kate"
        
        Calculate how many dependency changes occur on average in a major version bump
        
        And more! So stop waiting and write a follower!
      - https://github.com/npm/registry/blob/main/docs/COUCHDB.md
      - https://github.com/npm/registry/blob/main/docs/download-counts.md
        
        package download counts There is a public api that gives you download counts by package and time range.
        
        Our blog has an explanation of how npm download counts work, including "what counts as a download?"
        
        https://blog.npmjs.org/post/92574016600/numeric-precision-matters-how-npm-download-counts-work.html
        
        numeric precision matters: how npm download counts work
        
        npm's raw log data is continuously written to a series of buckets on AWS S3. Once per day, soon after UTC midnight, a map-reduce cluster is spun up that crunches the previous day's logs and pushes them into the database.
        
        https://github.com/npm/registry/blob/main/docs/download-counts.md#point-values
        
        Point values
        
        Gets the total downloads for a given period, for all packages or a specific package.
        
        GET https://api.npmjs.org/downloads/point/{period}[/{package}
        
        etc
        
        https://github.com/npm/registry/blob/main/docs/download-counts.md#bulk-queries
        
        Bulk Queries To perform a bulk query, you can hit the range or point endpoints with a comma separated list of packages rather than a single package
        
        https://github.com/npm/registry/blob/main/docs/download-counts.md#limits
        
        Limits Bulk queries are limited to at most 128 packages at a time and at most 365 days of data.
        
        All other queries are limited to at most 18 months of data. The earliest date for which data will be returned is January 10, 2015.
https://www.jsdelivr.com/
- A free CDN for open source projects
- Optimized for JS and ESM delivery from npm and GitHub. Works with all web formats.
- https://www.jsdelivr.com/docs/data.jsdelivr.com
  - jsDelivr API
  - https://www.jsdelivr.com/docs/data.jsdelivr.com#get-/v1/packages/npm/-package-@-version-
    - Get version metadata
    - Returns the default file and a list of all files in this version. An error is returned if the package size exceeds 100 MB.
  - https://www.jsdelivr.com/docs/data.jsdelivr.com#get-/v1/packages/npm/-package-@-version-/entrypoints
    - Get version entry points
    - Returns the recommended files to use from this package based on package metadata and additional heuristics. The response includes one file of each supported type (js, css), if available. The output may change over time as our algorithm improves.
  - https://www.jsdelivr.com/docs/data.jsdelivr.com#tag--Stats
    - Stats
    - Provides a wide range of usage statistics of jsDelivr. Most data are available with a two days delay, monthly and yearly summaries are available with a four days delay. Please note that different categories of data have different historical availability. The List stat periods endpoint provides information about which data are available for which time periods.
  - https://www.jsdelivr.com/docs/data.jsdelivr.com#get-/v1/stats/periods
    - List stats periods
    - Returns a list of all periods for which some stats are available in descending order.
  - https://www.jsdelivr.com/docs/data.jsdelivr.com#get-/v1/stats/packages
    - List top packages
    - Returns the most popular packages and their stats totals for the selected period. More detailed stats can be accessed via the provided links.
  - https://www.jsdelivr.com/docs/data.jsdelivr.com#get-/v1/stats/packages/npm/-package-
    - Get package stats
    - Returns daily usage stats for the package. Stats for specific versions can be accessed via the provided link.
  - https://www.jsdelivr.com/docs/data.jsdelivr.com#get-/v1/stats/packages/npm/-package-/versions
    - List top package versions
    - Returns daily usage stats for the most popular package versions. Stats for the individual version files can be accessed via the provided link.
  - https://www.jsdelivr.com/docs/data.jsdelivr.com#get-/v1/stats/packages/npm/-package-@-version-
    - Get package version stats
    - Returns daily usage stats for the specified package version. Stats for the individual version files can be accessed via the provided link.
  - https://www.jsdelivr.com/docs/data.jsdelivr.com#get-/v1/stats/packages/npm/-package-@-version-/files
    - List top package version files
    - Returns daily usage stats for the most popular package version files.
  - https://www.jsdelivr.com/docs/data.jsdelivr.com#get-/v1/lookup/hash/-hash-
    - Get file metadata from file hash
    - Allows a reverse lookup of a file at the CDN by its hash. Works only for files which were accessed at least once. If there are multiple files with the same hash, only the one which was accessed first via the CDN is returned.
- https://github.com/jsdelivr
  - jsDelivr is a Global CDN for Javascript and ES Modules
  - Check our main projects:
    - 🌎 jsDelivr Global CDN - Learn more about how our CDN works and how to enable advanced functionality like minification, ES modules, version aliasing and more. Our CDN serves more than 160 billion requests per month!
    - 🔍 API for jsDelivr, NPM and GitHub - Use our unique API that will allow you to better interact with any NPM and GitHub project out there. Get entrypoints, versions, CDN URLs and detailed download stats per file and version.
    - https://github.com/jsdelivr/data.jsdelivr.com
      - The public jsDelivr API. Get npm packages, files, versions, entry points, as well as their CDN URLs and download stats.
      - https://data.jsdelivr.com/v1
    - jsdelivr/data.jsdelivr.com#6
      - Searching jsdelivr.com for a package
      - there's a separate API for that, powered by Algolia.
        
        https://github.com/algolia/npm-search
        
        npm-search
        
        npm ↔️ Algolia replication tool
        
        This is a failure resilient npm registry to Algolia index replication process. It will replicate all npm packages to an Algolia index and keep it up to date. The state of the replication is saved in Algolia index settings.
        
        https://github.com/algolia/npm-search#how-does-it-work
        
        How does it work?
        
        When the process starts with seq=0:
        
        save the current sequence of the npm registry in the state (Algolia settings)
        
        bootstrap the initial index content by using /_all_docs
        
        replicate registry changes since the current sequence
        
        watch for registry changes continuously and replicate them
        
        https://replicate.npmjs.com/
        
        http://docs.couchdb.org/en/2.0.0/api/database/bulk-api.html
    - jsdelivr/data.jsdelivr.com#68
      - Wrong or insufficient result on API get /v1/lookup/hash/{hash}
      - When calling the API-Endpoint /v1/lookup/hash/{hash} with the filehash of jquery 3.7.1 i get back a wrong result or at least not the npm result.
      - jsdelivr/data.jsdelivr.com#68 (comment)
        
        we'll consider possible improvements here, but the current behavior matches the documentation:
        
        Allows a reverse lookup of a file at the CDN by its hash. Works only for files which were accessed at least once. If there are multiple files with the same hash, only the one which was accessed first via the CDN is returned.
      - jsdelivr/data.jsdelivr.com#68 (comment)
        
        We could maybe add an option to list all packages that have the file instead of returning just one. But then you'd need to somehow select the "right" one, which might still be hard (there are 20+ matches in this case).
      - jsdelivr/data.jsdelivr.com#68 (comment)
        
        okay i think i found a solution for my usecase if you could provide an endpoint with the full list. I can filter for npm entries. After that i just have to lookup via npm the release timestamp of that explicit version and take the oldest one.
      - jsdelivr/data.jsdelivr.com#68 (comment)
        
        I'll take a look at this when I get some time.
      - jsdelivr/data.jsdelivr.com#68 (comment)
        
        We could maybe add an option to list all packages that have the file instead of returning just one.
        
        @MartinKolarik Personally I would find that useful/interesting; more so than just getting a single match for whatever the first source that happened to be accessed was.
        
        It would also be potentially useful to be able to filter those down with a type param (similar to what other API endpoints have that let me specify npm / gh / etc)
        
        I'm not sure how much this would complicate things, but maybe that could also be sorted by some kind of 'popularity' measure like download stats/etc.
        
        For the example use case, I could probably guess that the intended version was the main npm version; but for other libraries I might not be as easily able to identify what the 'main canonical source' might be for that file; which is where I might be able to use the 'popularity' to help narrow it down (eg. if most of those results have 100/1000's of downloads, and then the main one has 1,000,000's of downloads; I could make a solid guess)
        
        A more complicated idea (that I'm not even sure if it would be viable), but that maybe could help figure out the 'canonical' version better, might be to:
        
        lookup the hash and get a list of matching projects that have that file
        
        for each of those projects, check the package.json / similar to see if this file is included in the main exports for that package
    - jsdelivr/data.jsdelivr.com#69
      - Allow base64 encoded sha256 hash to be looked up on /v1/lookup/hash/{hash}
      - Currently the lookup API endpoint allows specifying a hex-encoded sha256 hash to be looked up
      - But looking at some of the other API endpoints, in their responses they provide base64 encoded sha256 hashes
      - While I could obviously write some glue code to convert this; it would be nice if we were able to provide the base64 encoded version directly; with the API either automatically detecting the hash type, or even if we had to specify an extra param to tell it which encoding we're providing.
https://github.com/nice-registry
- nice-registry
- simple tools and datasets for dissecting the npm registry
- https://github.com/nice-registry/welcome#why
  - Why?
    
    There is a wealth of useful information in the npm registry, but it's difficult to access.
    
    npm Incorporated does not have a public API for collecting or querying registry metadata. Back in early 2015, they created a private internal registry API that is accessible exclusively to the npm website and npm CLI, and after more than two years there are still no signs of that API becoming publicly available.
  - Datasets
    
    Some of the projects in this org are not really tools, but datasets collected by conusming the entire registry and filtering for various criteria.
- https://github.com/nice-registry/all-the-package-repos
  - all-the-package-repos
  - Normalized repository URLs for every package in the npm registry. Updated daily.
  - Maintained by jsDelivr
  - All the repository URLs in the npm registry as an object whose keys are package names and values are URLs.
    
    This package weighs in at about 100 MB.
  - https://github.com/nice-registry/all-the-package-repos/tree/master/data
    - https://github.com/nice-registry/all-the-package-repos/blob/master/data/metadata.json
    - https://github.com/nice-registry/all-the-package-repos/blob/master/data/packages.json
      - Note: As of 2025-05-28, this is a ~185mb JSON file with ~3,454,782 JSON object entries mapping the package name to the corresponding repository URL.
  - https://github.com/nice-registry/all-the-package-repos/tree/master/scripts
    - https://github.com/nice-registry/all-the-package-repos/blob/master/scripts/update.js
      - const replicateUrl = 'https://replicate.npmjs.com/registry'
      - const registryUrl = 'https://registry.npmjs.org'
      - Note: As of 2025-05-28, this repo/update script actually seems to be fairly modern/maintained, unlike some of the other packages under this org.
- https://github.com/nice-registry/all-the-package-names
  - all-the-package-names
  - A list of all the public package names on npm. Updated daily.
  - Maintained by jsDelivr.
  - A list of all the public package names on npm.
    - Includes scoped packages
    - Updated daily
- https://github.com/nice-registry/all-the-packages
  - all-the-packages
  - All the npm registry metadata as an offline event stream. [DEPRECATED]
  - When you install this package, a postinstall script downloads the npm registry metadata to a local JSON file, which is about 540 MB.
    - https://github.com/nice-registry/all-the-packages/blob/master/package.json#L7-L8
      - "postinstall": "npm run download"
      - "download": "curl https://skimdb.npmjs.com/registry/_design/scratch/_view/byField -o skimdb.json",
  - To get cleaner package data, use nice-package
- https://github.com/nice-registry/nice-package
  - nice-package
  - Clean up messy package metadata from the npm registry
  - The package data served by the npm registry is messy and confusing. The folks at npm, Inc maintain a tool called normalize-package-data which does a lot of work to clean this data up, but the resulting object is still a bit confusing.
    
    nice-package uses normalize-package-data as a starter, then does even more package cleanup:
    - uses the doc['dist-tags'].latest as the baseline for package metadata
    - derives starsCount from the users object
    - derives a versions array from the time object
    - renames _npmUser to lastPublisher, because it's a more intuitive name.
    - renames maintainers to owners, for consistency with the CLI commands.
    - normalizes GitHub repository URLs to https format
    - moves internal bookkeeping properties like _id and _from into an other object that can easily be omitted.
    - more...
- https://github.com/nice-registry/package-stream
  - package-stream
  - An endless stream of clean package data from the npm registry.
  - The stream is an event emitter that emits two events: package and up-to-date. The up-to-date event is emitted when the stream reaches the end of all existing packages, but unlike typical read streams, this stream has no end event. It remains open indefinitely, emitting package events as new package versions are published to the npm registry in real time.
  - Each object emitted by the package event is a nice-package instance. Nice packages have cleaner metadata than you'd get directly from the npm registry, and some handy convenience methods .
  - https://github.com/nice-registry/package-stream/blob/79995371c0436d0b787eaf9c5ba2309b07d4bf60/index.js#L12-L15
https://unpkg.com/
- https://unpkg.com/#metadata-api
  - Metadata API
  - UNPKG serves metadata about the files in a package when you append ?meta to any package root or subdirectory URL.
  - This will return a JSON object with information about the files in that directory, including path, size, type, and subresource integrity value.
    - https://developer.mozilla.org/en-US/docs/Web/Security/Subresource_Integrity
      - Subresource Integrity
      - Subresource Integrity (SRI) is a security feature that enables browsers to verify that resources they fetch (for example, from a CDN) are delivered without unexpected manipulation. It works by allowing you to provide a cryptographic hash that a fetched resource must match.
      - An integrity value begins with at least one string, with each string including a prefix indicating a particular hash algorithm (currently the allowed prefixes are sha256, sha384, and sha512), followed by a dash, and ending with the actual base64-encoded hash.
      - Note: An integrity value may contain multiple hashes separated by whitespace. A resource will be loaded if it matches one of those hashes.
- https://github.com/unpkg/unpkg
  - UNPKG
  - UNPKG is a fast, global content delivery network for everything on npm. Use it to quickly and easily load any file from npm
https://www.skypack.dev/
- Skypack
- Load optimized npm packages with no install and no build tools.
- https://docs.skypack.dev/skypack-cdn/api-reference
  - API Reference
  - Skypack is a CDN at heart, but it can help to think of its interface as an API. Each request follows a specific format, and every response is valid JavaScript code. This lets you to load anything from our CDN via a JavaScript import statement.
  - https://docs.skypack.dev/skypack-cdn/api-reference/package-metadata
    - Package Metadata
    - View metadata about a particular package.
    - GET https://cdn.skypack.dev/:packageSpecifier?meta
    - View metadata about any package. Replace :packageSpecifier with any package name and optionally a version, like so: https://cdn.skypack.dev/preact?meta or https://cdn.skypack.dev/[email protected]?meta.
https://jsr.io/
- JSR
- The open-source package registry for modern JavaScript and TypeScript
- https://jsr.io/docs/api
  - API
https://gist.github.com/anvaka/8e8fa57c7ee1350e3491
- npm rank This gist is updated daily via cron job and lists stats for npm packages:
  - Top 1,000 most depended-upon packages
  - Top 1,000 packages with largest number of dependencies
  - Top 1,000 packages with highest PageRank score
- This seems to have last been updated: Fri, 16 Aug 2019 07:31:10 GMT
  - https://gist.github.com/anvaka/8e8fa57c7ee1350e3491?permalink_comment_id=3581063#gistcomment-3581063
    - the data is generated by https://github.com/anvaka/npmrank
      
      A process of downloading the npm packages is a bit involved, since npm deprecated their public endpoints, but still possible. The https://github.com/anvaka/npmrank repository instructions on getting the data are up to date.
      - https://github.com/anvaka/npmrank
        
        npmrank
        
        npm dependencies graph metrics
        
        This repository computes various graph metrics for npm dependencies.
        
        Download the npm graph from npm. To do this, follow the instructions from https://github.com/anvaka/allnpm#downloading-npm-data
        
        https://github.com/anvaka/allnpm
        
        allnpm Graph generator of entire npm registry.
        
        https://github.com/anvaka/allnpm#downloading-npm-data
        
        Downloading npm data
        
        Unfortunately we can no longer access https://skimdb.npmjs.com/registry/_design/scratch/_view/byField directly. This CouchDB view used to return every single package from npm, that could be used to construct the graph.
        
        To get all npm packages we have to replicate the entire npm repository using standalone instance of CouchDB and following instructions from https://www.npmjs.com/package/npm-registry-couchapp.
        
        The process took me ~2 days and ~300GB of hard drive, until local instance of CouchDB compacted its views. After compaction the disk usage went down to ~100GB.
        
        Note: it is not enough to just replicate, need to wait until all indexes are generated.
        
        Once the replication is complete you can do:
        
        wget http://admin:[email protected]:5984/registry/_design/scratch/_view/byField
        
        In November 2020, this produced 3.3GB of npm packages and saved it into byField file.
        
        https://github.com/npm/npm-registry-couchapp
        
        deprecation notice: as npm has scaled, the registry architecture has gradually migrated towards a complex distributed architecture, of which npm-registry-couchapp is only a small part. FOSS is an important part of npm, and over time we plan on exposing more APIs, and better documenting the existing API.
        
        https://github.com/anvaka/npmrank#online
        
        Discover relevant and popular packages quickly: https://anvaka.github.io/npmrank/online/ Select a keyword and get packages sorted by their pagerank value.
  - https://gist.github.com/anvaka/8e8fa57c7ee1350e3491?permalink_comment_id=4435858#gistcomment-4435858
    - I've made an updated version here -- top 10k packages.
      - https://leodog896.github.io/npm-rank/index.html
        
        npm-rank
        
        Automated top 10000 npm packages collector, inspired by anvaka's npm rank gist.
        
        https://leodog896.github.io/npm-rank/PACKAGES.html
        
        Packages Ordered list of top 10000 NPM packages
        
        https://github.com/LeoDog896/npm-rank
        
        npm-rank Automated top 10000 npm packages collector using Deno & GitHub actions.
        
        The raw data is available in releases as json.
- https://gist.github.com/anvaka/8e8fa57c7ee1350e3491#file-01-most-dependent-upon-md
  - Top 1000 most depended-upon packages
- https://gist.github.com/anvaka/8e8fa57c7ee1350e3491#file-02-with-most-dependencies-md
  - Top 1000 packages with most dependencies
- https://gist.github.com/anvaka/8e8fa57c7ee1350e3491#file-03-pagerank-md
  - Top 1000 packages with highest Pagerank
- https://gist.github.com/anvaka/8e8fa57c7ee1350e3491#file-04-hits-rank-md
  - Top 1000 packages with highest authority in HITS rank
https://github.com/evanwashere/top-npm-packages
- npm packages ranked by monthly downloads
- This seems to have the top 10,000 entries in .json and .txt; last updated on Dec 21, 2021
https://npmgraph.js.org/
- https://github.com/npmgraph/npmgraph
  - npmgraph
  - A tool for exploring npm modules and dependencies.
  - Be sure to check out the new npmgraph CLI
    - https://github.com/npmgraph/npmgraph-cli
      - npmgraph-cli
      - Command-line interface for launching the npmgraph web site to show module dependency graphs.
        
        Please note that the npmgraph-cli does not render dependency graphs directly. It's simply a convenience (albeit a pretty powerful one) for opening npmgraph.js.org URLs in accordance with the npmgraph URL API.
  - https://github.com/npmgraph/npmgraph/blob/main/components/ModulePane/ModuleNpmsIOScores.tsx#L15-L18
    - https://api.npms.io/v2/package/${encodeURIComponent(module.name)}
  - https://github.com/npmgraph/npmgraph/blob/main/components/ModulePane/ModuleBundleSize.tsx#L14-L16
    - https://bundlephobia.com/result?p=${pn}
    - https://bundlephobia.com/api/size?package=${pn}
  - https://github.com/npmgraph/npmgraph/blob/main/lib/ModuleCache.ts#L57-L92
    - https://registry.npmjs.org
    - raw.githubusercontent.com
  - https://github.com/npmgraph/npmgraph/blob/main/components/ModulePane/ModulePane.tsx#L97
    - https://www.npmjs.com/package/${module.name}/v/${module.version}
    - https://cdn.jsdelivr.net/npm/${module.key}/package.json
https://github.com/johnymontana/npm-graph
- npm-graph
- Load the npm registry into Neo4j for graph based module dependency analysis
https://www.npmcharts.com/
- https://github.com/cheapsteak/npmcharts.com
  - Compare npm package downloads over time
  - https://github.com/cheapsteak/npmcharts.com/blob/master/packages/utils/stats/fetchPackagesStats.js#L3-L14
    - Seems to fetch the package download stats directly from the npm API
https://npmtrends.com/
- npm trends Compare package download counts over time
https://taoalpha.github.io/npm-trending/
- Note: This seems to be broken currently as it looks for data for the latest date by default, and that doesn't seem to be getting filled in
- The latest date at time of writing appeared to be:
  - https://taoalpha.github.io/npm-trending/?date=2022-11-27
    - Npm Trending Report @ Sun, 27 Nov 2022 (total : 24651)
- https://github.com/taoalpha/npm-trending
  - npm-trending
  - will crawl npm packages download stats and generate trending pages every day!!!
  - Based on the update of the npm stats, I will generate report every day. The report will list three categories:
    - Top packages based on the increase of number of downloads from last day;
    - Top packages based on the increase percentage from last day;
    - Top packages with largest downloads today;
  - taoalpha/npm-trending#35
    - I don't see anything, normal?
    - Is this expected that I'm only seeing an empty template (a red bar at the top and two arrows on both sides of the page) not anything else?
    - Fetch for the report JSON is returning 404. Perhaps it isn't being generated.
    - ya I ran out of github storage for storing all the history records, so the CI builds has been failing for quite a while... trying to purge some very old records to bring it back
    - taoalpha/npm-trending#35 (comment)
      - I did a bit of a deep dive into how it defaults to trying to load data for the latest date, where it is reading that data from, what the actual latest data is and how to access it, the size of the repo (and the main different branches within it), etc.
https://npm.chart.dev/
- https://github.com/atinux/npm-chart
  - NPM Chart
  - Visualize your package npm downloads in a beautiful chart, ready to be shared with your community.
  - Visualize npm downloads in a beautiful chart, ready to be shared with your community.
  - Using npm-stat.com API
https://npm-stat.com/
- npm-stat npm-stat can generate download charts for any package on npm
- https://github.com/pvorb/npm-stat.com
  - npm-stat Download statistics for npm packages.
  - https://github.com/pvorb/npm-stat.com/blob/master/src/main/java/de/vorb/npmstat/clients/downloads/DownloadsClient.java#L25-L36
    - This seems to read the data from the official NPM downloads API, and I believe it also caches it in it's own local database as well
  - pvorb/npm-stat.com#73
    - Feature: downloads leaderboard
    - pvorb/npm-stat.com#73 (comment)
      - There doesn't seem to be a good API on npm's end to get the names of every package. If there was some sort of daily dump it could be easily done.
        
        There's a complete list of all packages at replicate.npmjs.com/_all_docs. Maybe I can use that for downloading all package stats in a daily cron job...
        
        This is basically the most recent/relevant page for updates to that replicate API:
        
        https://github.com/orgs/community/discussions/152515
        
        There seem to be a few issues with it at the moment (recently migrated to a new system), I believe this is the most recent/relevant official update on that + a summary of some of the issues:
        
        https://github.com/orgs/community/discussions/152515#discussioncomment-13433941
        
        But at least in theory, when it's working correctly, you'll be able to download that full snapshot once, and then use the sequence number to just download the updates since that first 'snapshot' you downloaded from:
        
        https://replicate.npmjs.com/registry/_changes
        
        For many usecases, the revision included in those records is useful for matching up with the full package data from one of the other npm API's; but for this use case I don't even think the part matters, as you're mostly just wanting a list of package names; so you could probably:
        
        fetch the latest changes since the last snapshot
        
        process and de-duplicate the package entries in that snapshot into a 'reduced snapshot' (since you only really care if a package name exists, not how many times/etc)
        
        process and de-duplicate that 'reduced snapshot' against the main database (I separated this from the above step since I figured de-duplicating the snapshot against itself is probably cheaper than hitting the DB for each entry; but it may be a negligible cost either way)
        
        I haven't looked deeply into the download stats API, but I think that would probably actually end up being the 'heavier' part of things; at least if you wanted it to be truly representative (and not just an approximation based on already fetched/cached data). The docs for the download count API are here:
        
        https://github.com/npm/registry/blob/main/docs/download-counts.md
        
        I won't go into that too deeply, since I believe it's what this project uses currently anyway, but probably the most important part of that would be the limitation of bulk queries:
        
        https://github.com/npm/registry/blob/main/docs/download-counts.md#bulk-queries
        
        Important: Scoped packages are not yet supported in bulk queries. So you cannot request /downloads/point/last-day/@slack/client,@iterables/map yet.
        
        Bulk queries are limited to at most 128 packages at a time and at most 365 days of data.
        
        All other queries are limited to at most 18 months of data. The earliest date for which data will be returned is January 10, 2015.
        
        Given we can only fetch the details of 128 packages at a time, we can do a bit of 'back of napkin' math to figure out what would be required. Looking at the total number of rows in the replicate API:
        
        https://replicate.npmjs.com/registry/_all_docs
        
        total_rows: 5537482
        
        I'm not 100%, but if we assume each of those corresponds with a single npm package, and there aren't duplicates/etc in that count, then we would need ~43,262 queries to fetch all of the download data for those packages for any given time period:
        
        5537482 / 128 = 43,261.578125 ~= 43,262
        
        The real number of queries would actually be higher than that, as according to those docs, the bulk query system doesn't work for scoped packages, so each of those would need to be a single query on it's own.
        
        From a quick search, I didn't find a lot about API rate limits/etc to see how viable it is to make that many requests over a short(ish) period of time; this was the best I found:
        
        npm/feedback#658
        
        https://blog.npmjs.org/post/164799520460/api-rate-limiting-rolling-out.html
        
        But if it were me implementing it, that number feels like it might be a bit too high for fetching daily stats.. so maybe weekly or monthly would be more viable in that regard.
        
        That said, there are likely also a lot of packages in that list that aren't really used/downloaded very often; so it would be an interesting exercise to get the full list of package names, and then get their download stats for the last year or so, and then sort/filter based on that. I suspect that there is probably a large number of packages that would have 0 downloads, or at the very least less than a certain low threshold. You could potentially look at the download count on that higher time scale (eg. year, or month maybe) to filter down the full list of packages to a 'reasonably active' list (eg. above a certain download count threshold), and then only fetch the lower time-scale download counts (eg. daily) for that 'reasonably active' list of packages.
        
        Anyway, that's my /2c of thoughts on this sort of thing based on recent explorations and my own project ponderings/musings.
https://github.com/kkeeth/npm-stats-api
- npm-stats-api
- Node Package's Statistics API
- Node Package's Statistics API | Our functions will provide statistics of node package | This is a Node.js API wrapper for the NPM API and Registry. Based on the original npm-stat-api.
https://npms.io/
- A better and open source search for node packages
- https://github.com/npms-io
  - https://github.com/npms-io/npms
    - Meta repository for centralized issues
  - https://github.com/npms-io/npms-www
    - The https://npms.io website
  - https://github.com/npms-io/npms-api
    - The https://npms.io API
    - https://api-docs.npms.io/
      - npms-api
      - The https://npms.io API.
  - https://github.com/npms-io/npms-analyzer
    - The analyzer behind https://npms.io
https://libraries.io/
- What is Libraries.io? Libraries.io is a free service that collects publicly available open source package information scraped from the internet. With it you can search 9.96M packages by license, language, or explore new, trending, or popular packages.
- https://libraries.io/api
  - API Docs
  - Want to use Libraries.io for Data-API capabilities? API access is free to all registered users. All you need to do is create an account to get access to these capabilities! (It's free!)
  - Rate limit
    
    All requests are subject to a 60 request/minute rate limit based on your API key, any further requests within that timeframe will result in a 429 response.
    
    Larger scale access to data is available from Tidelift.
  - https://github.com/hackebrot/go-librariesio
    - go-librariesio
    - API client for libraries.io written in Go
    - go-librariesio is a Go client library for accessing the libraries.io API.
  - https://github.com/ffflorian/api-clients/tree/main/packages/libraries.io
    - libraries.io
    - A libraries.io API client.
  - https://github.com/millette/librarian-api
    - librarian-api (deprecated)
    - Client library for the libraries.io api.
- https://libraries.io/npm
  - npm
  - Total Packages: 5,143,582
  - https://libraries.io/search?order=desc&platforms=npm&sort=rank
    - Popular NPM Projects - By SourceRank
    - 1 - 30 of 5.15M packages
  - https://libraries.io/search?order=desc&platforms=npm&sort=dependents_count
    - Popular NPM Projects - By Dependents
    - 1 - 30 of 5.15M packages
  - https://libraries.io/search?order=desc&platforms=npm&sort=stars
    - Popular NPM Projects - By GitHub Stars
    - 1 - 30 of 5.15M packages

Stat	Value	Meaning
`doc_count`	`3,520,565`	Current number of active (non-deleted) documents.
`doc_del_count`	`1,978,731`	Documents that have been deleted, but whose tombstones are still retained.
`update_seq`	`41,541,438`	Number of changes made to the DB (adds/updates/deletes).
`purge_seq`	`664,829`	Number of permanently purged revisions — purges fully erase history (unlike deletes).

Metric	Value	Description
`sizes.active`	`63.67 GB`	Actual active data in use.
`sizes.external`	`189.24 GB`	Raw data size as stored externally (JSON size without overhead, compression, etc).
`sizes.file`	`64.04 GB`	Size of the underlying DB file on disk.
`disk_size`	`64.04 GB`	Total physical size on disk (same as `sizes.file`).
`data_size`	`63.67 GB`	Compressed data currently in use (matches `sizes.active`).

Package / Bundle Size, Bundle Analyzer / Visualizer, etc

https://medium.com/@glitch.txs/on-measuring-the-bundle-size-of-javascript-packages-5816e216e3d8
- On Measuring the Bundle Size of JavaScript Packages
- https://bundlephobia.com/
  - Bundlephobia
  - find the cost of adding a npm package to your bundle
  - https://github.com/pastelsky/bundlephobia
    - Find out the cost of adding a new frontend dependency to your project
  - https://github.com/AdrieanKhisbe/bundle-phobia-cli
    - Cli for the node BundlePhobia Service
  - https://github.com/pastelsky/package-build-stats
    - package-build-stats
    - This is the cloud function that powers the core of building, minifying and gzipping of packages in bundlephobia
- https://bundlejs.com/
  - bundlejs
  - a quick npm package size checker
  - https://github.com/okikio/bundlejs
    - An online tool to quickly bundle & minify your projects, while viewing the compressed gzip/brotli bundle size, all running locally on your browser.
    - I used monaco-editor for the code-editor, esbuild as bundler and treeshaker respectively, denoflate as a wasm port of gzip, deno_brotli as a wasm port of brotli, deno_lz4 as a wasm port of lz4, bytes to convert the compressed size to human readable values, esbuild-visualizer to visualize and analyze your esbuild bundle to see which modules are taking up space and, umami for private, publicly available analytics and general usage stats all without cookies.
    - bundlejs is a quick and easy way to bundle your projects, minify and see it's gzip size. It's an online tool similar to bundlephobia, but bundle does all the bundling locally on you browser and can treeshake and bundle multiple packages (both commonjs and esm) together, all without having to install any npm packages and with typescript support.
- https://github.com/glitch-txs/vite-size
  - Vite Size
  - Check the bundle size of the output build of any package with Vite.
  - Measure the bundle size of any package with Vite
https://github.com/webpack-contrib/webpack-bundle-analyzer
- Webpack Bundle Analyzer
- Visualize size of webpack output files with an interactive zoomable treemap.
- Webpack plugin and CLI utility that represents bundle content as convenient interactive zoomable treemap
https://chrisbateman.github.io/webpack-visualizer/
- Webpack Visualizer
- https://github.com/chrisbateman/webpack-visualizer
  - Webpack Visualizer
  - Visualize and analyze your Webpack bundle to see which modules are taking up space and which might be duplicates.
https://github.com/btd/esbuild-visualizer
- EsBuild Visualizer
- Create chart of dependencies in your bundle
- Visualize and analyze your esbuild bundle to see which modules are taking up space.

Size Visualisation Data Structures

https://en.wikipedia.org/wiki/Treemapping
- In information visualization and computing, treemapping is a method for displaying hierarchical data using nested figures, usually rectangles.
  
  Treemaps display hierarchical (tree-structured) data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. A leaf node's rectangle has an area proportional to a specified dimension of the data. Often the leaf nodes are colored to show a separate dimension of the data.
  
  When the color and size dimensions are correlated in some way with the tree structure, one can often easily see patterns that would be difficult to spot in other ways, such as whether a certain color is particularly prevalent. A second advantage of treemaps is that, by construction, they make efficient use of space. As a result, they can legibly display thousands of items on the screen simultaneously.
https://en.wikipedia.org/wiki/Tree_(abstract_data_type)
- In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be connected to exactly one parent, except for the root node, which has no parent (i.e., the root node as the top-most node in the tree hierarchy). These constraints mean there are no cycles or "loops" (no node can be its own ancestor), and also that each child can be treated like the root node of its own subtree, making recursion a useful technique for tree traversal.
- The abstract data type (ADT) can be represented in a number of ways, including a list of parents with pointers to children, a list of children with pointers to parents, or a list of nodes and a separate list of parent-child relations (a specific type of adjacency list). Representations might also be more complicated, for example using indexes or ancestor lists for performance.

Link Dump 1

The below content was originally posted in this comment (Dec 7, 2023: Ref), and then copied over as the basis for a new issue in this comment (Dec 13, 2023: Ref)

It has been further refined/enhanced since, including fixing up the titles, adding abstracts, and removing irrelevant links.

Here is a link dump of a bunch of the tabs I have open but haven't got around to reviewing in depth yet, RE: 'AST fingerprinting' / Code Similarity / etc:

Unsorted/Unreviewed Initial Link Dump RE: 'AST fingerprinting' / Code Similarity

Program Dependence Graph, Control Flow Graph, Data Flow Graph, Data Flow Analysis, Program Analysis Tools, etc

https://en.wikipedia.org/wiki/Program_dependence_graph
- Program Dependence Graph - Wikipedia
- In computer science, a Program Dependence Graph (PDG) is a representation of a program's control and data dependencies. It's a directed graph where nodes represent program statements, and edges represent dependencies between these statements. PDGs are useful in various program analysis tasks, including optimizations, debugging, and understanding program behavior.
https://en.wikipedia.org/wiki/Control-flow_graph
- Control-Flow Graph - Wikipedia
- In computer science, a control-flow graph (CFG) is a representation, using graph notation, of all paths that might be traversed through a program during its execution.
- In a control-flow graph each node in the graph represents a basic block, i.e. a straight-line piece of code without any jumps or jump targets; jump targets start a block, and jumps end a block. Directed edges are used to represent jumps in the control flow. There are, in most presentations, two specially designated blocks: the entry block, through which control enters into the flow graph, and the exit block, through which all control flow leaves.
- https://github.com/rudrOwO/control-flow-graph
  - Control-flow Graph Generate control-flow graph (CFG) from any code consisting of C-like syntax
  - https://control-flow.vercel.app/
- https://reverseengineering.stackexchange.com/questions/16557/building-a-control-flow-graph-from-machine-code
  - Building a control flow graph from machine code (2017)
https://stackoverflow.com/questions/15087195/data-flow-graph-construction
- Stack Overflow: Data Flow Graph Construction (2013)
https://codereview.stackexchange.com/questions/276387/call-flow-graph-from-python-abstract-syntax-tree
- Code Review Stack Exchange: Call-flow graph from Python abstract syntax tree (2022)
https://codeql.github.com/docs/writing-codeql-queries/about-data-flow-analysis/
- CodeQL Documentation: About data flow analysis
- Data flow analysis is used to compute the possible values that a variable can hold at various points in a program, determining how those values propagate through the program and where they are used.
- https://codeql.github.com/docs/codeql-language-guides/analyzing-data-flow-in-javascript-and-typescript/#analyzing-data-flow-in-javascript-and-typescript
  - Analyzing data flow in JavaScript and TypeScript This topic describes how data flow analysis is implemented in the CodeQL libraries for JavaScript/TypeScript and includes examples to help you write your own data flow queries.
https://clang.llvm.org/docs/DataFlowAnalysisIntro.html
- Clang Documentation: Data flow analysis: an informal introduction
- This document introduces data flow analysis in an informal way. The goal is to give the reader an intuitive understanding of how it works, and show how it applies to a range of refactoring and bug finding problems.
- Data flow analysis is a static analysis technique that proves facts about a program or its fragment. It can make conclusions about all paths through the program, while taking control flow into account and scaling to large programs. The basic idea is propagating facts about the program through the edges of the control flow graph (CFG) until a fixpoint is reached.
https://www.cs.odu.edu/~zeil/cs350/latest/Public/analysis/index.html
- Program Analysis Tools
- Contents:
  - 1 Representing Programs
    - 1.1 Abstract Syntax Trees (ASTs)
    - 1.2 Control Flow Graphs
  - 2 Style and Anomaly Checking
    - 2.1 Lint
    - 2.2 Static Analysis by Compilers
    - 2.3 CheckStyle
    - 2.4 SpotBugs
    - 2.5 PMD
  - 3 Reverse-Engineering Tools
    - 3.1 Reverse Compilers
    - 3.2 Java Obfuscators
    - 3.3 Obfuscation Example
  - 4 Dynamic Analysis Tools
    - 4.1 Pointer/Memory Errors
    - 4.2 Profilers
- https://www.cs.odu.edu/~zeil/cs350/latest/Public/analysis/index.html#control-flow-graphs
  - 1.2 Control Flow Graphs
  - Represent each executable statement in the code as a node, with edges connecting nodes that can be executed one after another. Nodes for conditional statements have two or more outgoing edges.
- https://www.cs.odu.edu/~zeil/cs350/latest/Public/analysis/index.html#data-flow-analysis
  - 1.2.2 Data Flow Analysis
https://www.cs.columbia.edu/~suman/secure_sw_devel/Basic_Program_Analysis_CF.pdf
- Slides: Basic Program Analysis - Suman Jana
- ChatGPT Summary / Abstract:
  - Title: Basic Program Analysis
    
    Author: Suman Jana
    
    Institution: Columbia University
    
    Abstract: This document delves into the foundational concepts and techniques involved in program analysis, particularly focusing on control flow and data flow analysis essential for identifying security bugs in source code. The objective is to equip readers with the understanding and tools needed to effectively analyze programs without building systems from scratch, utilizing existing frameworks such as LLVM for customization and enhancement of analysis processes.
    
    The core discussion includes an overview of compiler design with specific emphasis on the Abstract Syntax Tree (AST), Control Flow Graph (CFG), and Data Flow Analysis. These elements are critical in understanding the structure of source code and its execution flow. The document highlights the conversion of source code into AST and subsequently into CFG, where data flow analysis can be applied to optimize code and identify potential security vulnerabilities.
    
    Additionally, the paper explores more complex topics like identifying basic blocks within CFG, constructing CFG from basic blocks, and advanced concepts such as loop identification and the concept of dominators in control flow. It also addresses the challenges and solutions related to handling irreducible Control Flow Graphs (CFGs), which are crucial for the analysis of less structured code.
    
    Keywords: Program Analysis, Compiler Design, Abstract Syntax Tree (AST), Control Flow Graph (CFG), Data Flow Analysis, LLVM, Security Bugs.

Stack Overflow: Assembly-level function fingerprint (2011)

https://stackoverflow.com/questions/7283702/assembly-level-function-fingerprint
- Stack Overflow: Assembly-level function fingerprint (2011)

Systems and methods for detecting copied computer code using fingerprints (2016)

https://patents.google.com/patent/US9459861B1/en
- Systems and methods for detecting copied computer code using fingerprints (2016)
- Systems and methods of detecting copying of computer code or portions of computer code involve generating unique fingerprints from compiled computer binaries. The unique fingerprints are simplified representations of functions in the compiled computer binaries and are compared with each other to identify similarities between functions in the respective compiled computer binaries. Copying can be detected when there are sufficient similarities between fingerprints of two functions.

A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features (2022)

https://dl.acm.org/doi/10.1145/3486860
- A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features (2022)
- Binary code fingerprinting is crucial in many security applications. Examples include malware detection, software infringement, vulnerability analysis, and digital forensics. It is also useful for security researchers and reverse engineers since it enables high fidelity reasoning about the binary code such as revealing the functionality, authorship, libraries used, and vulnerabilities. Numerous studies have investigated binary code with the goal of extracting fingerprints that can illuminate the semantics of a target application. However, extracting fingerprints is a challenging task since a substantial amount of significant information will be lost during compilation, notably, variable and function naming, the original data and control flow structures, comments, semantic information, and the code layout. This article provides the first systematic review of existing binary code fingerprinting approaches and the contexts in which they are used. In addition, it discusses the applications that rely on binary code fingerprints, the information that can be captured during the fingerprinting process, and the approaches used and their implementations. It also addresses limitations and open questions related to the fingerprinting process and proposes future directions.

BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code Executables (2017)

https://inria.hal.science/hal-01648996/document
- BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code Executables (2017)
- Binary code fingerprinting is a challenging problem that requires an in-depth analysis of binary components for deriving identifiable signatures. Fingerprints are useful in automating reverse engineering tasks including clone detection, library identification, authorship attribution, cyber forensics, patch analysis, malware clustering, binary auditing, etc. In this paper, we present BinSign, a binary function fingerprinting framework. The main objective of BinSign is providing an accurate and scalable solution to binary code fingerprinting by computing and matching structural and syntactic code profiles for disassemblies. We describe our methodology and evaluate its performance in several use cases, including function reuse, malware analysis, and indexing scalability. Additionally, we emphasize the scalability aspect of BinSign. We perform experiments on a database of 6 million functions. The indexing process requires an average time of 0.0072 seconds per function. We find that BinSign achieves higher accuracy compared to existing tools.

Software Fingerprinting in LLVM (2021)

https://www.unomaha.edu/college-of-information-science-and-technology/research-labs/_files/software-nsf.pdf
- Software Fingerprinting in LLVM (2021)
- Executable steganography, the hiding of software machine code inside of a larger program, is a potential approach to introduce new software protection constructs such as watermarks or fingerprints. Software fingerprinting is, therefore, a process similar to steganography, hiding data within other data. The goal of fingerprinting is to hide a unique secret message, such as a serial number, into copies of an executable program in order to provide proof of ownership of that program. Fingerprints are a special case of watermarks, with the difference being that each fingerprint is unique to each copy of a program. Traditionally, researchers describe four aims that a software fingerprint should achieve. These include the fingerprint should be difficult to remove, it should not be obvious, it should have a low false positive rate, and it should have negligible impact on performance. In this research, we propose to extend these objectives and introduce a fifth aim: that software fingerprints should be machine independent. As a result, the same fingerprinting method can be used regardless of the architecture used to execute the program. Hence, this paper presents an approach towardsthe realization of machine-independent fingerprinting of executable programs. We make use of Low-Level Virtual Machine (LLVM) intermediate representation during the software compilation process to demonstrate both a simple static fingerprinting method as well as a dynamic method, which displays our aim of hardware independent fingerprinting. The research contribution includes a realization of the approach using the LLVM infrastructure and provides a proof of concept for both simple static and dynamic watermarks that are architecture neutral.

Syntax tree fingerprinting for source code similarity detection (2009)

https://ieeexplore.ieee.org/document/5090050
- Syntax tree fingerprinting for source code similarity detection (2009)
- Numerous approaches based on metrics, token sequence pattern-matching, abstract syntax tree (AST) or program dependency graph (PDG) analysis have already been proposed to highlight similarities in source code: in this paper we present a simple and scalable architecture based on AST fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases.
- https://igm.univ-mlv.fr/~chilowi/research/syntax_tree_fingerprinting/syntax_tree_fingerprinting_ICPC09.pdf

Syntax tree fingerprinting: a foundation for source code similarity detection (2011)

https://hal.science/hal-00627811/document
- Syntax tree fingerprinting: a foundation for source code similarity detection (2011)
- Plagiarism detection and clone refactoring in software depend on one common concern: finding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modifications are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits. Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Dependency Graph (PDG), we believe that the AST could efficiently support the program analysis and transformations required for the advanced similarity detection process. In this paper we present a simple and scalable architecture based on syntax tree fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases.

Source Code Plagiarism Detection Based on Abstract Syntax Tree Fingerprintings (2022)

https://ieeexplore.ieee.org/document/9960266
- Source Code Plagiarism Detection Based on Abstract Syntax Tree Fingerprintings (2022)
- Syntax Tree (AST) is an abstract logical structure of source code represented as a tree. This research utilizes information of fingerprinting with AST to locate the similarities between source codes. The proposed method can detect plagiarism in source codes using the number of duplicated logical structures. The structural information of program is stored in the fingerprints format. Then, the fingerprints of source codes are compared to identify number of similar nodes. The final output is calculated from number of similar nodes known as similarities scores. The result shows that the proposed method accurately captures the common modification techniques from basic to advance.

Dynamic graph-based software fingerprinting (2007)

https://dl.acm.org/doi/abs/10.1145/1286821.1286826
- Dynamic graph-based software fingerprinting (2007)
- Fingerprinting embeds a secret message into a cover message. In media fingerprinting, the secret is usually a copyright notice and the cover a digital image. Fingerprinting an object discourages intellectual property theft, or when such theft has occurred, allows us to prove ownership.
  
  The Software Fingerprinting problem can be described as follows. Embed a structure W into a program P such that: W can be reliably located and extracted from P even after P has been subjected to code transformations such as translation, optimization and obfuscation; W is stealthy; W has a high data rate; embedding W into P does not adversely affect the performance of P; and W has a mathematical property that allows us to argue that its presence in P is the result of deliberate actions.
  
  In this article, we describe a software fingerprinting technique in which a dynamic graph fingerprint is stored in the execution state of a program. Because of the hardness of pointer alias analysis such fingerprints are difficult to attack automatically.
- https://dl.acm.org/doi/pdf/10.1145/1286821.1286826

Adaptive Structural Fingerprints for Graph Attention Networks (2019)

https://openreview.net/forum?id=BJxWx0NYPr
- Adaptive Structural Fingerprints for Graph Attention Networks (2019)
- Graph attention network (GAT) is a promising framework to perform convolution and massage passing on graphs. Yet, how to fully exploit rich structural information in the attention mechanism remains a challenge. In the current version, GAT calculates attention scores mainly using node features and among one-hop neighbors, while increasing the attention range to higher-order neighbors can negatively affect its performance, reflecting the over-smoothing risk of GAT (or graph neural networks in general), and the ineffectiveness in exploiting graph structural details. In this paper, we propose an "adaptive structural fingerprint" (ADSF) model to fully exploit graph topological details in graph attention network. The key idea is to contextualize each node with a weighted, learnable receptive field encoding rich and diverse local graph structures. By doing this, structural interactions between the nodes can be inferred accurately, thus significantly improving subsequent attention layer as well as the convergence of learning. Furthermore, our model provides a useful platform for different subspaces of node features and various scales of graph structures to 'cross-talk' with each other through the learning of multi-head attention, being particularly useful in handling complex real-world data. Empirical results demonstrate the power of our approach in exploiting rich structural information in GAT and in alleviating the intrinsic oversmoothing problem in graph neural networks.

Cloneless: Code Clone Detection via Program Dependence Graphs with Relaxed Constraints (2019)

https://digitalcommons.calpoly.edu/theses/2040/
- Cloneless: Code Clone Detection via Program Dependence Graphs with Relaxed Constraints (2019)
- Code clones are pieces of code that have the same functionality. While some clones may structurally match one another, others may look drastically different. The inclusion of code clones clutters a code base, leading to increased costs through maintenance. Duplicate code is introduced through a variety of means, such as copy-pasting, code generated by tools, or developers unintentionally writing similar pieces of code. While manual clone identification may be more accurate than automated detection, it is infeasible due to the extensive size of many code bases. Software code clone detection methods have differing degree of success based on the analysis performed. This thesis outlines a method of detecting clones using a program dependence graph and subgraph isomorphism to identify similar subgraphs, ultimately illuminating clones. The project imposes few constraints when comparing code segments to potentially reveal more clones.
- https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=3437&context=theses

Graph-of-Code: Semantic Clone Detection Using Graph Fingerprints (2023)

https://www.computer.org/csdl/journal/ts/2023/08/10125077/1Nc4Vd4vb7W
- Graph-of-Code: Semantic Clone Detection Using Graph Fingerprints (2023)
- The code clone detection issue has been researched using a number of explicit factors based on the tokens and contents and found effective results. However, exposing code contents may be an impractical option because of privacy and security factors. Moreover, the lack of scalability of past methods is an important challenge. The code flow states can be inferred by code structure and implicitly represented using empirical graphs. The assumption is that modelling of the code clone detection problem can be achieved without the content of the codes being revealed. Here, a Graph-of-Code concept for the code clone detection problem is introduced, which represents codes into graphs. While Graph-of-Code provides structural properties and quantification of its characteristics, it can exclude code contents or tokens to identify the clone type. The aim is to evaluate the impact of graph-of-code structural properties on the performance of code clone detection. This work employs a feature extraction-based approach for unlabelled graphs. The approach generates a “Graph Fingerprint” which represents different topological feature levels. The results of code clone detection indicate that code structure has a significant role in detecting clone types. We found different GoC-models outperform others. The models achieve between 96% to 99% in detecting code clones based on recall, precision, and F1-Score. The GoC approach is capable in detecting code clones with scalable dataset and with preserving codes privacy.

A graph-based code representation method to improve code readability classification (2023)

https://www.researchgate.net/publication/370980383_A_graph-based_code_representation_method_to_improve_code_readability_classification
- A graph-based code representation method to improve code readability classification (2023)
- Context Code readability is crucial for developers since it is closely related to code maintenance and affects developers’ work efficiency. Code readability classification refers to the source code being classified as pre-defined certain levels according to its readability. So far, many code readability classification models have been proposed in existing studies, including deep learning networks that have achieved relatively high accuracy and good performance. Objective However, in terms of representation, these methods lack effective preservation of the syntactic and semantic structure of the source code. To extract these features, we propose a graph-based code representation method. Method Firstly, the source code is parsed into a graph containing its abstract syntax tree (AST) combined with control and data flow edges to reserve the semantic structural information and then we convert the graph nodes’ source code and type information into vectors. Finally, we train our graph neural networks model composing Graph Convolutional Network (GCN), DMoNPooling, and K-dimensional Graph Neural Networks (k-GNNs) layers to extract these features from the program graph. Result We evaluate our approach to the task of code readability classification using a Java dataset provided by Scalabrino et al. (2016). The results show that our method achieves 72.5% and 88% in three-class and two-class classification accuracy, respectively. Conclusion We are the first to introduce graph-based representation into code readability classification. Our method outperforms state-of-the-art readability models, which suggests that the graph-based code representation method is effective in extracting syntactic and semantic information from source code, and ultimately improves code readability classification.

Link Dump 2

The below content was originally posted in the following comment (April 30, 2024: Ref)

It has been further refined/enhanced since.

OpenAI Embeddings

This is potentially more of a generalised/'naive' approach to the problem, but it would also be interesting to see if/how well an embedding model tuned for code would do at solving this sort of problem space:

https://openai.com/blog/introducing-text-and-code-embeddings
- https://platform.openai.com/docs/guides/embeddings
  - An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
- https://platform.openai.com/docs/api-reference/embeddings
https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/
- Faiss: A library for efficient similarity search

Unsorted/Unreviewed Link Dump RE: 'AST fingerprinting' / Code Similarity (v2)

Also, here's the latest version of my open tabs 'reading list' in this space of things, in case any of it is relevant/interesting/useful here:

Wikipedia Articles, etc

https://en.wikipedia.org/wiki/Content_similarity_detection
- Content similarity detection

A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges (2023)

https://arxiv.org/abs/2306.16171
- A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges (2023)
- Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.

A comparison of code similarity analysers (2017)

https://link.springer.com/article/10.1007/s10664-017-9564-7
- A comparison of code similarity analysers (2017)
- Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code.

Winnowing: Local Algorithms for Document Fingerprinting (2003)

https://www.researchgate.net/publication/2840981_Winnowing_Local_Algorithms_for_Document_Fingerprinting
- Winnowing: Local Algorithms for Document Fingerprinting (2003)
- Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with Moss, a widely-used plagiarism detection service.
- https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

Source Code Plagiarism Detection with Pre-Trained Model Embeddings and Automated Machine Learning (2023)

https://www.researchgate.net/publication/375651686_Source_Code_Plagiarism_Detection_with_Pre-Trained_Model_Embeddings_and_Automated_Machine_Learning
- Source Code Plagiarism Detection with Pre-Trained Model Embeddings and Automated Machine Learning (2023)
- https://aclanthology.org/2023.ranlp-1.34.pdf

A Source Code Similarity System for Plagiarism Detection (2013)

https://www.researchgate.net/publication/262322336_A_Source_Code_Similarity_System_for_Plagiarism_Detection
- A Source Code Similarity System for Plagiarism Detection (2013)
- Source code plagiarism is an easy to do task, but very difficult to detect without proper tool support. Various source code similarity detection systems have been developed to help detect source code plagiarism. Those systems need to recognize a number of lexical and structural source code modifications. For example, by some structural modifications (e.g. modification of control structures, modification of data structures or structural redesign of source code) the source code can be changed in such a way that it almost looks genuine. Most of the existing source code similarity detection systems can be confused when these structural modifications have been applied to the original source code. To be considered effective, a source code similarity detection system must address these issues. To address them, we designed and developed the source code similarity system for plagiarism detection. To demonstrate that the proposed system has the desired effectiveness, we performed a well-known conformism test. The proposed system showed promising results as compared with the JPlag system in detecting source code similarity when various lexical or structural modifications are applied to plagiarized code. As a confirmation of these results, an independent samples t-test revealed that there was a statistically significant difference between average values of F-measures for the test sets that we used and for the experiments that we have done in the practically usable range of cut-off threshold values of 35–70%.

A Source Code Similarity Based on Siamese Neural Network (2020)

https://www.mdpi.com/2076-3417/10/21/7519
- A Source Code Similarity Based on Siamese Neural Network (2020)
- Finding similar code snippets is a fundamental task in the field of software engineering. Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are fitted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method.

Detecting Source Code Similarity Using Compression (2019)

https://www.researchgate.net/publication/337196468_Detecting_Source_Code_Similarity_Using_Compression
- Detecting Source Code Similarity Using Compression (2019)
- Different forms of plagiarism make a fair assessment of student assignments more difficult. Source code plagiarisms pose a significant challenge especially for automated assessment systems aimed for students' programming solutions. Different automated assessment systems employ different text or source code similarity detection tools, and all of these tools have their advantages and disadvantages. In this paper, we revitalize the idea of similarity detection based on string complexity and compression. We slightly adapt an existing, third-party, approach, implement it and evaluate its potential on synthetically generated cases and on a small set of real student solutions. On synthetic cases, we showed that average deviation (in absolute values) from the expected similarity is less than 1% (0.94%). On the real-life examples of student programming solutions we compare our results with those of two established tools. The average difference is around 18.1% and 11.6%, while the average difference between those two tools is 10.8%. However, the results of all three tools follow the same trend. Finally, a deviation to some extent is expected as observed tools apply different approaches that are sensitive to other factors of similarities. Gained results additionally demonstrate open challenges in the field.
- https://ceur-ws.org/Vol-2508/paper-pri.pdf

Binary code similarity analysis based on naming function and common vector space (2023)

https://www.nature.com/articles/s41598-023-42769-9
- Binary code similarity analysis based on naming function and common vector space (2023)
- Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cross-platform binary code similarity analysis using control flow graphs. However, existing schemes suffer from the shortcomings of large differences in instruction syntaxes across different target platforms, inability to align control flow graph nodes, and less introduction of high-level semantics of stability, which pose challenges for identifying similar computations between binary functions of different platforms generated from the same source code. We argue that extracting stable, platform-independent semantics can improve model accuracy, and a cross-platform binary function similarity comparison model N_Match is proposed. The model elevates different platform instructions to the same semantic space to shield their underlying platform instruction differences, uses graph embedding technology to learn the stability semantics of neighbors, extracts high-level knowledge of naming function to alleviate the differences brought about by cross-platform and cross-optimization levels, and combines the stable graph structure as well as the stable, platform-independent API knowledge of naming function to represent the final semantics of functions. The experimental results show that the model accuracy of N_Match outperforms the baseline model in terms of cross-platform, cross-optimization level, and industrial scenarios. In the vulnerability search experiment, N_Match significantly improves hit@N, the mAP exceeds the current graph embedding model by 66%. In addition, we also give several interesting observations from the experiments. The code and model are publicly available at https://www.github.com/CSecurityZhongYuan/Binary-Name_Match

REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models (2023)

https://arxiv.org/abs/2305.03843
- REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models (2023)
- This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. We present the first-ever code search method that encodes dynamic runtime information during training without the need to execute either the corpus under search or the search query at inference time and the first code search technique that trains on both positive and negative reference samples. To validate the efficacy of our approach, we perform a set of studies demonstrating the capability of enhanced LLMs to perform cross-language code-to-code search. Our evaluation demonstrates that the effectiveness of our approach is consistent across various model architectures and programming languages. We outperform the state-of-the-art cross-language search tool by up to 44.7%. Moreover, our ablation studies reveal that even a single positive and negative reference sample in the training process results in substantial performance improvements demonstrating both similar and dissimilar references are important parts of code search. Importantly, we show that enhanced well-crafted, fine-tuned models consistently outperform enhanced larger modern LLMs without fine tuning, even when enhancing the largest available LLMs highlighting the importance for open-sourced models. To ensure the reproducibility and extensibility of our research, we present an open-sourced implementation of our tool and training procedures called REINFOREST.

Finding Bugs Using Your Own Code: Detecting Functionally-similar yet Inconsistent Code (2021)

https://www.usenix.org/conference/usenixsecurity21/presentation/ahmadi
- Finding Bugs Using Your Own Code: Detecting Functionally-similar yet Inconsistent Code (2021)
- Probabilistic classification has shown success in detecting known types of software bugs. However, the works following this approach tend to require a large amount of specimens to train their models. We present a new machine learning-based bug detection technique that does not require any external code or samples for training. Instead, our technique learns from the very codebase on which the bug detection is performed, and therefore, obviates the need for the cumbersome task of gathering and cleansing training samples (e.g., buggy code of certain kinds). The key idea behind our technique is a novel two-step clustering process applied on a given codebase. This clustering process identifies code snippets in a project that are functionally-similar yet appear in inconsistent forms. Such inconsistencies are found to cause a wide range of bugs, anything from missing checks to unsafe type conversions. Unlike previous works, our technique is generic and not specific to one type of inconsistency or bug. We prototyped our technique and evaluated it using 5 popular open source software, including QEMU and OpenSSL. With a minimal amount of manual analysis on the inconsistencies detected by our tool, we discovered 22 new unique bugs, despite the fact that many of these programs are constantly undergoing bug scans and new bugs in them are believed to be rare.
- https://www.usenix.org/system/files/sec21summer_ahmadi.pdf

MOSS: A1 System for Detecting Software Similarity (1997?)

https://theory.stanford.edu/~aiken/moss/
- MOSS: A1 System for Detecting Software Similarity

antiplag - similarity checking software for program codes, documents, and pictures (2019)

https://github.com/fanghon/antiplag
- antiplag - similarity checking software for program codes, documents, and pictures (2019) The software mainly checks and compares the similarities between electronic assignments submitted by students. It can detect the similarities between electronic assignments submitted by students and can analyze the content of multiple programming languages (such as java, c/c++, python, etc.) and multiple formats (txt, doc, docx, pdf, etc.) Comparative analysis of text and image similarities in multiple formats (png, jpg, gif, bmp, etc.) between English and simplified and traditional Chinese documents, and output codes, texts, and images with high similarity, thereby helping to detect plagiarism between students. the behavior of.

SCOSS - A Source Code Similarity System (2021)

https://github.com/BK-SCOSS/scoss
- scoss A Source Code Similarity System - SCOSS

Dolos (2019-2024+)

https://github.com/dodona-edu/dolos
- Dolos (2019-2024+) Dolos is a source code plagiarism detection tool for programming exercises. Dolos helps teachers in discovering students sharing solutions, even if they are modified. By providing interactive visualizations, Dolos can also be used to sensitize students to prevent plagiarism.
- https://dolos.ugent.be/
- https://dolos.ugent.be/about/algorithm.html
  - How Dolos works Conceptually, the plagiarism detection pipeline of Dolos can be split into four successive steps:
    - Tokenization
    - Fingerprinting
    - Indexing
    - Reporting
  - Tokenization To be immune against masking plagiarism by techniques such as renaming variables and functions, Dolos doesn't directly process the source code under investigation. It starts by performing a tokenization step using Tree-sitter. Tree-sitter can generate syntax trees for many programming languages, converts source code to a more structured form, and masks specific naming of variables and functions.
  - Fingerprinting To measure similarities between (converted) files, Dolos tries to find common sequences of tokens. More specifically, it uses subsequences of fixed length called k-grams. To efficiently make these comparisons and reduce the memory usage, all k-grams are hashed using a rolling hash function (the one used by Rabin-Karp in their string matching algorithm). The length k of k-grams can be with the -k option.
    
    To further reduce the memory usage, only a subset of all hashes are stored. The selection of hashes is done by the Winnowing algorithm as described by (Schleimer, Wilkerson and Aiken). In short: only the hash with the smallest numerical value is kept for each window. The window length (in k-grams) can be altered with the -w option.
    
    The remaining hashes are the fingerprints of the analyzed files. Internally, these are stored as simple integers.
  - Indexing Because Dolos needs to compare all files with each other, it is more efficient to first create an index containing the fingerprints of all files. For each of the fingerprints encountered in any of the files, we store the file and the corresponding line number where we encountered that fingerprint.
    
    As soon as a fingerprint is stored in the index twice, this is recorded as a match between the two files because they share at least one k-gram.
  - Reporting Dolos finally collects all fingerprints that occur in more than one file and aggregates the results into a report.
    
    This report contains all file pairs that have at least one common fingerprint, together with some metrics:
    - similarity: the fraction of shared fingerprints between the two files
    - total overlap: the absolute value of shared fingerprints, useful for larger projects
    - longest fragment: the length (in fingerprints) of the longest subsequence of fingerprints matching between the two files, useful when not the whole source code is copied
- https://dolos.ugent.be/about/languages.html
- https://dolos.ugent.be/about/publications.html
  - Publications Dolos is developed by Team Dodona at Ghent University in Belgium. Our research is published in the following journals and conferences.

MinHash-based Code Relationship & Investigation Toolkit (MCRIT) (2021-2025+)

https://github.com/danielplohmann/mcrit
- MinHash-based Code Relationship & Investigation Toolkit (MCRIT) (2021-2025+) MCRIT is a framework created to simplify the application of the MinHash algorithm in the context of code similarity. It can be used to rapidly implement "shinglers", i.e. methods which encode properties of disassembled functions, to then be used for similarity estimation via the MinHash algorithm. It is tailored to work with disassembly reports emitted by SMDA.

1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis (2021)

https://arxiv.org/abs/2112.12928
- 1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis (2021)
- Binary similarity analysis is critical to many code-reuse-related issues and "1-to-1" mechanism is widely applied, where one function in a binary file is matched against one function in a source file or binary file. However, we discover that function mapping is a more complex problem of "1-to-n" or even "n-to-n" due to the existence of function inlining.
  
  In this paper, we investigate the effect of function inlining on binary similarity analysis. We first construct 4 inlining-oriented datasets for four similarity analysis tasks, including code search, OSS reuse detection, vulnerability detection, and patch presence test. Then, we further study the extent of function inlining, the performance of existing works under function inlining, and the effectiveness of existing inlining-simulation strategies. > Results show that the proportion of function inlining can reach nearly 70%, while most existing works neglect it and use "1-to-1" mechanism. The mismatches cause a 30% loss in performance during code search and a 40% loss during vulnerability detection. Moreover, two existing inlining-simulation strategies can only recover 60% of the inlined functions. We discover that inlining is usually cumulative when optimization increases. Conditional inlining and incremental inlining are suggested to design low-cost and high-coverage inlining-simulation strategies.
- https://arxiv.org/pdf/2112.12928
- https://github.com/island255/TOSEM2022
  - Repository for the paper "1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis"

One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis (2021)

https://deepai.org/publication/one-to-one-or-one-to-many-what-function-inlining-brings-to-binary2source-similarity-analysis
- One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis (2021)
https://arxiv.org/abs/2112.12928v1
- One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis (2021)
- Binary2source code matching is critical to many code-reuse-related tasks, including code clone detection, software license violation detection, and reverse engineering assistance. Existing binary2source works always apply a "1-to-1" (one-to-one) mechanism, i.e., one function in a binary file is matched against one function in a source file. However, we assume that such mapping is usually a more complex problem of "1-to-n" (one-to-many) due to the existence of function inlining. To the best of our knowledge, few existing works have systematically studied the effect of function inlining on binary2source matching tasks. This paper will address this issue. To support our study, we first construct two datasets containing 61,179 binaries and 19,976,067 functions. We also propose an automated approach to label the dataset with line-level and function-level mapping. Based on our labeled dataset, we then investigate the extent of function inlining, the factors affecting function inlining, and the impact of function inlining on existing binary2source similarity methods. Finally, we discuss the interesting findings and give suggestions for designing more effective methodologies.
- https://arxiv.org/pdf/2112.12928v1
- https://github.com/island255/source2binary_dataset_construction
  - Source2binary Dataset Construction This is the repository for the paper "One to One or One to many? What function inline brings to binary similarity analysis".
https://www.researchgate.net/publication/357365866_One-to-One_or_One-to-many_What_function_inlining_brings_to_binary2source_similarity_analysis
- One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis

Comparing One with Many -- Solving Binary2source Function Matching Under Function Inlining (2022)

https://arxiv.org/abs/2210.15159
- Comparing One with Many -- Solving Binary2source Function Matching Under Function Inlining (2022)
- Binary2source function matching is a fundamental task for many security applications, including Software Component Analysis (SCA). The "1-to-1" mechanism has been applied in existing binary2source matching works, in which one binary function is matched against one source function. However, we discovered that such mapping could be "1-to-n" (one query binary function maps multiple source functions), due to the existence of function inlining.
  
  To help conduct binary2source function matching under function inlining, we propose a method named O2NMatcher to generate Source Function Sets (SFSs) as the matching target for binary functions with inlining. We first propose a model named ECOCCJ48 for inlined call site prediction. To train this model, we leverage the compilable OSS to generate a dataset with labeled call sites (inlined or not), extract several features from the call sites, and design a compiler-opt-based multi-label classifier by inspecting the inlining correlations between different compilations. Then, we use this model to predict the labels of call sites in the uncompilable OSS projects without compilation and obtain the labeled function call graphs of these projects. Next, we regard the construction of SFSs as a sub-tree generation problem and design root node selection and edge extension rules to construct SFSs automatically. Finally, these SFSs will be added to the corpus of source functions and compared with binary functions with inlining. We conduct several experiments to evaluate the effectiveness of O2NMatcher and results show our method increases the performance of existing works by 6% and exceeds all the state-of-the-art works.
- https://arxiv.org/pdf/2210.15159
https://github.com/island255/binary2source-matching-under-function-inlining
- binary2source-matching-under-function-inlining This is the repository illustrating how we label the inlined call sites, train the classifier for ICS prediction, and generate SFSs for binary2source matching.
- Repository for the paper "Binary2Source Function Similarity Detection Under Function Inlining"

Cross-Inlining Binary Function Similarity Detection (2024)

https://arxiv.org/abs/2401.05739v1
- Cross-Inlining Binary Function Similarity Detection (2024)
- Binary function similarity detection plays an important role in a wide range of security applications. Existing works usually assume that the query function and target function share equal semantics and compare their full semantics to obtain the similarity. However, we find that the function mapping is more complex, especially when function inlining happens.
  
  In this paper, we will systematically investigate cross-inlining binary function similarity detection. We first construct a cross-inlining dataset by compiling 51 projects using 9 compilers, with 4 optimizations, to 6 architectures, with 2 inlining flags, which results in two datasets both with 216 combinations. Then we construct the cross-inlining function mappings by linking the common source functions in these two datasets. Through analysis of this dataset, we find that three cross-inlining patterns widely exist while existing work suffers when detecting cross-inlining binary function similarity. Next, we propose a pattern-based model named CI-Detector for cross-inlining matching. CI-Detector uses the attributed CFG to represent the semantics of binary functions and GNN to embed binary functions into vectors. CI-Detector respectively trains a model for these three cross-inlining patterns. Finally, the testing pairs are input to these three models and all the produced similarities are aggregated to produce the final similarity. We conduct several experiments to evaluate CI-Detector. Results show that CI-Detector can detect cross-inlining pairs with a precision of 81% and a recall of 97%, which exceeds all state-of-the-art works.
- https://arxiv.org/pdf/2401.05739v1
- https://github.com/island255/cross-inlining_binary_function_similarity
  - The repository of the paper "Cross-Inlining Binary Function Similarity Detection"

Pcode-Similarity (2021)

https://github.com/JackHCC/Pcode-Similarity
- Pcode-Similarity (2021) Algorithm for calculating similarity between function and library function.

Awesome Binary code similarity detection (2021)

https://github.com/JackHCC/Awesome-Binary-Code-Similarity-Detection-2021
- Awesome Binary code similarity detection (2021) Awesome list for Binary Code Similarity Detection in 2021

SCALE: Semantic Code Analysis via Learned Embeddings (2023)

https://github.com/Jaso1024/Semantic-Code-Embeddings
- SCALE: Semantic Code Analysis via Learned Embeddings (2023) 3rd best paper on Artificial Intelligence track | presented at the 2023 International Conference on AI, Blockchain, Cloud Computing and Data Analytics This repository holds the code and supplementary materials for SCALE: Semantic Code Analysis via Learned Embeddings. This research explores the efficacy of contrastive learning alongside large language models as a paradigm for developing a model capable of creating code embeddings indicative of code on a functional level. Existing pre-trained models in NLP have demonstrated impressive success, surpassing previous benchmarks in various language-related tasks. However, when it comes to the field of code understanding, these models still face notable limitations. Code isomorphism, which deals with determining functional similarity between pieces of code, presents a challenging problem for NLP models. In this paper, we explore two approaches to code isomorphism. Our first approach, dubbed SCALE-FT, formulates the problem as a binary classification task, where we feed pairs of code snippets to a Large Language Model (LLM), using the embeddings to predict whether the given code segments are equivalent. The second approach, SCALE-CLR, adopts the SimCLR framework to generate embeddings for individual code snippets. By processing code samples with an LLM and observing the corresponding embeddings, we assess the similarity of two code snippets. These approaches enable us to leverage function-based code embeddings for various downstream tasks, such as code-optimization, code-comment alignment, and code classification. Our experiments on the CodeNet Python800 benchmark demonstrate promising results for both approaches. Notably, our SCALE-FT using Babbage-001 (GPT-3) achieves state-of-the-art performance, surpassing various benchmark models such as GPT-3.5 Turbo and GPT-4. Additionally, Salesforce's 350-million parameter CodeGen, when trained with the SCALE-FT framework, surpasses GPT-3.5 and GPT-4.

binary-sim - binary similarity using Deep learning (2023)

https://github.com/Aida-yy/binary-sim
- binary-sim - binary similarity using Deep learning (2023)
- Features: Function semantic information + control flow graph
  
  Semantic feature extraction: extract the byte data, assembly instruction data, and integer data of the function respectively, use independent encoders (DPCNN, TextCNN) to encode the text representation, and obtain its Embedding representation.
  
  Structural feature extraction, based on CFG and the assembly instructions in each block, generates ACFG, uses graph neural network to encode ACFG, and obtains Embedding representation; in addition, considering that the node order of the control flow graph of similar functions is also similar, the CFG's The adjacency matrix is taken as input and CNN is used to obtain its Embedding representation.
  
  Contrastive learning model structure: InfoNCE loss + In-batch negatives

Source Code Clone Detection Using Unsupervised Similarity Measures (2024)

https://arxiv.org/abs/2401.09885
- Source Code Clone Detection Using Unsupervised Similarity Measures (2024)
- Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at this https URL
- https://github.com/jorge-martinez-gil/codesim
  - Source Code Clone Detection Using Unsupervised Similarity Measures
  - This repository contains the source code for reproducing the paper Martinez-Gil, J. (2024). Source Code Clone Detection Using Unsupervised Similarity Measures. In: Bludau, P., Ramler, R., Winkler, D., Bergsmann, J. (eds) Software Quality as a Foundation for Security. SWQD 2024. Lecture Notes in Business Information Processing, vol 505. Springer, Cham. https://doi.org/10.1007/978-3-031-56281-5_2.

Transcending Language Barriers in Software Engineering with Crosslingual Code Clone Detection (2024)

https://github.com/jorge-martinez-gil/crosslingual-clone-detection
- Transcending Language Barriers in Software Engineering with Crosslingual Code Clone Detection (2024) Systematic study to determine the best methods to assess the similarity between code snippets in different programming languages

Link Dump 3

Improved Code Summarization via a Graph Neural Network (2020)

https://arxiv.org/abs/2004.02843
- Improved Code Summarization via a Graph Neural Network (2020)
- Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summarization techniques use the source code as input and outputs a natural language description. Yet a strong consensus is developing that using structural information as input leads to improved performance. The first approaches to use structural information flattened the AST into a sequence. Recently, more complex approaches based on random AST paths or graph neural networks have improved on the models using flattened ASTs. However, the literature still does not describe the using a graph neural network together with source code sequence as separate inputs to a model. Therefore, in this paper, we present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries. We evaluate our technique using a data set of 2.1 million Java method-comment pairs and show improvement over four baseline techniques, two from the software engineering literature, and two from machine learning literature.

Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree (2020)

https://arxiv.org/abs/2002.08653
- Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree (2020)
- Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection. We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.
- https://github.com/jacobwwh/graphmatch_clone
  - Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree
  - Code and data for paper "Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree".

Utilizing Abstract Syntax Tree Embedding to Improve the Quality of GNN-based Class Name Estimation (2023)

https://proceedings-of-deim.github.io/DEIM2023/1b-9-4.pdf
- Utilizing Abstract Syntax Tree Embedding to Improve the Quality of GNN-based Class Name Estimation (2023)
- While giving comprehensible names to identifiers is essential in software development, it is sometimes difficult since it requires development experience and knowledge of the application domain. Among work to support the developer’s identifier naming, a GNN-based class name estimation approach learns a graph of relationships between program elements, i.e., classes, methods, and fields, but it ignores information within the methods. This study proposes an approach that exploits information from method bodies, which can help estimate correct class names. The proposed approach extends the existing GNN-based approach to use embeddings of the corresponding ASTs for method nodes. An evaluation experiment measures how correctly the proposed approach can estimate class names in large datasets of open-source Java projects. The experimental result shows that the proposed approach improves the estimation correctness compared to the existing approach.

Code Similarity Using Graph Neural Networks (2023)

https://medium.com/stanford-cs224w/code-similarity-using-graph-neural-networks-1e58aa21bd92
- Code Similarity Using Graph Neural Networks (2023)
- Abstract/Summary by ChatGPT 4.5:
  - Code similarity detection is crucial for various software engineering tasks, including plagiarism detection, code search, refactoring, and automated code completion. Traditional approaches rely heavily on syntactic similarity, which fails to capture deeper semantic relationships between code segments. Inspired by recent advances in natural language processing and code intelligence using transformer-based models (e.g., BERT, GPT, and CodeBERT), our work explores the use of Graph Neural Networks (GNNs) to address code similarity through the semantic understanding provided by graph structures.
    
    We evaluate several GNN architectures—including GraphSAGE, Graph Attention Networks (GAT), and a novel OrderGNN leveraging permutation-aware aggregations—on the widely-used POJ-104 dataset, consisting of 32,000 C++ code segments spanning 64 distinct programming problems. Our pipeline involves parsing source code into Abstract Syntax Trees (ASTs) using the CLANG library, transforming these ASTs into NetworkX graphs, and subsequently into PyTorch Geometric (PyG) data objects for input into our GNN models.
    
    Our results demonstrate that permutation-invariant methods such as GraphSAGE and GAT struggle to capture critical ordered structures inherent in programming languages, resulting in limited performance (MAP@R). In contrast, the OrderGNN model, employing LSTM-based aggregation to preserve node ordering information, achieves significantly better semantic similarity identification, highlighting the necessity of permutation-awareness for effective code analysis. Nevertheless, the OrderGNN model presents substantial computational and memory overhead, limiting scalability.
    
    We conclude by suggesting future directions, including the exploration of more memory-efficient permutation-aware aggregation functions and alternative graph representations beyond the standard AST structure to further improve the efficacy and applicability of GNN-based code similarity detection methods.

Link Dump 4

JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games (2020)

https://taoxiease.github.io/publications/icse20seip-jsidentify.pdf#:~:text=match%20at%20L171%20hybrid%20framework%2C,we%20collect%20400%20mini
- JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games
- Online mini games are lightweight game apps, typically implemented in JavaScript (JS), that run inside another host mobile app (such as WeChat, Baidu, and Alipay). These mini games do not need to be downloaded or upgraded through an app store, making it possible for one host mobile app to perform the aggregated services of many apps. Hundreds of millions of users play tens of thousands of mini games, which make a great profit, and consequently are popular targets of plagiarism. In cases of plagiarism, deeply obfuscated code cloned from the original code often embodies malicious code segments and copyright infringements, posing great challenges for existing plagiarism detection tools. To address these challenges, in this paper, we design and implement JSidentify, a hybrid framework to detect plagiarism among online mini games. JSidentify includes three techniques based on different levels of code abstraction. JSidentify applies the included techniques in the constructed priority list one by one to reduce overall detection time. Our evaluation results show that JSidentify outperforms other existing related state-of-the-art approaches and achieves the best precision and recall with affordable detection time when detecting plagiarism among online mini games and clones among general JS programs. Our deployment experience of JSidentify also shows that JSidentify is indispensable in the daily operations of online mini games in WeChat.
https://ieeexplore.ieee.org/document/9276581
- JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games
https://www.researchgate.net/publication/344433961_JSidentify_a_hybrid_framework_for_detecting_plagiarism_among_JavaScript_code_in_online_mini_games
- JSidentify: a hybrid framework for detecting plagiarism among JavaScript code in online mini games (June 2020)

Relationship-aware code search for JavaScript frameworks (2016)

https://taoxiease.github.io/publications/fse16-racs.pdf
- Relationship-aware code search for JavaScript frameworks
- JavaScript frameworks, such as jQuery, are widely used for developing web applications. To facilitate using these JavaScript frameworks to implement a feature (e.g., functionality), a large number of programmers often search for code snippets that implement the same or similar feature. However, existing code search approaches tend to be ineffective, without taking into account the fact that JavaScript code snippets often implement a feature based on various relationships (e.g., sequencing, condition, and callback relationships) among the invoked framework API methods. To address this issue, we present a novel RelationshipAware Code Search (RACS) approach for finding code snippets that use JavaScript frameworks to implement a specific feature. In advance, RACS collects a large number of code snippets that use some JavaScript frameworks, mines API usage patterns from the collected code snippets, and represents the mined patterns with method call relationship (MCR) graphs, which capture framework API methods’ signatures and their relationships. Given a natural language (NL) search query issued by a programmer, RACS conducts NL processing to automatically extract an action relationship (AR) graph, which consists of actions and their relationships inferred from the query. In this way, RACS reduces code search to the problem of graph search: finding similar MCR graphs for a given AR graph. We conduct evaluations against representative real-world jQuery questions posted on Stack Overflow, based on 308,294 code snippets collected from over 81,540 files on the Internet. The evaluation results show the effectiveness of RACS: the top 1 snippet produced by RACS matches the target code snippet for 46% questions, compared to only 4% achieved by a relationship-oblivious approach.
https://dl.acm.org/doi/10.1145/2950290.2950341
- Relationship-aware code search for JavaScript frameworks

Code Search: A Survey of Techniques for Finding Code (2022)

https://arxiv.org/abs/2204.02765
- Code Search: A Survey of Techniques for Finding Code
- The immense amounts of source code provide ample challenges and opportunities during software development. To handle the size of code bases, developers commonly search for code, e.g., when trying to find where a particular feature is implemented or when looking for code examples to reuse. To support developers in finding relevant code, various code search engines have been proposed. This article surveys 30 years of research on code search, giving a comprehensive overview of challenges and techniques that address them. We discuss the kinds of queries that code search engines support, how to preprocess and expand queries, different techniques for indexing and retrieving code, and ways to rank and prune search results. Moreover, we describe empirical studies of code search in practice. Based on the discussion of prior work, we conclude the article with an outline of challenges and opportunities to be addressed in the future.
- https://arxiv.org/pdf/2204.02765
  - Code Search: A Survey of Techniques for Finding Code
https://www.researchgate.net/publication/359786256_Code_Search_A_Survey_of_Techniques_for_Finding_Code
- Code Search: A Survey of Techniques for Finding Code

graph2vec: Learning Distributed Representations of Graphs (2017)

https://arxiv.org/abs/1707.05005
- graph2vec: Learning Distributed Representations of Graphs (2017)
- Recent works on representation learning for graph structured data predominantly focus on learning distributed representations of graph substructures such as nodes and subgraphs. However, many graph analytics tasks such as graph classification and clustering require representing entire graphs as fixed length feature vectors. While the aforementioned approaches are naturally unequipped to learn such representations, graph kernels remain as the most effective way of obtaining them. However, these graph kernels use handcrafted features (e.g., shortest paths, graphlets, etc.) and hence are hampered by problems such as poor generalization. To address this limitation, in this work, we propose a neural embedding framework named graph2vec to learn data-driven distributed representations of arbitrary sized graphs. graph2vec's embeddings are learnt in an unsupervised manner and are task agnostic. Hence, they could be used for any downstream task such as graph classification, clustering and even seeding supervised representation learning approaches. Our experiments on several benchmark and large real-world datasets show that graph2vec achieves significant improvements in classification and clustering accuracies over substructure representation learning approaches and are competitive with state-of-the-art graph kernels.
- https://arxiv.org/pdf/1707.05005
- https://github.com/benedekrozemberczki/graph2vec
  - Graph2Vec
  - A parallel implementation of "graph2vec: Learning Distributed Representations of Graphs" (MLGWorkshop 2017).
  - The model is now also available in the Karate Club package.
- https://github.com/annamalai-nr/graph2vec_tf
  - This repository contains the "tensorflow" implementation of our paper "graph2vec: Learning distributed representations of graphs".

SimGNN: A Neural Network Approach to Fast Graph Similarity Computation (2018; revised 2020)

https://arxiv.org/abs/1808.05689
- SimGNN: A Neural Network Approach to Fast Graph Similarity Computation (2018; revised 2020)
- Graph similarity search is among the most important graph-based applications, e.g. finding the chemical compounds that are most similar to a query compound. Graph similarity computation, such as Graph Edit Distance (GED) and Maximum Common Subgraph (MCS), is the core operation of graph similarity search and many other applications, but very costly to compute in practice. Inspired by the recent success of neural network approaches to several graph applications, such as node or graph classification, we propose a novel neural network based approach to address this classic yet challenging graph problem, aiming to alleviate the computational burden while preserving a good performance.

The proposed approach, called SimGNN, combines two strategies. First, we design a learnable embedding function that maps every graph into a vector, which provides a global summary of a graph. A novel attention mechanism is proposed to emphasize the important nodes with respect to a specific similarity metric. Second, we design a pairwise node comparison method to supplement the graph-level embeddings with fine-grained node-level information. Our model achieves better generalization on unseen graphs, and in the worst case runs in quadratic time with respect to the number of nodes in two graphs. Taking GED computation as an example, experimental results on three real graph datasets demonstrate the effectiveness and efficiency of our approach. Specifically, our model achieves smaller error rate and great time reduction compared against a series of baselines, including several approximation algorithms on GED computation, and many existing graph neural network based models. To the best of our knowledge, we are among the first to adopt neural networks to explicitly model the similarity between two graphs, and provide a new direction for future research on graph similarity computation and graph similarity search.

https://arxiv.org/pdf/1808.05689
https://github.com/benedekrozemberczki/SimGNN
- SimGNN
- A PyTorch implementation of "SimGNN: A Neural Network Approach to Fast Graph Similarity Computation" (WSDM 2019).

awesome-network-embedding

https://github.com/chihming/awesome-network-embedding
- awesome-network-embedding
- A curated list of network embedding techniques.
- Also called network representation learning, graph embedding, knowledge embedding, etc.
  
  The task is to learn the representations of the vertices from a given network.

Karate Club

https://karateclub.readthedocs.io/en/latest/
- Karate Club is an unsupervised machine learning extension library for NetworkX. It builds on other open source linear algebra, machine learning, and graph signal processing libraries such as Numpy, Scipy, Gensim, PyGSP, and Scikit-Learn. Karate Club consists of state-of-the-art methods to do unsupervised learning on graph structured data. To put it simply it is a Swiss Army knife for small-scale graph mining research. First, it provides network embedding techniques at the node and graph level. Second, it includes a variety of overlapping and non-overlapping community detection methods. Implemented methods cover a wide range of network science (NetSci, Complenet), data mining (ICDM, CIKM, KDD), artificial intelligence (AAAI, IJCAI) and machine learning (NeurIPS, ICML, ICLR) conferences, workshops, and pieces from prominent journals.

NetworkX

https://networkx.org/
- NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
- Software for complex networks
  - Data structures for graphs, digraphs, and multigraphs
  - Many standard graph algorithms
  - Network structure and analysis measures
  - Generators for classic graphs, random graphs, and synthetic networks
  - Nodes can be "anything" (e.g., text, images, XML records)
  - Edges can hold arbitrary data (e.g., weights, time-series)
  - Open source 3-clause BSD license
  - Well tested with over 90% code coverage
  - Additional benefits from Python include fast prototyping, easy to teach, and multi-platform

Link Dump 5

ResearchGate

https://www.researchgate.net/
- ResearchGate
- Discover research
- Access over 160 million publication pages and stay up to date with what's happening in your field.

ScienceDirect

https://www.sciencedirect.com/
- ScienceDirect
- Search for peer-reviewed journal articles and book chapters (including open access content)

Kaggle - Dataset Search

https://www.kaggle.com/datasets
- Kaggle - Datasets
- Explore, analyze, and share quality data.
- Learn more about data types, creating, and collaborating.
  - https://www.kaggle.com/docs/datasets

Google Research - Dataset Search

https://datasetsearch.research.google.com/
- Dataset Search

js150 - 150k Javascript Dataset

https://www.sri.inf.ethz.ch/js150
- js150
- 150k Javascript Dataset
- We provide a dataset consisting of 150'000 JavaScript files and their corresponding parsed ASTs that were used to train and evaluate the DeepSyn tool. The JavaScript programs are collected from GitHub repositories by removing duplicate files, removing project forks (copy of another existing repository), keeping only programs that parse and we aim to remove obfuscated files. For parsing we used the error-tolerant Acorn parser (using the parse_dammit interface). The dataset is split into two parts -- 100'000 files used for training and 50'000 files used for evaluation.

Top 5200+ NPM Packages Dataset (2024)

https://huggingface.co/datasets/deepklarity/top-npm-packages
- Top NPM Packages Dataset
  
  This dataset contains a snapshot of Top 5200+ popular node packages hosted on Node Package Manager
  
  The dataset was scraped in October-2024.
  
  We aim to use this dataset to perform analysis and identify trends and get a bird's eye view of nodejs ecosystem.

npm-follower: A Complete Dataset Tracking the NPM Ecosystem (2023)

https://arxiv.org/abs/2308.12545
- npm-follower: A Complete Dataset Tracking the NPM Ecosystem (2023)
- Software developers typically rely upon a large network of dependencies to build their applications. For instance, the NPM package repository contains over 3 million packages and serves tens of billions of downloads weekly. Understanding the structure and nature of packages, dependencies, and published code requires datasets that provide researchers with easy access to metadata and code of packages. However, prior work on NPM dataset construction typically has two limitations: 1) only metadata is scraped, and 2) packages or versions that are deleted from NPM can not be scraped. Over 330,000 versions of packages were deleted from NPM between July 2022 and May 2023. This data is critical for researchers as it often pertains to important questions of security and malware. We present npm-follower, a dataset and crawling architecture which archives metadata and code of all packages and versions as they are published, and is thus able to retain data which is later deleted. The dataset currently includes over 35 million versions of packages, and grows at a rate of about 1 million versions per month. The dataset is designed to be easily used by researchers answering questions involving either metadata or program analysis. Both the code and dataset are available at this https URL.
  - https://web.archive.org/web/20230404134909/https://dependencies.science/
    - NPM Datasets for Open Science
    - Datasets collected from the NPM package registry of complete dependency and code data. Designed to support easy data analysis of the NPM ecosystem
    - https://web.archive.org/web/20240614230027/https://dependencies.science/docs/
      - Documentation
      - Documentation on the structure of dependencies.science data
    - https://web.archive.org/web/20240614220607/https://dependencies.science/downloads/
      - Downloads
      - NPM Package Metadata (PostgreSQL, ~30 GB)
      - NPM Package Code (~20 TB)
        
        The full source code of all scraped packages is hosted on HuggingFace.
        
        https://huggingface.co/datasets/nuprl/npm-follower-data
https://www.jonbell.net/preprint/fse23-demo-npmfollower.pdf
- npm-follower: A Complete Dataset Tracking the NPM Ecosystem (2024)

Awesome-Info-Inferring-Binary

https://github.com/yasong/Awesome-Info-Inferring-Binary
- Awesome-Info-Inferring-Binary
- A collection of papers, tools about type inferring, variable renaming, and function name inferring on stripped binary executables.

Awesome Binary Similarity

https://github.com/SystemSecurityStorm/Awesome-Binary-Similarity
- Awesome Binary Similarity
- An awesome & curated list of binary code similarity papers

JsDeObsBench: Measuring and Benchmarking LLMs for JavaScript Deobfuscation (2025)

See Also:
- https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#jsdeobsbench
https://arxiv.org/abs/2506.20170v1
- JsDeObsBench: Measuring and Benchmarking LLMs for JavaScript Deobfuscation (2025)
- Deobfuscating JavaScript (JS) code poses a significant challenge in web security, particularly as obfuscation techniques are frequently used to conceal malicious activities within scripts. While Large Language Models (LLMs) have recently shown promise in automating the deobfuscation process, transforming detection and mitigation strategies against these obfuscated threats, a systematic benchmark to quantify their effectiveness and limitations has been notably absent. To address this gap, we present JsDeObsBench, a dedicated benchmark designed to rigorously evaluate the effectiveness of LLMs in the context of JS deobfuscation. We detail our benchmarking methodology, which includes a wide range of obfuscation techniques ranging from basic variable renaming to sophisticated structure transformations, providing a robust framework for assessing LLM performance in real-world scenarios. Our extensive experimental analysis investigates the proficiency of cutting-edge LLMs, e.g., GPT-4o, Mixtral, Llama, and DeepSeek-Coder, revealing superior performance in code simplification despite challenges in maintaining syntax accuracy and execution reliability compared to baseline methods. We further evaluate the deobfuscation of JS malware to exhibit the potential of LLMs in security scenarios. The findings highlight the utility of LLMs in deobfuscation applications and pinpoint crucial areas for further improvement.

DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios (2025)

https://arxiv.org/abs/2505.11340v1
- DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios (2025)
- Decompilers are fundamental tools for critical security tasks, from vulnerability discovery to malware analysis, yet their evaluation remains fragmented. Existing approaches primarily focus on syntactic correctness through synthetic micro-benchmarks or subjective human ratings, failing to address real-world requirements for semantic fidelity and analyst usability. We present DecompileBench, the first comprehensive framework that enables effective evaluation of decompilers in reverse engineering workflows through three key components: \textit{real-world function extraction} (comprising 23,400 functions from 130 real-world programs), \textit{runtime-aware validation}, and \textit{automated human-centric assessment} using LLM-as-Judge to quantify the effectiveness of decompilers in reverse engineering workflows. Through a systematic comparison between six industrial-strength decompilers and six recent LLM-powered approaches, we demonstrate that LLM-based methods surpass commercial tools in code understandability despite 52.2% lower functionality correctness. These findings highlight the potential of LLM-based approaches to transform human-centric reverse engineering. We open source \href{this https URL}{DecompileBench} to provide a framework to advance research on decompilers and assist security experts in making informed tool selections based on their specific requirements.
- https://github.com/vul337/DecompileBench
  - DecompileBench
  - This repository provides scripts and tools for evaluating the performance of decompilation processes using both traditional decompilers and large language models (LLMs). It is used in the paper "DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios".

A Survey on Large Language Models for Software Engineering (2024)

https://arxiv.org/abs/2312.15223
- A Survey on Large Language Models for Software Engineering (2024)
- Software Engineering (SE) is the systematic design, development, maintenance, and management of software applications underpinning the digital infrastructure of our modern world. Very recently, the SE community has seen a rapidly increasing number of techniques employing Large Language Models (LLMs) to automate a broad range of SE tasks. Nevertheless, existing information of the applications, effects, and possible limitations of LLMs within SE is still not well-studied.
  
  In this paper, we provide a systematic survey to summarize the current state-of-the-art research in the LLM-based SE community. We summarize 62 representative LLMs of Code across three model architectures, 15 pre-training objectives across four categories, and 16 downstream tasks across five categories. We then present a detailed summarization of the recent SE studies for which LLMs are commonly utilized, including 947 studies for 112 specific code-related tasks across five crucial phases within the SE workflow. We also discuss several critical aspects during the integration of LLMs into SE, such as empirical evaluation, benchmarking, security and reliability, domain tuning, compressing and distillation. Finally, we highlight several challenges and potential opportunities on applying LLMs for future SE studies, such as exploring domain LLMs and constructing clean evaluation datasets. Overall, our work can help researchers gain a comprehensive understanding about the achievements of the existing LLM-based SE studies and promote the practical application of these techniques. Our artifacts are publicly available and will be continuously updated at the living repository: this https URL.
- https://github.com/iSEngLab/AwesomeLLM4SE
  - Large Language Models for Software Engineering
  - A collection of academic publications and methodologies on the classification of Code Large Language Models' pre-training tasks, downstream tasks, and the application of Large Language Models in the field of Software Engineering (LLM4SE).
  - We welcome all researchers to contribute to this repository and further contribute to the knowledge of the Large Language Models with Software Engineering field.

How Far Have We Gone in Binary Code Understanding Using Large Language Models (2024)

https://arxiv.org/abs/2404.09836v3
- How Far Have We Gone in Binary Code Understanding Using Large Language Models (2024)
- Binary code analysis plays a pivotal role in various software security applications, such as software maintenance, malware detection, software vulnerability discovery, patch analysis, etc. However, unlike source code, understanding binary code is challenging for reverse engineers due to the absence of semantic information. Therefore, automated tools are needed to assist human players in interpreting binary code. In recent years, two groups of technologies have shown promising prospects: (1) Deep learning-based technologies have demonstrated competitive results in tasks related to binary code understanding, furthermore, (2) Large Language Models (LLMs) have been extensively pre-trained at the source-code level for tasks such as code understanding and generation. This makes participants wonder about the ability of LLMs in binary code understanding.
  
  In this work, we propose a benchmark to evaluate the effectiveness of LLMs in real-world reverse engineering scenarios. The benchmark covers two key binary code understanding tasks, including function name recovery and binary code summarization. We gain valuable insights into their capabilities and limitations through extensive evaluations of popular LLMs using our benchmark. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis. Our results highlight the great potential of the LLMs in advancing the field of binary code understanding.

Revisiting Code Similarity Evaluation with Abstract Syntax Tree Edit Distance (2024)

https://arxiv.org/abs/2404.08817
- Revisiting Code Similarity Evaluation with Abstract Syntax Tree Edit Distance (2024)
- This paper revisits recent code similarity evaluation metrics, particularly focusing on the application of Abstract Syntax Tree (AST) editing distance in diverse programming languages. In particular, we explore the usefulness of these metrics and compare them to traditional sequence similarity metrics. Our experiments showcase the effectiveness of AST editing distance in capturing intricate code structures, revealing a high correlation with established metrics. Furthermore, we explore the strengths and weaknesses of AST editing distance and prompt-based GPT similarity scores in comparison to BLEU score, execution match, and Jaccard Similarity. We propose, optimize, and publish an adaptable metric that demonstrates effectiveness across all tested languages, representing an enhanced version of Tree Similarity of Edit Distance (TSED).

A Web Application Fingerprint Recognition Method Based on Machine Learning (2024)

https://www.researchgate.net/publication/379841224_A_Web_Application_Fingerprint_Recognition_Method_Based_on_Machine_Learning
- A Web Application Fingerprint Recognition Method Based on Machine Learning (2024)
- Web application fingerprint recognition is an effective security technology designed to identify and classify web applications, thereby enhancing the detection of potential threats and attacks. Traditional fingerprint recognition methods, which rely on preannotated feature matching, face inherent limitations due to the ever-evolving nature and diverse landscape of web applications. In response to these challenges, this work proposes an innovative web application fingerprint recognition method founded on clustering techniques. The method involves extensive data collection from the Tranco List, employing adjusted feature selection built upon Wappalyzer and noise reduction through truncated SVD dimensionality reduction. The core of the methodology lies in the application of the unsupervised OPTICS clustering algorithm, eliminating the need for preannotated labels. By transforming web applications into feature vectors and leveraging clustering algorithms, our approach accurately categorizes diverse web applications, providing comprehensive and precise fingerprint recognition. The experimental results, which are obtained on a dataset featuring various web application types, affirm the efficacy of the method, demonstrating its ability to achieve high accuracy and broad coverage. This novel approach not only distinguishes between different web application types effectively but also demonstrates superiority in terms of classification accuracy and coverage, offering a robust solution to the challenges of web application fingerprint recognition.
https://www.sciencedirect.com/org/science/article/pii/S1526149224001541
- A Web Application Fingerprint Recognition Method Based on Machine Learning (2024)
https://www.techscience.com/CMES/v140n1/56177
- A Web Application Fingerprint Recognition Method Based on Machine Learning (2024)
- https://www.techscience.com/CMES/v140n1/56177/pdf
  - https://cdn.techscience.cn/files/CMES/2024/TSP_CMES-140-1/TSP_CMES_46140/TSP_CMES_46140.pdf

ReSym: Harnessing LLMs to Recover Variable and Data Structure Symbols from Stripped Binaries (2024)

https://dl.acm.org/doi/10.1145/3658644.3670340
- ReSym: Harnessing LLMs to Recover Variable and Data Structure Symbols from Stripped Binaries (2024)
- Decompilation aims to recover a binary executable to the source code form and hence has a wide range of applications in cyber security, such as malware analysis and legacy code hardening. A prominent challenge is to recover variable symbols, including both primitive and complex types such as user-defined data structures, along with their symbol information such as names and types. Existing efforts focus on solving parts of the problem, e.g., recovering only types (without names) or only local variables (without user-defined structures). In this paper, we propose ReSym, a novel hybrid technique that combines Large Language Models (LLMs) and program analysis to recover both names and types for local variables and user-defined data structures. Our method encompasses fine-tuning two LLMs to handle local variables and structures, respectively. To overcome the token limitations inherent in current LLMs, we devise a novel Prolog-based algorithm to aggregate and cross-check results from multiple LLM queries, suppressing uncertainty and hallucinations. Our experiments show that ReSym is effective in recovering variable information and user-defined data structures, substantially outperforming the state-of-the-art methods.
- https://www.cs.purdue.edu/homes/lintan/publications/resym-ccs24.pdf
  - ReSym: Harnessing LLMs to Recover Variable and Data Structure Symbols from Stripped Binaries (2024)

Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary (2024)

https://arxiv.org/abs/2306.02546v4
- Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary (2024)
- Decompilation aims to recover the source code form of a binary executable. It has many security applications, such as malware analysis, vulnerability detection, and code hardening. A prominent challenge in decompilation is to recover variable names. We propose a novel technique that leverages the strengths of generative models while mitigating model biases. We build a prototype, GenNm, from pre-trained generative models CodeGemma-2B, CodeLlama-7B, and CodeLlama-34B. We finetune GenNm on decompiled functions and teach models to leverage contextual information. GenNm includes names from callers and callees while querying a function, providing rich contextual information within the model's input token limitation. We mitigate model biases by aligning the output distribution of models with symbol preferences of developers. Our results show that GenNm improves the state-of-the-art name recovery precision by 5.6-11.4 percentage points on two commonly used datasets and improves the state-of-the-art by 32% (from 17.3% to 22.8%) in the most challenging setup where ground-truth variable names are not seen in the training dataset.

DIRE and its Data: Neural Decompiled Variable Renamings with Respect to Software Class (2023)

https://dl.acm.org/doi/full/10.1145/3546946
- DIRE and its Data: Neural Decompiled Variable Renamings with Respect to Software Class (2023)
- The decompiler is one of the most common tools for examining executable binaries without the corresponding source code. It transforms binaries into high-level code, reversing the compilation process. Unfortunately, decompiler output is far from readable because the decompilation process is often incomplete. State-of-the-art techniques use machine learning to predict missing information like variable names. While these approaches are often able to suggest good variable names in context, no existing work examines how the selection of training data influences these machine learning models. We investigate how data provenance and the quality of training data affect performance, and how well, if at all, trained models generalize across software domains. We focus on the variable renaming problem using one such machine learning model, DIRE. We first describe DIRE in detail and the accompanying technique used to generate training data from raw code. We also evaluate DIRE’s overall performance without respect to data quality. Next, we show how training on more popular, possibly higher quality code (measured using GitHub stars) leads to a more generalizable model because popular code tends to have more diverse variable names. Finally, we evaluate how well DIRE predicts domain-specific identifiers, propose a modification to incorporate domain information, and show that it can predict identifiers in domain-specific scenarios 23% more frequently than the original DIRE model.
- https://cmustrudel.github.io/papers/dramko2022tosem.pdf
  - DIRE and its Data: Neural Decompiled Variable Renamings with Respect to Software Class (2023)

A lightweight framework for function name reassignment based on large-scale stripped binaries (2021)

https://dl.acm.org/doi/10.1145/3460319.3464804
- A lightweight framework for function name reassignment based on large-scale stripped binaries (2021)
- Software in the wild is usually released as stripped binaries that contain no debug information (e.g., function names). This paper studies the issue of reassigning descriptive names for functions to help facilitate reverse engineering. Since the essence of this issue is a data-driven prediction task, persuasive research should be based on sufficiently large-scale and diverse data. However, prior studies can only be based on small-scale datasets because their techniques suffer from heavyweight binary analysis, making them powerless in the face of big-size and large-scale binaries.
  
  This paper presents the Neural Function Rename Engine (NFRE), a lightweight framework for function name reassignment that utilizes both sequential and structural information of assembly code. NFRE uses fine-grained and easily acquired features to model assembly code, making it more effective and efficient than existing techniques. In addition, we construct a large-scale dataset and present two data-preprocessing approaches to help improve its usability. Benefiting from the lightweight design, NFRE can be efficiently trained on the large-scale dataset, thereby having better generalization capability for unknown functions. The comparative experiments show that NFRE outperforms two existing techniques by a relative improvement of 32% and 16%, respectively, while the time cost for binary analysis is much less.
- http://staff.ustc.edu.cn/~zhangwm/Paper/2021_17.pdf
  - A lightweight framework for function name reassignment based on large-scale stripped binaries (2021)

DIRE: A Neural Approach to Decompiled Identifier Naming (2019)

https://arxiv.org/abs/1909.09029
- DIRE: A Neural Approach to Decompiled Identifier Naming (2019)
- The decompiler is one of the most common tools for examining binaries without corresponding source code. It transforms binaries into high-level code, reversing the compilation process. Decompilers can reconstruct much of the information that is lost during the compilation process (e.g., structure and type information). Unfortunately, they do not reconstruct semantically meaningful variable names, which are known to increase code understandability. We propose the Decompiled Identifier Renaming Engine (DIRE), a novel probabilistic technique for variable name recovery that uses both lexical and structural information recovered by the decompiler. We also present a technique for generating corpora suitable for training and evaluating models of decompiled code renaming, which we use to create a corpus of 164,632 unique x86-64 binaries generated from C projects mined from GitHub. Our results show that on this corpus DIRE can predict variable names identical to the names in the original source code up to 74.3% of the time.
- https://cmustrudel.github.io/papers/ase19dire.pdf
  - DIRE: A Neural Approach to Decompiled Identifier Naming (2019)

Recovering Variable Names for Minified Code with Usage Contexts (2019) (AKA: JSNeat)

https://arxiv.org/abs/1906.03488
- Recovering Variable Names for Minified Code with Usage Contexts (2019)
- In modern Web technology, JavaScript (JS) code plays an important role. To avoid the exposure of original source code, the variable names in JS code deployed in the wild are often replaced by short, meaningless names, thus making the code extremely difficult to manually understand and analysis. This paper presents JSNeat, an information retrieval (IR)-based approach to recover the variable names in minified JS code. JSNeat follows a data-driven approach to recover names by searching for them in a large corpus of open-source JS code. We use three types of contexts to match a variable in given minified code against the corpus including the context of properties and roles of the variable, the context of that variable and relations with other variables under recovery, and the context of the task of the function to which the variable contributes. We performed several empirical experiments to evaluate JSNeat on the dataset of more than 322K JS files with 1M functions, and 3.5M variables with 176K unique variable names. We found that JSNeat achieves a high accuracy of 69.1%, which is the relative improvements of 66.1% and 43% over two state-of-the-art approaches JSNice and JSNaughty, respectively. The time to recover for a file or for a variable with JSNeat is twice as fast as with JSNice and 4x as fast as with JNaughty, respectively.
https://github.com/mrstarrynight/JSNeat
- JSNeat official website
- Link: https://mrstarrynight.github.io/JSNeat/
- https://mrstarrynight.github.io/JSNeat/
  - JSNeat
  - My mission is to make Javascript files available to everyone
  - Corpus
    - Total corpus
      
      We collected a corpus of 12,000 open-source JavaScript projects from GitHub with highest ratings.
      - https://raw.githubusercontent.com/mrstarrynight/JSNeat/master/JS-stars-5-ranked-by-stars.csv
    - Relation graph data
      
      To build the relation graphs, we used Rhino to parse the JS files and extract the context information. Table belows shows the statistics of our dataset G of relation graphs.

Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts (2018)

https://arxiv.org/abs/1809.05193
- Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts (2018)
- Most of the JavaScript code deployed in the wild has been minified, a process in which identifier names are replaced with short, arbitrary and meaningless names. Minified code occupies less space, but also makes the code extremely difficult to manually inspect and understand. This paper presents Context2Name, a deep learningbased technique that partially reverses the effect of minification by predicting natural identifier names for minified names. The core idea is to predict from the usage context of a variable a name that captures the meaning of the variable. The approach combines a lightweight, token-based static analysis with an auto-encoder neural network that summarizes usage contexts and a recurrent neural network that predict natural names for a given usage context. We evaluate Context2Name with a large corpus of real-world JavaScript code and show that it successfully predicts 47.5% of all minified identifiers while taking only 2.9 milliseconds on average to predict a name. A comparison with the state-of-the-art tools JSNice and JSNaughty shows that our approach performs comparably in terms of accuracy while improving in terms of efficiency. Moreover, Context2Name complements the state-of-the-art by predicting 5.3% additional identifiers that are missed by both existing tools.
https://github.com/rbavishi/Context2Name
- Context2Name
- The training and testing dataset is a derivative of the js150 dataset. Duplicates and common entries between the training and testing set have been removed.

A Survey of Machine Learning for Big Code and Naturalness (2018)

https://arxiv.org/abs/1709.06182
- A Survey of Machine Learning for Big Code and Naturalness (2018)
- Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.

Machine Learning for Big Code and Naturalness (living literature review)

https://ml4code.github.io/
- Machine Learning for Big Code and Naturalness
- Research on machine learning for source code.
- https://ml4code.github.io/papers.html
  - Search across all paper titles, abstracts, authors by using the search field. Please consider contributing by updating the information of existing papers or adding new work.
- https://github.com/ml4code/ml4code.github.io
  - Machine Learning for Big Code and Naturalness
  - This is the repository for the Survey of Machine Learning for Big Code and Naturalness. Please look at the website for more information about the survey and for information about contributing your work or taxonomy to the website.
  - This research area is evolving so fast that a static review cannot keep up. But a website can! We hope to make this site a living document. Anyone can add a paper to this web site, essentially by creating one Markdown file.

Suggesting meaningful variable names for decompiled code: a machine translation approach (2017)

Note: This seems to be the same method, released at basically the same time as JSNaughty; just applied to decompiled code rather than JavaScript.
https://dl.acm.org/doi/10.1145/3106237.3121274
- Suggesting meaningful variable names for decompiled code: a machine translation approach (2017)
- Decompiled code lacks meaningful variable names. We used statistical machine translation to suggest variable names that are natural given the context. This technique has previously been successfully applied to obfuscated JavaScript code, but decompiled C code poses unique challenges in constructing an aligned corpus and selecting the best translation from among several candidates.
https://cmustrudel.github.io/papers/fse17src.pdf
- Suggesting Meaningful Variable Names for Decompiled Code: A Machine Translation Approach (2017)

Recovering clear, natural identifiers from obfuscated JS names (2017) (AKA: JSNaughty, Autonym)

https://www.researchgate.net/publication/318868712_Recovering_clear_natural_identifiers_from_obfuscated_JS_names
- Well-chosen variable names are critical to source code readability, reusability, and maintainability. Unfortunately, in deployed JavaScript code (which is ubiquitous on the web) the identifier names are frequently minified and overloaded. This is done both for efficiency and also to protect potentially proprietary intellectual property. In this paper, we describe an approach based on statistical machine translation (SMT) that recovers some of the original names from the JavaScript programs minified by the very popular UglifyJS. This simple tool, Autonym, performs comparably to the best currently available deobfuscator for JavaScript, JSNice, which uses sophisticated static analysis. In fact, Autonym is quite complementary to JSNice, performing well when it does not, and vice versa. We also introduce a new tool, JSNaughty, which blends Autonym and JSNice, and significantly outperforms both at identifier name recovery, while remaining just as easy to use as JSNice. JSNaughty is available online at http://jsnaughty.org.
- http://jsnaughty.org
  - https://web.archive.org/web/20180810004744/http://jsnaughty.org/
    - Got an HTTP 302 response at crawl time. Redirecting to... https://goo.gl/riyTFp
    - https://web.archive.org/web/20180810004741/https://goo.gl/riyTFp
      - Got an HTTP 301 response at crawl time. Redirecting to... http://tardigrade.andrew.cmu.edu:8000/
      - https://web.archive.org/web/20180810004746/http://tardigrade.andrew.cmu.edu:8000/
        
        JSNaughty
        
        http://tardigrade.andrew.cmu.edu:8000/
      - https://web.archive.org/web/20190901071846/http://tardigrade.andrew.cmu.edu:8000/about/
        
        About JSNaughty
        
        JSNaughty is a tool for recovering names from obfuscated Javascript files. It is based on framing the deobfuscation problem as a language translation problem - we translate the obfuscated names to meaningful names using the context in which variables are defined.
        
        To do this, the tool makes uses of the Moses statistical machine translation framework to perform the translation, along with some pre and post processing to handle differences between code and natural language - such as code's requirement for consistency within scope.
        
        The source code for the tool can be found at on our GitHub page here. In addition to the website, a more flexible environment to run the tool can be found on DockerHub, with addition instructions on running the docker container and the tool inside are in the README file here.
        
        Using this website:
        
        The Home page allows the recovering of obfuscated names from the UglifyJS obfuscation tool. You can paste the Javascript obfuscated with this tool in the left text box (replacing the example), and the tool will produced renamed (and indented if needed) Javascript. If you included the JSNice mixing option then the tool combines the names suggested by Moses with names suggested by JSNice. Our experiements show this is the most effective option. You can disable including the JSNice names by turning it off, or follow the link to try the renaming with just the JSNice tools.
https://cmustrudel.github.io/projects/jsnaughty/
- JSNaughty
- https://cmustrudel.github.io/papers/fse17jsnaughty.pdf
  - Recovering Clear, Natural Identifiers from Obfuscated JS Names (2017)
  - https://cmustrudel.github.io/slides/fse17talk.pdf
    - Slides
https://www.cs.ucdavis.edu/~devanbu/jsnaughty.pdf
- Recovering Clear, Natural Identifiers from Obfuscated JS Names (2017)
https://ml4code.github.io/publications/vasilescu2017recovering/
- Recovering Clear, Natural Identifiers from Obfuscated JS Names (2017)
https://github.com/bvasiles/jsNaughty
- jsNaughty
- JS reverse minifier based on statistical machine translation
- jsNaughty is a tool for recovering names from obfuscated Javascript files. It is based on framing the deobfuscation problem as a language translation problem - we translate the obfuscated names to meaningful names using the context in which variables are defined.
  
  To do this, the tool makes uses of the Moses statistical machine translation framework to perform the translation, along with some pre and post processing to handle code specific considerations.
  - https://www2.statmt.org/moses/
    - Welcome to Moses!
    - Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair. All you need is a collection of translated texts (parallel corpus). Once you have a trained model, an efficient search algorithm quickly finds the highest probability translation among the exponential number of choices.
    - https://github.com/moses-smt/mosesdecoder
      - Moses, the machine translation system

Predicting Program Properties from "Big Code" (2015) (AKA: JSNice)

http://jsnice.org/
- JSNice: Statistical renaming, Type inference and Deobfuscation
- https://www.sri.inf.ethz.ch/jsnice
  - Programming tools with big data and conditional random fields
  - A new research paper showing how to build programming tools based on probabilistic models learned from massive codebases recently appeared at the ACM Principles of Programming Languages Conference, 2015 (ACM POPL’15). The paper presents a machine learning framework for predicting facts about programs based on probabilistic graphical models.
  - The full PDF of the paper can be found here:
    - https://files.sri.inf.ethz.ch/website/papers/jsnice15.pdf
      - Predicting Program Properties from “Big Code” (2015)
  - JSNice As an example of this framework, it also presents the design and implementation of JSNice (http://jsnice.org), a popular system among JavaScript developers that predicts variable names and type annotations for JavaScript code
  - How does it work?
    
    The general approach (as well as JSNice) is based on state-of-the-art machine learning:
    - Conditional Random Fields (CRFs) as a general framework for learning from code. CRFs are graphical models which are tremendously popular in image processing and natural language processing. This work pioneers CRFs in the domain of programs.
    - Fast prediction algorithms that take into account the existing names and types in order to predict new names and types. Such algorithms are also known as MAP inference.
    - Maximum-margin training based on state-of-the-art efficient learning techniques from Support Vector Machines.
    - An efficient, scalable and parallel implementation that learns from massive amounts of code quickly.
https://dl.acm.org/doi/10.1145/2676726.2677009
- Predicting Program Properties from "Big Code" (2015)
- We present a new approach for predicting program properties from massive codebases (aka "Big Code"). Our approach first learns a probabilistic model from existing data and then uses this model to predict properties of new, unseen programs.
  
  The key idea of our work is to transform the input program into a representation which allows us to phrase the problem of inferring program properties as structured prediction in machine learning. This formulation enables us to leverage powerful probabilistic graphical models such as conditional random fields (CRFs) in order to perform joint prediction of program properties.
  
  As an example of our approach, we built a scalable prediction engine called JSNice for solving two kinds of problems in the context of JavaScript: predicting (syntactic) names of identifiers and predicting (semantic) type annotations of variables. Experimentally, JSNice predicts correct names for 63% of name identifiers and its type annotation predictions are correct in 81% of the cases. In the first week since its release, JSNice was used by more than 30,000 developers and in only few months has become a popular tool in the JavaScript developer community.
  
  By formulating the problem of inferring program properties as structured prediction and showing how to perform both learning and inference in this context, our work opens up new possibilities for attacking a wide range of difficult problems in the context of "Big Code" including invariant generation, decompilation, synthesis and others.
https://ml4code.github.io/publications/raychev2015predicting/
- Predicting Program Properties from “Big Code” (2015)

Software Similarity and Classification (2012; Book; Silvio Cesare, Yang Xiang)

https://books.google.com.au/books/about/Software_Similarity_and_Classification.html?id=Fy_mNhg2lK4C
https://link.springer.com/book/10.1007/978-1-4471-2909-7
- Software Similarity and Classification
- Authors: Silvio Cesare , Yang Xiang
- Number of Pages: XIV, 88
- The first book to construct a theory to describe the problems in software similarity and classification
- Software similarity and classification is an emerging topic with wide applications. It is applicable to the areas of malware detection, software theft detection, plagiarism detection, and software clone detection. Extracting program features, processing those features into suitable representations, and constructing distance metrics to define similarity and dissimilarity are the key methods to identify software variants, clones, derivatives, and classes of software. Software Similarity and Classification reviews the literature of those core concepts, in addition to relevant literature in each application and demonstrates that considering these applied problems as a similarity and classification problem enables techniques to be shared between areas. Additionally, the authors present in-depth case studies using the software similarity and classification techniques developed throughout the book.
- Includes supplementary material: https://extras.springer.com/?query=978-1-4471-2908-0
  - 1 zip file containing 3 PDFs:
    - Table of Contents (6 pages)
      - 1 Introduction (6 pages)
        
        1.1 Background
        
        1.2 Applications of Software Similarity and Classification
        
        1.3 Motivation
        
        1.4 Problem Formulization
        
        1.5 Problem Overview
        
        1.6 Aims and Scope
        
        1.7 Book Organization
        
        References
        
        2 Taxonomy of Program Features (10 pages)
        
        2.1 Syntactic Features
        
        2.1.1 Raw Code
        
        2.1.2 Abstract Syntax Trees
        
        2.1.3 Variables
        
        2.1.4 Pointers
        
        2.1.5 Instructions
        
        2.1.6 Basic Blocks
        
        2.1.7 Procedures
        
        2.1.8 Control Flow Graphs
        
        2.1.9 Call Graphs
        
        2.1.10 Object Inheritances and Dependencies
        
        2.2 Semantic Features
        
        2.2.1 API Calls
        
        2.2.2 Data Flow
        
        2.2.3 Procedure Dependence Graphs
        
        2.2.4 System Dependence Graph
        
        2.3 Taxonomy of Features in Program Binaries
        
        2.3.1 Object File Formats
        
        2.3.2 Headers
        
        2.3.3 Object Code
        
        2.3.4 Symbols
        
        2.3.5 Debugging Information
        
        2.3.6 Relocations
        
        2.3.7 Dynamic Linking Information
        
        2.4 Case Studies
        
        2.4.1 Portable Executable
        
        2.4.2 Executable and Linking Format
        
        2.4.3 Java Class File
        
        References
        
        3 Program Transformations and Obfuscations (10 pages)
        
        3.1 Compiler Optimization and Recompilation
        
        3.1.1 Instruction Reordering
        
        3.1.2 Loop Invariant Code Motion
        
        3.1.3 Code Fusion
        
        3.1.4 Function Inlining
        
        3.1.5 Loop Unrolling
        
        3.1.6 Branch/Loop Inversion
        
        3.1.7 Strength Reduction
        
        3.1.8 Algebraic Identities
        
        3.1.9 Register Reassignment
        
        3.2 Program Obfuscation
        
        3.3 Plagiarism, Software Theft, and Derivative Works
        
        3.3.1 Semantic Changes
        
        3.3.2 Code Insertion
        
        3.3.3 Code Deletion
        
        3.3.4 Code Substitution
        
        3.3.5 Code Transposition
        
        3.4 Malware Packing, Polymorphism, and Metamorphism
        
        3.4.1 Dead Code Insertion
        
        3.4.2 Instruction Substitution
        
        3.4.3 Variable Renaming
        
        3.4.4 Code Reordering
        
        3.4.5 Branch Obfuscation
        
        3.4.6 Branch Inversion and Flipping
        
        3.4.7 Opaque Predicate Insertion
        
        3.4.8 Malware Obfuscation Using Code Packing
        
        3.4.9 Traditional Code Packing
        
        3.4.10 Shifting Decode Frame
        
        3.4.11 Instruction Virtualization and Malware Emulators
        
        3.5 Features under Program Transformations
        
        References
        
        4 Formal Methods of Program Analysis (12 pages)
        
        4.1 Static Feature Extraction
        
        4.2 Formal Syntax and Lexical Analysis
        
        4.3 Parsing
        
        4.4 Intermediate Representations
        
        4.4.1 Intermediate Code Generation
        
        4.4.2 Abstract Machines
        
        4.4.3 Basic Blocks
        
        4.4.4 Control Flow Graph
        
        4.4.5 Call Graph
        
        4.5 Formal Semantics of Programming Languages
        
        4.5.1 Operational Semantics
        
        4.5.2 Denotational Semantics
        
        4.5.3 Axiomatic Semantics
        
        4.6 Theorem Proving
        
        4.6.1 Hoare Logic
        
        4.6.2 Predicate Transformer Semantics
        
        4.6.3 Symbolic Execution
        
        4.7 Model Checking
        
        4.8 Data Flow Analysis
        
        4.8.1 Partially Ordered Sets
        
        4.8.2 Lattices
        
        4.8.3 Monotone Functions and Fixed Points
        
        4.8.4 Fixed Point Solutions to Monotone Functions
        
        4.8.5 Dataflow Equations
        
        4.8.6 Dataflow Analysis Examples
        
        4.8.7 Reaching Definitions
        
        4.8.8 Live Variables
        
        4.8.9 Available Expressions
        
        4.8.10 Very Busy Expressions
        
        4.8.11 Classification of Dataflow Analyses
        
        4.9 Abstract Interpretation
        
        4.9.1 Widening and Narrowing
        
        4.10 Intermediate Code Optimisation
        
        4.11 Research Opportunities
        
        References
        
        5 Static Analysis of Binaries (8 pages)
        
        5.1 Disassembly
        
        5.2 Intermediate Code Generation
        
        5.3 Procedure Identification
        
        5.4 Procedure Disassembly
        
        5.5 Control Flow Analysis, Deobfuscation and Reconstruction
        
        5.6 Pointer Analysis
        
        5.7 Decompilation of Binaries
        
        5.7.1 Condition Code Elimination
        
        5.7.2 Stack Variable Reconstruction
        
        5.7.3 Preserved Register Detection
        
        5.7.4 Procedure Parameter Reconstruction
        
        5.7.5 Reconstruction of Structured Control Flow
        
        5.7.6 Type Reconstruction
        
        5.8 Obfuscation and Limits to Static Analysis
        
        5.9 Research Opportunities
        
        References
        
        6 Dynamic Analysis (6 pages)
        
        6.1 Relationship to Static Analysis
        
        6.2 Environments
        
        6.3 Debugging
        
        6.4 Hooking
        
        6.5 Dynamic Binary Instrumentation
        
        6.6 Virtualization
        
        6.7 Application Level Emulation
        
        6.8 Whole System Emulation
        
        References
        
        7 Feature Extraction (4 pages)
        
        7.1 Processing Program Features
        
        7.2 Strings
        
        7.3 Vectors
        
        7.4 Sets
        
        7.5 Sets of Vectors
        
        7.6 Trees
        
        7.7 Graphs
        
        7.8 Embeddings
        
        7.9 Kernels
        
        7.10 Research Opportunities
        
        References
        
        8 Software Birthmark Similarity (8 pages)
        
        8.1 Distance Metrics
        
        8.2 String Similarity
        
        8.2.1 Levenshtein Distance
        
        8.2.2 Smith-Waterman Algorithm
        
        8.2.3 Longest Common Subsequence (LCS)
        
        8.2.4 Normalized Compression Distance
        
        8.3 Vector Similarity
        
        8.3.1 Euclidean Distance
        
        8.3.2 Manhattan Distance
        
        8.3.3 Cosine Similarity
        
        8.4 Set Similarity
        
        8.4.1 Dice Coefficient
        
        8.4.2 Jaccard Index
        
        8.4.3 Jaccard Distance
        
        8.4.4 Containment
        
        8.4.5 Overlap Coefficient
        
        8.4.6 Tversky Index
        
        8.5 Set of Vectors Similarity
        
        8.6 Tree Similarity
        
        8.7 Graph Similarity
        
        8.7.1 Graph Isomorphism
        
        8.7.2 Graph Edit Distance
        
        8.7.3 Maximum Common Subgraph
        
        References
        
        9 Software Similarity Searching and Classification (6 pages)
        
        9.1 Instance-Based Learning and Nearest Neighbour
        
        9.1.1 k Nearest Neighbours Query
        
        9.1.2 Range Query
        
        9.1.3 Metric Trees
        
        9.1.4 Locality Sensitive Hashing
        
        9.1.5 Distributed Similarity Search
        
        9.2 Statistical Machine Learning
        
        9.2.1 Vector Space Models
        
        9.2.2 Kernel Methods
        
        9.3 Research Opportunities
        
        References
        
        10 Applications (6 pages)
        
        10.1 Malware Classification
        
        10.1.1 Raw Code
        
        10.1.2 Instructions
        
        10.1.3 Basic Blocks
        
        10.1.4 API Calls
        
        10.1.5 Control Flow and Data Flow
        
        10.1.6 Data Flow
        
        10.1.7 Call Graph
        
        10.1.8 Control Flow Graphs
        
        10.2 Software Theft Detection (Static Approaches)
        
        10.2.1 Instructions
        
        10.2.2 Control Flow
        
        10.2.3 API Calls
        
        10.2.4 Object Dependencies
        
        10.3 Software Theft Detection (Dynamic Approaches)
        
        10.3.1 Instructions
        
        10.3.2 Control Flow
        
        10.3.3 API Calls
        
        10.3.4 Dependence Graphs
        
        10.4 Plagiarism Detection
        
        10.4.1 Raw Code and Tokens
        
        10.4.2 Parse Trees
        
        10.4.3 Program Dependency Graph
        
        10.5 Code Clone Detection
        
        10.5.1 Raw Code and Tokens
        
        10.5.2 Abstract Syntax Tree
        
        10.5.3 Program Dependency Graph
        
        10.6 Critical Analysis
        
        References
        
        11 Future Trends and Conclusion
        
        11.1 Future Trends
        
        11.2 Conclusion
    - Preface (1 page)
    - Chapter 2: Taxonomy of Program Features (10 pages)

Unsorted

https://binary.ninja/2022/06/20/introducing-tanto.html#potential-uses-and-some-speculation
- What I’ve found most interesting, and have been speculating about, is using variable slices like these (though not directly through the UI) in the function fingerprinting space. I’ve long suspected that a dataflow-based approach to fingerprinting might prove to be robust against compiler optimizations and versions, as well as source code changes that don’t completely redefine the implementation of a function. Treating each variable slice as a record of what happens to data within a function, a similarity score for two slices could be generated from the count of matching operations, matching constant interactions (2 + var_a), and matching variable interactions (var_f + var_a). Considering all slices, a confidence metric could be derived for whether two functions match. Significant research would be required to answer these questions concretely… and, if you could solve subgraph isomorphism at the same time, that’d be great!
- https://bsky.app/profile/elykdeer.bsky.social
- https://bsky.app/profile/1ns0mn1h4ck.bsky.social/post/3liypgatkt22x
  - We’re thrilled to announce Kyle Martin’s session at Insomni’hack 2025: "'A Slice of' Modern Program Analysis".
  - 🔍 Discover the lineup and book your spot: insomnihack.ch/talks/a-slic...
    - https://insomnihack.ch/talks/a-slice-of-modern-program-analysis/
      - Talk: "A Slice of" Modern Program Analysis March 14, 13:30 (CAMPUS)
      - This talk introduces Tanto 2.0: an open-source, binary analysis, slicing framework and plugin for Binary Ninja designed to help discover and verify bugs and vulnerabilities faster than ever before. As government-funded programs and private-sector research continue to encounter increasingly complex problems that require more data and context to solve, slicing aims to cut those problems back down to size.
- https://bsky.app/profile/binary.ninja/post/3lma74a4aem2n
  - Kyle's talk at Insomni'Hack is live! youtu.be/I0PoE0IdtmE?... Check it out if you're interested in a slice of modern program analysis and try the latest version of Tanto as well, in the plugin manager or at github.com/Vector35/tanto
    - https://www.youtube.com/watch?v=I0PoE0IdtmE
      - YouTube: "A Slice Of" Modern Program Analysis - Kyle Martin (50:02)
    - https://github.com/Vector35/tanto
      - Tantō slices functions into more consumable chunks
      - https://github.com/Vector35/tanto#insomnihack-2025-slides
        
        Insomni'Hack 2025 Slides
        
        Here are the slides for the talk @ElykDeer gave at Insomni'Hack 2025: "A Slice of" Modern Program Analysis

0xdevalias/fingerprinting-minified-javascript-libraries-ast-fingerprinting-source-code-similarity-etc.md

Fingerprinting Minified JavaScript Libraries / AST Fingerprinting / Source Code Similarity / Etc

Table of Contents

Original Notes

ChatGPT Explorations

Musings

On Twitter

Embedding Based Code Search Across the Open-Source Ecosystem

Is there anything like PyPi-Data but for the NPM / JavaScript ecosystem?

Code Embeddings

Benchmarks / Leaderboards / etc

MMTEB: Massive Multilingual Text Embedding Benchmark (2025) / MTEB: Massive Text Embedding Benchmark (2022)

CoIR: A Comprehensive Benchmark for Code Information Retrieval Models (2024-2025)

Embedding Models

Qwen3 Embedding (2025-06)

CodeXEmbed (SFR-Embedding-Code) (2025-01)

Voyage (2024-2025)

Vector Embedding Databases

Faiss

Chroma

Unsorted

Code Search

GitHub Code Search

Public Code Search

Docs

Blogs, YouTube, etc

SourceGraph

Public Code Search

Docs

SourceGraph GitHub

Main

Zoekt - Fast Code Search

SCIP - SCIP Code Intelligence Protocol

LSIF (Legacy)

LSP - Language Server Protocol (Legacy)

ctags (Legacy)

srclib / jsg (Legacy)

Treesitter (Forks)

Golang Libs

Unsorted

Vercel Grep.app

searchcode

Ben E. C. Boyter's Blog

Google Code Search

Programmable Search Engine

Unsorted

npm Package Ranking, Bundle Size, etc

npm Package Registry Data, Ranking, etc

📊 Document Stats:

💾 Size Metrics:

Package / Bundle Size, Bundle Analyzer / Visualizer, etc

Size Visualisation Data Structures

Link Dump 1

Unsorted/Unreviewed Initial Link Dump RE: 'AST fingerprinting' / Code Similarity

Program Dependence Graph, Control Flow Graph, Data Flow Graph, Data Flow Analysis, Program Analysis Tools, etc

Stack Overflow: Assembly-level function fingerprint (2011)

Systems and methods for detecting copied computer code using fingerprints (2016)

A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features (2022)

BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code Executables (2017)

Software Fingerprinting in LLVM (2021)

Syntax tree fingerprinting for source code similarity detection (2009)

Syntax tree fingerprinting: a foundation for source code similarity detection (2011)

Source Code Plagiarism Detection Based on Abstract Syntax Tree Fingerprintings (2022)

Dynamic graph-based software fingerprinting (2007)

Adaptive Structural Fingerprints for Graph Attention Networks (2019)

Cloneless: Code Clone Detection via Program Dependence Graphs with Relaxed Constraints (2019)

Graph-of-Code: Semantic Clone Detection Using Graph Fingerprints (2023)

A graph-based code representation method to improve code readability classification (2023)

Link Dump 2

OpenAI Embeddings

Unsorted/Unreviewed Link Dump RE: 'AST fingerprinting' / Code Similarity (v2)

Wikipedia Articles, etc

A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges (2023)

A comparison of code similarity analysers (2017)

Winnowing: Local Algorithms for Document Fingerprinting (2003)

Source Code Plagiarism Detection with Pre-Trained Model Embeddings and Automated Machine Learning (2023)

A Source Code Similarity System for Plagiarism Detection (2013)

A Source Code Similarity Based on Siamese Neural Network (2020)

Detecting Source Code Similarity Using Compression (2019)

Binary code similarity analysis based on naming function and common vector space (2023)

`srclib` / `jsg` (Legacy)

`npm` Package Ranking, Bundle Size, etc