Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save 0xdevalias/31c6574891db3e36f15069b859065267 to your computer and use it in GitHub Desktop.
Save 0xdevalias/31c6574891db3e36f15069b859065267 to your computer and use it in GitHub Desktop.
Some notes and tools on fingerprinting minified JavaScript libraries, AST fingerprinting, source code similarity, etc

Fingerprinting Minified JavaScript Libraries / AST Fingerprinting / Source Code Similarity / Etc

Some notes and tools on fingerprinting minified JavaScript libraries, AST fingerprinting, source code similarity, etc.

Table of Contents

Original Notes

This gist was created as I was finding there was too much content related to this topic to keep tacking it onto my older gist on Deobfuscating / Unminifying Obfuscated Web App / JavaScript Code; but until I move all of the relevant content from there to this gist; here is a link to the main notes I was keeping track of there (largely copies of my comments on various relevant GitHub repo's exploring this topic + related research / tools / etc):

ChatGPT Explorations

  • https://chatgpt.com/c/d2713f5a-19ee-41fe-836d-0db4ba3daeac
    • Public Share (created 2025-03-25): https://chatgpt.com/share/67e25fc8-f638-8008-a610-3edaa6614072
    • Private ChatGPT conversation about various things related to AST fingerprinting/etc; or as it summarised itself:
      • This chat explored how to create a stable and efficient system for fingerprinting and identifying variables in minified JavaScript code using structural patterns from AST analysis. We examined how tools like eslint-scope can help extract scope and reference data, discussed structural fingerprinting techniques inspired by academic research, and considered which JavaScript elements typically survive minification (like strings, symbols, and function structures). Finally, we developed an enhanced AST traversal script that categorizes these preserved elements by context—scopes, functions, classes, and modules—to make them easier to understand and analyze.

    • TODO: Summarise/pull out the relevant parts from this and include them here
  • https://chatgpt.com/c/67e25d5d-1aa4-8008-ac08-c971ac64090e
    • Public Share (created 2025-03-25): https://chatgpt.com/share/67e25f3a-b604-8008-9d83-e12c738eb306
    • Private ChatGPT conversation about various things related to identifying NPM imports in a bundled apps module import/export graph; or as it summarised itself:
      • This chat discusses techniques for analyzing a module dependency graph extracted from a bundled and minified JavaScript web app to identify subgraphs likely representing third-party library code. It covers methods such as graph clustering (e.g., Louvain, spectral clustering), centrality analysis, import tree depth, symbol naming heuristics, fingerprint/signature matching, entropy analysis, and dynamic profiling. These approaches help isolate self-contained, library-like clusters that can potentially be "sliced off" from the main application logic, supporting the goal of distinguishing app code from imported npm dependencies.

    • TODO: Summarise/pull out the relevant parts from this and include them here

Musings

On Twitter

Embedding Based Code Search Across the Open-Source Ecosystem

Code Search

GitHub Code Search

Public Code Search

Docs

  • https://docs.github.com/en/search-github/github-code-search/about-github-code-search
    • About GitHub Code Search You can search, navigate and understand code across GitHub with code search.

    • https://docs.github.com/en/search-github/github-code-search/about-github-code-search#limitations
      • Limitations

        We have indexed many public repositories for code search, and continue to index more. Additionally, the private repositories of GitHub users are indexed and searchable by those that already have access to those private repositories on GitHub. However, very large repositories may not be indexed at this time, and not all code is indexed.

        The current limitations on indexed code are:

        • Vendored and generated code is excluded
        • Empty files and files over 350 KiB are excluded
        • Lines over 1,024 characters long are truncated
        • Binary files (PDF, etc.) are excluded
        • Only UTF-8 encoded files are included
        • Very large repositories may not be indexed
        • Exhaustive search is not supported
        • Files with more than one line over 4096 bytes are excluded

        We currently only support searching for code on the default branch of a repository. The query length is limited to 1000 characters.

        Results for any search with code search are restricted to 100 results (5 pages). Sorting is not supported for code search results at this time. This limitation only applies to searching code with the new code search and does not apply to other types of searches.

        If you use the path: qualifier for a file that's in multiple repositories with similar content, GitHub will only show a few of those files. If this happens, you can choose to expand by clicking Show identical files at the bottom of the page.

        Code search supports searching for symbol definitions in code, such as function or class definitions, using the symbol: qualifier. However, note that the symbol: qualifier only searches for definitions and not references, and not all symbol types or languages are fully supported yet.

  • https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax
  • https://docs.github.com/en/search-github/searching-on-github/searching-code
    • Searching code (legacy) You only need to use the legacy code search syntax if you are using the code search API.

    • https://docs.github.com/en/rest/search/search#search-code
      • Search code Searches for query terms inside of a file. This method returns up to 100 results per page.

      • GET /search/code

Blogs, YouTube, etc

  • https://www.youtube.com/watch?v=QCs76SC1ZZ0
    • YouTube: The technology behind GitHub's new code search - Universe 2022

  • https://github.blog/engineering/architecture-optimization/the-technology-behind-githubs-new-code-search/
    • The technology behind GitHub’s new code search (February 6, 2023) A look at what went into building the world’s largest public code search index.

    • TODO: read through this and include more relevant snippets here
    • https://news.ycombinator.com/item?id=34681223
      • I've worked alongside the CEO/CTO of Sourcegraph for the past 8 years, everyone else is at our company offsite so I figured I'd chime in :) nobody asked me to write this (nor did I ask) :)

        The article is a top-notch technical write-up, the devs on GitHub code search should be proud of what they've achieved so far!

        Honestly, we're rooting for GitHub to improve their code search, viewing them as a close peer-not a competitor. We also maintain OSS projects like Zoekt, which IIRC GitLab is maybe looking at using for their own. The more devs that 'get' code search, the better off Sourcegraph is frankly!

        GitHub has a nice intuitive/simple UX, we could learn a thing or two there (though, easier to do with less features.)

        Still, Sourcegraph search tech is quite a bit more powerful:

        • Searching over commit messages, diffs, filename, etc. are super nice for tracking down regressions / finding 'that PR I swear my coworker made'
        • Expressiveness like "find this regexp in repositories, but only if the repo has had a commit in the last month AND has a file named package.json in its root"
        • Since Steve Yegge joined us, we've started thinking about ranking of search results, a notoriously difficult thing to do well in code search unless you have great factors to rank on (e.g. a semantic understanding of code): https://about.sourcegraph.com/blog/new-search-ranking
        • We stream results back, so you can get a comprehensive set of results - not just a few pages, from our API.
        • Works in GitHub Enterprise, not just GitHub.com. Plus on all your code hosts, think BitBucket, GitLab, Azure DevOps, Gerrit, Phabricator, etc. and even non-Git VCS like Perforce.
        • Respects permissions of all your code hosts (a very difficult problem, as there are no official APIs to query this info from code hosts in general)

        Having code search is one thing, but using it is another:

        • Code Insights (we use search as an API to gather statistics about code, track code quality, keywords, etc. both over time and retroactively and let you build dashboards)
        • Batch changes (find+replace, but over thousands of repositories. Run a Docker container per repo, run your custom linter script etc. and then draft or send PRs to thousands of repos, manage/track campaigns with thousands of PRs like that over time, etc.)
        • Precise code intel / semantic awareness of code, we use SCIP indexers for this (spiritual successor to Microsoft's LSIF format for indexing LSP servers.)

        I am super happy GitHub continues to push their code search effort, and genuinely believe it's a great thing for all developers and us over at Sourcegraph. Also excited to see when they do their public rollout of this :)

        Anyway, that's just my take as someone who works there-other Sourcegraphers will chime in later if anything I said above feels off to them I'm sure :)

        • https://sourcegraph.com/blog/new-search-ranking
          • Rethinking search results ranking on Sourcegraph.com

          • Announcing Search Ranking and Relevance

            I’m thrilled to announce that Sourcegraph has launched PageRank-driven Code Search result rankings that prioritize relevance and showing reusable code. This launched today for searches on popular OSS repos on https://sourcegraph.com/ , and we are working to bring ranking to private Sourcegraph deployments soon.

          • Sourcegraph’s new search ranking uses a rendition of the Google PageRank algorithm on source code, powered by the code symbol graph from our sophisticated code intelligence platform (CIP).

          • Why is using PageRank for Code Search so revolutionary and effective? Let’s dig in.

          • For web pages, Google’s PageRank tracks which pages are pointed at (referenced) most often by other web pages. PageRank is a measure of how “cool” they are: Who’s pointing at them?

            For source code, the pointing hands are code usages: function calls, imports, that sort of thing. If there’s only one arm pointing at a smiley, that’s a code use. But if more than one arm is pointing in… that’s reuse! The big yellow smiley is being reused by more code than any other smiley in the diagram. The PageRank algorithm uncovered this fact.

            The implication here is that PageRank is a measure of code reuse. Which makes it an incredibly powerful ranking signal. Because when you’re doing a code search, you are almost always looking for code you can reuse.

          • TODO: read through this and include more relevant snippets here
  • https://github.blog/engineering/a-brief-history-of-code-search-at-github/
    • A brief history of code search at GitHub (December 15, 2021)

      This blog post tells the story of why we built a new search engine optimized for code.

    • We want to share more about our work on code exploration, navigation, search, and developer productivity. Recently, we substantially improved the precision of our code navigation for Python, and open-sourced the tools we developed for this. The stack graph formalism we developed will form the basis for precise code navigation support for more languages, and will even allow us to empower language communities to build and improve support for their own languages, similarly to how we accept contributions to github/linguist to expand GitHub’s syntax highlighting capabilities.

    • TODO: read through this and include more relevant snippets here
  • https://github.blog/open-source/introducing-stack-graphs/
    • Introducing stack graphs (December 9, 2021 | Updated July 23, 2024)

      Precise code navigation is powered by stack graphs, a new open source framework that lets you define the name binding rules for a programming language.

    • Today, we announced the general availability of precise code navigation for all public and private Python repositories on GitHub.com. Precise code navigation is powered by stack graphs, a new open source framework we’ve created that lets you define the name binding rules for a programming language using a declarative, domain-specific language (DSL). With stack graphs, we can generate code navigation data for a repository without requiring any configuration from the repository owner, and without tapping into a build process or other CI job. In this post, I’ll dig into how stack graphs work, and how they achieve these results.

    • TODO: read through this and include more relevant snippets here
    • https://dcreager.net/talks/stack-graphs/
      • Incremental, zero-config Code Navigation using stack graphs.

        Exploring a large or unfamiliar codebase can be tricky. Code Navigation features like “jump to definition” and “find all references” let you discover how different pieces of code relate to each other. To power these features, we need to extract lists of symbols from the code, and describe the language-specific rules for how those symbols relate to each other.

        It’s difficult to add Code Nav to a large hosted service like GitHub, where we must support hundreds of programming languages, hundreds of millions of repositories, and petabytes of history. At this scale, we have a different set of design constraints than a local IDE. We need our data extraction to be incremental, so that we can reuse previous results for files that haven’t changed in a newly pushed commit, saving both compute and storage costs. And to support cross-repo lookups, it should require zero configuration — repo owners should not have to set up anything manually to activate the feature.

        In this talk I’ll describe stack graphs, which use a graphical notation to define the name binding rules for a programming language. They work equally well for dynamic languages like Python and JavaScript, and for static languages like Go and Java. Our solution is fast — processing most commits within seconds of us receiving your push. It does not require setting up a CI job, or tapping into a project-specific build process. And it is open-source, building on the tree-sitter project’s existing ecosystem of language tools.

      • Presentation: https://www.youtube.com/watch?v=l2R1PTGcwrE
        • YouTube: "Incremental, zero-config Code Nav using stack graphs" by Douglas Creager

      • Slides: https://media.dcreager.net/dcreager-strange-loop-2021-slides.pdf
    • https://arxiv.org/abs/2211.01224
      • Stack graphs: Name resolution at scale (2022)

      • We present stack graphs, an extension of Visser et al.'s scope graphs framework. Stack graphs power Precise Code Navigation at GitHub, allowing users to navigate name binding references both within and across repositories. Like scope graphs, stack graphs encode the name binding information about a program in a graph structure, in which paths represent valid name bindings. Resolving a reference to its definition is then implemented with a simple path-finding search.

        GitHub hosts millions of repositories, containing petabytes of total code, implemented in hundreds of different programming languages, and receiving thousands of pushes per minute. To support this scale, we ensure that the graph construction and path-finding judgments are file-incremental: for each source file, we create an isolated subgraph without any knowledge of, or visibility into, any other file in the program. This lets us eliminate the storage and compute costs of reanalyzing file versions that we have already seen. Since most commits change a small fraction of the files in a repository, this greatly amortizes the operational costs of indexing large, frequently changed repositories over time. To handle type-directed name lookups (which require "pausing" the current lookup to resolve another name), our name resolution algorithm maintains a stack of the currently paused (but still pending) lookups. Stack graphs can be constructed via a purely syntactic analysis of the program's source code, using a new declarative graph construction language. This means that we can extract name binding information for every repository without any per-package configuration, and without having to invoke an arbitrary, untrusted, package-specific build process.

  • https://github.blog/news-insights/product-news/precise-code-navigation-python-code-navigation-pull-requests/
    • Precise code navigation for Python, and code navigation in pull requests (December 9, 2021 | Updated July 23, 2024)

      Code navigation is now available in PRs, and code navigation results for Python are now more precise.

    • Over the coming months, we will add stack graph support for additional languages, allowing us to show precise code navigation results for them as well. Our stack-graphs library is open source and builds on the Tree-sitter ecosystem of parsers. We will also be publishing information on how language communities can self-serve stack graph support for their languages, should they wish to.

    • If you would like to learn more about how stack graphs enable precise code navigation with zero configuration, check out our deep dive post and Strange Loop presentation.

    • TODO: read through this and include more relevant snippets here

SourceGraph

  • https://sourcegraph.com/
    • Sourcegraph accelerates how software gets built, helping developers search, understand, and write code in complex codebases with AI

    • Code Search Find and navigate code, make large-scale changes, and track insights across codebases of any size.

      • https://sourcegraph.com/contexts
        • Search code you care about with search contexts

          • https://sourcegraph.com/docs/code-search/working/search_contexts
            • Search Contexts

            • Search Contexts help you search the code you care about on Sourcegraph. A search context represents a set of repositories at specific revisions on a Sourcegraph instance that will be targeted by search queries by default.

            • Every search on Sourcegraph uses a search context. Search contexts can be defined with the contexts selector shown in the search input, or entered directly in a search query.

      • https://sourcegraph.com/code-search
        • Code Search makes it easy to find code, make large-scale changes, and track insights across codebases of any scale and with any number of code hosts.

        • Efficiently reuse existing code. Find code across thousands of repositories and multiple code hosts in seconds.

        • Understand your code and its dependencies

          • Onboard to codebases faster with cross-repository code navigation features like “Go to definition” and “Find references”.
          • Complete code reviews, get up to speed on unfamiliar code, and determine the impact of code changes with the confidence of compiler-accurate code navigation.
          • Determine root causes quickly with code navigation that tracks dependencies and references across repositories.
    • https://sourcegraph.com/pricing
      • Free

        • $0 per month
        • AI editor extension for hobbyists or light usage
      • Enterprise Starter

        • $19 per user/month
        • AI & search experience for growing organizations hosted on our cloud
        • This seems to be the first tier that adds specialised search features (beyond whats available publicly anyway)
          • Integrated search results

          • Code Search Features

            • Code Search
            • Symbol Search
      • Enterprise

        • $59 per user/month
        • AI & search with enterprise-level security, scalability, and flexibility
        • Extra search features
          • Everything in Enterprise Starter, plus:

          • Code Search Features

            • Batch Changes
            • Code Insights
            • Code Navigation

Public Code Search

Docs

SourceGraph GitHub

Main
  • https://github.com/sourcegraph/sourcegraph-public-snapshot
    • Sourcegraph Code AI platform with Code Search & Cody

    • Note

      Sourcegraph transitioned to a private monorepo. This repository, sourcegraph/sourcegraph-public-snapshot is a publicly available copy of the sourcegraph/sourcegraph repository as it was just before the migration.

    • Tip

      If you are interested in working with the code, this commit is the last one made under an Apache License.

      • This commit was made on Jun 14, 2023
    • Note: The latest commits seem to be from August 2024
    • https://news.ycombinator.com/item?id=36584656
      • Sourcegraph is no longer open source

      • sqs on July 4, 2023

        Sourcegraph CEO here. Sourcegraph is now 2 separate products: code search and Cody (our code AI). Cody remains open source (Apache 2) in the client/cody* directories in the repository, and we're extracting that to a separate 100% OSS repository soon.

        Our licensing principle remains to charge companies while making tools for individual devs open source. Very few individual devs (or companies) used the limited-feature open-source variant of code search, so we decided to remove it. Usage of Sourcegraph code search was even more skewed toward our official non-OSS build than in other similar situations like Google Chrome vs. Chromium or VS Code vs. VSCodium. Maintaining 2 variants was a burden on our engineering team that had very little benefit for anyone.

        You can see more explanation at sourcegraph/sourcegraph-public-snapshot#53528 (comment) . The change was announced in the changelog and in a PR (all of our development occurs in public), and we will have a blog post this week after we separate our big monorepo into 2 repos as planned: the 100% OSS repo for Cody and the non-OSS repo for code search.

        You can still use Sourcegraph code search for free on public code at https://sourcegraph.com and on our self-hosted free tier on private code (which means individual devs can still run Sourcegraph code search 100% for free). Customers are not affected at all.

    • https://github.com/sourcegraph/src-cli
      • Sourcegraph CLI

      • src is a command line interface to Sourcegraph:

        • Search & get results in your terminal
        • Search & get JSON for programmatic consumption
        • Make GraphQL API requests with auth easily & get JSON back fast
        • Execute batch changes
        • Manage & administrate repositories, users, and more
        • Easily convert src-CLI commands to equivalent curl commands, just add --get-curl!
Zoekt - Fast Code Search
  • https://github.com/sourcegraph/zoekt
    • Zoekt: fast code search

    • Fast trigram based code search

    • Zoekt is a text search engine intended for use with source code. (Pronunciation: roughly as you would pronounce "zooked" in English)

    • Note: This has been the maintained source for Zoekt since 2017, when it was forked from the original repository github.com/google/zoekt.

    • Zoekt supports fast substring and regexp matching on source code, with a rich query language that includes boolean operators (and, or, not). It can search individual repositories, and search across many repositories in a large codebase. Zoekt ranks search results using a combination of code-related signals like whether the match is on a symbol. Because of its general design based on trigram indexing and syntactic parsing, it works well for a variety of programming languages.

      The two main ways to use the project are

      • Through individual commands, to index repositories and perform searches through Zoekt's query language
      • Or, through the indexserver and webserver, which support syncing repositories from a code host and searching them through a web UI or API

      For more details on Zoekt's design, see the docs directory.

    • Note: It is also recommended to install Universal ctags, as symbol information is a key signal in ranking search results. See ctags.md for more information.

    • https://github.com/sourcegraph/zoekt/blob/main/doc/query_syntax.md
      • Zoekt Query Language Guide This guide explains the Zoekt query language, used for searching text within Git repositories. Zoekt queries allow combining multiple filters and expressions using logical operators, negations, and grouping. Here's how to craft queries effectively.

    • https://github.com/sourcegraph/zoekt-archived
      • Note: This is a Sourcegraph fork of github.com/google/zoekt. It contains some changes that do not make sense to upstream and or have not yet been upstreamed.

SCIP - SCIP Code Intelligence Protocol
LSIF (Legacy)
LSP - Language Server Protocol (Legacy)
  • https://github.com/sourcegraph/sourcegraph-typescript
    • Language server for TypeScript/JavaScript

    • Provides code intelligence for TypeScript

    • This repository has been superseded by scip-typescript.

  • https://github.com/sourcegraph/lsp-client
    • @sourcegraph/lsp-client

    • Connects Sourcegraph extensions to language servers

  • https://github.com/sourcegraph/lsp-adapter
    • lsp-adapter provides a proxy which adapts Sourcegraph LSP requests to vanilla LSP requests

    • Code Intelligence on Sourcegraph is powered by the Language Server Protocol.

      Previously, language servers that were used on sourcegraph.com were additionally required to support our custom LSP files extensions. These extensions allowed language servers to operate without sharing a physical file system with the client. While it's preferable for language servers to implement these extensions for performance reasons, implementing this functionality is a large undertaking.

      lsp-adapter eliminates the need for this requirement, which allows off-the-shelf language servers to be able to provide basic functionality (hovers, local definitions) to Sourcegraph.

  • https://github.com/sourcegraph/javascript-typescript-langserver
    • JavaScript and TypeScript code intelligence through the Language Server Protocol

    • This project is no longer maintained

      This language server is an implementation of LSP using TypeScript's APIs. This approach made it difficult to keep up with new features of TypeScript and implied that the server always uses a bundled TypeScript version, instead of the local TypeScript in node_modules like using the official (non-LSP) tsserver allows.

      On top of that, over time we simplified our architecture for running language servers in the cloud at Sourcegraph which removed the necessity for this level of tight integration and control. Theia's TypeScript language server is a thinner wrapper around tsserver, which avoids these problems to some extent. Our latest approach of running a TypeScript language server in the cloud uses Theia's language server (and transitively tsserver) under the hood.

      However, since then our code intelligence evolved even further and is nowadays powered primarily by LSIF, the Language Server Index Format. LSIF is developed together with LSP and uses the same structures, but in a pre-computed serialization instead of an RPC protocol. This allows us to provide near-instant code intelligence for our tricky on-demand cloud code intelligence scenarios and hence we are focusing all of our efforts on LSIF indexers. All of this work is also open source of course and if you're curious you can read more about how we use LSIF on our blog.

      LSP is still the obvious choice for editor scenarios and everyone is welcome to fork this repository and pick up maintenance, although from what we learned we would recommend to build on Theia's approach (wrapping tsserver). We would also love to see and are looking forward to native LSP support for the official tsserver, which would eliminate the need for any wrappers.

  • https://github.com/sourcegraph/typescript-language-server
ctags (Legacy)
srclib / jsg (Legacy)
  • https://github.com/sourcegraph/jsg
    • jsg: JavaScript grapher

    • JavaScript grapher -- part of GraphKit, a collection of source analyzers for popular programming languages

    • Moved to srclib-javascript (this repository is no longer a standalone project; submit patches to srclib-javascript)

  • https://srclib.org/
    • srclib is a hackable, multi-language code analysis library for building better software tools.

      srclib makes developer tools like code search and static analyzers better. It supports things like jump to definition, find usages, type inference, and documentation generation.

      srclib consists of language analysis toolchains (currently for Go, Python, JavaScript, and Ruby) with a common output format, and developer tools that consume this format.

      srclib originated inside Sourcegraph, where it powers intelligent code search over hundreds of thousands of projects.

    • https://github.com/sourcegraph/srclib
      • srclib is a polyglot code analysis library, built for hackability. It consists of language analysis toolchains (currently for Go and Java, with Python, JavaScript, and Ruby in beta) with a common output format, and a CLI tool for running the analysis.

    • https://github.com/sourcegraph/srclib-javascript
      • JavaScript (node.js) toolchain for srclib

      • srclib-javascript is a srclib toolchain that performs JavaScript (Node.js) code analysis: type inference, documentation generation, jump-to-definition, dependency resolution, etc.

        It enables this functionality in any client application whose code analysis is powered by srclib, including Sourcegraph.

    • https://github.com/sourcegraph/srclib-typescript
      • Sourcegraph support for typescript toolchain

      • srclib-typescript is a srclib toolchain that performs TypeScript code analysis: type inference, documentation generation, jump-to-definition, dependency resolution, etc. It enables this functionality in any client application whose code analysis is powered by srclib, including Sourcegraph.

Treesitter (Forks)
Golang Libs
Unsorted

Vercel Grep.app

  • https://grep.app/
  • https://vercel.com/blog/vercel-acquires-grep
    • Vercel acquires Grep to accelerate code search

    • Grep allows developers to quickly search code across over 500,000 public git repositories. With the acquisition, founder Dan Fox will also be joining Vercel’s AI team to continue building Grep to enhance code search for developers.

searchcode

  • https://searchcode.com/
    • SearchCode

    • Artisanal, small batch, handcrafted code search!

    • Simple, comprehensive code search

    • Helping you find real world examples of functions, API's and libraries in 378+ languages across 10+ public code sources

    • Filter down to one or many sources such as Bitbucket, CodePlex, Fedora Project, GitLab, Github, Gitorious, Google Android, Google Code, Minix3, Seek Quarry, Sourceforge, Tizen, codeberg, repo.or.cz, sr.ht or by 378+ languages.

    • https://searchcode.com/about/
      • Team / Contact

        searchcode is currently the work of a single developer standing on the shoulders of giants.

        Feel free to contact me at [email protected] or via twitter @boyter or follow developments at https://boyter.org/

    • https://searchcode.com/api/
      • searchcode API

      • Code Index

        Queries the code index and returns at most 100 results. All filters supported by searchcode are available. These include src (sources), lan (languages) and loc (lines of code). These work in the same way that the main page works. See the examples for how to use these.

      • Code Result

        Returns the raw data from a code file given the code id which can be found as the id in a code search result.

      • Related Results

        Returns an array of results given a searchcode unique code id which are considered to be duplicates. The matching is slightly fuzzy allowing so that small differences between files are ignored.

      • etc
  • https://searchcodeserver.com/
    • searchcode server

    • The best code search solution. Guaranteed. The code search solution for companies that build or maintain software who want to improve productivity and shorten development time by getting value from their existing source code.

    • How searchcode server works.

      By indexing your source code it allows you to search over this code quickly, filtering down by repositories, languages and file owners to find what you were looking for. Own your data, searchcode server is not a SAAS or cloud product, download and install it on your own servers.

    • https://searchcodeserver.com/pricing.html
      • Pricing for searchcode server

      • Requirements: A GNU/Linux/Windows/BSD machine running the Java 8 runtime. Everything else is configured out of the box for you.

        The community edition is free to use for as many users as you wish but you must leave the searchcode branding visible.

        All paid plans include a full downloadable version of searchcode server with the ability to change the icon and modify other look and feel elements. The software comes with a lifetime licence to install use searchcode server internally on as many instances as you like. You can use any paid for version in an manner you see fit include public facing websites. Finally you will get direct emails letting you know when updates are available and links to the update for the length of the support period.

    • https://github.com/boyter/searchcode-server/tree/master
      • searchcode server

      • searchcode server is a powerful code search engine with a sleek web user interface.

        searchcode server works in tandem with your source control system, indexing thousands of repositories and files allowing you and your developers to quickly find and reuse code across teams.

Ben E. C. Boyter's Blog

Shortlist:

Additional/Unsorted:

Google Code Search

Note: I think this might only be for Google projects/similar(?)

Programmable Search Engine

Unsorted

npm Package Ranking, Bundle Size, etc

Link Dump 1

The below content was originally posted in this comment (Dec 7, 2023: Ref), and then copied over as the basis for a new issue in this comment (Dec 13, 2023: Ref)

It has been further refined/enhanced since, including fixing up the titles, adding abstracts, and removing irrelevant links.


Here is a link dump of a bunch of the tabs I have open but haven't got around to reviewing in depth yet, RE: 'AST fingerprinting' / Code Similarity / etc:

Unsorted/Unreviewed Initial Link Dump RE: 'AST fingerprinting' / Code Similarity

Program Dependence Graph, Control Flow Graph, Data Flow Graph, Data Flow Analysis, Program Analysis Tools, etc

  • https://en.wikipedia.org/wiki/Program_dependence_graph
    • Program Dependence Graph - Wikipedia

    • In computer science, a Program Dependence Graph (PDG) is a representation of a program's control and data dependencies. It's a directed graph where nodes represent program statements, and edges represent dependencies between these statements. PDGs are useful in various program analysis tasks, including optimizations, debugging, and understanding program behavior.

  • https://en.wikipedia.org/wiki/Control-flow_graph
    • Control-Flow Graph - Wikipedia

    • In computer science, a control-flow graph (CFG) is a representation, using graph notation, of all paths that might be traversed through a program during its execution.

    • In a control-flow graph each node in the graph represents a basic block, i.e. a straight-line piece of code without any jumps or jump targets; jump targets start a block, and jumps end a block. Directed edges are used to represent jumps in the control flow. There are, in most presentations, two specially designated blocks: the entry block, through which control enters into the flow graph, and the exit block, through which all control flow leaves.

    • https://github.com/rudrOwO/control-flow-graph
    • https://reverseengineering.stackexchange.com/questions/16557/building-a-control-flow-graph-from-machine-code
      • Building a control flow graph from machine code (2017)

  • https://stackoverflow.com/questions/15087195/data-flow-graph-construction
    • Stack Overflow: Data Flow Graph Construction (2013)

  • https://codereview.stackexchange.com/questions/276387/call-flow-graph-from-python-abstract-syntax-tree
    • Code Review Stack Exchange: Call-flow graph from Python abstract syntax tree (2022)

  • https://codeql.github.com/docs/writing-codeql-queries/about-data-flow-analysis/
  • https://clang.llvm.org/docs/DataFlowAnalysisIntro.html
    • Clang Documentation: Data flow analysis: an informal introduction

    • This document introduces data flow analysis in an informal way. The goal is to give the reader an intuitive understanding of how it works, and show how it applies to a range of refactoring and bug finding problems.

    • Data flow analysis is a static analysis technique that proves facts about a program or its fragment. It can make conclusions about all paths through the program, while taking control flow into account and scaling to large programs. The basic idea is propagating facts about the program through the edges of the control flow graph (CFG) until a fixpoint is reached.

  • https://www.cs.odu.edu/~zeil/cs350/latest/Public/analysis/index.html
  • https://www.cs.columbia.edu/~suman/secure_sw_devel/Basic_Program_Analysis_CF.pdf
    • Slides: Basic Program Analysis - Suman Jana

    • ChatGPT Summary / Abstract:
      • Title: Basic Program Analysis

        Author: Suman Jana

        Institution: Columbia University

        Abstract: This document delves into the foundational concepts and techniques involved in program analysis, particularly focusing on control flow and data flow analysis essential for identifying security bugs in source code. The objective is to equip readers with the understanding and tools needed to effectively analyze programs without building systems from scratch, utilizing existing frameworks such as LLVM for customization and enhancement of analysis processes.

        The core discussion includes an overview of compiler design with specific emphasis on the Abstract Syntax Tree (AST), Control Flow Graph (CFG), and Data Flow Analysis. These elements are critical in understanding the structure of source code and its execution flow. The document highlights the conversion of source code into AST and subsequently into CFG, where data flow analysis can be applied to optimize code and identify potential security vulnerabilities.

        Additionally, the paper explores more complex topics like identifying basic blocks within CFG, constructing CFG from basic blocks, and advanced concepts such as loop identification and the concept of dominators in control flow. It also addresses the challenges and solutions related to handling irreducible Control Flow Graphs (CFGs), which are crucial for the analysis of less structured code.

        Keywords: Program Analysis, Compiler Design, Abstract Syntax Tree (AST), Control Flow Graph (CFG), Data Flow Analysis, LLVM, Security Bugs.

Stack Overflow: Assembly-level function fingerprint (2011)

Systems and methods for detecting copied computer code using fingerprints (2016)

  • https://patents.google.com/patent/US9459861B1/en
    • Systems and methods for detecting copied computer code using fingerprints (2016)

    • Systems and methods of detecting copying of computer code or portions of computer code involve generating unique fingerprints from compiled computer binaries. The unique fingerprints are simplified representations of functions in the compiled computer binaries and are compared with each other to identify similarities between functions in the respective compiled computer binaries. Copying can be detected when there are sufficient similarities between fingerprints of two functions.

A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features (2022)

  • https://dl.acm.org/doi/10.1145/3486860
    • A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features (2022)

    • Binary code fingerprinting is crucial in many security applications. Examples include malware detection, software infringement, vulnerability analysis, and digital forensics. It is also useful for security researchers and reverse engineers since it enables high fidelity reasoning about the binary code such as revealing the functionality, authorship, libraries used, and vulnerabilities. Numerous studies have investigated binary code with the goal of extracting fingerprints that can illuminate the semantics of a target application. However, extracting fingerprints is a challenging task since a substantial amount of significant information will be lost during compilation, notably, variable and function naming, the original data and control flow structures, comments, semantic information, and the code layout. This article provides the first systematic review of existing binary code fingerprinting approaches and the contexts in which they are used. In addition, it discusses the applications that rely on binary code fingerprints, the information that can be captured during the fingerprinting process, and the approaches used and their implementations. It also addresses limitations and open questions related to the fingerprinting process and proposes future directions.

BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code Executables (2017)

  • https://inria.hal.science/hal-01648996/document
    • BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code Executables (2017)

    • Binary code fingerprinting is a challenging problem that requires an in-depth analysis of binary components for deriving identifiable signatures. Fingerprints are useful in automating reverse engineering tasks including clone detection, library identification, authorship attribution, cyber forensics, patch analysis, malware clustering, binary auditing, etc. In this paper, we present BinSign, a binary function fingerprinting framework. The main objective of BinSign is providing an accurate and scalable solution to binary code fingerprinting by computing and matching structural and syntactic code profiles for disassemblies. We describe our methodology and evaluate its performance in several use cases, including function reuse, malware analysis, and indexing scalability. Additionally, we emphasize the scalability aspect of BinSign. We perform experiments on a database of 6 million functions. The indexing process requires an average time of 0.0072 seconds per function. We find that BinSign achieves higher accuracy compared to existing tools.

Software Fingerprinting in LLVM (2021)

  • https://www.unomaha.edu/college-of-information-science-and-technology/research-labs/_files/software-nsf.pdf
    • Software Fingerprinting in LLVM (2021)

    • Executable steganography, the hiding of software machine code inside of a larger program, is a potential approach to introduce new software protection constructs such as watermarks or fingerprints. Software fingerprinting is, therefore, a process similar to steganography, hiding data within other data. The goal of fingerprinting is to hide a unique secret message, such as a serial number, into copies of an executable program in order to provide proof of ownership of that program. Fingerprints are a special case of watermarks, with the difference being that each fingerprint is unique to each copy of a program. Traditionally, researchers describe four aims that a software fingerprint should achieve. These include the fingerprint should be difficult to remove, it should not be obvious, it should have a low false positive rate, and it should have negligible impact on performance. In this research, we propose to extend these objectives and introduce a fifth aim: that software fingerprints should be machine independent. As a result, the same fingerprinting method can be used regardless of the architecture used to execute the program. Hence, this paper presents an approach towardsthe realization of machine-independent fingerprinting of executable programs. We make use of Low-Level Virtual Machine (LLVM) intermediate representation during the software compilation process to demonstrate both a simple static fingerprinting method as well as a dynamic method, which displays our aim of hardware independent fingerprinting. The research contribution includes a realization of the approach using the LLVM infrastructure and provides a proof of concept for both simple static and dynamic watermarks that are architecture neutral.

Syntax tree fingerprinting for source code similarity detection (2009)

  • https://ieeexplore.ieee.org/document/5090050
    • Syntax tree fingerprinting for source code similarity detection (2009)

    • Numerous approaches based on metrics, token sequence pattern-matching, abstract syntax tree (AST) or program dependency graph (PDG) analysis have already been proposed to highlight similarities in source code: in this paper we present a simple and scalable architecture based on AST fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases.

    • https://igm.univ-mlv.fr/~chilowi/research/syntax_tree_fingerprinting/syntax_tree_fingerprinting_ICPC09.pdf

Syntax tree fingerprinting: a foundation for source code similarity detection (2011)

  • https://hal.science/hal-00627811/document
    • Syntax tree fingerprinting: a foundation for source code similarity detection (2011)

    • Plagiarism detection and clone refactoring in software depend on one common concern: finding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modifications are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits. Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Dependency Graph (PDG), we believe that the AST could efficiently support the program analysis and transformations required for the advanced similarity detection process. In this paper we present a simple and scalable architecture based on syntax tree fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases.

Source Code Plagiarism Detection Based on Abstract Syntax Tree Fingerprintings (2022)

  • https://ieeexplore.ieee.org/document/9960266
    • Source Code Plagiarism Detection Based on Abstract Syntax Tree Fingerprintings (2022)

    • Syntax Tree (AST) is an abstract logical structure of source code represented as a tree. This research utilizes information of fingerprinting with AST to locate the similarities between source codes. The proposed method can detect plagiarism in source codes using the number of duplicated logical structures. The structural information of program is stored in the fingerprints format. Then, the fingerprints of source codes are compared to identify number of similar nodes. The final output is calculated from number of similar nodes known as similarities scores. The result shows that the proposed method accurately captures the common modification techniques from basic to advance.

Dynamic graph-based software fingerprinting (2007)

  • https://dl.acm.org/doi/abs/10.1145/1286821.1286826
    • Dynamic graph-based software fingerprinting (2007)

    • Fingerprinting embeds a secret message into a cover message. In media fingerprinting, the secret is usually a copyright notice and the cover a digital image. Fingerprinting an object discourages intellectual property theft, or when such theft has occurred, allows us to prove ownership.

      The Software Fingerprinting problem can be described as follows. Embed a structure W into a program P such that: W can be reliably located and extracted from P even after P has been subjected to code transformations such as translation, optimization and obfuscation; W is stealthy; W has a high data rate; embedding W into P does not adversely affect the performance of P; and W has a mathematical property that allows us to argue that its presence in P is the result of deliberate actions.

      In this article, we describe a software fingerprinting technique in which a dynamic graph fingerprint is stored in the execution state of a program. Because of the hardness of pointer alias analysis such fingerprints are difficult to attack automatically.

    • https://dl.acm.org/doi/pdf/10.1145/1286821.1286826

Adaptive Structural Fingerprints for Graph Attention Networks (2019)

  • https://openreview.net/forum?id=BJxWx0NYPr
    • Adaptive Structural Fingerprints for Graph Attention Networks (2019)

    • Graph attention network (GAT) is a promising framework to perform convolution and massage passing on graphs. Yet, how to fully exploit rich structural information in the attention mechanism remains a challenge. In the current version, GAT calculates attention scores mainly using node features and among one-hop neighbors, while increasing the attention range to higher-order neighbors can negatively affect its performance, reflecting the over-smoothing risk of GAT (or graph neural networks in general), and the ineffectiveness in exploiting graph structural details. In this paper, we propose an "adaptive structural fingerprint" (ADSF) model to fully exploit graph topological details in graph attention network. The key idea is to contextualize each node with a weighted, learnable receptive field encoding rich and diverse local graph structures. By doing this, structural interactions between the nodes can be inferred accurately, thus significantly improving subsequent attention layer as well as the convergence of learning. Furthermore, our model provides a useful platform for different subspaces of node features and various scales of graph structures to 'cross-talk' with each other through the learning of multi-head attention, being particularly useful in handling complex real-world data. Empirical results demonstrate the power of our approach in exploiting rich structural information in GAT and in alleviating the intrinsic oversmoothing problem in graph neural networks.

Cloneless: Code Clone Detection via Program Dependence Graphs with Relaxed Constraints (2019)

  • https://digitalcommons.calpoly.edu/theses/2040/
    • Cloneless: Code Clone Detection via Program Dependence Graphs with Relaxed Constraints (2019)

    • Code clones are pieces of code that have the same functionality. While some clones may structurally match one another, others may look drastically different. The inclusion of code clones clutters a code base, leading to increased costs through maintenance. Duplicate code is introduced through a variety of means, such as copy-pasting, code generated by tools, or developers unintentionally writing similar pieces of code. While manual clone identification may be more accurate than automated detection, it is infeasible due to the extensive size of many code bases. Software code clone detection methods have differing degree of success based on the analysis performed. This thesis outlines a method of detecting clones using a program dependence graph and subgraph isomorphism to identify similar subgraphs, ultimately illuminating clones. The project imposes few constraints when comparing code segments to potentially reveal more clones.

    • https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=3437&context=theses

Graph-of-Code: Semantic Clone Detection Using Graph Fingerprints (2023)

  • https://www.computer.org/csdl/journal/ts/2023/08/10125077/1Nc4Vd4vb7W
    • Graph-of-Code: Semantic Clone Detection Using Graph Fingerprints (2023)

    • The code clone detection issue has been researched using a number of explicit factors based on the tokens and contents and found effective results. However, exposing code contents may be an impractical option because of privacy and security factors. Moreover, the lack of scalability of past methods is an important challenge. The code flow states can be inferred by code structure and implicitly represented using empirical graphs. The assumption is that modelling of the code clone detection problem can be achieved without the content of the codes being revealed. Here, a Graph-of-Code concept for the code clone detection problem is introduced, which represents codes into graphs. While Graph-of-Code provides structural properties and quantification of its characteristics, it can exclude code contents or tokens to identify the clone type. The aim is to evaluate the impact of graph-of-code structural properties on the performance of code clone detection. This work employs a feature extraction-based approach for unlabelled graphs. The approach generates a “Graph Fingerprint” which represents different topological feature levels. The results of code clone detection indicate that code structure has a significant role in detecting clone types. We found different GoC-models outperform others. The models achieve between 96% to 99% in detecting code clones based on recall, precision, and F1-Score. The GoC approach is capable in detecting code clones with scalable dataset and with preserving codes privacy.

A graph-based code representation method to improve code readability classification (2023)

  • https://www.researchgate.net/publication/370980383_A_graph-based_code_representation_method_to_improve_code_readability_classification
    • A graph-based code representation method to improve code readability classification (2023)

    • Context Code readability is crucial for developers since it is closely related to code maintenance and affects developers’ work efficiency. Code readability classification refers to the source code being classified as pre-defined certain levels according to its readability. So far, many code readability classification models have been proposed in existing studies, including deep learning networks that have achieved relatively high accuracy and good performance. Objective However, in terms of representation, these methods lack effective preservation of the syntactic and semantic structure of the source code. To extract these features, we propose a graph-based code representation method. Method Firstly, the source code is parsed into a graph containing its abstract syntax tree (AST) combined with control and data flow edges to reserve the semantic structural information and then we convert the graph nodes’ source code and type information into vectors. Finally, we train our graph neural networks model composing Graph Convolutional Network (GCN), DMoNPooling, and K-dimensional Graph Neural Networks (k-GNNs) layers to extract these features from the program graph. Result We evaluate our approach to the task of code readability classification using a Java dataset provided by Scalabrino et al. (2016). The results show that our method achieves 72.5% and 88% in three-class and two-class classification accuracy, respectively. Conclusion We are the first to introduce graph-based representation into code readability classification. Our method outperforms state-of-the-art readability models, which suggests that the graph-based code representation method is effective in extracting syntactic and semantic information from source code, and ultimately improves code readability classification.

Link Dump 2

The below content was originally posted in the following comment (April 30, 2024: Ref)

It has been further refined/enhanced since.

OpenAI Embeddings

This is potentially more of a generalised/'naive' approach to the problem, but it would also be interesting to see if/how well an embedding model tuned for code would do at solving this sort of problem space:

Unsorted/Unreviewed Link Dump RE: 'AST fingerprinting' / Code Similarity (v2)

Also, here's the latest version of my open tabs 'reading list' in this space of things, in case any of it is relevant/interesting/useful here:

Wikipedia Articles, etc

A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges (2023)

  • https://arxiv.org/abs/2306.16171
    • A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges (2023)

    • Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.

A comparison of code similarity analysers (2017)

  • https://link.springer.com/article/10.1007/s10664-017-9564-7
    • A comparison of code similarity analysers (2017)

    • Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code.

Winnowing: Local Algorithms for Document Fingerprinting (2003)

  • https://www.researchgate.net/publication/2840981_Winnowing_Local_Algorithms_for_Document_Fingerprinting
    • Winnowing: Local Algorithms for Document Fingerprinting (2003)

    • Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with Moss, a widely-used plagiarism detection service.

    • https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

Source Code Plagiarism Detection with Pre-Trained Model Embeddings and Automated Machine Learning (2023)

A Source Code Similarity System for Plagiarism Detection (2013)

  • https://www.researchgate.net/publication/262322336_A_Source_Code_Similarity_System_for_Plagiarism_Detection
    • A Source Code Similarity System for Plagiarism Detection (2013)

    • Source code plagiarism is an easy to do task, but very difficult to detect without proper tool support. Various source code similarity detection systems have been developed to help detect source code plagiarism. Those systems need to recognize a number of lexical and structural source code modifications. For example, by some structural modifications (e.g. modification of control structures, modification of data structures or structural redesign of source code) the source code can be changed in such a way that it almost looks genuine. Most of the existing source code similarity detection systems can be confused when these structural modifications have been applied to the original source code. To be considered effective, a source code similarity detection system must address these issues. To address them, we designed and developed the source code similarity system for plagiarism detection. To demonstrate that the proposed system has the desired effectiveness, we performed a well-known conformism test. The proposed system showed promising results as compared with the JPlag system in detecting source code similarity when various lexical or structural modifications are applied to plagiarized code. As a confirmation of these results, an independent samples t-test revealed that there was a statistically significant difference between average values of F-measures for the test sets that we used and for the experiments that we have done in the practically usable range of cut-off threshold values of 35–70%.

A Source Code Similarity Based on Siamese Neural Network (2020)

  • https://www.mdpi.com/2076-3417/10/21/7519
    • A Source Code Similarity Based on Siamese Neural Network (2020)

    • Finding similar code snippets is a fundamental task in the field of software engineering. Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are fitted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method.

Detecting Source Code Similarity Using Compression (2019)

  • https://www.researchgate.net/publication/337196468_Detecting_Source_Code_Similarity_Using_Compression
    • Detecting Source Code Similarity Using Compression (2019)

    • Different forms of plagiarism make a fair assessment of student assignments more difficult. Source code plagiarisms pose a significant challenge especially for automated assessment systems aimed for students' programming solutions. Different automated assessment systems employ different text or source code similarity detection tools, and all of these tools have their advantages and disadvantages. In this paper, we revitalize the idea of similarity detection based on string complexity and compression. We slightly adapt an existing, third-party, approach, implement it and evaluate its potential on synthetically generated cases and on a small set of real student solutions. On synthetic cases, we showed that average deviation (in absolute values) from the expected similarity is less than 1% (0.94%). On the real-life examples of student programming solutions we compare our results with those of two established tools. The average difference is around 18.1% and 11.6%, while the average difference between those two tools is 10.8%. However, the results of all three tools follow the same trend. Finally, a deviation to some extent is expected as observed tools apply different approaches that are sensitive to other factors of similarities. Gained results additionally demonstrate open challenges in the field.

    • https://ceur-ws.org/Vol-2508/paper-pri.pdf

Binary code similarity analysis based on naming function and common vector space (2023)

  • https://www.nature.com/articles/s41598-023-42769-9
    • Binary code similarity analysis based on naming function and common vector space (2023)

    • Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cross-platform binary code similarity analysis using control flow graphs. However, existing schemes suffer from the shortcomings of large differences in instruction syntaxes across different target platforms, inability to align control flow graph nodes, and less introduction of high-level semantics of stability, which pose challenges for identifying similar computations between binary functions of different platforms generated from the same source code. We argue that extracting stable, platform-independent semantics can improve model accuracy, and a cross-platform binary function similarity comparison model N_Match is proposed. The model elevates different platform instructions to the same semantic space to shield their underlying platform instruction differences, uses graph embedding technology to learn the stability semantics of neighbors, extracts high-level knowledge of naming function to alleviate the differences brought about by cross-platform and cross-optimization levels, and combines the stable graph structure as well as the stable, platform-independent API knowledge of naming function to represent the final semantics of functions. The experimental results show that the model accuracy of N_Match outperforms the baseline model in terms of cross-platform, cross-optimization level, and industrial scenarios. In the vulnerability search experiment, N_Match significantly improves hit@N, the mAP exceeds the current graph embedding model by 66%. In addition, we also give several interesting observations from the experiments. The code and model are publicly available at https://www.github.com/CSecurityZhongYuan/Binary-Name_Match

REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models (2023)

  • https://arxiv.org/abs/2305.03843
    • REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models (2023)

    • This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. We present the first-ever code search method that encodes dynamic runtime information during training without the need to execute either the corpus under search or the search query at inference time and the first code search technique that trains on both positive and negative reference samples. To validate the efficacy of our approach, we perform a set of studies demonstrating the capability of enhanced LLMs to perform cross-language code-to-code search. Our evaluation demonstrates that the effectiveness of our approach is consistent across various model architectures and programming languages. We outperform the state-of-the-art cross-language search tool by up to 44.7%. Moreover, our ablation studies reveal that even a single positive and negative reference sample in the training process results in substantial performance improvements demonstrating both similar and dissimilar references are important parts of code search. Importantly, we show that enhanced well-crafted, fine-tuned models consistently outperform enhanced larger modern LLMs without fine tuning, even when enhancing the largest available LLMs highlighting the importance for open-sourced models. To ensure the reproducibility and extensibility of our research, we present an open-sourced implementation of our tool and training procedures called REINFOREST.

Finding Bugs Using Your Own Code: Detecting Functionally-similar yet Inconsistent Code (2021)

  • https://www.usenix.org/conference/usenixsecurity21/presentation/ahmadi
    • Finding Bugs Using Your Own Code: Detecting Functionally-similar yet Inconsistent Code (2021)

    • Probabilistic classification has shown success in detecting known types of software bugs. However, the works following this approach tend to require a large amount of specimens to train their models. We present a new machine learning-based bug detection technique that does not require any external code or samples for training. Instead, our technique learns from the very codebase on which the bug detection is performed, and therefore, obviates the need for the cumbersome task of gathering and cleansing training samples (e.g., buggy code of certain kinds). The key idea behind our technique is a novel two-step clustering process applied on a given codebase. This clustering process identifies code snippets in a project that are functionally-similar yet appear in inconsistent forms. Such inconsistencies are found to cause a wide range of bugs, anything from missing checks to unsafe type conversions. Unlike previous works, our technique is generic and not specific to one type of inconsistency or bug. We prototyped our technique and evaluated it using 5 popular open source software, including QEMU and OpenSSL. With a minimal amount of manual analysis on the inconsistencies detected by our tool, we discovered 22 new unique bugs, despite the fact that many of these programs are constantly undergoing bug scans and new bugs in them are believed to be rare.

    • https://www.usenix.org/system/files/sec21summer_ahmadi.pdf

MOSS: A1 System for Detecting Software Similarity (1997?)

antiplag - similarity checking software for program codes, documents, and pictures (2019)

  • https://github.com/fanghon/antiplag
    • antiplag - similarity checking software for program codes, documents, and pictures (2019) The software mainly checks and compares the similarities between electronic assignments submitted by students. It can detect the similarities between electronic assignments submitted by students and can analyze the content of multiple programming languages ​​​​(such as java, c/c++, python, etc.) and multiple formats (txt, doc, docx, pdf, etc.) Comparative analysis of text and image similarities in multiple formats (png, jpg, gif, bmp, etc.) between English and simplified and traditional Chinese documents, and output codes, texts, and images with high similarity, thereby helping to detect plagiarism between students. the behavior of.

SCOSS - A Source Code Similarity System (2021)

Dolos (2019-2024+)

  • https://github.com/dodona-edu/dolos
    • Dolos (2019-2024+) Dolos is a source code plagiarism detection tool for programming exercises. Dolos helps teachers in discovering students sharing solutions, even if they are modified. By providing interactive visualizations, Dolos can also be used to sensitize students to prevent plagiarism.

    • https://dolos.ugent.be/
    • https://dolos.ugent.be/about/algorithm.html
      • How Dolos works Conceptually, the plagiarism detection pipeline of Dolos can be split into four successive steps:

        • Tokenization
        • Fingerprinting
        • Indexing
        • Reporting
      • Tokenization To be immune against masking plagiarism by techniques such as renaming variables and functions, Dolos doesn't directly process the source code under investigation. It starts by performing a tokenization step using Tree-sitter. Tree-sitter can generate syntax trees for many programming languages, converts source code to a more structured form, and masks specific naming of variables and functions.

      • Fingerprinting To measure similarities between (converted) files, Dolos tries to find common sequences of tokens. More specifically, it uses subsequences of fixed length called k-grams. To efficiently make these comparisons and reduce the memory usage, all k-grams are hashed using a rolling hash function (the one used by Rabin-Karp in their string matching algorithm). The length k of k-grams can be with the -k option.

        To further reduce the memory usage, only a subset of all hashes are stored. The selection of hashes is done by the Winnowing algorithm as described by (Schleimer, Wilkerson and Aiken). In short: only the hash with the smallest numerical value is kept for each window. The window length (in k-grams) can be altered with the -w option.

        The remaining hashes are the fingerprints of the analyzed files. Internally, these are stored as simple integers.

      • Indexing Because Dolos needs to compare all files with each other, it is more efficient to first create an index containing the fingerprints of all files. For each of the fingerprints encountered in any of the files, we store the file and the corresponding line number where we encountered that fingerprint.

        As soon as a fingerprint is stored in the index twice, this is recorded as a match between the two files because they share at least one k-gram.

      • Reporting Dolos finally collects all fingerprints that occur in more than one file and aggregates the results into a report.

        This report contains all file pairs that have at least one common fingerprint, together with some metrics:

        • similarity: the fraction of shared fingerprints between the two files
        • total overlap: the absolute value of shared fingerprints, useful for larger projects
        • longest fragment: the length (in fingerprints) of the longest subsequence of fingerprints matching between the two files, useful when not the whole source code is copied
    • https://dolos.ugent.be/about/languages.html
    • https://dolos.ugent.be/about/publications.html
      • Publications Dolos is developed by Team Dodona at Ghent University in Belgium. Our research is published in the following journals and conferences.

MinHash-based Code Relationship & Investigation Toolkit (MCRIT) (2021-2025+)

  • https://github.com/danielplohmann/mcrit
    • MinHash-based Code Relationship & Investigation Toolkit (MCRIT) (2021-2025+) MCRIT is a framework created to simplify the application of the MinHash algorithm in the context of code similarity. It can be used to rapidly implement "shinglers", i.e. methods which encode properties of disassembled functions, to then be used for similarity estimation via the MinHash algorithm. It is tailored to work with disassembly reports emitted by SMDA.

1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis (2021)

  • https://arxiv.org/abs/2112.12928
    • 1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis (2021)

    • Binary similarity analysis is critical to many code-reuse-related issues and "1-to-1" mechanism is widely applied, where one function in a binary file is matched against one function in a source file or binary file. However, we discover that function mapping is a more complex problem of "1-to-n" or even "n-to-n" due to the existence of function inlining.

      In this paper, we investigate the effect of function inlining on binary similarity analysis. We first construct 4 inlining-oriented datasets for four similarity analysis tasks, including code search, OSS reuse detection, vulnerability detection, and patch presence test. Then, we further study the extent of function inlining, the performance of existing works under function inlining, and the effectiveness of existing inlining-simulation strategies. > Results show that the proportion of function inlining can reach nearly 70%, while most existing works neglect it and use "1-to-1" mechanism. The mismatches cause a 30% loss in performance during code search and a 40% loss during vulnerability detection. Moreover, two existing inlining-simulation strategies can only recover 60% of the inlined functions. We discover that inlining is usually cumulative when optimization increases. Conditional inlining and incremental inlining are suggested to design low-cost and high-coverage inlining-simulation strategies.

    • https://arxiv.org/pdf/2112.12928
    • https://github.com/island255/TOSEM2022
      • Repository for the paper "1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis"

One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis (2021)

  • https://deepai.org/publication/one-to-one-or-one-to-many-what-function-inlining-brings-to-binary2source-similarity-analysis
    • One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis (2021)

  • https://arxiv.org/abs/2112.12928v1
    • One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis (2021)

    • Binary2source code matching is critical to many code-reuse-related tasks, including code clone detection, software license violation detection, and reverse engineering assistance. Existing binary2source works always apply a "1-to-1" (one-to-one) mechanism, i.e., one function in a binary file is matched against one function in a source file. However, we assume that such mapping is usually a more complex problem of "1-to-n" (one-to-many) due to the existence of function inlining. To the best of our knowledge, few existing works have systematically studied the effect of function inlining on binary2source matching tasks. This paper will address this issue. To support our study, we first construct two datasets containing 61,179 binaries and 19,976,067 functions. We also propose an automated approach to label the dataset with line-level and function-level mapping. Based on our labeled dataset, we then investigate the extent of function inlining, the factors affecting function inlining, and the impact of function inlining on existing binary2source similarity methods. Finally, we discuss the interesting findings and give suggestions for designing more effective methodologies.

    • https://arxiv.org/pdf/2112.12928v1
    • https://github.com/island255/source2binary_dataset_construction
      • Source2binary Dataset Construction This is the repository for the paper "One to One or One to many? What function inline brings to binary similarity analysis".

  • https://www.researchgate.net/publication/357365866_One-to-One_or_One-to-many_What_function_inlining_brings_to_binary2source_similarity_analysis
    • One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis

Comparing One with Many -- Solving Binary2source Function Matching Under Function Inlining (2022)

  • https://arxiv.org/abs/2210.15159
    • Comparing One with Many -- Solving Binary2source Function Matching Under Function Inlining (2022)

    • Binary2source function matching is a fundamental task for many security applications, including Software Component Analysis (SCA). The "1-to-1" mechanism has been applied in existing binary2source matching works, in which one binary function is matched against one source function. However, we discovered that such mapping could be "1-to-n" (one query binary function maps multiple source functions), due to the existence of function inlining.

      To help conduct binary2source function matching under function inlining, we propose a method named O2NMatcher to generate Source Function Sets (SFSs) as the matching target for binary functions with inlining. We first propose a model named ECOCCJ48 for inlined call site prediction. To train this model, we leverage the compilable OSS to generate a dataset with labeled call sites (inlined or not), extract several features from the call sites, and design a compiler-opt-based multi-label classifier by inspecting the inlining correlations between different compilations. Then, we use this model to predict the labels of call sites in the uncompilable OSS projects without compilation and obtain the labeled function call graphs of these projects. Next, we regard the construction of SFSs as a sub-tree generation problem and design root node selection and edge extension rules to construct SFSs automatically. Finally, these SFSs will be added to the corpus of source functions and compared with binary functions with inlining. We conduct several experiments to evaluate the effectiveness of O2NMatcher and results show our method increases the performance of existing works by 6% and exceeds all the state-of-the-art works.

    • https://arxiv.org/pdf/2210.15159
  • https://github.com/island255/binary2source-matching-under-function-inlining
    • binary2source-matching-under-function-inlining This is the repository illustrating how we label the inlined call sites, train the classifier for ICS prediction, and generate SFSs for binary2source matching.

    • Repository for the paper "Binary2Source Function Similarity Detection Under Function Inlining"

Cross-Inlining Binary Function Similarity Detection (2024)

  • https://arxiv.org/abs/2401.05739v1
    • Cross-Inlining Binary Function Similarity Detection (2024)

    • Binary function similarity detection plays an important role in a wide range of security applications. Existing works usually assume that the query function and target function share equal semantics and compare their full semantics to obtain the similarity. However, we find that the function mapping is more complex, especially when function inlining happens.

      In this paper, we will systematically investigate cross-inlining binary function similarity detection. We first construct a cross-inlining dataset by compiling 51 projects using 9 compilers, with 4 optimizations, to 6 architectures, with 2 inlining flags, which results in two datasets both with 216 combinations. Then we construct the cross-inlining function mappings by linking the common source functions in these two datasets. Through analysis of this dataset, we find that three cross-inlining patterns widely exist while existing work suffers when detecting cross-inlining binary function similarity. Next, we propose a pattern-based model named CI-Detector for cross-inlining matching. CI-Detector uses the attributed CFG to represent the semantics of binary functions and GNN to embed binary functions into vectors. CI-Detector respectively trains a model for these three cross-inlining patterns. Finally, the testing pairs are input to these three models and all the produced similarities are aggregated to produce the final similarity. We conduct several experiments to evaluate CI-Detector. Results show that CI-Detector can detect cross-inlining pairs with a precision of 81% and a recall of 97%, which exceeds all state-of-the-art works.

    • https://arxiv.org/pdf/2401.05739v1
    • https://github.com/island255/cross-inlining_binary_function_similarity
      • The repository of the paper "Cross-Inlining Binary Function Similarity Detection"

Pcode-Similarity (2021)

Awesome Binary code similarity detection (2021)

SCALE: Semantic Code Analysis via Learned Embeddings (2023)

  • https://github.com/Jaso1024/Semantic-Code-Embeddings
    • SCALE: Semantic Code Analysis via Learned Embeddings (2023) 3rd best paper on Artificial Intelligence track | presented at the 2023 International Conference on AI, Blockchain, Cloud Computing and Data Analytics This repository holds the code and supplementary materials for SCALE: Semantic Code Analysis via Learned Embeddings. This research explores the efficacy of contrastive learning alongside large language models as a paradigm for developing a model capable of creating code embeddings indicative of code on a functional level. Existing pre-trained models in NLP have demonstrated impressive success, surpassing previous benchmarks in various language-related tasks. However, when it comes to the field of code understanding, these models still face notable limitations. Code isomorphism, which deals with determining functional similarity between pieces of code, presents a challenging problem for NLP models. In this paper, we explore two approaches to code isomorphism. Our first approach, dubbed SCALE-FT, formulates the problem as a binary classification task, where we feed pairs of code snippets to a Large Language Model (LLM), using the embeddings to predict whether the given code segments are equivalent. The second approach, SCALE-CLR, adopts the SimCLR framework to generate embeddings for individual code snippets. By processing code samples with an LLM and observing the corresponding embeddings, we assess the similarity of two code snippets. These approaches enable us to leverage function-based code embeddings for various downstream tasks, such as code-optimization, code-comment alignment, and code classification. Our experiments on the CodeNet Python800 benchmark demonstrate promising results for both approaches. Notably, our SCALE-FT using Babbage-001 (GPT-3) achieves state-of-the-art performance, surpassing various benchmark models such as GPT-3.5 Turbo and GPT-4. Additionally, Salesforce's 350-million parameter CodeGen, when trained with the SCALE-FT framework, surpasses GPT-3.5 and GPT-4.

binary-sim - binary similarity using Deep learning (2023)

Source Code Clone Detection Using Unsupervised Similarity Measures (2024)

  • https://arxiv.org/abs/2401.09885
    • Source Code Clone Detection Using Unsupervised Similarity Measures (2024)

    • Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at this https URL

    • https://github.com/jorge-martinez-gil/codesim
      • Source Code Clone Detection Using Unsupervised Similarity Measures

      • This repository contains the source code for reproducing the paper Martinez-Gil, J. (2024). Source Code Clone Detection Using Unsupervised Similarity Measures. In: Bludau, P., Ramler, R., Winkler, D., Bergsmann, J. (eds) Software Quality as a Foundation for Security. SWQD 2024. Lecture Notes in Business Information Processing, vol 505. Springer, Cham. https://doi.org/10.1007/978-3-031-56281-5_2.

Transcending Language Barriers in Software Engineering with Crosslingual Code Clone Detection (2024)

Link Dump 3

Improved Code Summarization via a Graph Neural Network (2020)

  • https://arxiv.org/abs/2004.02843
    • Improved Code Summarization via a Graph Neural Network (2020)

    • Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summarization techniques use the source code as input and outputs a natural language description. Yet a strong consensus is developing that using structural information as input leads to improved performance. The first approaches to use structural information flattened the AST into a sequence. Recently, more complex approaches based on random AST paths or graph neural networks have improved on the models using flattened ASTs. However, the literature still does not describe the using a graph neural network together with source code sequence as separate inputs to a model. Therefore, in this paper, we present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries. We evaluate our technique using a data set of 2.1 million Java method-comment pairs and show improvement over four baseline techniques, two from the software engineering literature, and two from machine learning literature.

Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree (2020)

  • https://arxiv.org/abs/2002.08653
    • Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree (2020)

    • Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection. We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.

    • https://github.com/jacobwwh/graphmatch_clone
      • Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree

      • Code and data for paper "Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree".

Utilizing Abstract Syntax Tree Embedding to Improve the Quality of GNN-based Class Name Estimation (2023)

  • https://proceedings-of-deim.github.io/DEIM2023/1b-9-4.pdf
    • Utilizing Abstract Syntax Tree Embedding to Improve the Quality of GNN-based Class Name Estimation (2023)

    • While giving comprehensible names to identifiers is essential in software development, it is sometimes difficult since it requires development experience and knowledge of the application domain. Among work to support the developer’s identifier naming, a GNN-based class name estimation approach learns a graph of relationships between program elements, i.e., classes, methods, and fields, but it ignores information within the methods. This study proposes an approach that exploits information from method bodies, which can help estimate correct class names. The proposed approach extends the existing GNN-based approach to use embeddings of the corresponding ASTs for method nodes. An evaluation experiment measures how correctly the proposed approach can estimate class names in large datasets of open-source Java projects. The experimental result shows that the proposed approach improves the estimation correctness compared to the existing approach.

Code Similarity Using Graph Neural Networks (2023)

  • https://medium.com/stanford-cs224w/code-similarity-using-graph-neural-networks-1e58aa21bd92
    • Code Similarity Using Graph Neural Networks (2023)

    • Abstract/Summary by ChatGPT 4.5:
      • Code similarity detection is crucial for various software engineering tasks, including plagiarism detection, code search, refactoring, and automated code completion. Traditional approaches rely heavily on syntactic similarity, which fails to capture deeper semantic relationships between code segments. Inspired by recent advances in natural language processing and code intelligence using transformer-based models (e.g., BERT, GPT, and CodeBERT), our work explores the use of Graph Neural Networks (GNNs) to address code similarity through the semantic understanding provided by graph structures.

        We evaluate several GNN architectures—including GraphSAGE, Graph Attention Networks (GAT), and a novel OrderGNN leveraging permutation-aware aggregations—on the widely-used POJ-104 dataset, consisting of 32,000 C++ code segments spanning 64 distinct programming problems. Our pipeline involves parsing source code into Abstract Syntax Trees (ASTs) using the CLANG library, transforming these ASTs into NetworkX graphs, and subsequently into PyTorch Geometric (PyG) data objects for input into our GNN models.

        Our results demonstrate that permutation-invariant methods such as GraphSAGE and GAT struggle to capture critical ordered structures inherent in programming languages, resulting in limited performance (MAP@R). In contrast, the OrderGNN model, employing LSTM-based aggregation to preserve node ordering information, achieves significantly better semantic similarity identification, highlighting the necessity of permutation-awareness for effective code analysis. Nevertheless, the OrderGNN model presents substantial computational and memory overhead, limiting scalability.

        We conclude by suggesting future directions, including the exploration of more memory-efficient permutation-aware aggregation functions and alternative graph representations beyond the standard AST structure to further improve the efficacy and applicability of GNN-based code similarity detection methods.

Link Dump 4

JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games (2020)

  • https://taoxiease.github.io/publications/icse20seip-jsidentify.pdf#:~:text=match%20at%20L171%20hybrid%20framework%2C,we%20collect%20400%20mini
    • JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games

    • Online mini games are lightweight game apps, typically implemented in JavaScript (JS), that run inside another host mobile app (such as WeChat, Baidu, and Alipay). These mini games do not need to be downloaded or upgraded through an app store, making it possible for one host mobile app to perform the aggregated services of many apps. Hundreds of millions of users play tens of thousands of mini games, which make a great profit, and consequently are popular targets of plagiarism. In cases of plagiarism, deeply obfuscated code cloned from the original code often embodies malicious code segments and copyright infringements, posing great challenges for existing plagiarism detection tools. To address these challenges, in this paper, we design and implement JSidentify, a hybrid framework to detect plagiarism among online mini games. JSidentify includes three techniques based on different levels of code abstraction. JSidentify applies the included techniques in the constructed priority list one by one to reduce overall detection time. Our evaluation results show that JSidentify outperforms other existing related state-of-the-art approaches and achieves the best precision and recall with affordable detection time when detecting plagiarism among online mini games and clones among general JS programs. Our deployment experience of JSidentify also shows that JSidentify is indispensable in the daily operations of online mini games in WeChat.

  • https://ieeexplore.ieee.org/document/9276581
    • JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games

  • https://www.researchgate.net/publication/344433961_JSidentify_a_hybrid_framework_for_detecting_plagiarism_among_JavaScript_code_in_online_mini_games
    • JSidentify: a hybrid framework for detecting plagiarism among JavaScript code in online mini games (June 2020)

Relationship-aware code search for JavaScript frameworks (2016)

  • https://taoxiease.github.io/publications/fse16-racs.pdf
    • Relationship-aware code search for JavaScript frameworks

    • JavaScript frameworks, such as jQuery, are widely used for developing web applications. To facilitate using these JavaScript frameworks to implement a feature (e.g., functionality), a large number of programmers often search for code snippets that implement the same or similar feature. However, existing code search approaches tend to be ineffective, without taking into account the fact that JavaScript code snippets often implement a feature based on various relationships (e.g., sequencing, condition, and callback relationships) among the invoked framework API methods. To address this issue, we present a novel RelationshipAware Code Search (RACS) approach for finding code snippets that use JavaScript frameworks to implement a specific feature. In advance, RACS collects a large number of code snippets that use some JavaScript frameworks, mines API usage patterns from the collected code snippets, and represents the mined patterns with method call relationship (MCR) graphs, which capture framework API methods’ signatures and their relationships. Given a natural language (NL) search query issued by a programmer, RACS conducts NL processing to automatically extract an action relationship (AR) graph, which consists of actions and their relationships inferred from the query. In this way, RACS reduces code search to the problem of graph search: finding similar MCR graphs for a given AR graph. We conduct evaluations against representative real-world jQuery questions posted on Stack Overflow, based on 308,294 code snippets collected from over 81,540 files on the Internet. The evaluation results show the effectiveness of RACS: the top 1 snippet produced by RACS matches the target code snippet for 46% questions, compared to only 4% achieved by a relationship-oblivious approach.

  • https://dl.acm.org/doi/10.1145/2950290.2950341
    • Relationship-aware code search for JavaScript frameworks

Code Search: A Survey of Techniques for Finding Code (2022)

  • https://arxiv.org/abs/2204.02765
    • Code Search: A Survey of Techniques for Finding Code

    • The immense amounts of source code provide ample challenges and opportunities during software development. To handle the size of code bases, developers commonly search for code, e.g., when trying to find where a particular feature is implemented or when looking for code examples to reuse. To support developers in finding relevant code, various code search engines have been proposed. This article surveys 30 years of research on code search, giving a comprehensive overview of challenges and techniques that address them. We discuss the kinds of queries that code search engines support, how to preprocess and expand queries, different techniques for indexing and retrieving code, and ways to rank and prune search results. Moreover, we describe empirical studies of code search in practice. Based on the discussion of prior work, we conclude the article with an outline of challenges and opportunities to be addressed in the future.

    • https://arxiv.org/pdf/2204.02765
      • Code Search: A Survey of Techniques for Finding Code

  • https://www.researchgate.net/publication/359786256_Code_Search_A_Survey_of_Techniques_for_Finding_Code
    • Code Search: A Survey of Techniques for Finding Code

graph2vec: Learning Distributed Representations of Graphs (2017)

  • https://arxiv.org/abs/1707.05005
    • graph2vec: Learning Distributed Representations of Graphs (2017)

    • Recent works on representation learning for graph structured data predominantly focus on learning distributed representations of graph substructures such as nodes and subgraphs. However, many graph analytics tasks such as graph classification and clustering require representing entire graphs as fixed length feature vectors. While the aforementioned approaches are naturally unequipped to learn such representations, graph kernels remain as the most effective way of obtaining them. However, these graph kernels use handcrafted features (e.g., shortest paths, graphlets, etc.) and hence are hampered by problems such as poor generalization. To address this limitation, in this work, we propose a neural embedding framework named graph2vec to learn data-driven distributed representations of arbitrary sized graphs. graph2vec's embeddings are learnt in an unsupervised manner and are task agnostic. Hence, they could be used for any downstream task such as graph classification, clustering and even seeding supervised representation learning approaches. Our experiments on several benchmark and large real-world datasets show that graph2vec achieves significant improvements in classification and clustering accuracies over substructure representation learning approaches and are competitive with state-of-the-art graph kernels.

    • https://arxiv.org/pdf/1707.05005
    • https://github.com/benedekrozemberczki/graph2vec
      • Graph2Vec

      • A parallel implementation of "graph2vec: Learning Distributed Representations of Graphs" (MLGWorkshop 2017).

      • The model is now also available in the Karate Club package.

    • https://github.com/annamalai-nr/graph2vec_tf
      • This repository contains the "tensorflow" implementation of our paper "graph2vec: Learning distributed representations of graphs".

SimGNN: A Neural Network Approach to Fast Graph Similarity Computation (2018; revised 2020)

  • https://arxiv.org/abs/1808.05689
    • SimGNN: A Neural Network Approach to Fast Graph Similarity Computation (2018; revised 2020)

    • Graph similarity search is among the most important graph-based applications, e.g. finding the chemical compounds that are most similar to a query compound. Graph similarity computation, such as Graph Edit Distance (GED) and Maximum Common Subgraph (MCS), is the core operation of graph similarity search and many other applications, but very costly to compute in practice. Inspired by the recent success of neural network approaches to several graph applications, such as node or graph classification, we propose a novel neural network based approach to address this classic yet challenging graph problem, aiming to alleviate the computational burden while preserving a good performance.

The proposed approach, called SimGNN, combines two strategies. First, we design a learnable embedding function that maps every graph into a vector, which provides a global summary of a graph. A novel attention mechanism is proposed to emphasize the important nodes with respect to a specific similarity metric. Second, we design a pairwise node comparison method to supplement the graph-level embeddings with fine-grained node-level information. Our model achieves better generalization on unseen graphs, and in the worst case runs in quadratic time with respect to the number of nodes in two graphs. Taking GED computation as an example, experimental results on three real graph datasets demonstrate the effectiveness and efficiency of our approach. Specifically, our model achieves smaller error rate and great time reduction compared against a series of baselines, including several approximation algorithms on GED computation, and many existing graph neural network based models. To the best of our knowledge, we are among the first to adopt neural networks to explicitly model the similarity between two graphs, and provide a new direction for future research on graph similarity computation and graph similarity search.

awesome-network-embedding

  • https://github.com/chihming/awesome-network-embedding
    • awesome-network-embedding

    • A curated list of network embedding techniques.

    • Also called network representation learning, graph embedding, knowledge embedding, etc.

      The task is to learn the representations of the vertices from a given network.

Karate Club

  • https://karateclub.readthedocs.io/en/latest/
    • Karate Club is an unsupervised machine learning extension library for NetworkX. It builds on other open source linear algebra, machine learning, and graph signal processing libraries such as Numpy, Scipy, Gensim, PyGSP, and Scikit-Learn. Karate Club consists of state-of-the-art methods to do unsupervised learning on graph structured data. To put it simply it is a Swiss Army knife for small-scale graph mining research. First, it provides network embedding techniques at the node and graph level. Second, it includes a variety of overlapping and non-overlapping community detection methods. Implemented methods cover a wide range of network science (NetSci, Complenet), data mining (ICDM, CIKM, KDD), artificial intelligence (AAAI, IJCAI) and machine learning (NeurIPS, ICML, ICLR) conferences, workshops, and pieces from prominent journals.

NetworkX

  • https://networkx.org/
    • NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

    • Software for complex networks

      • Data structures for graphs, digraphs, and multigraphs
      • Many standard graph algorithms
      • Network structure and analysis measures
      • Generators for classic graphs, random graphs, and synthetic networks
      • Nodes can be "anything" (e.g., text, images, XML records)
      • Edges can hold arbitrary data (e.g., weights, time-series)
      • Open source 3-clause BSD license
      • Well tested with over 90% code coverage
      • Additional benefits from Python include fast prototyping, easy to teach, and multi-platform

Software Similarity and Classification (2012; Book; Silvio Cesare, Yang Xiang)

  • https://books.google.com.au/books/about/Software_Similarity_and_Classification.html?id=Fy_mNhg2lK4C
  • https://link.springer.com/book/10.1007/978-1-4471-2909-7
    • Software Similarity and Classification

    • Authors: Silvio Cesare , Yang Xiang

    • Number of Pages: XIV, 88

    • The first book to construct a theory to describe the problems in software similarity and classification

    • Software similarity and classification is an emerging topic with wide applications. It is applicable to the areas of malware detection, software theft detection, plagiarism detection, and software clone detection. Extracting program features, processing those features into suitable representations, and constructing distance metrics to define similarity and dissimilarity are the key methods to identify software variants, clones, derivatives, and classes of software. Software Similarity and Classification reviews the literature of those core concepts, in addition to relevant literature in each application and demonstrates that considering these applied problems as a similarity and classification problem enables techniques to be shared between areas. Additionally, the authors present in-depth case studies using the software similarity and classification techniques developed throughout the book.

    • Includes supplementary material: https://extras.springer.com/?query=978-1-4471-2908-0

      • 1 zip file containing 3 PDFs:
        • Table of Contents (6 pages)
            • 1 Introduction (6 pages)
              • 1.1 Background
              • 1.2 Applications of Software Similarity and Classification
              • 1.3 Motivation
              • 1.4 Problem Formulization
              • 1.5 Problem Overview
              • 1.6 Aims and Scope
              • 1.7 Book Organization
              • References
            • 2 Taxonomy of Program Features (10 pages)
              • 2.1 Syntactic Features
                • 2.1.1 Raw Code
                • 2.1.2 Abstract Syntax Trees
                • 2.1.3 Variables
                • 2.1.4 Pointers
                • 2.1.5 Instructions
                • 2.1.6 Basic Blocks
                • 2.1.7 Procedures
                • 2.1.8 Control Flow Graphs
                • 2.1.9 Call Graphs
                • 2.1.10 Object Inheritances and Dependencies
              • 2.2 Semantic Features
                • 2.2.1 API Calls
                • 2.2.2 Data Flow
                • 2.2.3 Procedure Dependence Graphs
                • 2.2.4 System Dependence Graph
              • 2.3 Taxonomy of Features in Program Binaries
                • 2.3.1 Object File Formats
                • 2.3.2 Headers
                • 2.3.3 Object Code
                • 2.3.4 Symbols
                • 2.3.5 Debugging Information
                • 2.3.6 Relocations
                • 2.3.7 Dynamic Linking Information
              • 2.4 Case Studies
                • 2.4.1 Portable Executable
                • 2.4.2 Executable and Linking Format
                • 2.4.3 Java Class File
              • References
            • 3 Program Transformations and Obfuscations (10 pages)
              • 3.1 Compiler Optimization and Recompilation
                • 3.1.1 Instruction Reordering
                • 3.1.2 Loop Invariant Code Motion
                • 3.1.3 Code Fusion
                • 3.1.4 Function Inlining
                • 3.1.5 Loop Unrolling
                • 3.1.6 Branch/Loop Inversion
                • 3.1.7 Strength Reduction
                • 3.1.8 Algebraic Identities
                • 3.1.9 Register Reassignment
              • 3.2 Program Obfuscation
              • 3.3 Plagiarism, Software Theft, and Derivative Works
                • 3.3.1 Semantic Changes
                • 3.3.2 Code Insertion
                • 3.3.3 Code Deletion
                • 3.3.4 Code Substitution
                • 3.3.5 Code Transposition
              • 3.4 Malware Packing, Polymorphism, and Metamorphism
                • 3.4.1 Dead Code Insertion
                • 3.4.2 Instruction Substitution
                • 3.4.3 Variable Renaming
                • 3.4.4 Code Reordering
                • 3.4.5 Branch Obfuscation
                • 3.4.6 Branch Inversion and Flipping
                • 3.4.7 Opaque Predicate Insertion
                • 3.4.8 Malware Obfuscation Using Code Packing
                • 3.4.9 Traditional Code Packing
                • 3.4.10 Shifting Decode Frame
                • 3.4.11 Instruction Virtualization and Malware Emulators
              • 3.5 Features under Program Transformations
              • References
            • 4 Formal Methods of Program Analysis (12 pages)
              • 4.1 Static Feature Extraction
              • 4.2 Formal Syntax and Lexical Analysis
              • 4.3 Parsing
              • 4.4 Intermediate Representations
                • 4.4.1 Intermediate Code Generation
                • 4.4.2 Abstract Machines
                • 4.4.3 Basic Blocks
                • 4.4.4 Control Flow Graph
                • 4.4.5 Call Graph
              • 4.5 Formal Semantics of Programming Languages
                • 4.5.1 Operational Semantics
                • 4.5.2 Denotational Semantics
                • 4.5.3 Axiomatic Semantics
              • 4.6 Theorem Proving
                • 4.6.1 Hoare Logic
                • 4.6.2 Predicate Transformer Semantics
                • 4.6.3 Symbolic Execution
              • 4.7 Model Checking
              • 4.8 Data Flow Analysis
                • 4.8.1 Partially Ordered Sets
                • 4.8.2 Lattices
                • 4.8.3 Monotone Functions and Fixed Points
                • 4.8.4 Fixed Point Solutions to Monotone Functions
                • 4.8.5 Dataflow Equations
                • 4.8.6 Dataflow Analysis Examples
                • 4.8.7 Reaching Definitions
                • 4.8.8 Live Variables
                • 4.8.9 Available Expressions
                • 4.8.10 Very Busy Expressions
                • 4.8.11 Classification of Dataflow Analyses
              • 4.9 Abstract Interpretation
                • 4.9.1 Widening and Narrowing
              • 4.10 Intermediate Code Optimisation
              • 4.11 Research Opportunities
              • References
            • 5 Static Analysis of Binaries (8 pages)
              • 5.1 Disassembly
              • 5.2 Intermediate Code Generation
              • 5.3 Procedure Identification
              • 5.4 Procedure Disassembly
              • 5.5 Control Flow Analysis, Deobfuscation and Reconstruction
              • 5.6 Pointer Analysis
              • 5.7 Decompilation of Binaries
                • 5.7.1 Condition Code Elimination
                • 5.7.2 Stack Variable Reconstruction
                • 5.7.3 Preserved Register Detection
                • 5.7.4 Procedure Parameter Reconstruction
                • 5.7.5 Reconstruction of Structured Control Flow
                • 5.7.6 Type Reconstruction
              • 5.8 Obfuscation and Limits to Static Analysis
              • 5.9 Research Opportunities
              • References
            • 6 Dynamic Analysis (6 pages)
              • 6.1 Relationship to Static Analysis
              • 6.2 Environments
              • 6.3 Debugging
              • 6.4 Hooking
              • 6.5 Dynamic Binary Instrumentation
              • 6.6 Virtualization
              • 6.7 Application Level Emulation
              • 6.8 Whole System Emulation
              • References
            • 7 Feature Extraction (4 pages)
              • 7.1 Processing Program Features
              • 7.2 Strings
              • 7.3 Vectors
              • 7.4 Sets
              • 7.5 Sets of Vectors
              • 7.6 Trees
              • 7.7 Graphs
              • 7.8 Embeddings
              • 7.9 Kernels
              • 7.10 Research Opportunities
              • References
            • 8 Software Birthmark Similarity (8 pages)
              • 8.1 Distance Metrics
              • 8.2 String Similarity
                • 8.2.1 Levenshtein Distance
                • 8.2.2 Smith-Waterman Algorithm
                • 8.2.3 Longest Common Subsequence (LCS)
                • 8.2.4 Normalized Compression Distance
              • 8.3 Vector Similarity
                • 8.3.1 Euclidean Distance
                • 8.3.2 Manhattan Distance
                • 8.3.3 Cosine Similarity
              • 8.4 Set Similarity
                • 8.4.1 Dice Coefficient
                • 8.4.2 Jaccard Index
                • 8.4.3 Jaccard Distance
                • 8.4.4 Containment
                • 8.4.5 Overlap Coefficient
                • 8.4.6 Tversky Index
              • 8.5 Set of Vectors Similarity
              • 8.6 Tree Similarity
              • 8.7 Graph Similarity
                • 8.7.1 Graph Isomorphism
                • 8.7.2 Graph Edit Distance
                • 8.7.3 Maximum Common Subgraph
              • References
            • 9 Software Similarity Searching and Classification (6 pages)
              • 9.1 Instance-Based Learning and Nearest Neighbour
                • 9.1.1 k Nearest Neighbours Query
                • 9.1.2 Range Query
                • 9.1.3 Metric Trees
                • 9.1.4 Locality Sensitive Hashing
                • 9.1.5 Distributed Similarity Search
              • 9.2 Statistical Machine Learning
                • 9.2.1 Vector Space Models
                • 9.2.2 Kernel Methods
              • 9.3 Research Opportunities
              • References
            • 10 Applications (6 pages)
              • 10.1 Malware Classification
                • 10.1.1 Raw Code
                • 10.1.2 Instructions
                • 10.1.3 Basic Blocks
                • 10.1.4 API Calls
                • 10.1.5 Control Flow and Data Flow
                • 10.1.6 Data Flow
                • 10.1.7 Call Graph
                • 10.1.8 Control Flow Graphs
              • 10.2 Software Theft Detection (Static Approaches)
                • 10.2.1 Instructions
                • 10.2.2 Control Flow
                • 10.2.3 API Calls
                • 10.2.4 Object Dependencies
              • 10.3 Software Theft Detection (Dynamic Approaches)
                • 10.3.1 Instructions
                • 10.3.2 Control Flow
                • 10.3.3 API Calls
                • 10.3.4 Dependence Graphs
              • 10.4 Plagiarism Detection
                • 10.4.1 Raw Code and Tokens
                • 10.4.2 Parse Trees
                • 10.4.3 Program Dependency Graph
              • 10.5 Code Clone Detection
                • 10.5.1 Raw Code and Tokens
                • 10.5.2 Abstract Syntax Tree
                • 10.5.3 Program Dependency Graph
              • 10.6 Critical Analysis
              • References
            • 11 Future Trends and Conclusion
              • 11.1 Future Trends
              • 11.2 Conclusion
        • Preface (1 page)
        • Chapter 2: Taxonomy of Program Features (10 pages)

Unsorted

  • https://binary.ninja/2022/06/20/introducing-tanto.html#potential-uses-and-some-speculation
    • What I’ve found most interesting, and have been speculating about, is using variable slices like these (though not directly through the UI) in the function fingerprinting space. I’ve long suspected that a dataflow-based approach to fingerprinting might prove to be robust against compiler optimizations and versions, as well as source code changes that don’t completely redefine the implementation of a function. Treating each variable slice as a record of what happens to data within a function, a similarity score for two slices could be generated from the count of matching operations, matching constant interactions (2 + var_a), and matching variable interactions (var_f + var_a). Considering all slices, a confidence metric could be derived for whether two functions match. Significant research would be required to answer these questions concretely… and, if you could solve subgraph isomorphism at the same time, that’d be great!

See Also

My Other Related Deepdive Gist's and Projects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment