Some notes and tools on fingerprinting minified JavaScript libraries, AST fingerprinting, source code similarity, etc.
- Original Notes
- ChatGPT Explorations
- Musings
- Code Search
npm
Package Ranking, Bundle Size, etc- Link Dump 1
- Unsorted/Unreviewed Initial Link Dump RE: 'AST fingerprinting' / Code Similarity
- Program Dependence Graph, Control Flow Graph, Data Flow Graph, Data Flow Analysis, Program Analysis Tools, etc
- Stack Overflow: Assembly-level function fingerprint (2011)
- Systems and methods for detecting copied computer code using fingerprints (2016)
- A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features (2022)
- BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code Executables (2017)
- Software Fingerprinting in LLVM (2021)
- Syntax tree fingerprinting for source code similarity detection (2009)
- Syntax tree fingerprinting: a foundation for source code similarity detection (2011)
- Source Code Plagiarism Detection Based on Abstract Syntax Tree Fingerprintings (2022)
- Dynamic graph-based software fingerprinting (2007)
- Adaptive Structural Fingerprints for Graph Attention Networks (2019)
- Cloneless: Code Clone Detection via Program Dependence Graphs with Relaxed Constraints (2019)
- Graph-of-Code: Semantic Clone Detection Using Graph Fingerprints (2023)
- A graph-based code representation method to improve code readability classification (2023)
- Unsorted/Unreviewed Initial Link Dump RE: 'AST fingerprinting' / Code Similarity
- Link Dump 2
- OpenAI Embeddings
- Unsorted/Unreviewed Link Dump RE: 'AST fingerprinting' / Code Similarity (v2)
- Wikipedia Articles, etc
- A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges (2023)
- A comparison of code similarity analysers (2017)
- Winnowing: Local Algorithms for Document Fingerprinting (2003)
- Source Code Plagiarism Detection with Pre-Trained Model Embeddings and Automated Machine Learning (2023)
- A Source Code Similarity System for Plagiarism Detection (2013)
- A Source Code Similarity Based on Siamese Neural Network (2020)
- Detecting Source Code Similarity Using Compression (2019)
- Binary code similarity analysis based on naming function and common vector space (2023)
- REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models (2023)
- Finding Bugs Using Your Own Code: Detecting Functionally-similar yet Inconsistent Code (2021)
- MOSS: A1 System for Detecting Software Similarity (1997?)
- antiplag - similarity checking software for program codes, documents, and pictures (2019)
- SCOSS - A Source Code Similarity System (2021)
- Dolos (2019-2024+)
- MinHash-based Code Relationship & Investigation Toolkit (MCRIT) (2021-2025+)
- 1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis (2021)
- One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis (2021)
- Comparing One with Many -- Solving Binary2source Function Matching Under Function Inlining (2022)
- Cross-Inlining Binary Function Similarity Detection (2024)
- Pcode-Similarity (2021)
- Awesome Binary code similarity detection (2021)
- SCALE: Semantic Code Analysis via Learned Embeddings (2023)
- binary-sim - binary similarity using Deep learning (2023)
- Source Code Clone Detection Using Unsupervised Similarity Measures (2024)
- Transcending Language Barriers in Software Engineering with Crosslingual Code Clone Detection (2024)
- Link Dump 3
- Improved Code Summarization via a Graph Neural Network (2020)
- Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree (2020)
- Utilizing Abstract Syntax Tree Embedding to Improve the Quality of GNN-based Class Name Estimation (2023)
- Code Similarity Using Graph Neural Networks (2023)
- Link Dump 4
- JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games (2020)
- Relationship-aware code search for JavaScript frameworks (2016)
- Code Search: A Survey of Techniques for Finding Code (2022)
- graph2vec: Learning Distributed Representations of Graphs (2017)
- SimGNN: A Neural Network Approach to Fast Graph Similarity Computation (2018; revised 2020)
- awesome-network-embedding
- Karate Club
- NetworkX
- Software Similarity and Classification (2012; Book; Silvio Cesare, Yang Xiang)
- Unsorted
- See Also
This gist was created as I was finding there was too much content related to this topic to keep tacking it onto my older gist on Deobfuscating / Unminifying Obfuscated Web App / JavaScript Code; but until I move all of the relevant content from there to this gist; here is a link to the main notes I was keeping track of there (largely copies of my comments on various relevant GitHub repo's exploring this topic + related research / tools / etc):
fingerprinting-minified-javascript-libraries.md
-
Fingerprinting Minified JavaScript Libraries
-
- https://chatgpt.com/c/d2713f5a-19ee-41fe-836d-0db4ba3daeac
- Public Share (created 2025-03-25): https://chatgpt.com/share/67e25fc8-f638-8008-a610-3edaa6614072
- Private ChatGPT conversation about various things related to AST fingerprinting/etc; or as it summarised itself:
-
This chat explored how to create a stable and efficient system for fingerprinting and identifying variables in minified JavaScript code using structural patterns from AST analysis. We examined how tools like
eslint-scope
can help extract scope and reference data, discussed structural fingerprinting techniques inspired by academic research, and considered which JavaScript elements typically survive minification (like strings, symbols, and function structures). Finally, we developed an enhanced AST traversal script that categorizes these preserved elements by context—scopes, functions, classes, and modules—to make them easier to understand and analyze.
-
- TODO: Summarise/pull out the relevant parts from this and include them here
- https://chatgpt.com/c/67e25d5d-1aa4-8008-ac08-c971ac64090e
- Public Share (created 2025-03-25): https://chatgpt.com/share/67e25f3a-b604-8008-9d83-e12c738eb306
- Private ChatGPT conversation about various things related to identifying NPM imports in a bundled apps module import/export graph; or as it summarised itself:
-
This chat discusses techniques for analyzing a module dependency graph extracted from a bundled and minified JavaScript web app to identify subgraphs likely representing third-party library code. It covers methods such as graph clustering (e.g., Louvain, spectral clustering), centrality analysis, import tree depth, symbol naming heuristics, fingerprint/signature matching, entropy analysis, and dynamic profiling. These approaches help isolate self-contained, library-like clusters that can potentially be "sliced off" from the main application logic, supporting the goal of distinguishing app code from imported npm dependencies.
-
- TODO: Summarise/pull out the relevant parts from this and include them here
- https://x.com/_devalias/status/1905905312093053215
-
@_devalias (March 29 2025)
I wonder who's going to give me robust embedding based search across the entire open source ecosystem first.. @github code search, or @Sourcegraph?
Ideally not just at a file level, but at a function level.
I don't think either do currently, but I may not have read deep enough
- https://x.com/_devalias/status/1905905692869042316
-
@_devalias (March 29 2025)
Basically, given a random snippet of (potentially minified) code from a JS bundle; I want to be able to create an embedding for it, and then search for that across the whole NPM package ecosystem / open-source JS repos; and be able to identify which dependency it is.
-
- https://x.com/_devalias/status/1905906049917542735
-
@_devalias (March 29 2025)
There are ways I could do this currently, by extracting various 'stable' / 'salient' parts from the module and then using the regex/etc search features for it.
But it just kind of feels like being able to find the closest matches based on a code embeding would be even nicer.
-
- https://x.com/_devalias/status/1905906304100778169
-
@_devalias (March 29 2025)
Bonus points would also be if by searching via that embedding, not only did it end up matching the library I wanted; but if it identified the specific version/commit/similar because it was a closer match.
-
- https://x.com/_devalias/status/1905908702215094752
-
@_devalias (March 29 2025)
My ever-growing deep dive gist of thoughts/resources/research/etc tangentially related to this and similar:
Fingerprinting Minified JavaScript Libraries / AST Fingerprinting / Source Code Similarity / Etc
-
- https://x.com/_devalias/status/1906298255278739858
-
@_devalias (March 30 2025)
Or as the wildcard entry.. just stumbled across @boyter 's http://searchcode.com + Blog full of a literal goldmine of interesting looking content related to it!
(will be dumping a pile of interesting looking blog links into my aforementioned gist as soon as I can)
-
-
- https://docs.github.com/en/search-github/github-code-search/about-github-code-search
-
About GitHub Code Search You can search, navigate and understand code across GitHub with code search.
- https://docs.github.com/en/search-github/github-code-search/about-github-code-search#limitations
-
Limitations
We have indexed many public repositories for code search, and continue to index more. Additionally, the private repositories of GitHub users are indexed and searchable by those that already have access to those private repositories on GitHub. However, very large repositories may not be indexed at this time, and not all code is indexed.
The current limitations on indexed code are:
- Vendored and generated code is excluded
- Empty files and files over 350 KiB are excluded
- Lines over 1,024 characters long are truncated
- Binary files (PDF, etc.) are excluded
- Only UTF-8 encoded files are included
- Very large repositories may not be indexed
- Exhaustive search is not supported
- Files with more than one line over 4096 bytes are excluded
We currently only support searching for code on the default branch of a repository. The query length is limited to 1000 characters.
Results for any search with code search are restricted to 100 results (5 pages). Sorting is not supported for code search results at this time. This limitation only applies to searching code with the new code search and does not apply to other types of searches.
If you use the
path:
qualifier for a file that's in multiple repositories with similar content, GitHub will only show a few of those files. If this happens, you can choose to expand by clicking Show identical files at the bottom of the page.Code search supports searching for symbol definitions in code, such as function or class definitions, using the
symbol:
qualifier. However, note that thesymbol:
qualifier only searches for definitions and not references, and not all symbol types or languages are fully supported yet.
-
-
- https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax
-
Understanding GitHub Code Search syntax You can build search queries for the results you want with specialized code qualifiers, regular expressions, and boolean operations.
-
The search syntax in this article only applies to searching code with GitHub code search. Note that the syntax and qualifiers for searching for non-code content, such as issues, users, and discussions, is not the same as the syntax for code search. For more information on non-code search, see About searching on GitHub and Searching on GitHub.
- https://docs.github.com/en/search-github/getting-started-with-searching-on-github/understanding-the-search-syntax
-
Understanding the search syntax When searching GitHub, you can construct queries that match specific numbers and words.
-
Note: The syntax below applies to non-code search. For more information on code search syntax, see Understanding GitHub Code Search syntax.
-
Search queries consist of search terms, comprising text you want to search for, and qualifiers, which narrow down the search.
-
A bare term with no qualifiers will match either the content of a file or the file's path.
-
You can enter multiple terms separated by whitespace to search for documents that satisfy both terms.
-
Searching for multiple terms separated by whitespace is the equivalent to the search
hello AND world
. Other boolean operations, such ashello OR world
, are also supported.- https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#using-boolean-operations
-
Using boolean operations
-
Code search supports boolean expressions. You can use the operators
AND
,OR
, andNOT
to combine search terms. -
By default, adjacent terms separated by whitespace are equivalent to using the
AND
operator. -
You can use parentheses to express more complicated boolean expressions.
-
- https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#using-boolean-operations
-
Code search also supports searching for an exact string, including whitespace.
-
You can narrow your code search with specialized qualifiers, such as
repo:
,language:
andpath:
- https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#using-qualifiers
-
Using qualifiers
You can use specialized keywords to qualify your search.
- Repository qualifier (
repo:
) - Organization (
org:
) and user (user:
) qualifiers - Language qualifier (
language:
) - Path qualifier (
path:
) - Symbol qualifier (
symbol:
)-
You can search for symbol definitions in code, such as function or class definitions, using the
symbol:
qualifier. Symbol search is based on parsing your code using the open source Tree-sitter parser ecosystem, so no extra setup or build tool integration is required. -
In some languages, you can search for symbols using a prefix (e.g. a prefix of their class name). For example, for a method
deleteRows
on a structMaint
, you could searchsymbol:Maint.deleteRows
if you are using Go, orsymbol:Maint::deleteRows
in Rust.You can also use regular expressions with the symbol qualifier.
-
Note that this qualifier only searches for definitions and not references, and not all symbol types or languages are fully supported yet. Symbol extraction is supported for the following languages:
- Bash
- C
- C#
- C++
- CodeQL
- Elixir
- Go
- JSX
- Java
- JavaScript
- Lua
- PHP
- Protocol Buffers
- Python
- R
- Ruby
- Rust
- Scala
- Starlark
- Swift
- Typescript
We are working on adding support for more languages. If you would like to help contribute to this effort, you can add support for your language in the open source Tree-sitter parser ecosystem, upon which symbol search is based.
-
- Content qualifier (
content:
) - Is qualifier (
is:
)-
To filter based on repository properties, you can use the
is:
qualifier.is:
supports the following values:archived
: restricts the search to archived repositories.fork
: restricts the search to forked repositories.vendored
: restricts the search to content detected as vendored.generated
: restricts the search to content detected as generated.
-
- Repository qualifier (
-
- https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#using-qualifiers
-
You can also use regular expressions in your searches by surrounding the expression in slashes.
- https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#using-regular-expressions
-
Using regular expressions
Code search supports regular expressions to search for patterns in your code. You can use regular expressions in bare search terms as well as within many qualifiers, by surrounding the regex in slashes.
-
Most common regular expressions features work in code search. However, "look-around" assertions are not supported.
-
- https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#using-regular-expressions
- https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#searching-for-quotes-and-backslashes
-
Searching for quotes and backslashes
-
To search for code containing a quotation mark, you can escape the quotation mark using a backslash.
-
- https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#separating-search-terms
-
Separating search terms
All parts of a search, such as search terms, exact strings, regular expressions, qualifiers, parentheses, and the boolean keywords
AND
,OR
, andNOT
, must be separated from one another with spaces. The one exception is that items inside parentheses,(
)
, don't need to be separated from the parentheses.If your search contains multiple components that aren't separated by spaces, or other text that does not follow the rules listed above, code search will try to guess what you mean. It often falls back on treating that component of your query as the exact text to search for.
-
If code search guesses wrong, you can always get the search you wanted by using quotes and spaces to make the meaning clear.
-
- https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax#case-sensitivity
-
Case sensitivity
By default, code search is case-insensitive, and results will include both uppercase and lowercase results. You can do case-sensitive searches by using a regular expression with case insensitivity turned off. For example, to search for the string "True", you would use:
/(?-i)True/
-
-
- https://docs.github.com/en/search-github/searching-on-github/searching-code
-
Searching code (legacy) You only need to use the legacy code search syntax if you are using the code search API.
- https://docs.github.com/en/rest/search/search#search-code
-
Search code Searches for query terms inside of a file. This method returns up to 100 results per page.
-
GET /search/code
-
-
- https://www.youtube.com/watch?v=QCs76SC1ZZ0
-
YouTube: The technology behind GitHub's new code search - Universe 2022
-
- https://github.blog/engineering/architecture-optimization/the-technology-behind-githubs-new-code-search/
-
The technology behind GitHub’s new code search (February 6, 2023) A look at what went into building the world’s largest public code search index.
- TODO: read through this and include more relevant snippets here
- https://news.ycombinator.com/item?id=34681223
-
I've worked alongside the CEO/CTO of Sourcegraph for the past 8 years, everyone else is at our company offsite so I figured I'd chime in :) nobody asked me to write this (nor did I ask) :)
The article is a top-notch technical write-up, the devs on GitHub code search should be proud of what they've achieved so far!
Honestly, we're rooting for GitHub to improve their code search, viewing them as a close peer-not a competitor. We also maintain OSS projects like Zoekt, which IIRC GitLab is maybe looking at using for their own. The more devs that 'get' code search, the better off Sourcegraph is frankly!
GitHub has a nice intuitive/simple UX, we could learn a thing or two there (though, easier to do with less features.)
Still, Sourcegraph search tech is quite a bit more powerful:
- Searching over commit messages, diffs, filename, etc. are super nice for tracking down regressions / finding 'that PR I swear my coworker made'
- Expressiveness like "find this regexp in repositories, but only if the repo has had a commit in the last month AND has a file named package.json in its root"
- Since Steve Yegge joined us, we've started thinking about ranking of search results, a notoriously difficult thing to do well in code search unless you have great factors to rank on (e.g. a semantic understanding of code): https://about.sourcegraph.com/blog/new-search-ranking
- We stream results back, so you can get a comprehensive set of results - not just a few pages, from our API.
- Works in GitHub Enterprise, not just GitHub.com. Plus on all your code hosts, think BitBucket, GitLab, Azure DevOps, Gerrit, Phabricator, etc. and even non-Git VCS like Perforce.
- Respects permissions of all your code hosts (a very difficult problem, as there are no official APIs to query this info from code hosts in general)
Having code search is one thing, but using it is another:
- Code Insights (we use search as an API to gather statistics about code, track code quality, keywords, etc. both over time and retroactively and let you build dashboards)
- Batch changes (find+replace, but over thousands of repositories. Run a Docker container per repo, run your custom linter script etc. and then draft or send PRs to thousands of repos, manage/track campaigns with thousands of PRs like that over time, etc.)
- Precise code intel / semantic awareness of code, we use SCIP indexers for this (spiritual successor to Microsoft's LSIF format for indexing LSP servers.)
I am super happy GitHub continues to push their code search effort, and genuinely believe it's a great thing for all developers and us over at Sourcegraph. Also excited to see when they do their public rollout of this :)
Anyway, that's just my take as someone who works there-other Sourcegraphers will chime in later if anything I said above feels off to them I'm sure :)
- https://sourcegraph.com/blog/new-search-ranking
-
Rethinking search results ranking on Sourcegraph.com
-
Announcing Search Ranking and Relevance
I’m thrilled to announce that Sourcegraph has launched PageRank-driven Code Search result rankings that prioritize relevance and showing reusable code. This launched today for searches on popular OSS repos on https://sourcegraph.com/ , and we are working to bring ranking to private Sourcegraph deployments soon.
-
Sourcegraph’s new search ranking uses a rendition of the Google PageRank algorithm on source code, powered by the code symbol graph from our sophisticated code intelligence platform (CIP).
-
Why is using PageRank for Code Search so revolutionary and effective? Let’s dig in.
-
For web pages, Google’s PageRank tracks which pages are pointed at (referenced) most often by other web pages. PageRank is a measure of how “cool” they are: Who’s pointing at them?
For source code, the pointing hands are code usages: function calls, imports, that sort of thing. If there’s only one arm pointing at a smiley, that’s a code use. But if more than one arm is pointing in… that’s reuse! The big yellow smiley is being reused by more code than any other smiley in the diagram. The PageRank algorithm uncovered this fact.
The implication here is that PageRank is a measure of code reuse. Which makes it an incredibly powerful ranking signal. Because when you’re doing a code search, you are almost always looking for code you can reuse.
- TODO: read through this and include more relevant snippets here
-
-
-
- https://github.blog/engineering/a-brief-history-of-code-search-at-github/
-
A brief history of code search at GitHub (December 15, 2021)
This blog post tells the story of why we built a new search engine optimized for code.
-
We want to share more about our work on code exploration, navigation, search, and developer productivity. Recently, we substantially improved the precision of our code navigation for Python, and open-sourced the tools we developed for this. The stack graph formalism we developed will form the basis for precise code navigation support for more languages, and will even allow us to empower language communities to build and improve support for their own languages, similarly to how we accept contributions to github/linguist to expand GitHub’s syntax highlighting capabilities.
- TODO: read through this and include more relevant snippets here
-
- https://github.blog/open-source/introducing-stack-graphs/
-
Introducing stack graphs (December 9, 2021 | Updated July 23, 2024)
Precise code navigation is powered by stack graphs, a new open source framework that lets you define the name binding rules for a programming language.
-
Today, we announced the general availability of precise code navigation for all public and private Python repositories on GitHub.com. Precise code navigation is powered by stack graphs, a new open source framework we’ve created that lets you define the name binding rules for a programming language using a declarative, domain-specific language (DSL). With stack graphs, we can generate code navigation data for a repository without requiring any configuration from the repository owner, and without tapping into a build process or other CI job. In this post, I’ll dig into how stack graphs work, and how they achieve these results.
- TODO: read through this and include more relevant snippets here
- https://dcreager.net/talks/stack-graphs/
-
Incremental, zero-config Code Navigation using stack graphs.
Exploring a large or unfamiliar codebase can be tricky. Code Navigation features like “jump to definition” and “find all references” let you discover how different pieces of code relate to each other. To power these features, we need to extract lists of symbols from the code, and describe the language-specific rules for how those symbols relate to each other.
It’s difficult to add Code Nav to a large hosted service like GitHub, where we must support hundreds of programming languages, hundreds of millions of repositories, and petabytes of history. At this scale, we have a different set of design constraints than a local IDE. We need our data extraction to be incremental, so that we can reuse previous results for files that haven’t changed in a newly pushed commit, saving both compute and storage costs. And to support cross-repo lookups, it should require zero configuration — repo owners should not have to set up anything manually to activate the feature.
In this talk I’ll describe stack graphs, which use a graphical notation to define the name binding rules for a programming language. They work equally well for dynamic languages like Python and JavaScript, and for static languages like Go and Java. Our solution is fast — processing most commits within seconds of us receiving your push. It does not require setting up a CI job, or tapping into a project-specific build process. And it is open-source, building on the tree-sitter project’s existing ecosystem of language tools.
- Presentation: https://www.youtube.com/watch?v=l2R1PTGcwrE
-
YouTube: "Incremental, zero-config Code Nav using stack graphs" by Douglas Creager
-
- Slides: https://media.dcreager.net/dcreager-strange-loop-2021-slides.pdf
-
- https://arxiv.org/abs/2211.01224
-
Stack graphs: Name resolution at scale (2022)
-
We present stack graphs, an extension of Visser et al.'s scope graphs framework. Stack graphs power Precise Code Navigation at GitHub, allowing users to navigate name binding references both within and across repositories. Like scope graphs, stack graphs encode the name binding information about a program in a graph structure, in which paths represent valid name bindings. Resolving a reference to its definition is then implemented with a simple path-finding search.
GitHub hosts millions of repositories, containing petabytes of total code, implemented in hundreds of different programming languages, and receiving thousands of pushes per minute. To support this scale, we ensure that the graph construction and path-finding judgments are file-incremental: for each source file, we create an isolated subgraph without any knowledge of, or visibility into, any other file in the program. This lets us eliminate the storage and compute costs of reanalyzing file versions that we have already seen. Since most commits change a small fraction of the files in a repository, this greatly amortizes the operational costs of indexing large, frequently changed repositories over time. To handle type-directed name lookups (which require "pausing" the current lookup to resolve another name), our name resolution algorithm maintains a stack of the currently paused (but still pending) lookups. Stack graphs can be constructed via a purely syntactic analysis of the program's source code, using a new declarative graph construction language. This means that we can extract name binding information for every repository without any per-package configuration, and without having to invoke an arbitrary, untrusted, package-specific build process.
-
-
- https://github.blog/news-insights/product-news/precise-code-navigation-python-code-navigation-pull-requests/
-
Precise code navigation for Python, and code navigation in pull requests (December 9, 2021 | Updated July 23, 2024)
Code navigation is now available in PRs, and code navigation results for Python are now more precise.
-
Over the coming months, we will add stack graph support for additional languages, allowing us to show precise code navigation results for them as well. Our
stack-graphs
library is open source and builds on the Tree-sitter ecosystem of parsers. We will also be publishing information on how language communities can self-serve stack graph support for their languages, should they wish to. -
If you would like to learn more about how stack graphs enable precise code navigation with zero configuration, check out our deep dive post and Strange Loop presentation.
- TODO: read through this and include more relevant snippets here
-
- https://sourcegraph.com/
-
Sourcegraph accelerates how software gets built, helping developers search, understand, and write code in complex codebases with AI
-
Code Search Find and navigate code, make large-scale changes, and track insights across codebases of any size.
- https://sourcegraph.com/contexts
-
Search code you care about with search contexts
- https://sourcegraph.com/docs/code-search/working/search_contexts
-
Search Contexts
-
Search Contexts help you search the code you care about on Sourcegraph. A search context represents a set of repositories at specific revisions on a Sourcegraph instance that will be targeted by search queries by default.
-
Every search on Sourcegraph uses a search context. Search contexts can be defined with the contexts selector shown in the search input, or entered directly in a search query.
-
- https://sourcegraph.com/docs/code-search/working/search_contexts
-
- https://sourcegraph.com/code-search
-
Code Search makes it easy to find code, make large-scale changes, and track insights across codebases of any scale and with any number of code hosts.
-
Efficiently reuse existing code. Find code across thousands of repositories and multiple code hosts in seconds.
-
Understand your code and its dependencies
- Onboard to codebases faster with cross-repository code navigation features like “Go to definition” and “Find references”.
- Complete code reviews, get up to speed on unfamiliar code, and determine the impact of code changes with the confidence of compiler-accurate code navigation.
- Determine root causes quickly with code navigation that tracks dependencies and references across repositories.
-
- https://sourcegraph.com/contexts
- https://sourcegraph.com/pricing
-
Free
- $0 per month
- AI editor extension for hobbyists or light usage
-
Enterprise Starter
- $19 per user/month
- AI & search experience for growing organizations hosted on our cloud
- This seems to be the first tier that adds specialised search features (beyond whats available publicly anyway)
-
Integrated search results
-
Code Search Features
- Code Search
- Symbol Search
-
-
Enterprise
- $59 per user/month
- AI & search with enterprise-level security, scalability, and flexibility
- Extra search features
-
Everything in Enterprise Starter, plus:
-
Code Search Features
- Batch Changes
- Code Insights
- Code Navigation
-
-
-
- https://sourcegraph.com/search
-
Public Code Search
-
- https://sourcegraph.com/docs
-
Documentation Sourcegraph allows developers to rapidly search, write, and understand code by bringing insights from their entire codebase right into the editor.
- https://sourcegraph.com/docs/code-search
-
Code Search
-
Code Search allows you to find, fix, and navigate code with any code host or language across multiple repositories with real-time updates. It deeply understands your code, prioritizing the most relevant results for an enhanced search experience.
-
Sourcegraph's Code Search empowers you to:
- Utilize regular expressions, boolean operations, and keyboard shortcuts to help you unleash the full potential of your searches
- With the symbol, commit, and diff search capabilities, it identifies code vulnerabilities in milliseconds and quickly helps you resolve issues and incidents
- Offers innovative code view with seamless code navigation for a comprehensive coding experience
- https://sourcegraph.com/docs/code-search/features
-
Code Search Capabilities
Learn and understand more about Sourcegraph's Code Search features and core functionality.
- https://sourcegraph.com/docs/code-search/features#powerful-flexible-queries
- https://sourcegraph.com/docs/code-search/features#symbol-search
-
Searching for symbols makes it easier to find specific functions, variables, and more. Use the
type:symbol
filter to search for symbol results. Symbol results also appear in typeahead suggestions, so you can jump directly to symbols by name. When on an indexed commit, it uses Zoekt. Otherwise it uses the symbols service- https://sourcegraph.com/docs/code-search/types/symbol
-
We use Ctags to index the symbols of a repository on demand. These symbols are used to implement symbol search, matching declarations instead of plain text.
-
- https://sourcegraph.com/docs/code-search/types/symbol
-
- https://sourcegraph.com/docs/code-search/features#saved-searches
- https://sourcegraph.com/docs/code-search/features#search-contexts
- https://sourcegraph.com/docs/code-search/features#re2-regular-expressions
-
RE2 Regular Expressions
The Sourcegraph search language supports RE2 syntax. If you're used to tools like Perl which uses PCRE syntax, you may notice that there are some features that are missing from RE2 like backreferences and lookarounds. We choose to use RE2 for a few reasons:
- It makes it possible to build worst-case linear evaluation engines, which is very desirable for building a production-ready regex search engine.
- It's well-supported in Go, allowing us to take advantage of a rich ecosystem (notably including Zoekt)
- Our API and tooling makes it straightforward to use Sourcegraph with other tools that provide facilities not built in to the search language.
- https://github.com/google/re2
-
RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.
- https://github.com/google/re2/wiki/Syntax
-
-
- https://sourcegraph.com/docs/code-search/features#search-experience
- etc
-
- https://sourcegraph.com/docs/code-search/queries
-
Search Query Syntax This page describes the query syntax for Code Search.
-
- https://sourcegraph.com/docs/code-search/code-navigation
-
Code Navigation
Learn how to navigate your code and understand its dependencies with high precision.
Code Navigation helps you quickly understand your code, its dependencies, and symbols within the Sourcegraph file view while making it easier to move through your codebase
- https://sourcegraph.com/docs/code-search/code-navigation#code-navigation-types
-
Code Navigation types
There are two types of Code Navigation that Sourcegraph supports:
- Search-based Code Navigation: Works out of the box with most popular programming languages, powered by Sourcegraph's code search. It uses a mix of text search and syntax-level heuristics (no language-level semantic information) for fast, performant searches across large code bases.
- Precise Code Navigation: Uses compile-time information to provide users with accurate cross-repository navigation experience across the entire code base.
- https://sourcegraph.com/docs/code-search/code-navigation/precise_code_navigation
-
Precise Code Navigation
-
Precise Code Navigation is an opt-in feature that is enabled from your admin settings and requires you to upload indexes for each repository to your Sourcegraph instance. Once setup is complete on Sourcegraph, precise code navigation is available for use across popular development tools
-
Sourcegraph automatically uses Precise Code Navigation whenever available, and Search-based Code Navigation is used as a fallback when precise navigation is not available.
-
Precise code navigation relies on the open source SCIP Code Intelligence Protocol, which is a language-agnostic protocol for indexing source code.
-
- https://sourcegraph.com/docs/code-search/code-navigation/auto_indexing
-
Auto-indexing
-
With Sourcegraph deployments supporting executors, your repository contents can be automatically analyzed to produce a code graph index file. Once auto-indexing is enabled and auto-indexing policies are configured, repositories will be periodically cloned into an executor sandbox, analyzed, and the resulting index file will be uploaded back to the Sourcegraph instance.
Auto-indexing is currently available for Go, TypeScript, JavaScript, Python, Ruby and JVM repositories. See also dependency navigation for instructions on how to setup cross-dependency navigation depending on what language ecosystem you use.
-
- https://sourcegraph.com/docs/code-search/code-navigation/writing_an_indexer#writing-an-indexer
-
Indexers
This page describes the process of writing an indexer and details all the recommended indexers that Sourcegraph currently supports.
-
The following documentation describes the SCIP Code Intelligence Protocol and explains steps to write an indexer to emit SCIP.
-
- https://sourcegraph.com/docs/code-search/code-navigation/writing_an_indexer#sourcegraph-recommended-indexers
-
Sourcegraph recommended indexers
Language support is an ever-evolving feature of Sourcegraph. Some languages may be better supported than others due to demand or developer bandwidth/expertise. The following clarifies the status of the indexers which the Sourcegraph team can both recommend to customers and provide support for.
-
- https://sourcegraph.com/docs/code-search/code-navigation/writing_an_indexer#cross-repository-emits-monikers-for-cross-repository-support
-
Cross repository: Emits monikers for cross-repository support
The next milestone provides support for cross-repository definitions and references.
The indexer can emit a valid index including import monikers for each symbol defined non-locally, and export monikers for each symbol importable by another repository. This index should be consumed without error by the latest Sourcegraph instance and Go to Definition and Find References should work on cross-repository symbols given that both repositories are indexed at the exact commit imported.
-
-
- https://sourcegraph.com/docs/code-search/code-navigation/rockskip
-
Rockskip: fast symbol sidebar and search-based code navigation on monorepos
-
Rockskip is an alternative symbol indexing and query engine for the symbol service intended to improve performance of the symbol sidebar and search-based code navigation on big monorepos. It was added in Sourcegraph 3.38.
-
-
- https://sourcegraph.com/docs/code-search/types/fuzzy
-
Fuzzy Finder
Learn and understand about Sourcegraph's Fuzzy Search and core functionality.
Use the fuzzy finder to quickly navigate to a repository, symbol, or file.
-
- https://sourcegraph.com/docs/code-search/types/structural
-
Structural Search
-
Changed in version 5.3. Structural search is disabled by default. To enable it, ask your site administrator to set experimentalFeatures.structuralSearch = "enabled" in site configuration. Structural search has performance limitations and is not actively developed. We recommend using regex search or a combination of Search Jobs and custom scripts instead.
-
With structural search, you can match richer syntax patterns specifically in code and structured data formats like JSON. It can be awkward or difficult to match code blocks or nested expressions with regular expressions. To meet this challenge we've introduced a new and easier way to search code that operates more closely on a program's parse tree. We use Comby syntax for structural matching. Below you'll find examples and notes for this language-aware search functionality.
- https://comby.dev/
-
Comby is a tool for searching and changing code structure
- https://comby.dev/docs/overview
-
Comby provides a lightweight way of matching syntactic structures of a program’s parse tree, like expressions and function blocks. Comby is language-aware and understands basic syntax of code, strings, and comment syntax in many languages.
-
- https://comby.dev/docs/syntax-reference
-
Syntax Reference
-
-
- https://comby.dev/
-
- https://sourcegraph.com/docs/code-search/types/search-jobs
-
Search Jobs
-
Use Search Jobs to search code at scale for large-scale organizations.
Search Jobs allows you to run search queries across your organization's codebase (all repositories, branches, and revisions) at scale. It enhances the existing Sourcegraph's search capabilities, enabling you to run searches without query timeouts or incomplete results.
With Search Jobs, you can start a search, let it run in the background, and then download the results from the Search Jobs UI when it's done.
-
- https://sourcegraph.com/docs/code-search/working/snippets
-
Search Snippets
-
Every project and team has a different set of repositories they commonly work with and queries they perform regularly. Custom search snippets enable users and organizations to quickly filter existing search results with search fragments matching those use cases.
A search snippet is any valid query. For example, a search snippet that defines all repositories in the "example" organization would be
repo:^github\.com/example/
. After adding this snippet to your settings, it would appear in the search snippet panel in the search sidebar under a label of your choosing (as of v3.29).
-
- https://sourcegraph.com/docs/code-search/working/search_subexpressions
-
Search Subexpressions
-
Search subexpressions combine groups of filters like
repo:
and operators likeor
. Compared to basic examples, search subexpressions allow more sophisticated queries.
-
-
- https://sourcegraph.com/docs/api/graphql
-
Sourcegraph GraphQL API
The Sourcegraph GraphQL API is a rich API that exposes data related to the code available on a Sourcegraph instance.
The Sourcegraph GraphQL API supports the following types of queries:
- Full-text and regexp code search
- Rich git-level metadata, including commits, branches, blame information, and file tree data
- Repository and user metadata
- https://sourcegraph.com/docs/api/graphql#documentation
-
Sourcegraph's GraphQL API documentation is available on the API Docs page, as well as directly in the API console itself.
- https://sourcegraph.com/docs/api/graphql/api-docs
-
Sourcegraph API
-
-
- https://sourcegraph.com/docs/api/graphql#search
-
Search
See additional documentation about search GraphQL API: https://sourcegraph.com/docs/api/graphql/search
-
- https://sourcegraph.com/docs/api/graphql#using-the-api-via-the-sourcegraph-cli
-
Using the API via the Sourcegraph CLI
A command line interface to Sourcegraph's API is available. Today, it is roughly the same as using the API via
curl
(see below), but it offers a few nice things:- Allows you to easily compose queries from scripts, e.g. without worrying about escaping JSON input to
curl
properly. - Reads your access token and Sourcegraph server endpoint from a config file (or env var).
- Pipe multi-line GraphQL queries into it easily.
- Get any API query written using the CLI as a
curl
command using thesrc api -get-curl
flag.
To learn more, see sourcegraph/src-cli
- Allows you to easily compose queries from scripts, e.g. without worrying about escaping JSON input to
-
- https://sourcegraph.com/docs/api/graphql#using-the-api-via-curl
-
Using the API via curl
The entire API can be used via
curl
(or any HTTP library), just the same as any other GraphQL API.
-
-
- https://sourcegraph.com/docs/api/stream_api
-
Sourcegraph Stream API
With the Stream API you can consume search results and related metadata as a stream of events. The Sourcegraph UI calls the Stream API for all interactive searches. Compared to our GraphQL API, it offers shorter times to first results and supports running exhaustive searches returning a large volume of results without putting pressure on the backend.
-
-
- https://github.com/sourcegraph/sourcegraph-public-snapshot
-
Sourcegraph Code AI platform with Code Search & Cody
-
Note
Sourcegraph transitioned to a private monorepo. This repository,
sourcegraph/sourcegraph-public-snapshot
is a publicly available copy of thesourcegraph/sourcegraph
repository as it was just before the migration. -
Tip
If you are interested in working with the code, this commit is the last one made under an Apache License.
- This commit was made on Jun 14, 2023
- Note: The latest commits seem to be from August 2024
- https://news.ycombinator.com/item?id=36584656
-
Sourcegraph is no longer open source
-
sqs on July 4, 2023
Sourcegraph CEO here. Sourcegraph is now 2 separate products: code search and Cody (our code AI). Cody remains open source (Apache 2) in the client/cody* directories in the repository, and we're extracting that to a separate 100% OSS repository soon.
Our licensing principle remains to charge companies while making tools for individual devs open source. Very few individual devs (or companies) used the limited-feature open-source variant of code search, so we decided to remove it. Usage of Sourcegraph code search was even more skewed toward our official non-OSS build than in other similar situations like Google Chrome vs. Chromium or VS Code vs. VSCodium. Maintaining 2 variants was a burden on our engineering team that had very little benefit for anyone.
You can see more explanation at sourcegraph/sourcegraph-public-snapshot#53528 (comment) . The change was announced in the changelog and in a PR (all of our development occurs in public), and we will have a blog post this week after we separate our big monorepo into 2 repos as planned: the 100% OSS repo for Cody and the non-OSS repo for code search.
You can still use Sourcegraph code search for free on public code at https://sourcegraph.com and on our self-hosted free tier on private code (which means individual devs can still run Sourcegraph code search 100% for free). Customers are not affected at all.
-
- https://github.com/sourcegraph/src-cli
-
Sourcegraph CLI
-
src
is a command line interface to Sourcegraph:- Search & get results in your terminal
- Search & get JSON for programmatic consumption
- Make GraphQL API requests with auth easily & get JSON back fast
- Execute batch changes
- Manage & administrate repositories, users, and more
- Easily convert src-CLI commands to equivalent curl commands, just add
--get-curl
!
-
-
- https://github.com/sourcegraph/zoekt
-
Zoekt: fast code search
-
Fast trigram based code search
-
Zoekt is a text search engine intended for use with source code. (Pronunciation: roughly as you would pronounce "zooked" in English)
-
Note: This has been the maintained source for Zoekt since 2017, when it was forked from the original repository github.com/google/zoekt.
-
Zoekt supports fast substring and regexp matching on source code, with a rich query language that includes boolean operators (and, or, not). It can search individual repositories, and search across many repositories in a large codebase. Zoekt ranks search results using a combination of code-related signals like whether the match is on a symbol. Because of its general design based on trigram indexing and syntactic parsing, it works well for a variety of programming languages.
The two main ways to use the project are
- Through individual commands, to index repositories and perform searches through Zoekt's query language
- Or, through the indexserver and webserver, which support syncing repositories from a code host and searching them through a web UI or API
For more details on Zoekt's design, see the docs directory.
-
Note: It is also recommended to install Universal ctags, as symbol information is a key signal in ranking search results. See ctags.md for more information.
- https://github.com/sourcegraph/zoekt/blob/main/doc/ctags.md
-
CTAGS
Ctags generates indices of symbol definitions in source files. It started its life as part of the BSD Unix, but there are several more modern flavors. Zoekt supports universal-ctags.
-
- https://github.com/sourcegraph/zoekt/blob/main/doc/ctags.md
- https://github.com/sourcegraph/zoekt/blob/main/doc/query_syntax.md
-
Zoekt Query Language Guide This guide explains the Zoekt query language, used for searching text within Git repositories. Zoekt queries allow combining multiple filters and expressions using logical operators, negations, and grouping. Here's how to craft queries effectively.
-
- https://github.com/sourcegraph/zoekt-archived
-
Note: This is a Sourcegraph fork of github.com/google/zoekt. It contains some changes that do not make sense to upstream and or have not yet been upstreamed.
-
-
- https://github.com/sourcegraph/scip.dev
-
Future home of scip.dev
-
- https://github.com/sourcegraph/scip
-
SCIP Code Intelligence Protocol
-
SCIP (pronunciation: "skip") is a language-agnostic protocol for indexing source code, which can be used to power code navigation functionality such as Go to definition, Find references, and Find implementations.
This repository includes:
- A Protobuf schema for SCIP.
- Rich Go and Rust bindings for SCIP: These include many utility functions to help build tooling on top of SCIP.
- Auto-generated bindings for TypeScript and Haskell.
- The
scip
CLI, which makes SCIP indexes a breeze to work with.
If you're interested in better understanding the motivation behind SCIP, check out the announcement blog post and the design doc.
If you're interested in writing a new indexer that emits SCIP, check out our documentation on how to write an indexer. Also, check out the Debugging section in the Development docs.
If you're interested in consuming SCIP data, you can either use one of the provided language bindings, or generate code for the SCIP Protobuf schema using the Protobuf toolchain for your language ecosystem. Also, check out the Debugging section in the Development docs.
-
- https://github.com/sourcegraph/scip-typescript
-
SCIP indexer for TypeScript and JavaScript
-
- https://github.com/sourcegraph/scip-semantic
-
scip-semantic
-
various semantic and syntax based tools related to SCIP
-
-
-
A community-driven source of knowledge for Language Server Index Format implementations
-
What is LSIF?
The Language Server Index Format (LSIF, pronounced “else if”) is a standard format for language servers or other programming tools to emit their knowledge about a code workspace. This persisted information can later be used to answer LSP requests for the same workspace without running a language server.
- https://code.visualstudio.com/blogs/2019/02/19/lsif
-
The Language Server Index Format (LSIF) (February 19, 2019)
-
-
-
https://github.com/sourcegraph/lsif-protocol
-
LSIF protocol utilities for Go This repository contains LSIF protocol struct definitions.
-
This project has been merged into github.com/sourcegraph/sourcegraph-public-snapshot
-
-
https://github.com/sourcegraph/lsif-semanticdb
-
Language Server Index Format (LSIF) converter
-
This project is now part of
lsif-java
Visit https://sourcegraph.github.io/lsif-java/docs/getting-started.html to install the
lsif-java
command-line tool. Run the following command to generate LSIF from SemanticDB.
-
-
https://github.com/sourcegraph/lsif-node
-
Language Server Indexing Format (LSIF) generator for JavaScript and TypeScript
-
Deprecated: TypeScript LSIF indexer This project is no longer maintained. Please use scip-typescript instead.
- https://github.com/sourcegraph/lsif-node-action
-
Sourcegraph TypeScript LSIF Indexer GitHub Action
-
This action generate LSIF data from TypeScript source code. See the LSIF TypeScript indexer for more details.
-
-
-
https://github.com/sourcegraph/lsif-upload-action
-
Sourcegraph LSIF Uploader GitHub Action
-
This action uploads generated LSIF data to a Sourcegraph instance.
-
-
https://github.com/sourcegraph/coif-to-lsif
-
Converts CoIF to LSIF
-
CoIF is not actively developed; you probably want to look at SCIP instead.
-
CoIF (Code Index Format) is similar to LSIF, but simpler. It's intended to be a format that is easier for indexers to emit than LSIF. The CoIF to LSIF converter only needs to be written once, so it can save the indexer from needing to be aware of all the nuances of LSIF.
-
- https://github.com/sourcegraph/sourcegraph-typescript
-
Language server for TypeScript/JavaScript
-
Provides code intelligence for TypeScript
-
This repository has been superseded by scip-typescript.
-
- https://github.com/sourcegraph/lsp-client
-
@sourcegraph/lsp-client
-
Connects Sourcegraph extensions to language servers
-
- https://github.com/sourcegraph/lsp-adapter
-
lsp-adapter provides a proxy which adapts Sourcegraph LSP requests to vanilla LSP requests
-
Code Intelligence on Sourcegraph is powered by the Language Server Protocol.
Previously, language servers that were used on sourcegraph.com were additionally required to support our custom LSP files extensions. These extensions allowed language servers to operate without sharing a physical file system with the client. While it's preferable for language servers to implement these extensions for performance reasons, implementing this functionality is a large undertaking.
lsp-adapter
eliminates the need for this requirement, which allows off-the-shelf language servers to be able to provide basic functionality (hovers, local definitions) to Sourcegraph.
-
- https://github.com/sourcegraph/javascript-typescript-langserver
-
JavaScript and TypeScript code intelligence through the Language Server Protocol
-
This project is no longer maintained
This language server is an implementation of LSP using TypeScript's APIs. This approach made it difficult to keep up with new features of TypeScript and implied that the server always uses a bundled TypeScript version, instead of the local TypeScript in
node_modules
like using the official (non-LSP)tsserver
allows.On top of that, over time we simplified our architecture for running language servers in the cloud at Sourcegraph which removed the necessity for this level of tight integration and control. Theia's TypeScript language server is a thinner wrapper around
tsserver
, which avoids these problems to some extent. Our latest approach of running a TypeScript language server in the cloud uses Theia's language server (and transitivelytsserver
) under the hood.However, since then our code intelligence evolved even further and is nowadays powered primarily by LSIF, the Language Server Index Format. LSIF is developed together with LSP and uses the same structures, but in a pre-computed serialization instead of an RPC protocol. This allows us to provide near-instant code intelligence for our tricky on-demand cloud code intelligence scenarios and hence we are focusing all of our efforts on LSIF indexers. All of this work is also open source of course and if you're curious you can read more about how we use LSIF on our blog.
LSP is still the obvious choice for editor scenarios and everyone is welcome to fork this repository and pick up maintenance, although from what we learned we would recommend to build on Theia's approach (wrapping
tsserver
). We would also love to see and are looking forward to native LSP support for the officialtsserver
, which would eliminate the need for any wrappers.
-
- https://github.com/sourcegraph/typescript-language-server
-
TypeScript & JavaScript Language Server
-
Forked from https://github.com/typescript-language-server/typescript-language-server
-
- https://github.com/sourcegraph/go-ctags
-
go-ctags: universal-ctags wrapper for easy access in Go
Note: This library is meant only for Sourcegraph use.
To improve
type:symbol
results in Sourcegraph, for languages with high quality Tree-sitter grammars, prefer adding support inscip-ctags
in the Sourcegraph monorepo over adding support in this repo.
-
- https://github.com/sourcegraph/ctags
-
Forked from https://github.com/universal-ctags/ctags
- https://ctags.io/
- https://github.com/universal-ctags/ctags
-
Universal Ctags (abbreviated as u-ctags) is a maintained implementation of
ctags
.ctags
generates an index (or tag) file of language objects found in source files for programming languages. This index makes it easy for text editors and other tools to locate the indexed items.
-
- https://github.com/universal-ctags/ctags
-
- https://github.com/sourcegraph/jsg
-
jsg: JavaScript grapher
-
JavaScript grapher -- part of GraphKit, a collection of source analyzers for popular programming languages
-
Moved to srclib-javascript (this repository is no longer a standalone project; submit patches to
srclib-javascript
)
-
- https://srclib.org/
-
srclib is a hackable, multi-language code analysis library for building better software tools.
srclib makes developer tools like code search and static analyzers better. It supports things like jump to definition, find usages, type inference, and documentation generation.
srclib consists of language analysis toolchains (currently for Go, Python, JavaScript, and Ruby) with a common output format, and developer tools that consume this format.
srclib originated inside Sourcegraph, where it powers intelligent code search over hundreds of thousands of projects.
- https://github.com/sourcegraph/srclib
-
srclib is a polyglot code analysis library, built for hackability. It consists of language analysis toolchains (currently for Go and Java, with Python, JavaScript, and Ruby in beta) with a common output format, and a CLI tool for running the analysis.
-
- https://github.com/sourcegraph/srclib-javascript
-
JavaScript (node.js) toolchain for srclib
-
srclib-javascript is a srclib toolchain that performs JavaScript (Node.js) code analysis: type inference, documentation generation, jump-to-definition, dependency resolution, etc.
It enables this functionality in any client application whose code analysis is powered by srclib, including Sourcegraph.
-
- https://github.com/sourcegraph/srclib-typescript
-
Sourcegraph support for typescript toolchain
-
srclib-typescript is a srclib toolchain that performs TypeScript code analysis: type inference, documentation generation, jump-to-definition, dependency resolution, etc. It enables this functionality in any client application whose code analysis is powered by srclib, including Sourcegraph.
-
-
- https://github.com/sourcegraph/go-tree-sitter
-
forked from smacker/go-tree-sitter
- https://github.com/smacker/go-tree-sitter
-
Golang bindings for
tree-sitter
https://github.com/tree-sitter/tree-sitter
-
-
- https://github.com/sourcegraph/tree-sitter-wasms
-
forked from Gregoor/tree-sitter-wasms
- https://github.com/Gregoor/tree-sitter-wasms
-
tree-sitter-wasms Prebuilt WASM binaries for tree-sitter's language parsers. Forked from https://github.com/Menci/tree-sitter-wasm-prebuilt because I wanted to use GitHub Actions to automate publishing.
-
Prebuilt WASM binaries for
tree-sitter
's language parsers.
-
-
- https://github.com/sourcegraph/tree-sitter-typescript
-
forked from tree-sitter/tree-sitter-typescript
- https://github.com/tree-sitter/tree-sitter-typescript
-
TypeScript grammar for
tree-sitter
-
-
- https://github.com/sourcegraph/go-diff
-
go-diff Unified diff parser and printer for Go
-
Diff parser and printer for Go.
-
It doesn't actually compute a diff. It only reads in (and prints out, given a Go struct representation) unified diff output
-
- https://github.com/sourcegraph/go-dep-parser
-
Forked from aquasecurity/go-dep-parser
- https://github.com/aquasecurity/go-dep-parser
-
go-dep-parser Dependency Parser for Multiple Programming Languages
-
Note: Moved to the dependency package in Trivy
- https://github.com/aquasecurity/trivy/tree/main/pkg/dependency
- https://github.com/aquasecurity/trivy
-
Trivy (pronunciation) is a comprehensive and versatile security scanner. Trivy has scanners that look for security issues, and targets where it can find those issues.
-
Find vulnerabilities, misconfigurations, secrets, SBOM in containers, Kubernetes, code repositories, clouds and more
-
- https://github.com/aquasecurity/trivy
- https://github.com/aquasecurity/trivy/tree/main/pkg/dependency
-
-
- https://github.com/sourcegraph/tiktoken-go
-
forked from pkoukk/tiktoken-go
- https://github.com/pkoukk/tiktoken-go
-
tiktoken-go
OpenAI's tiktoken in Go.
Tiktoken is a fast BPE tokeniser for use with OpenAI's models.
This is a port of the original tiktoken.
-
-
- https://github.com/sourcegraph/awesome-code-ai
-
Awesome-Code-AI
-
A list of AI coding tools (assistants, completion, refactoring, etc.).
-
- https://github.com/sourcegraph/codesearch.ai
-
codesearch.ai (Archived)
-
codesearch.ai is a semantic code search engine. It allows searching GitHub functions and StackOverflow answers using natural language queries. It uses HuggingFace Transformers under the hood, and the training procedure is inspired by a paper called Text and Code Embeddings by Contrastive Pre-Training from OpenAI. The CodeSearchNet project served as a basis for data collection and cleaning.
-
- https://github.com/sourcegraph/whouses
-
Who Uses (Archived) Find out what awesome you've started with Sourcegraph.
-
Find out what projects are using your npm package
- https://whouses.netlify.app/
-
- https://grep.app/
-
Code search made fast
-
Effortlessly search for code, files, and paths across a million GitHub repositories.
- https://grep.app/api/search?q=
-
- https://vercel.com/blog/vercel-acquires-grep
-
Vercel acquires Grep to accelerate code search
-
Grep allows developers to quickly search code across over 500,000 public git repositories. With the acquisition, founder Dan Fox will also be joining Vercel’s AI team to continue building Grep to enhance code search for developers.
-
- https://searchcode.com/
-
SearchCode
-
Artisanal, small batch, handcrafted code search!
-
Simple, comprehensive code search
-
Helping you find real world examples of functions, API's and libraries in 378+ languages across 10+ public code sources
-
Filter down to one or many sources such as Bitbucket, CodePlex, Fedora Project, GitLab, Github, Gitorious, Google Android, Google Code, Minix3, Seek Quarry, Sourceforge, Tizen, codeberg, repo.or.cz, sr.ht or by 378+ languages.
- https://searchcode.com/about/
-
Team / Contact
searchcode is currently the work of a single developer standing on the shoulders of giants.
Feel free to contact me at [email protected] or via twitter @boyter or follow developments at https://boyter.org/
-
- https://searchcode.com/api/
-
searchcode API
-
Code Index
Queries the code index and returns at most 100 results. All filters supported by searchcode are available. These include src (sources), lan (languages) and loc (lines of code). These work in the same way that the main page works. See the examples for how to use these.
-
Code Result
Returns the raw data from a code file given the code id which can be found as the id in a code search result.
-
Related Results
Returns an array of results given a searchcode unique code id which are considered to be duplicates. The matching is slightly fuzzy allowing so that small differences between files are ignored.
- etc
-
-
- https://searchcodeserver.com/
-
searchcode server
-
The best code search solution. Guaranteed. The code search solution for companies that build or maintain software who want to improve productivity and shorten development time by getting value from their existing source code.
-
How searchcode server works.
By indexing your source code it allows you to search over this code quickly, filtering down by repositories, languages and file owners to find what you were looking for. Own your data, searchcode server is not a SAAS or cloud product, download and install it on your own servers.
- https://searchcodeserver.com/pricing.html
-
Pricing for searchcode server
-
Requirements: A GNU/Linux/Windows/BSD machine running the Java 8 runtime. Everything else is configured out of the box for you.
The community edition is free to use for as many users as you wish but you must leave the searchcode branding visible.
All paid plans include a full downloadable version of searchcode server with the ability to change the icon and modify other look and feel elements. The software comes with a lifetime licence to install use searchcode server internally on as many instances as you like. You can use any paid for version in an manner you see fit include public facing websites. Finally you will get direct emails letting you know when updates are available and links to the update for the length of the support period.
-
- https://github.com/boyter/searchcode-server/tree/master
-
searchcode server
-
searchcode server is a powerful code search engine with a sleek web user interface.
searchcode server works in tandem with your source control system, indexing thousands of repositories and files allowing you and your developers to quickly find and reuse code across teams.
-
-
- https://boyter.org/
-
Ben E. C. Boyter's Blog
- https://boyter.org/about/
-
Shortlist:
- https://boyter.org/posts/searchcode-bigger-sqlite-than-you/
-
searchcode.com’s SQLite database is probably 6 terabytes bigger than yours (2025-02-16)
-
- https://boyter.org/posts/how-i-built-my-own-index-for-searchcode/
-
Building a custom code search index in Go for searchcode.com (2022-11-22)
-
Additional/Unsorted:
- https://boyter.org/posts/searchcode.com-vibe-coding/
-
Vibe coding searchcode a new UI and saving myself 40+ hours of work (2025-03-12) (1108 words)
-
- https://boyter.org/posts/bloom-filters-sqlite/
-
Bloom Filters and SQLite (2024-11-20) (421 words)
- https://github.com/boyter/bloom-sqlite
-
bloom-sqlite
-
- https://github.com/boyter/indexer
-
indexer
Code for GopherConSyd 2023
So please clone this, and start interacting!
It's a small portion of the caisson index that powers searchcode.com with no dependencies.
-
-
- https://boyter.org/posts/one-hundred-million-little-queries/
-
One hundred million little queries (2024-04-23) (605 words)
-
- https://boyter.org/posts/brute-force-text-search-optimizations/
-
Brute force text search optimizations (2024-03-27) (854 words)
-
- https://boyter.org/posts/codespelunker-details/
-
Code Spelunker how it works (2023-06-06) (858 words)
-
- https://boyter.org/posts/code-spelunker-a-code-search-command-line-tool/
-
Code Spelunker a Code Search Command Line Tool (2023-06-05) (1107 words)
- https://github.com/boyter/cs
-
codespelunker (cs)
A command line search tool. Allows you to search over code or text files in the current directory either on the console, via a TUI or HTTP server, using some boolean queries or regular expressions.
Consider it a similar approach to using ripgrep, silver searcher or grep coupled with fzf but in a single tool.
-
-
- https://boyter.org/posts/profiling-ngram-trigram-tokenization-in-go/
-
Real World CPU profiling of ngram/trigram tokenization in Go to reduce index time in searchcode.com (2023-04-12) (554 words)
-
- https://boyter.org/posts/search-index-implementations/
-
Search index implementations (2022-06-26) (539 words)
-
Trie for example https://github.com/typesense/typesense which uses Adaptive Radix Tree https://stackoverflow.com/questions/50127290/data-structure-for-fast-full-text-search
- https://github.com/typesense/typesense
-
Open Source alternative to Algolia + Pinecone and an Easier-to-Use alternative to ElasticSearch ⚡ 🔍 ✨ Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences
-
- https://github.com/typesense/typesense
-
Bit Signatures
This is something I remember reading about years ago, and found this link to prove I had not lost my mind https://www.stavros.io/posts/bloom-filter-search-engine/ At the time I thought it was neat but not very practical… However then it turns out that Bing has been using this technique over its entire web corpus http://bitfunnel.org/ https://www.youtube.com/watch?v=1-Xoy5w5ydM
-
- https://boyter.org/posts/bloom-filter/
-
Bloom Filters - Much, much more than a space efficient hashmap! (2020-12-10) (2447 words)
-
- https://boyter.org/posts/building-an-api-rate-limiter-in-go-for-searchcode/
-
Building a API rate limiter in Go for searchcode (2020-05-04) (1327 words)
-
- https://boyter.org/posts/searchcode-rebuilt-with-go/
-
searchcode Rebuilt with Go (2020-04-22) (984 words)
- https://github.com/boyter/searchcode-server-highlighter
-
searchcode-server-highlighter
-
A very simple Go HTTP based Syntax highlighter. Run it, then post some code to the default port and it will return CSS + HTML syntax highlighted code.
-
-
- https://boyter.org/posts/an-informal-survey-of-10-million-github-bitbucket-gitlab-projects/
-
Processing 40 TB of code from ~10 million projects with a dedicated server and Go for $100 (2019-09-30) (13129 words)
-
- https://boyter.org/posts/file-read-challange/
-
Processing Large Files – Java, Go and 'hitting the wall' (2019-05-08) (2480 words)
-
- https://boyter.org/2018/03/collection-favorite-optimization-posts-articles/
-
Collection of my favorite optimization posts and articles (2018-03-08) (597 words)
-
- https://boyter.org/2017/12/searchcode-plexus/
-
searchcode plexus (2017-12-05) (1224 words)
-
- https://boyter.org/2017/06/design-searchcode-server/
-
Design for searchcode server (2017-06-27) (107 words)
-
- https://boyter.org/2017/03/golang-solution-faster-equivalent-java-solution/
-
Why is this GoLang solution faster than the equivalent Java Solution? (2017-03-30) (2146 words)
-
- https://boyter.org/2017/01/repository-overview-searchcode-server/
-
Repository overview now in searchcode server (2017-01-30) (363 words)
-
- https://boyter.org/2016/08/searchcode-server-fair-source/
-
searchcode server under fair source (2016-08-24) (185 words)
-
- https://boyter.org/2016/08/syncing-stashbitbucket-searchcode-server/
-
Syncing Stash/BitBucket with searchcode server (2016-08-04) (295 words)
-
- https://boyter.org/2016/07/searchcode-com-architecture-migration-3-0/
-
searchcode.com: The Architecture – migration 3.0 (2016-07-28) (1619 words)
-
- https://boyter.org/2016/03/searchcode-server-released/
-
searchcode server released (2016-03-31) (167 words)
-
- https://boyter.org/2015/12/searchcode-server/
-
searchcode server (2015-12-29) (235 words)
-
- https://boyter.org/2015/10/searchcode-local/
-
searchcode local (2015-10-30) (301 words)
-
- https://boyter.org/2015/09/search/
-
Go Forth and Search (2015-09-02) (247 words)
-
- https://boyter.org/2015/07/searchcode-path-profitability/
-
searchcode the path to profitability (2015-07-17) (338 words)
-
- https://boyter.org/2015/07/searchcode-com-unit-integration-tested/
-
How searchcode.com is Unit and Integration Tested (2015-07-01) (1216 words)
-
- https://boyter.org/2015/03/updates-searchcode-com/
-
Updates to searchcode.com (2015-03-18) (286 words)
-
- https://boyter.org/2014/10/searchcode-com-100-free-software/
-
Why searchcode.com isn't 100% free software (2014-10-10) (765 words)
-
- https://boyter.org/2014/06/sphinx-searchcode/
-
Sphinx and searchcode (2014-06-20) (631 words)
- http://sphinxsearch.com/
- http://sphinxsearch.com/blog/2014/06/19/sphinx-searches-code-at-searchcode-com/
-
- https://boyter.org/2014/06/estimating-sphinx-search-ram-requirements/
-
Estimating Sphinx Search RAM Requirements (2014-06-19) (117 words)
-
- https://boyter.org/2014/06/searchcode/
-
searchcode next (2014-06-16) (542 words)
-
- https://boyter.org/2014/03/searchcode-screenshot/
-
searchcode screenshot (2014-03-26) (117 words)
-
- https://boyter.org/2014/02/searchcode-logo/
-
New searchcode Logo (2014-02-10) (157 words)
-
- https://boyter.org/2014/02/storing-tracking-managing-billions-tiny-files-file-system-nightmare/
-
Why is storing, tracking and managing billions of tiny files directly on a file system a nightmare? (2014-02-06) (202 words)
-
- https://boyter.org/2013/02/why-code-search-is-difficult/
-
Why Code Search is Difficult (2013-02-28) (475 words)
-
- https://boyter.org/2013/01/want-to-write-a-search-engine-have-some-links/
-
Want to write a search engine? Have some links (2013-01-30) (635 words)
- https://github.com/gigablast/open-source-search-engine
-
open-source-search-engine An open source web and enterprise search engine and spider/crawler. As can be seen on http://www.gigablast.com/
-
-
- https://boyter.org/2013/01/code-for-a-search-engine-in-php-part-5/
-
Code a Search Engine in PHP Part 5 (2013-01-10) (1344 words)
-
- https://boyter.org/2013/01/code-for-a-search-engine-in-php-part-4/
-
Code a Search Engine in PHP Part 4 (2013-01-10) (1348 words)
-
- https://boyter.org/2013/01/code-for-a-search-engine-in-php-part-3/
-
Code a Search Engine in PHP Part 3 (2013-01-10) (1891 words)
-
- https://boyter.org/2013/01/code-for-a-search-engine-in-php-part-2/
-
Code a Search Engine in PHP Part 2 (2013-01-10) (2074 words)
-
- https://boyter.org/2013/01/code-for-a-search-engine-in-php-part-1/
-
Code a Search Engine in PHP Part 1 (2013-01-10) (5454 words)
-
- https://boyter.org/2012/11/building-a-search-engine-the-most-important-feature-you-can-add/
-
Building a search engine? The most important feature you can add. (2012-11-15) (434 words)
- https://duckduckgo.com/bangs
-
What are bangs?
Bangs are shortcuts that quickly take you to search results on other sites. For example, when you know you want to search on another site like Wikipedia or Amazon, our bangs get you there fastest. A search for !w filter bubble will take you directly to Wikipedia.
-
-
- https://boyter.org/2012/07/billions-of-lines-of-code/
-
Billions of lines of code (2012-07-16) (267 words)
-
- https://boyter.org/2012/06/codesearch-api/
-
Codesearch API (2012-06-26) (309 words)
-
- https://boyter.org/2012/04/growing-index/
-
Growing Index (2012-04-13) (216 words)
-
- https://boyter.org/2012/04/performance/
-
Performance (2012-04-12) (96 words)
-
- https://boyter.org/2012/02/improving-the-index/
-
Improving the Index (2012-02-29) (503 words)
-
- https://boyter.org/2011/12/searchcode-now-supports-regex-code-search/
-
searchcode now supports regex code search (2011-12-17) (284 words)
-
- https://boyter.org/2011/10/google-killing-off-code-search/
-
Google Killing off Code Search (2011-10-15) (186 words)
-
- https://boyter.org/2011/06/vector-space-search-model-explained/
-
Vector Space Search Model Explained (2011-06-28) (700 words)
- http://la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf
- This link 404's now unfortunately, but these seem similar:
- https://ondoc.logand.com/d/2697/pdf
-
Basic Vector Space Search Engine Theory (January 2, 2004)
-
- https://www.researchgate.net/publication/289611753_A_Vector_Space_Model_Approach_for_Searching_and_Matching_Product_E-Catalogues
-
A Vector Space Model Approach for Searching and Matching Product E-Catalogues
-
- https://ondoc.logand.com/d/2697/pdf
- This link 404's now unfortunately, but these seem similar:
-
- https://boyter.org/2010/08/build-vector-space-search-engine-python/
-
Building a Vector Space Indexing Engine in Python (2010-08-23) (1437 words)
-
- https://boyter.org/2008/09/data-mining/
-
Data Mining (2008-09-22) (680 words)
-
Note: I think this might only be for Google projects/similar(?)
- https://developers.google.com/code-search
-
Code Search
-
You can search for specific files or code snippets by using the search box located at the top of the Code Search UI
-
Start using this public code search tool for exploring code without downloading the source.
-
- https://developers.google.com/code-search/user/getting-started
-
Getting started with Code Search
-
To get started, open the Code Search UI for your project:
- https://cs.android.com/
-
Android Code Search
-
- https://source.chromium.org/
-
Chromium Code Search
-
- https://cs.opensource.google/?authuser=1
-
Google Open Source
-
- https://cs.android.com/
-
- https://developers.google.com/code-search/reference?authuser=1
-
Syntax reference
-
This page provides detailed information on the supported filters, operators, syntax options, and keyboard shortcuts for Code Search.
-
- https://programmablesearchengine.google.com/controlpanel/all
- https://programmablesearchengine.googleblog.com/
-
Programmable Search Engine Blog The latest news, updates and tips from the Programmable Search Engine team
-
The Custom Search Site Restricted JSON API endpoints will cease serving traffic on January 8, 2025.
Beginning on January 8, 2025, all Custom Search Site Restricted JSON API customers must begin their transition to Google Cloud's Vertex AI Search to maintain access to their site search functionality.
- https://developers.google.com/custom-search/v1/site_restricted_api
-
Custom Search Site Restricted JSON API
-
If your Programmable Search Engine is restricted to only searching specific sites (10 or fewer), you can use the Custom Search Site Restricted JSON API. This API is similar to the Custom Search JSON API except this version has no daily query limit. To use this version, confirm that you see 10 or fewer sites to search in the “Sites to Search” section of your Programmable Search Engine control panel, there are no global top level domain patterns, and that “Search the entire web” is set to OFF.
-
- https://cloud.google.com/enterprise-search
-
Vertex AI Search Vertex AI Search helps developers build secure, Google-quality search experiences for websites, intranet and RAG systems for generative AI agents and apps.
-
- https://developers.google.com/custom-search/v1/site_restricted_api
-
- https://github.com/livegrep/livegrep
-
Livegrep
-
Livegrep is a tool, partially inspired by Google Code Search, for interactive regex search of ~gigabyte-scale source repositories. You can see a running instance at http://livegrep.com/.
-
To run livegrep, you need to invoke both the codesearch backend index/search process, and the livegrep web interface.
- https://livegrep.com/search/linux
- This only has a few example repositories indexed
-
- https://gist.github.com/phillipalexander/9244143
-
Source Code Search Engines
-
NOTE: This list is almost entirely copy/pasted from THIS awesome article. I've made my own personal edits (adding some additional content) which is why I keep it here.
- A lot of the search engines listed here seem to not be a good match for what I want, or no longer exist, etc.
-
- https://openhub.net/
-
Discover, Track and Compare Open Source
- https://openhub.net/tools
- https://github.com/blackducksoftware/ohloh_api#open-hub-api-documentation
-
- https://stackoverflow.com/questions/28526255/resource-for-npms-most-downloaded-this-week-month
- https://registry.npmjs.org/
- https://github.com/npm/public-api
-
npm's public APIs are haphazard, old-fashioned, and scattered. We can and will do better. An internal API was created to handle the needs of www, but needs some work before it can be publicly released.
-
This repository is deprecated. If you are interested in filing an issue about npm's public registry API, please file over at the npm/registry repo. You can also find documentation over there!
- https://github.com/npm/registry
-
npm registry documentation
-
A collection of archived documentation about registry endpoints/API.
- https://github.com/npm/registry/tree/main/docs
- https://github.com/npm/registry/blob/main/docs/REGISTRY-API.md
-
Public Registry API
-
- https://github.com/npm/registry/blob/main/docs/REPLICATE-API.md
-
Replication API
- https://github.com/npm/registry/blob/main/docs/REPLICATE-API.md#the-follower-pattern
-
The Follower Pattern
The primary pattern of using these services is to build a follower. If you'd rather just jump into building something with the registry data, head on over to this tutorial to get started!
- https://github.com/npm/registry/blob/main/docs/follower.md / https://github.com/npm/registry-follower-tutorial
-
This tutorial will teach you how to write a generic boilerplate NodeJS application that can manipulate, respond to, broadcast, analyze, and otherwise play with package metadata as it changes in the npm registry.
Wait...what? Why?
Here's the deal: do you want to have some fun with the package.json data from every version of every package in the npm registry? Some neat ideas:
- Find all the package READMEs that mention dogs
- Discover how many package authors are named "Kate"
- Calculate how many dependency changes occur on average in a major version bump
And more! So stop waiting and write a follower!
-
- https://github.com/npm/registry/blob/main/docs/follower.md / https://github.com/npm/registry-follower-tutorial
-
-
- https://github.com/npm/registry/blob/main/docs/COUCHDB.md
- https://github.com/npm/registry/blob/main/docs/download-counts.md
-
package download counts There is a public api that gives you download counts by package and time range.
Our blog has an explanation of how npm download counts work, including "what counts as a download?"
- https://blog.npmjs.org/post/92574016600/numeric-precision-matters-how-npm-download-counts-work.html
-
numeric precision matters: how npm download counts work
-
- https://blog.npmjs.org/post/92574016600/numeric-precision-matters-how-npm-download-counts-work.html
-
npm's raw log data is continuously written to a series of buckets on AWS S3. Once per day, soon after UTC midnight, a map-reduce cluster is spun up that crunches the previous day's logs and pushes them into the database.
- https://github.com/npm/registry/blob/main/docs/download-counts.md#point-values
-
Point values
Gets the total downloads for a given period, for all packages or a specific package.
GET https://api.npmjs.org/downloads/point/{period}[/{package}
- etc
-
- https://github.com/npm/registry/blob/main/docs/download-counts.md#bulk-queries
-
Bulk Queries To perform a bulk query, you can hit the range or point endpoints with a comma separated list of packages rather than a single package
-
- https://github.com/npm/registry/blob/main/docs/download-counts.md#limits
-
Limits Bulk queries are limited to at most 128 packages at a time and at most 365 days of data.
All other queries are limited to at most 18 months of data. The earliest date for which data will be returned is January 10, 2015.
-
-
- https://github.com/npm/registry/blob/main/docs/REGISTRY-API.md
-
- https://github.com/npm/registry
-
- https://gist.github.com/anvaka/8e8fa57c7ee1350e3491
-
npm rank This gist is updated daily via cron job and lists stats for npm packages:
- Top 1,000 most depended-upon packages
- Top 1,000 packages with largest number of dependencies
- Top 1,000 packages with highest PageRank score
- This seems to have last been updated: Fri, 16 Aug 2019 07:31:10 GMT
- https://gist.github.com/anvaka/8e8fa57c7ee1350e3491?permalink_comment_id=3581063#gistcomment-3581063
-
the data is generated by https://github.com/anvaka/npmrank
A process of downloading the npm packages is a bit involved, since npm deprecated their public endpoints, but still possible. The https://github.com/anvaka/npmrank repository instructions on getting the data are up to date.
- https://github.com/anvaka/npmrank
-
npmrank
-
npm dependencies graph metrics
-
This repository computes various graph metrics for npm dependencies.
-
Download the npm graph from npm. To do this, follow the instructions from https://github.com/anvaka/allnpm#downloading-npm-data
- https://github.com/anvaka/allnpm
-
allnpm Graph generator of entire npm registry.
- https://github.com/anvaka/allnpm#downloading-npm-data
-
Downloading npm data
Unfortunately we can no longer access
https://skimdb.npmjs.com/registry/_design/scratch/_view/byField
directly. This CouchDB view used to return every single package from npm, that could be used to construct the graph.To get all npm packages we have to replicate the entire npm repository using standalone instance of CouchDB and following instructions from https://www.npmjs.com/package/npm-registry-couchapp.
The process took me ~2 days and ~300GB of hard drive, until local instance of CouchDB compacted its views. After compaction the disk usage went down to ~100GB.
Note: it is not enough to just replicate, need to wait until all indexes are generated.
Once the replication is complete you can do:
wget http://admin:[email protected]:5984/registry/_design/scratch/_view/byField
In November 2020, this produced
3.3GB
of npm packages and saved it intobyField
file. - https://github.com/npm/npm-registry-couchapp
-
deprecation notice: as npm has scaled, the registry architecture has gradually migrated towards a complex distributed architecture, of which npm-registry-couchapp is only a small part. FOSS is an important part of npm, and over time we plan on exposing more APIs, and better documenting the existing API.
-
-
-
- https://github.com/anvaka/allnpm
- https://github.com/anvaka/npmrank#online
-
Discover relevant and popular packages quickly: https://anvaka.github.io/npmrank/online/ Select a keyword and get packages sorted by their pagerank value.
-
-
- https://github.com/anvaka/npmrank
-
- https://gist.github.com/anvaka/8e8fa57c7ee1350e3491?permalink_comment_id=4435858#gistcomment-4435858
-
I've made an updated version here -- top 10k packages.
- https://leodog896.github.io/npm-rank/index.html
-
npm-rank
Automated top 10000 npm packages collector, inspired by anvaka's npm rank gist.
- https://leodog896.github.io/npm-rank/PACKAGES.html
-
Packages Ordered list of top 10000 NPM packages
-
- https://github.com/LeoDog896/npm-rank
-
- https://leodog896.github.io/npm-rank/index.html
-
- https://gist.github.com/anvaka/8e8fa57c7ee1350e3491?permalink_comment_id=3581063#gistcomment-3581063
- https://gist.github.com/anvaka/8e8fa57c7ee1350e3491#file-01-most-dependent-upon-md
-
Top 1000 most depended-upon packages
-
- https://gist.github.com/anvaka/8e8fa57c7ee1350e3491#file-02-with-most-dependencies-md
-
Top 1000 packages with most dependencies
-
- https://gist.github.com/anvaka/8e8fa57c7ee1350e3491#file-03-pagerank-md
-
Top 1000 packages with highest Pagerank
-
- https://gist.github.com/anvaka/8e8fa57c7ee1350e3491#file-04-hits-rank-md
-
Top 1000 packages with highest authority in HITS rank
-
-
- https://github.com/evanwashere/top-npm-packages
-
npm packages ranked by monthly downloads
- This seems to have the top 10,000 entries in
.json
and.txt
; last updated on Dec 21, 2021
-
- https://npmgraph.js.org/
- https://github.com/npmgraph/npmgraph
-
npmgraph
-
A tool for exploring npm modules and dependencies.
-
- https://github.com/npmgraph/npmgraph
- https://www.npmcharts.com/
- https://github.com/cheapsteak/npmcharts.com
-
Compare npm package downloads over time
-
- https://github.com/cheapsteak/npmcharts.com
- https://npmtrends.com/
-
npm trends Compare package download counts over time
-
- https://npm.chart.dev/
- https://github.com/atinux/npm-chart
-
NPM Chart
-
Visualize your package npm downloads in a beautiful chart, ready to be shared with your community.
- Visualize npm downloads in a beautiful chart, ready to be shared with your community.
-
- https://github.com/atinux/npm-chart
- https://npm-stat.com/
-
npm-stat npm-stat can generate download charts for any package on npm
- https://github.com/pvorb/npm-stat.com
-
npm-stat Download statistics for npm packages.
-
-
- https://medium.com/@glitch.txs/on-measuring-the-bundle-size-of-javascript-packages-5816e216e3d8
-
On Measuring the Bundle Size of JavaScript Packages
- https://bundlephobia.com/
-
Bundlephobia
-
find the cost of adding a npm package to your bundle
- https://github.com/pastelsky/bundlephobia
-
Find out the cost of adding a new frontend dependency to your project
-
- https://github.com/AdrieanKhisbe/bundle-phobia-cli
-
Cli for the node BundlePhobia Service
-
-
- https://bundlejs.com/
-
bundlejs
-
a quick npm package size checker
- https://github.com/okikio/bundlejs
-
An online tool to quickly bundle & minify your projects, while viewing the compressed gzip/brotli bundle size, all running locally on your browser.
-
I used monaco-editor for the code-editor, esbuild as bundler and treeshaker respectively, denoflate as a wasm port of gzip, deno_brotli as a wasm port of brotli, deno_lz4 as a wasm port of lz4, bytes to convert the compressed size to human readable values, esbuild-visualizer to visualize and analyze your esbuild bundle to see which modules are taking up space and, umami for private, publicly available analytics and general usage stats all without cookies.
-
bundlejs is a quick and easy way to bundle your projects, minify and see it's gzip size. It's an online tool similar to bundlephobia, but bundle does all the bundling locally on you browser and can treeshake and bundle multiple packages (both commonjs and esm) together, all without having to install any npm packages and with typescript support.
-
-
- https://github.com/glitch-txs/vite-size
-
Vite Size
-
Check the bundle size of the output build of any package with Vite.
-
Measure the bundle size of any package with Vite
-
-
- https://github.com/webpack-contrib/webpack-bundle-analyzer
-
Webpack Bundle Analyzer
-
Visualize size of webpack output files with an interactive zoomable treemap.
-
Webpack plugin and CLI utility that represents bundle content as convenient interactive zoomable treemap
-
- https://chrisbateman.github.io/webpack-visualizer/
-
Webpack Visualizer
- https://github.com/chrisbateman/webpack-visualizer
-
Webpack Visualizer
-
Visualize and analyze your Webpack bundle to see which modules are taking up space and which might be duplicates.
-
-
- https://github.com/btd/esbuild-visualizer
-
EsBuild Visualizer
-
Create chart of dependencies in your bundle
-
Visualize and analyze your esbuild bundle to see which modules are taking up space.
-
The below content was originally posted in this comment (Dec 7, 2023: Ref), and then copied over as the basis for a new issue in this comment (Dec 13, 2023: Ref)
It has been further refined/enhanced since, including fixing up the titles, adding abstracts, and removing irrelevant links.
Here is a link dump of a bunch of the tabs I have open but haven't got around to reviewing in depth yet, RE: 'AST fingerprinting' / Code Similarity / etc:
Program Dependence Graph, Control Flow Graph, Data Flow Graph, Data Flow Analysis, Program Analysis Tools, etc
- https://en.wikipedia.org/wiki/Program_dependence_graph
-
Program Dependence Graph - Wikipedia
-
In computer science, a Program Dependence Graph (PDG) is a representation of a program's control and data dependencies. It's a directed graph where nodes represent program statements, and edges represent dependencies between these statements. PDGs are useful in various program analysis tasks, including optimizations, debugging, and understanding program behavior.
-
- https://en.wikipedia.org/wiki/Control-flow_graph
-
Control-Flow Graph - Wikipedia
-
In computer science, a control-flow graph (CFG) is a representation, using graph notation, of all paths that might be traversed through a program during its execution.
-
In a control-flow graph each node in the graph represents a basic block, i.e. a straight-line piece of code without any jumps or jump targets; jump targets start a block, and jumps end a block. Directed edges are used to represent jumps in the control flow. There are, in most presentations, two specially designated blocks: the entry block, through which control enters into the flow graph, and the exit block, through which all control flow leaves.
- https://github.com/rudrOwO/control-flow-graph
-
Control-flow Graph Generate control-flow graph (CFG) from any code consisting of C-like syntax
- https://control-flow.vercel.app/
-
- https://reverseengineering.stackexchange.com/questions/16557/building-a-control-flow-graph-from-machine-code
-
Building a control flow graph from machine code (2017)
-
-
- https://stackoverflow.com/questions/15087195/data-flow-graph-construction
-
Stack Overflow: Data Flow Graph Construction (2013)
-
- https://codereview.stackexchange.com/questions/276387/call-flow-graph-from-python-abstract-syntax-tree
-
Code Review Stack Exchange: Call-flow graph from Python abstract syntax tree (2022)
-
- https://codeql.github.com/docs/writing-codeql-queries/about-data-flow-analysis/
-
CodeQL Documentation: About data flow analysis
-
Data flow analysis is used to compute the possible values that a variable can hold at various points in a program, determining how those values propagate through the program and where they are used.
- https://codeql.github.com/docs/codeql-language-guides/analyzing-data-flow-in-javascript-and-typescript/#analyzing-data-flow-in-javascript-and-typescript
-
Analyzing data flow in JavaScript and TypeScript This topic describes how data flow analysis is implemented in the CodeQL libraries for JavaScript/TypeScript and includes examples to help you write your own data flow queries.
-
-
- https://clang.llvm.org/docs/DataFlowAnalysisIntro.html
-
Clang Documentation: Data flow analysis: an informal introduction
-
This document introduces data flow analysis in an informal way. The goal is to give the reader an intuitive understanding of how it works, and show how it applies to a range of refactoring and bug finding problems.
-
Data flow analysis is a static analysis technique that proves facts about a program or its fragment. It can make conclusions about all paths through the program, while taking control flow into account and scaling to large programs. The basic idea is propagating facts about the program through the edges of the control flow graph (CFG) until a fixpoint is reached.
-
- https://www.cs.odu.edu/~zeil/cs350/latest/Public/analysis/index.html
-
Program Analysis Tools
-
Contents:
- 1 Representing Programs
- 1.1 Abstract Syntax Trees (ASTs)
- 1.2 Control Flow Graphs
- 2 Style and Anomaly Checking
- 2.1 Lint
- 2.2 Static Analysis by Compilers
- 2.3 CheckStyle
- 2.4 SpotBugs
- 2.5 PMD
- 3 Reverse-Engineering Tools
- 3.1 Reverse Compilers
- 3.2 Java Obfuscators
- 3.3 Obfuscation Example
- 4 Dynamic Analysis Tools
- 4.1 Pointer/Memory Errors
- 4.2 Profilers
- 1 Representing Programs
- https://www.cs.odu.edu/~zeil/cs350/latest/Public/analysis/index.html#control-flow-graphs
-
1.2 Control Flow Graphs
-
Represent each executable statement in the code as a node, with edges connecting nodes that can be executed one after another. Nodes for conditional statements have two or more outgoing edges.
-
- https://www.cs.odu.edu/~zeil/cs350/latest/Public/analysis/index.html#data-flow-analysis
-
1.2.2 Data Flow Analysis
-
-
- https://www.cs.columbia.edu/~suman/secure_sw_devel/Basic_Program_Analysis_CF.pdf
-
Slides: Basic Program Analysis - Suman Jana
- ChatGPT Summary / Abstract:
-
Title: Basic Program Analysis
Author: Suman Jana
Institution: Columbia University
Abstract: This document delves into the foundational concepts and techniques involved in program analysis, particularly focusing on control flow and data flow analysis essential for identifying security bugs in source code. The objective is to equip readers with the understanding and tools needed to effectively analyze programs without building systems from scratch, utilizing existing frameworks such as LLVM for customization and enhancement of analysis processes.
The core discussion includes an overview of compiler design with specific emphasis on the Abstract Syntax Tree (AST), Control Flow Graph (CFG), and Data Flow Analysis. These elements are critical in understanding the structure of source code and its execution flow. The document highlights the conversion of source code into AST and subsequently into CFG, where data flow analysis can be applied to optimize code and identify potential security vulnerabilities.
Additionally, the paper explores more complex topics like identifying basic blocks within CFG, constructing CFG from basic blocks, and advanced concepts such as loop identification and the concept of dominators in control flow. It also addresses the challenges and solutions related to handling irreducible Control Flow Graphs (CFGs), which are crucial for the analysis of less structured code.
Keywords: Program Analysis, Compiler Design, Abstract Syntax Tree (AST), Control Flow Graph (CFG), Data Flow Analysis, LLVM, Security Bugs.
-
-
- https://stackoverflow.com/questions/7283702/assembly-level-function-fingerprint
-
Stack Overflow: Assembly-level function fingerprint (2011)
-
- https://patents.google.com/patent/US9459861B1/en
-
Systems and methods for detecting copied computer code using fingerprints (2016)
-
Systems and methods of detecting copying of computer code or portions of computer code involve generating unique fingerprints from compiled computer binaries. The unique fingerprints are simplified representations of functions in the compiled computer binaries and are compared with each other to identify similarities between functions in the respective compiled computer binaries. Copying can be detected when there are sufficient similarities between fingerprints of two functions.
-
- https://dl.acm.org/doi/10.1145/3486860
-
A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features (2022)
-
Binary code fingerprinting is crucial in many security applications. Examples include malware detection, software infringement, vulnerability analysis, and digital forensics. It is also useful for security researchers and reverse engineers since it enables high fidelity reasoning about the binary code such as revealing the functionality, authorship, libraries used, and vulnerabilities. Numerous studies have investigated binary code with the goal of extracting fingerprints that can illuminate the semantics of a target application. However, extracting fingerprints is a challenging task since a substantial amount of significant information will be lost during compilation, notably, variable and function naming, the original data and control flow structures, comments, semantic information, and the code layout. This article provides the first systematic review of existing binary code fingerprinting approaches and the contexts in which they are used. In addition, it discusses the applications that rely on binary code fingerprints, the information that can be captured during the fingerprinting process, and the approaches used and their implementations. It also addresses limitations and open questions related to the fingerprinting process and proposes future directions.
-
- https://inria.hal.science/hal-01648996/document
-
BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code Executables (2017)
-
Binary code fingerprinting is a challenging problem that requires an in-depth analysis of binary components for deriving identifiable signatures. Fingerprints are useful in automating reverse engineering tasks including clone detection, library identification, authorship attribution, cyber forensics, patch analysis, malware clustering, binary auditing, etc. In this paper, we present BinSign, a binary function fingerprinting framework. The main objective of BinSign is providing an accurate and scalable solution to binary code fingerprinting by computing and matching structural and syntactic code profiles for disassemblies. We describe our methodology and evaluate its performance in several use cases, including function reuse, malware analysis, and indexing scalability. Additionally, we emphasize the scalability aspect of BinSign. We perform experiments on a database of 6 million functions. The indexing process requires an average time of 0.0072 seconds per function. We find that BinSign achieves higher accuracy compared to existing tools.
-
- https://www.unomaha.edu/college-of-information-science-and-technology/research-labs/_files/software-nsf.pdf
-
Software Fingerprinting in LLVM (2021)
-
Executable steganography, the hiding of software machine code inside of a larger program, is a potential approach to introduce new software protection constructs such as watermarks or fingerprints. Software fingerprinting is, therefore, a process similar to steganography, hiding data within other data. The goal of fingerprinting is to hide a unique secret message, such as a serial number, into copies of an executable program in order to provide proof of ownership of that program. Fingerprints are a special case of watermarks, with the difference being that each fingerprint is unique to each copy of a program. Traditionally, researchers describe four aims that a software fingerprint should achieve. These include the fingerprint should be difficult to remove, it should not be obvious, it should have a low false positive rate, and it should have negligible impact on performance. In this research, we propose to extend these objectives and introduce a fifth aim: that software fingerprints should be machine independent. As a result, the same fingerprinting method can be used regardless of the architecture used to execute the program. Hence, this paper presents an approach towardsthe realization of machine-independent fingerprinting of executable programs. We make use of Low-Level Virtual Machine (LLVM) intermediate representation during the software compilation process to demonstrate both a simple static fingerprinting method as well as a dynamic method, which displays our aim of hardware independent fingerprinting. The research contribution includes a realization of the approach using the LLVM infrastructure and provides a proof of concept for both simple static and dynamic watermarks that are architecture neutral.
-
- https://ieeexplore.ieee.org/document/5090050
-
Syntax tree fingerprinting for source code similarity detection (2009)
-
Numerous approaches based on metrics, token sequence pattern-matching, abstract syntax tree (AST) or program dependency graph (PDG) analysis have already been proposed to highlight similarities in source code: in this paper we present a simple and scalable architecture based on AST fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases.
- https://igm.univ-mlv.fr/~chilowi/research/syntax_tree_fingerprinting/syntax_tree_fingerprinting_ICPC09.pdf
-
- https://hal.science/hal-00627811/document
-
Syntax tree fingerprinting: a foundation for source code similarity detection (2011)
-
Plagiarism detection and clone refactoring in software depend on one common concern: finding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modifications are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits. Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Dependency Graph (PDG), we believe that the AST could efficiently support the program analysis and transformations required for the advanced similarity detection process. In this paper we present a simple and scalable architecture based on syntax tree fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases.
-
- https://ieeexplore.ieee.org/document/9960266
-
Source Code Plagiarism Detection Based on Abstract Syntax Tree Fingerprintings (2022)
-
Syntax Tree (AST) is an abstract logical structure of source code represented as a tree. This research utilizes information of fingerprinting with AST to locate the similarities between source codes. The proposed method can detect plagiarism in source codes using the number of duplicated logical structures. The structural information of program is stored in the fingerprints format. Then, the fingerprints of source codes are compared to identify number of similar nodes. The final output is calculated from number of similar nodes known as similarities scores. The result shows that the proposed method accurately captures the common modification techniques from basic to advance.
-
- https://dl.acm.org/doi/abs/10.1145/1286821.1286826
-
Dynamic graph-based software fingerprinting (2007)
-
Fingerprinting embeds a secret message into a cover message. In media fingerprinting, the secret is usually a copyright notice and the cover a digital image. Fingerprinting an object discourages intellectual property theft, or when such theft has occurred, allows us to prove ownership.
The Software Fingerprinting problem can be described as follows. Embed a structure W into a program P such that: W can be reliably located and extracted from P even after P has been subjected to code transformations such as translation, optimization and obfuscation; W is stealthy; W has a high data rate; embedding W into P does not adversely affect the performance of P; and W has a mathematical property that allows us to argue that its presence in P is the result of deliberate actions.
In this article, we describe a software fingerprinting technique in which a dynamic graph fingerprint is stored in the execution state of a program. Because of the hardness of pointer alias analysis such fingerprints are difficult to attack automatically.
- https://dl.acm.org/doi/pdf/10.1145/1286821.1286826
-
- https://openreview.net/forum?id=BJxWx0NYPr
-
Adaptive Structural Fingerprints for Graph Attention Networks (2019)
-
Graph attention network (GAT) is a promising framework to perform convolution and massage passing on graphs. Yet, how to fully exploit rich structural information in the attention mechanism remains a challenge. In the current version, GAT calculates attention scores mainly using node features and among one-hop neighbors, while increasing the attention range to higher-order neighbors can negatively affect its performance, reflecting the over-smoothing risk of GAT (or graph neural networks in general), and the ineffectiveness in exploiting graph structural details. In this paper, we propose an "adaptive structural fingerprint" (ADSF) model to fully exploit graph topological details in graph attention network. The key idea is to contextualize each node with a weighted, learnable receptive field encoding rich and diverse local graph structures. By doing this, structural interactions between the nodes can be inferred accurately, thus significantly improving subsequent attention layer as well as the convergence of learning. Furthermore, our model provides a useful platform for different subspaces of node features and various scales of graph structures to 'cross-talk' with each other through the learning of multi-head attention, being particularly useful in handling complex real-world data. Empirical results demonstrate the power of our approach in exploiting rich structural information in GAT and in alleviating the intrinsic oversmoothing problem in graph neural networks.
-
- https://digitalcommons.calpoly.edu/theses/2040/
-
Cloneless: Code Clone Detection via Program Dependence Graphs with Relaxed Constraints (2019)
-
Code clones are pieces of code that have the same functionality. While some clones may structurally match one another, others may look drastically different. The inclusion of code clones clutters a code base, leading to increased costs through maintenance. Duplicate code is introduced through a variety of means, such as copy-pasting, code generated by tools, or developers unintentionally writing similar pieces of code. While manual clone identification may be more accurate than automated detection, it is infeasible due to the extensive size of many code bases. Software code clone detection methods have differing degree of success based on the analysis performed. This thesis outlines a method of detecting clones using a program dependence graph and subgraph isomorphism to identify similar subgraphs, ultimately illuminating clones. The project imposes few constraints when comparing code segments to potentially reveal more clones.
- https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=3437&context=theses
-
- https://www.computer.org/csdl/journal/ts/2023/08/10125077/1Nc4Vd4vb7W
-
Graph-of-Code: Semantic Clone Detection Using Graph Fingerprints (2023)
-
The code clone detection issue has been researched using a number of explicit factors based on the tokens and contents and found effective results. However, exposing code contents may be an impractical option because of privacy and security factors. Moreover, the lack of scalability of past methods is an important challenge. The code flow states can be inferred by code structure and implicitly represented using empirical graphs. The assumption is that modelling of the code clone detection problem can be achieved without the content of the codes being revealed. Here, a Graph-of-Code concept for the code clone detection problem is introduced, which represents codes into graphs. While Graph-of-Code provides structural properties and quantification of its characteristics, it can exclude code contents or tokens to identify the clone type. The aim is to evaluate the impact of graph-of-code structural properties on the performance of code clone detection. This work employs a feature extraction-based approach for unlabelled graphs. The approach generates a “Graph Fingerprint” which represents different topological feature levels. The results of code clone detection indicate that code structure has a significant role in detecting clone types. We found different GoC-models outperform others. The models achieve between 96% to 99% in detecting code clones based on recall, precision, and F1-Score. The GoC approach is capable in detecting code clones with scalable dataset and with preserving codes privacy.
-
- https://www.researchgate.net/publication/370980383_A_graph-based_code_representation_method_to_improve_code_readability_classification
-
A graph-based code representation method to improve code readability classification (2023)
-
Context Code readability is crucial for developers since it is closely related to code maintenance and affects developers’ work efficiency. Code readability classification refers to the source code being classified as pre-defined certain levels according to its readability. So far, many code readability classification models have been proposed in existing studies, including deep learning networks that have achieved relatively high accuracy and good performance. Objective However, in terms of representation, these methods lack effective preservation of the syntactic and semantic structure of the source code. To extract these features, we propose a graph-based code representation method. Method Firstly, the source code is parsed into a graph containing its abstract syntax tree (AST) combined with control and data flow edges to reserve the semantic structural information and then we convert the graph nodes’ source code and type information into vectors. Finally, we train our graph neural networks model composing Graph Convolutional Network (GCN), DMoNPooling, and K-dimensional Graph Neural Networks (k-GNNs) layers to extract these features from the program graph. Result We evaluate our approach to the task of code readability classification using a Java dataset provided by Scalabrino et al. (2016). The results show that our method achieves 72.5% and 88% in three-class and two-class classification accuracy, respectively. Conclusion We are the first to introduce graph-based representation into code readability classification. Our method outperforms state-of-the-art readability models, which suggests that the graph-based code representation method is effective in extracting syntactic and semantic information from source code, and ultimately improves code readability classification.
-
The below content was originally posted in the following comment (April 30, 2024: Ref)
It has been further refined/enhanced since.
This is potentially more of a generalised/'naive' approach to the problem, but it would also be interesting to see if/how well an embedding model tuned for code would do at solving this sort of problem space:
- https://openai.com/blog/introducing-text-and-code-embeddings
- https://platform.openai.com/docs/guides/embeddings
-
An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
-
- https://platform.openai.com/docs/api-reference/embeddings
- https://platform.openai.com/docs/guides/embeddings
- https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/
-
Faiss: A library for efficient similarity search
-
Also, here's the latest version of my open tabs 'reading list' in this space of things, in case any of it is relevant/interesting/useful here:
- https://en.wikipedia.org/wiki/Content_similarity_detection
-
Content similarity detection
-
A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges (2023)
- https://arxiv.org/abs/2306.16171
-
A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges (2023)
-
Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.
-
- https://link.springer.com/article/10.1007/s10664-017-9564-7
-
A comparison of code similarity analysers (2017)
-
Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code.
-
- https://www.researchgate.net/publication/2840981_Winnowing_Local_Algorithms_for_Document_Fingerprinting
-
Winnowing: Local Algorithms for Document Fingerprinting (2003)
-
Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with Moss, a widely-used plagiarism detection service.
- https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
-
Source Code Plagiarism Detection with Pre-Trained Model Embeddings and Automated Machine Learning (2023)
- https://www.researchgate.net/publication/375651686_Source_Code_Plagiarism_Detection_with_Pre-Trained_Model_Embeddings_and_Automated_Machine_Learning
-
Source Code Plagiarism Detection with Pre-Trained Model Embeddings and Automated Machine Learning (2023)
- https://aclanthology.org/2023.ranlp-1.34.pdf
-
- https://www.researchgate.net/publication/262322336_A_Source_Code_Similarity_System_for_Plagiarism_Detection
-
A Source Code Similarity System for Plagiarism Detection (2013)
-
Source code plagiarism is an easy to do task, but very difficult to detect without proper tool support. Various source code similarity detection systems have been developed to help detect source code plagiarism. Those systems need to recognize a number of lexical and structural source code modifications. For example, by some structural modifications (e.g. modification of control structures, modification of data structures or structural redesign of source code) the source code can be changed in such a way that it almost looks genuine. Most of the existing source code similarity detection systems can be confused when these structural modifications have been applied to the original source code. To be considered effective, a source code similarity detection system must address these issues. To address them, we designed and developed the source code similarity system for plagiarism detection. To demonstrate that the proposed system has the desired effectiveness, we performed a well-known conformism test. The proposed system showed promising results as compared with the JPlag system in detecting source code similarity when various lexical or structural modifications are applied to plagiarized code. As a confirmation of these results, an independent samples t-test revealed that there was a statistically significant difference between average values of F-measures for the test sets that we used and for the experiments that we have done in the practically usable range of cut-off threshold values of 35–70%.
-
- https://www.mdpi.com/2076-3417/10/21/7519
-
A Source Code Similarity Based on Siamese Neural Network (2020)
-
Finding similar code snippets is a fundamental task in the field of software engineering. Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are fitted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method.
-
- https://www.researchgate.net/publication/337196468_Detecting_Source_Code_Similarity_Using_Compression
-
Detecting Source Code Similarity Using Compression (2019)
-
Different forms of plagiarism make a fair assessment of student assignments more difficult. Source code plagiarisms pose a significant challenge especially for automated assessment systems aimed for students' programming solutions. Different automated assessment systems employ different text or source code similarity detection tools, and all of these tools have their advantages and disadvantages. In this paper, we revitalize the idea of similarity detection based on string complexity and compression. We slightly adapt an existing, third-party, approach, implement it and evaluate its potential on synthetically generated cases and on a small set of real student solutions. On synthetic cases, we showed that average deviation (in absolute values) from the expected similarity is less than 1% (0.94%). On the real-life examples of student programming solutions we compare our results with those of two established tools. The average difference is around 18.1% and 11.6%, while the average difference between those two tools is 10.8%. However, the results of all three tools follow the same trend. Finally, a deviation to some extent is expected as observed tools apply different approaches that are sensitive to other factors of similarities. Gained results additionally demonstrate open challenges in the field.
- https://ceur-ws.org/Vol-2508/paper-pri.pdf
-
- https://www.nature.com/articles/s41598-023-42769-9
-
Binary code similarity analysis based on naming function and common vector space (2023)
-
Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cross-platform binary code similarity analysis using control flow graphs. However, existing schemes suffer from the shortcomings of large differences in instruction syntaxes across different target platforms, inability to align control flow graph nodes, and less introduction of high-level semantics of stability, which pose challenges for identifying similar computations between binary functions of different platforms generated from the same source code. We argue that extracting stable, platform-independent semantics can improve model accuracy, and a cross-platform binary function similarity comparison model N_Match is proposed. The model elevates different platform instructions to the same semantic space to shield their underlying platform instruction differences, uses graph embedding technology to learn the stability semantics of neighbors, extracts high-level knowledge of naming function to alleviate the differences brought about by cross-platform and cross-optimization levels, and combines the stable graph structure as well as the stable, platform-independent API knowledge of naming function to represent the final semantics of functions. The experimental results show that the model accuracy of N_Match outperforms the baseline model in terms of cross-platform, cross-optimization level, and industrial scenarios. In the vulnerability search experiment, N_Match significantly improves hit@N, the mAP exceeds the current graph embedding model by 66%. In addition, we also give several interesting observations from the experiments. The code and model are publicly available at https://www.github.com/CSecurityZhongYuan/Binary-Name_Match
-
- https://arxiv.org/abs/2305.03843
-
REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models (2023)
-
This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. We present the first-ever code search method that encodes dynamic runtime information during training without the need to execute either the corpus under search or the search query at inference time and the first code search technique that trains on both positive and negative reference samples. To validate the efficacy of our approach, we perform a set of studies demonstrating the capability of enhanced LLMs to perform cross-language code-to-code search. Our evaluation demonstrates that the effectiveness of our approach is consistent across various model architectures and programming languages. We outperform the state-of-the-art cross-language search tool by up to 44.7%. Moreover, our ablation studies reveal that even a single positive and negative reference sample in the training process results in substantial performance improvements demonstrating both similar and dissimilar references are important parts of code search. Importantly, we show that enhanced well-crafted, fine-tuned models consistently outperform enhanced larger modern LLMs without fine tuning, even when enhancing the largest available LLMs highlighting the importance for open-sourced models. To ensure the reproducibility and extensibility of our research, we present an open-sourced implementation of our tool and training procedures called REINFOREST.
-
- https://www.usenix.org/conference/usenixsecurity21/presentation/ahmadi
-
Finding Bugs Using Your Own Code: Detecting Functionally-similar yet Inconsistent Code (2021)
-
Probabilistic classification has shown success in detecting known types of software bugs. However, the works following this approach tend to require a large amount of specimens to train their models. We present a new machine learning-based bug detection technique that does not require any external code or samples for training. Instead, our technique learns from the very codebase on which the bug detection is performed, and therefore, obviates the need for the cumbersome task of gathering and cleansing training samples (e.g., buggy code of certain kinds). The key idea behind our technique is a novel two-step clustering process applied on a given codebase. This clustering process identifies code snippets in a project that are functionally-similar yet appear in inconsistent forms. Such inconsistencies are found to cause a wide range of bugs, anything from missing checks to unsafe type conversions. Unlike previous works, our technique is generic and not specific to one type of inconsistency or bug. We prototyped our technique and evaluated it using 5 popular open source software, including QEMU and OpenSSL. With a minimal amount of manual analysis on the inconsistencies detected by our tool, we discovered 22 new unique bugs, despite the fact that many of these programs are constantly undergoing bug scans and new bugs in them are believed to be rare.
- https://www.usenix.org/system/files/sec21summer_ahmadi.pdf
-
- https://theory.stanford.edu/~aiken/moss/
-
MOSS: A1 System for Detecting Software Similarity
-
- https://github.com/fanghon/antiplag
-
antiplag - similarity checking software for program codes, documents, and pictures (2019) The software mainly checks and compares the similarities between electronic assignments submitted by students. It can detect the similarities between electronic assignments submitted by students and can analyze the content of multiple programming languages (such as java, c/c++, python, etc.) and multiple formats (txt, doc, docx, pdf, etc.) Comparative analysis of text and image similarities in multiple formats (png, jpg, gif, bmp, etc.) between English and simplified and traditional Chinese documents, and output codes, texts, and images with high similarity, thereby helping to detect plagiarism between students. the behavior of.
-
- https://github.com/BK-SCOSS/scoss
-
scoss A Source Code Similarity System - SCOSS
-
- https://github.com/dodona-edu/dolos
-
Dolos (2019-2024+) Dolos is a source code plagiarism detection tool for programming exercises. Dolos helps teachers in discovering students sharing solutions, even if they are modified. By providing interactive visualizations, Dolos can also be used to sensitize students to prevent plagiarism.
- https://dolos.ugent.be/
- https://dolos.ugent.be/about/algorithm.html
-
How Dolos works Conceptually, the plagiarism detection pipeline of Dolos can be split into four successive steps:
- Tokenization
- Fingerprinting
- Indexing
- Reporting
-
Tokenization To be immune against masking plagiarism by techniques such as renaming variables and functions, Dolos doesn't directly process the source code under investigation. It starts by performing a tokenization step using Tree-sitter. Tree-sitter can generate syntax trees for many programming languages, converts source code to a more structured form, and masks specific naming of variables and functions.
-
Fingerprinting To measure similarities between (converted) files, Dolos tries to find common sequences of tokens. More specifically, it uses subsequences of fixed length called k-grams. To efficiently make these comparisons and reduce the memory usage, all k-grams are hashed using a rolling hash function (the one used by Rabin-Karp in their string matching algorithm). The length k of k-grams can be with the -k option.
To further reduce the memory usage, only a subset of all hashes are stored. The selection of hashes is done by the Winnowing algorithm as described by (Schleimer, Wilkerson and Aiken). In short: only the hash with the smallest numerical value is kept for each window. The window length (in k-grams) can be altered with the -w option.
The remaining hashes are the fingerprints of the analyzed files. Internally, these are stored as simple integers.
-
Indexing Because Dolos needs to compare all files with each other, it is more efficient to first create an index containing the fingerprints of all files. For each of the fingerprints encountered in any of the files, we store the file and the corresponding line number where we encountered that fingerprint.
As soon as a fingerprint is stored in the index twice, this is recorded as a match between the two files because they share at least one k-gram.
-
Reporting Dolos finally collects all fingerprints that occur in more than one file and aggregates the results into a report.
This report contains all file pairs that have at least one common fingerprint, together with some metrics:
- similarity: the fraction of shared fingerprints between the two files
- total overlap: the absolute value of shared fingerprints, useful for larger projects
- longest fragment: the length (in fingerprints) of the longest subsequence of fingerprints matching between the two files, useful when not the whole source code is copied
-
- https://dolos.ugent.be/about/languages.html
- https://dolos.ugent.be/about/publications.html
-
Publications Dolos is developed by Team Dodona at Ghent University in Belgium. Our research is published in the following journals and conferences.
-
-
- https://github.com/danielplohmann/mcrit
-
MinHash-based Code Relationship & Investigation Toolkit (MCRIT) (2021-2025+) MCRIT is a framework created to simplify the application of the MinHash algorithm in the context of code similarity. It can be used to rapidly implement "shinglers", i.e. methods which encode properties of disassembled functions, to then be used for similarity estimation via the MinHash algorithm. It is tailored to work with disassembly reports emitted by SMDA.
-
1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis (2021)
- https://arxiv.org/abs/2112.12928
-
1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis (2021)
-
Binary similarity analysis is critical to many code-reuse-related issues and "1-to-1" mechanism is widely applied, where one function in a binary file is matched against one function in a source file or binary file. However, we discover that function mapping is a more complex problem of "1-to-n" or even "n-to-n" due to the existence of function inlining.
In this paper, we investigate the effect of function inlining on binary similarity analysis. We first construct 4 inlining-oriented datasets for four similarity analysis tasks, including code search, OSS reuse detection, vulnerability detection, and patch presence test. Then, we further study the extent of function inlining, the performance of existing works under function inlining, and the effectiveness of existing inlining-simulation strategies. > Results show that the proportion of function inlining can reach nearly 70%, while most existing works neglect it and use "1-to-1" mechanism. The mismatches cause a 30% loss in performance during code search and a 40% loss during vulnerability detection. Moreover, two existing inlining-simulation strategies can only recover 60% of the inlined functions. We discover that inlining is usually cumulative when optimization increases. Conditional inlining and incremental inlining are suggested to design low-cost and high-coverage inlining-simulation strategies.
- https://arxiv.org/pdf/2112.12928
- https://github.com/island255/TOSEM2022
-
Repository for the paper "1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis"
-
-
One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis (2021)
- https://deepai.org/publication/one-to-one-or-one-to-many-what-function-inlining-brings-to-binary2source-similarity-analysis
-
One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis (2021)
-
- https://arxiv.org/abs/2112.12928v1
-
One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis (2021)
-
Binary2source code matching is critical to many code-reuse-related tasks, including code clone detection, software license violation detection, and reverse engineering assistance. Existing binary2source works always apply a "1-to-1" (one-to-one) mechanism, i.e., one function in a binary file is matched against one function in a source file. However, we assume that such mapping is usually a more complex problem of "1-to-n" (one-to-many) due to the existence of function inlining. To the best of our knowledge, few existing works have systematically studied the effect of function inlining on binary2source matching tasks. This paper will address this issue. To support our study, we first construct two datasets containing 61,179 binaries and 19,976,067 functions. We also propose an automated approach to label the dataset with line-level and function-level mapping. Based on our labeled dataset, we then investigate the extent of function inlining, the factors affecting function inlining, and the impact of function inlining on existing binary2source similarity methods. Finally, we discuss the interesting findings and give suggestions for designing more effective methodologies.
- https://arxiv.org/pdf/2112.12928v1
- https://github.com/island255/source2binary_dataset_construction
-
Source2binary Dataset Construction This is the repository for the paper "One to One or One to many? What function inline brings to binary similarity analysis".
-
-
- https://www.researchgate.net/publication/357365866_One-to-One_or_One-to-many_What_function_inlining_brings_to_binary2source_similarity_analysis
-
One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis
-
- https://arxiv.org/abs/2210.15159
-
Comparing One with Many -- Solving Binary2source Function Matching Under Function Inlining (2022)
-
Binary2source function matching is a fundamental task for many security applications, including Software Component Analysis (SCA). The "1-to-1" mechanism has been applied in existing binary2source matching works, in which one binary function is matched against one source function. However, we discovered that such mapping could be "1-to-n" (one query binary function maps multiple source functions), due to the existence of function inlining.
To help conduct binary2source function matching under function inlining, we propose a method named O2NMatcher to generate Source Function Sets (SFSs) as the matching target for binary functions with inlining. We first propose a model named ECOCCJ48 for inlined call site prediction. To train this model, we leverage the compilable OSS to generate a dataset with labeled call sites (inlined or not), extract several features from the call sites, and design a compiler-opt-based multi-label classifier by inspecting the inlining correlations between different compilations. Then, we use this model to predict the labels of call sites in the uncompilable OSS projects without compilation and obtain the labeled function call graphs of these projects. Next, we regard the construction of SFSs as a sub-tree generation problem and design root node selection and edge extension rules to construct SFSs automatically. Finally, these SFSs will be added to the corpus of source functions and compared with binary functions with inlining. We conduct several experiments to evaluate the effectiveness of O2NMatcher and results show our method increases the performance of existing works by 6% and exceeds all the state-of-the-art works.
- https://arxiv.org/pdf/2210.15159
-
- https://github.com/island255/binary2source-matching-under-function-inlining
-
binary2source-matching-under-function-inlining This is the repository illustrating how we label the inlined call sites, train the classifier for ICS prediction, and generate SFSs for binary2source matching.
-
Repository for the paper "Binary2Source Function Similarity Detection Under Function Inlining"
-
- https://arxiv.org/abs/2401.05739v1
-
Cross-Inlining Binary Function Similarity Detection (2024)
-
Binary function similarity detection plays an important role in a wide range of security applications. Existing works usually assume that the query function and target function share equal semantics and compare their full semantics to obtain the similarity. However, we find that the function mapping is more complex, especially when function inlining happens.
In this paper, we will systematically investigate cross-inlining binary function similarity detection. We first construct a cross-inlining dataset by compiling 51 projects using 9 compilers, with 4 optimizations, to 6 architectures, with 2 inlining flags, which results in two datasets both with 216 combinations. Then we construct the cross-inlining function mappings by linking the common source functions in these two datasets. Through analysis of this dataset, we find that three cross-inlining patterns widely exist while existing work suffers when detecting cross-inlining binary function similarity. Next, we propose a pattern-based model named CI-Detector for cross-inlining matching. CI-Detector uses the attributed CFG to represent the semantics of binary functions and GNN to embed binary functions into vectors. CI-Detector respectively trains a model for these three cross-inlining patterns. Finally, the testing pairs are input to these three models and all the produced similarities are aggregated to produce the final similarity. We conduct several experiments to evaluate CI-Detector. Results show that CI-Detector can detect cross-inlining pairs with a precision of 81% and a recall of 97%, which exceeds all state-of-the-art works.
- https://arxiv.org/pdf/2401.05739v1
- https://github.com/island255/cross-inlining_binary_function_similarity
-
The repository of the paper "Cross-Inlining Binary Function Similarity Detection"
-
-
- https://github.com/JackHCC/Pcode-Similarity
-
Pcode-Similarity (2021) Algorithm for calculating similarity between function and library function.
-
- https://github.com/JackHCC/Awesome-Binary-Code-Similarity-Detection-2021
-
Awesome Binary code similarity detection (2021) Awesome list for Binary Code Similarity Detection in 2021
-
- https://github.com/Jaso1024/Semantic-Code-Embeddings
-
SCALE: Semantic Code Analysis via Learned Embeddings (2023) 3rd best paper on Artificial Intelligence track | presented at the 2023 International Conference on AI, Blockchain, Cloud Computing and Data Analytics This repository holds the code and supplementary materials for SCALE: Semantic Code Analysis via Learned Embeddings. This research explores the efficacy of contrastive learning alongside large language models as a paradigm for developing a model capable of creating code embeddings indicative of code on a functional level. Existing pre-trained models in NLP have demonstrated impressive success, surpassing previous benchmarks in various language-related tasks. However, when it comes to the field of code understanding, these models still face notable limitations. Code isomorphism, which deals with determining functional similarity between pieces of code, presents a challenging problem for NLP models. In this paper, we explore two approaches to code isomorphism. Our first approach, dubbed SCALE-FT, formulates the problem as a binary classification task, where we feed pairs of code snippets to a Large Language Model (LLM), using the embeddings to predict whether the given code segments are equivalent. The second approach, SCALE-CLR, adopts the SimCLR framework to generate embeddings for individual code snippets. By processing code samples with an LLM and observing the corresponding embeddings, we assess the similarity of two code snippets. These approaches enable us to leverage function-based code embeddings for various downstream tasks, such as code-optimization, code-comment alignment, and code classification. Our experiments on the CodeNet Python800 benchmark demonstrate promising results for both approaches. Notably, our SCALE-FT using Babbage-001 (GPT-3) achieves state-of-the-art performance, surpassing various benchmark models such as GPT-3.5 Turbo and GPT-4. Additionally, Salesforce's 350-million parameter CodeGen, when trained with the SCALE-FT framework, surpasses GPT-3.5 and GPT-4.
-
- https://github.com/Aida-yy/binary-sim
-
binary-sim - binary similarity using Deep learning (2023)
-
Features: Function semantic information + control flow graph
Semantic feature extraction: extract the byte data, assembly instruction data, and integer data of the function respectively, use independent encoders (DPCNN, TextCNN) to encode the text representation, and obtain its Embedding representation.
Structural feature extraction, based on CFG and the assembly instructions in each block, generates ACFG, uses graph neural network to encode ACFG, and obtains Embedding representation; in addition, considering that the node order of the control flow graph of similar functions is also similar, the CFG's The adjacency matrix is taken as input and CNN is used to obtain its Embedding representation.
Contrastive learning model structure: InfoNCE loss + In-batch negatives
-
- https://arxiv.org/abs/2401.09885
-
Source Code Clone Detection Using Unsupervised Similarity Measures (2024)
-
Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at this https URL
- https://github.com/jorge-martinez-gil/codesim
-
Source Code Clone Detection Using Unsupervised Similarity Measures
-
This repository contains the source code for reproducing the paper Martinez-Gil, J. (2024). Source Code Clone Detection Using Unsupervised Similarity Measures. In: Bludau, P., Ramler, R., Winkler, D., Bergsmann, J. (eds) Software Quality as a Foundation for Security. SWQD 2024. Lecture Notes in Business Information Processing, vol 505. Springer, Cham. https://doi.org/10.1007/978-3-031-56281-5_2.
-
-
Transcending Language Barriers in Software Engineering with Crosslingual Code Clone Detection (2024)
- https://github.com/jorge-martinez-gil/crosslingual-clone-detection
-
Transcending Language Barriers in Software Engineering with Crosslingual Code Clone Detection (2024) Systematic study to determine the best methods to assess the similarity between code snippets in different programming languages
-
- https://arxiv.org/abs/2004.02843
-
Improved Code Summarization via a Graph Neural Network (2020)
-
Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summarization techniques use the source code as input and outputs a natural language description. Yet a strong consensus is developing that using structural information as input leads to improved performance. The first approaches to use structural information flattened the AST into a sequence. Recently, more complex approaches based on random AST paths or graph neural networks have improved on the models using flattened ASTs. However, the literature still does not describe the using a graph neural network together with source code sequence as separate inputs to a model. Therefore, in this paper, we present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries. We evaluate our technique using a data set of 2.1 million Java method-comment pairs and show improvement over four baseline techniques, two from the software engineering literature, and two from machine learning literature.
-
- https://arxiv.org/abs/2002.08653
-
Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree (2020)
-
Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection. We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.
- https://github.com/jacobwwh/graphmatch_clone
-
Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree
-
Code and data for paper "Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree".
-
-
Utilizing Abstract Syntax Tree Embedding to Improve the Quality of GNN-based Class Name Estimation (2023)
- https://proceedings-of-deim.github.io/DEIM2023/1b-9-4.pdf
-
Utilizing Abstract Syntax Tree Embedding to Improve the Quality of GNN-based Class Name Estimation (2023)
-
While giving comprehensible names to identifiers is essential in software development, it is sometimes difficult since it requires development experience and knowledge of the application domain. Among work to support the developer’s identifier naming, a GNN-based class name estimation approach learns a graph of relationships between program elements, i.e., classes, methods, and fields, but it ignores information within the methods. This study proposes an approach that exploits information from method bodies, which can help estimate correct class names. The proposed approach extends the existing GNN-based approach to use embeddings of the corresponding ASTs for method nodes. An evaluation experiment measures how correctly the proposed approach can estimate class names in large datasets of open-source Java projects. The experimental result shows that the proposed approach improves the estimation correctness compared to the existing approach.
-
- https://medium.com/stanford-cs224w/code-similarity-using-graph-neural-networks-1e58aa21bd92
-
Code Similarity Using Graph Neural Networks (2023)
- Abstract/Summary by ChatGPT 4.5:
-
Code similarity detection is crucial for various software engineering tasks, including plagiarism detection, code search, refactoring, and automated code completion. Traditional approaches rely heavily on syntactic similarity, which fails to capture deeper semantic relationships between code segments. Inspired by recent advances in natural language processing and code intelligence using transformer-based models (e.g., BERT, GPT, and CodeBERT), our work explores the use of Graph Neural Networks (GNNs) to address code similarity through the semantic understanding provided by graph structures.
We evaluate several GNN architectures—including GraphSAGE, Graph Attention Networks (GAT), and a novel OrderGNN leveraging permutation-aware aggregations—on the widely-used POJ-104 dataset, consisting of 32,000 C++ code segments spanning 64 distinct programming problems. Our pipeline involves parsing source code into Abstract Syntax Trees (ASTs) using the CLANG library, transforming these ASTs into NetworkX graphs, and subsequently into PyTorch Geometric (PyG) data objects for input into our GNN models.
Our results demonstrate that permutation-invariant methods such as GraphSAGE and GAT struggle to capture critical ordered structures inherent in programming languages, resulting in limited performance (MAP@R). In contrast, the OrderGNN model, employing LSTM-based aggregation to preserve node ordering information, achieves significantly better semantic similarity identification, highlighting the necessity of permutation-awareness for effective code analysis. Nevertheless, the OrderGNN model presents substantial computational and memory overhead, limiting scalability.
We conclude by suggesting future directions, including the exploration of more memory-efficient permutation-aware aggregation functions and alternative graph representations beyond the standard AST structure to further improve the efficacy and applicability of GNN-based code similarity detection methods.
-
-
JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games (2020)
- https://taoxiease.github.io/publications/icse20seip-jsidentify.pdf#:~:text=match%20at%20L171%20hybrid%20framework%2C,we%20collect%20400%20mini
-
JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games
-
Online mini games are lightweight game apps, typically implemented in JavaScript (JS), that run inside another host mobile app (such as WeChat, Baidu, and Alipay). These mini games do not need to be downloaded or upgraded through an app store, making it possible for one host mobile app to perform the aggregated services of many apps. Hundreds of millions of users play tens of thousands of mini games, which make a great profit, and consequently are popular targets of plagiarism. In cases of plagiarism, deeply obfuscated code cloned from the original code often embodies malicious code segments and copyright infringements, posing great challenges for existing plagiarism detection tools. To address these challenges, in this paper, we design and implement JSidentify, a hybrid framework to detect plagiarism among online mini games. JSidentify includes three techniques based on different levels of code abstraction. JSidentify applies the included techniques in the constructed priority list one by one to reduce overall detection time. Our evaluation results show that JSidentify outperforms other existing related state-of-the-art approaches and achieves the best precision and recall with affordable detection time when detecting plagiarism among online mini games and clones among general JS programs. Our deployment experience of JSidentify also shows that JSidentify is indispensable in the daily operations of online mini games in WeChat.
-
- https://ieeexplore.ieee.org/document/9276581
-
JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games
-
- https://www.researchgate.net/publication/344433961_JSidentify_a_hybrid_framework_for_detecting_plagiarism_among_JavaScript_code_in_online_mini_games
-
JSidentify: a hybrid framework for detecting plagiarism among JavaScript code in online mini games (June 2020)
-
- https://taoxiease.github.io/publications/fse16-racs.pdf
-
Relationship-aware code search for JavaScript frameworks
-
JavaScript frameworks, such as jQuery, are widely used for developing web applications. To facilitate using these JavaScript frameworks to implement a feature (e.g., functionality), a large number of programmers often search for code snippets that implement the same or similar feature. However, existing code search approaches tend to be ineffective, without taking into account the fact that JavaScript code snippets often implement a feature based on various relationships (e.g., sequencing, condition, and callback relationships) among the invoked framework API methods. To address this issue, we present a novel RelationshipAware Code Search (RACS) approach for finding code snippets that use JavaScript frameworks to implement a specific feature. In advance, RACS collects a large number of code snippets that use some JavaScript frameworks, mines API usage patterns from the collected code snippets, and represents the mined patterns with method call relationship (MCR) graphs, which capture framework API methods’ signatures and their relationships. Given a natural language (NL) search query issued by a programmer, RACS conducts NL processing to automatically extract an action relationship (AR) graph, which consists of actions and their relationships inferred from the query. In this way, RACS reduces code search to the problem of graph search: finding similar MCR graphs for a given AR graph. We conduct evaluations against representative real-world jQuery questions posted on Stack Overflow, based on 308,294 code snippets collected from over 81,540 files on the Internet. The evaluation results show the effectiveness of RACS: the top 1 snippet produced by RACS matches the target code snippet for 46% questions, compared to only 4% achieved by a relationship-oblivious approach.
-
- https://dl.acm.org/doi/10.1145/2950290.2950341
-
Relationship-aware code search for JavaScript frameworks
-
- https://arxiv.org/abs/2204.02765
-
Code Search: A Survey of Techniques for Finding Code
-
The immense amounts of source code provide ample challenges and opportunities during software development. To handle the size of code bases, developers commonly search for code, e.g., when trying to find where a particular feature is implemented or when looking for code examples to reuse. To support developers in finding relevant code, various code search engines have been proposed. This article surveys 30 years of research on code search, giving a comprehensive overview of challenges and techniques that address them. We discuss the kinds of queries that code search engines support, how to preprocess and expand queries, different techniques for indexing and retrieving code, and ways to rank and prune search results. Moreover, we describe empirical studies of code search in practice. Based on the discussion of prior work, we conclude the article with an outline of challenges and opportunities to be addressed in the future.
- https://arxiv.org/pdf/2204.02765
-
Code Search: A Survey of Techniques for Finding Code
-
-
- https://www.researchgate.net/publication/359786256_Code_Search_A_Survey_of_Techniques_for_Finding_Code
-
Code Search: A Survey of Techniques for Finding Code
-
- https://arxiv.org/abs/1707.05005
-
graph2vec: Learning Distributed Representations of Graphs (2017)
-
Recent works on representation learning for graph structured data predominantly focus on learning distributed representations of graph substructures such as nodes and subgraphs. However, many graph analytics tasks such as graph classification and clustering require representing entire graphs as fixed length feature vectors. While the aforementioned approaches are naturally unequipped to learn such representations, graph kernels remain as the most effective way of obtaining them. However, these graph kernels use handcrafted features (e.g., shortest paths, graphlets, etc.) and hence are hampered by problems such as poor generalization. To address this limitation, in this work, we propose a neural embedding framework named graph2vec to learn data-driven distributed representations of arbitrary sized graphs. graph2vec's embeddings are learnt in an unsupervised manner and are task agnostic. Hence, they could be used for any downstream task such as graph classification, clustering and even seeding supervised representation learning approaches. Our experiments on several benchmark and large real-world datasets show that graph2vec achieves significant improvements in classification and clustering accuracies over substructure representation learning approaches and are competitive with state-of-the-art graph kernels.
- https://arxiv.org/pdf/1707.05005
- https://github.com/benedekrozemberczki/graph2vec
-
Graph2Vec
-
A parallel implementation of "graph2vec: Learning Distributed Representations of Graphs" (MLGWorkshop 2017).
-
The model is now also available in the Karate Club package.
-
- https://github.com/annamalai-nr/graph2vec_tf
-
This repository contains the "tensorflow" implementation of our paper "graph2vec: Learning distributed representations of graphs".
-
-
- https://arxiv.org/abs/1808.05689
-
SimGNN: A Neural Network Approach to Fast Graph Similarity Computation (2018; revised 2020)
-
Graph similarity search is among the most important graph-based applications, e.g. finding the chemical compounds that are most similar to a query compound. Graph similarity computation, such as Graph Edit Distance (GED) and Maximum Common Subgraph (MCS), is the core operation of graph similarity search and many other applications, but very costly to compute in practice. Inspired by the recent success of neural network approaches to several graph applications, such as node or graph classification, we propose a novel neural network based approach to address this classic yet challenging graph problem, aiming to alleviate the computational burden while preserving a good performance.
-
The proposed approach, called SimGNN, combines two strategies. First, we design a learnable embedding function that maps every graph into a vector, which provides a global summary of a graph. A novel attention mechanism is proposed to emphasize the important nodes with respect to a specific similarity metric. Second, we design a pairwise node comparison method to supplement the graph-level embeddings with fine-grained node-level information. Our model achieves better generalization on unseen graphs, and in the worst case runs in quadratic time with respect to the number of nodes in two graphs. Taking GED computation as an example, experimental results on three real graph datasets demonstrate the effectiveness and efficiency of our approach. Specifically, our model achieves smaller error rate and great time reduction compared against a series of baselines, including several approximation algorithms on GED computation, and many existing graph neural network based models. To the best of our knowledge, we are among the first to adopt neural networks to explicitly model the similarity between two graphs, and provide a new direction for future research on graph similarity computation and graph similarity search.
- https://arxiv.org/pdf/1808.05689
- https://github.com/benedekrozemberczki/SimGNN
-
SimGNN
-
A PyTorch implementation of "SimGNN: A Neural Network Approach to Fast Graph Similarity Computation" (WSDM 2019).
-
- https://github.com/chihming/awesome-network-embedding
-
awesome-network-embedding
-
A curated list of network embedding techniques.
-
Also called network representation learning, graph embedding, knowledge embedding, etc.
The task is to learn the representations of the vertices from a given network.
-
- https://karateclub.readthedocs.io/en/latest/
-
Karate Club is an unsupervised machine learning extension library for NetworkX. It builds on other open source linear algebra, machine learning, and graph signal processing libraries such as Numpy, Scipy, Gensim, PyGSP, and Scikit-Learn. Karate Club consists of state-of-the-art methods to do unsupervised learning on graph structured data. To put it simply it is a Swiss Army knife for small-scale graph mining research. First, it provides network embedding techniques at the node and graph level. Second, it includes a variety of overlapping and non-overlapping community detection methods. Implemented methods cover a wide range of network science (NetSci, Complenet), data mining (ICDM, CIKM, KDD), artificial intelligence (AAAI, IJCAI) and machine learning (NeurIPS, ICML, ICLR) conferences, workshops, and pieces from prominent journals.
-
- https://networkx.org/
-
NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
-
Software for complex networks
- Data structures for graphs, digraphs, and multigraphs
- Many standard graph algorithms
- Network structure and analysis measures
- Generators for classic graphs, random graphs, and synthetic networks
- Nodes can be "anything" (e.g., text, images, XML records)
- Edges can hold arbitrary data (e.g., weights, time-series)
- Open source 3-clause BSD license
- Well tested with over 90% code coverage
- Additional benefits from Python include fast prototyping, easy to teach, and multi-platform
-
- https://books.google.com.au/books/about/Software_Similarity_and_Classification.html?id=Fy_mNhg2lK4C
- https://link.springer.com/book/10.1007/978-1-4471-2909-7
-
Software Similarity and Classification
-
Authors: Silvio Cesare , Yang Xiang
-
Number of Pages: XIV, 88
-
The first book to construct a theory to describe the problems in software similarity and classification
-
Software similarity and classification is an emerging topic with wide applications. It is applicable to the areas of malware detection, software theft detection, plagiarism detection, and software clone detection. Extracting program features, processing those features into suitable representations, and constructing distance metrics to define similarity and dissimilarity are the key methods to identify software variants, clones, derivatives, and classes of software. Software Similarity and Classification reviews the literature of those core concepts, in addition to relevant literature in each application and demonstrates that considering these applied problems as a similarity and classification problem enables techniques to be shared between areas. Additionally, the authors present in-depth case studies using the software similarity and classification techniques developed throughout the book.
-
Includes supplementary material: https://extras.springer.com/?query=978-1-4471-2908-0
- 1 zip file containing 3 PDFs:
- Table of Contents (6 pages)
-
- 1 Introduction (6 pages)
- 1.1 Background
- 1.2 Applications of Software Similarity and Classification
- 1.3 Motivation
- 1.4 Problem Formulization
- 1.5 Problem Overview
- 1.6 Aims and Scope
- 1.7 Book Organization
- References
- 2 Taxonomy of Program Features (10 pages)
- 2.1 Syntactic Features
- 2.1.1 Raw Code
- 2.1.2 Abstract Syntax Trees
- 2.1.3 Variables
- 2.1.4 Pointers
- 2.1.5 Instructions
- 2.1.6 Basic Blocks
- 2.1.7 Procedures
- 2.1.8 Control Flow Graphs
- 2.1.9 Call Graphs
- 2.1.10 Object Inheritances and Dependencies
- 2.2 Semantic Features
- 2.2.1 API Calls
- 2.2.2 Data Flow
- 2.2.3 Procedure Dependence Graphs
- 2.2.4 System Dependence Graph
- 2.3 Taxonomy of Features in Program Binaries
- 2.3.1 Object File Formats
- 2.3.2 Headers
- 2.3.3 Object Code
- 2.3.4 Symbols
- 2.3.5 Debugging Information
- 2.3.6 Relocations
- 2.3.7 Dynamic Linking Information
- 2.4 Case Studies
- 2.4.1 Portable Executable
- 2.4.2 Executable and Linking Format
- 2.4.3 Java Class File
- References
- 2.1 Syntactic Features
- 3 Program Transformations and Obfuscations (10 pages)
- 3.1 Compiler Optimization and Recompilation
- 3.1.1 Instruction Reordering
- 3.1.2 Loop Invariant Code Motion
- 3.1.3 Code Fusion
- 3.1.4 Function Inlining
- 3.1.5 Loop Unrolling
- 3.1.6 Branch/Loop Inversion
- 3.1.7 Strength Reduction
- 3.1.8 Algebraic Identities
- 3.1.9 Register Reassignment
- 3.2 Program Obfuscation
- 3.3 Plagiarism, Software Theft, and Derivative Works
- 3.3.1 Semantic Changes
- 3.3.2 Code Insertion
- 3.3.3 Code Deletion
- 3.3.4 Code Substitution
- 3.3.5 Code Transposition
- 3.4 Malware Packing, Polymorphism, and Metamorphism
- 3.4.1 Dead Code Insertion
- 3.4.2 Instruction Substitution
- 3.4.3 Variable Renaming
- 3.4.4 Code Reordering
- 3.4.5 Branch Obfuscation
- 3.4.6 Branch Inversion and Flipping
- 3.4.7 Opaque Predicate Insertion
- 3.4.8 Malware Obfuscation Using Code Packing
- 3.4.9 Traditional Code Packing
- 3.4.10 Shifting Decode Frame
- 3.4.11 Instruction Virtualization and Malware Emulators
- 3.5 Features under Program Transformations
- References
- 3.1 Compiler Optimization and Recompilation
- 4 Formal Methods of Program Analysis (12 pages)
- 4.1 Static Feature Extraction
- 4.2 Formal Syntax and Lexical Analysis
- 4.3 Parsing
- 4.4 Intermediate Representations
- 4.4.1 Intermediate Code Generation
- 4.4.2 Abstract Machines
- 4.4.3 Basic Blocks
- 4.4.4 Control Flow Graph
- 4.4.5 Call Graph
- 4.5 Formal Semantics of Programming Languages
- 4.5.1 Operational Semantics
- 4.5.2 Denotational Semantics
- 4.5.3 Axiomatic Semantics
- 4.6 Theorem Proving
- 4.6.1 Hoare Logic
- 4.6.2 Predicate Transformer Semantics
- 4.6.3 Symbolic Execution
- 4.7 Model Checking
- 4.8 Data Flow Analysis
- 4.8.1 Partially Ordered Sets
- 4.8.2 Lattices
- 4.8.3 Monotone Functions and Fixed Points
- 4.8.4 Fixed Point Solutions to Monotone Functions
- 4.8.5 Dataflow Equations
- 4.8.6 Dataflow Analysis Examples
- 4.8.7 Reaching Definitions
- 4.8.8 Live Variables
- 4.8.9 Available Expressions
- 4.8.10 Very Busy Expressions
- 4.8.11 Classification of Dataflow Analyses
- 4.9 Abstract Interpretation
- 4.9.1 Widening and Narrowing
- 4.10 Intermediate Code Optimisation
- 4.11 Research Opportunities
- References
- 5 Static Analysis of Binaries (8 pages)
- 5.1 Disassembly
- 5.2 Intermediate Code Generation
- 5.3 Procedure Identification
- 5.4 Procedure Disassembly
- 5.5 Control Flow Analysis, Deobfuscation and Reconstruction
- 5.6 Pointer Analysis
- 5.7 Decompilation of Binaries
- 5.7.1 Condition Code Elimination
- 5.7.2 Stack Variable Reconstruction
- 5.7.3 Preserved Register Detection
- 5.7.4 Procedure Parameter Reconstruction
- 5.7.5 Reconstruction of Structured Control Flow
- 5.7.6 Type Reconstruction
- 5.8 Obfuscation and Limits to Static Analysis
- 5.9 Research Opportunities
- References
- 6 Dynamic Analysis (6 pages)
- 6.1 Relationship to Static Analysis
- 6.2 Environments
- 6.3 Debugging
- 6.4 Hooking
- 6.5 Dynamic Binary Instrumentation
- 6.6 Virtualization
- 6.7 Application Level Emulation
- 6.8 Whole System Emulation
- References
- 7 Feature Extraction (4 pages)
- 7.1 Processing Program Features
- 7.2 Strings
- 7.3 Vectors
- 7.4 Sets
- 7.5 Sets of Vectors
- 7.6 Trees
- 7.7 Graphs
- 7.8 Embeddings
- 7.9 Kernels
- 7.10 Research Opportunities
- References
- 8 Software Birthmark Similarity (8 pages)
- 8.1 Distance Metrics
- 8.2 String Similarity
- 8.2.1 Levenshtein Distance
- 8.2.2 Smith-Waterman Algorithm
- 8.2.3 Longest Common Subsequence (LCS)
- 8.2.4 Normalized Compression Distance
- 8.3 Vector Similarity
- 8.3.1 Euclidean Distance
- 8.3.2 Manhattan Distance
- 8.3.3 Cosine Similarity
- 8.4 Set Similarity
- 8.4.1 Dice Coefficient
- 8.4.2 Jaccard Index
- 8.4.3 Jaccard Distance
- 8.4.4 Containment
- 8.4.5 Overlap Coefficient
- 8.4.6 Tversky Index
- 8.5 Set of Vectors Similarity
- 8.6 Tree Similarity
- 8.7 Graph Similarity
- 8.7.1 Graph Isomorphism
- 8.7.2 Graph Edit Distance
- 8.7.3 Maximum Common Subgraph
- References
- 9 Software Similarity Searching and Classification (6 pages)
- 9.1 Instance-Based Learning and Nearest Neighbour
- 9.1.1 k Nearest Neighbours Query
- 9.1.2 Range Query
- 9.1.3 Metric Trees
- 9.1.4 Locality Sensitive Hashing
- 9.1.5 Distributed Similarity Search
- 9.2 Statistical Machine Learning
- 9.2.1 Vector Space Models
- 9.2.2 Kernel Methods
- 9.3 Research Opportunities
- References
- 9.1 Instance-Based Learning and Nearest Neighbour
- 10 Applications (6 pages)
- 10.1 Malware Classification
- 10.1.1 Raw Code
- 10.1.2 Instructions
- 10.1.3 Basic Blocks
- 10.1.4 API Calls
- 10.1.5 Control Flow and Data Flow
- 10.1.6 Data Flow
- 10.1.7 Call Graph
- 10.1.8 Control Flow Graphs
- 10.2 Software Theft Detection (Static Approaches)
- 10.2.1 Instructions
- 10.2.2 Control Flow
- 10.2.3 API Calls
- 10.2.4 Object Dependencies
- 10.3 Software Theft Detection (Dynamic Approaches)
- 10.3.1 Instructions
- 10.3.2 Control Flow
- 10.3.3 API Calls
- 10.3.4 Dependence Graphs
- 10.4 Plagiarism Detection
- 10.4.1 Raw Code and Tokens
- 10.4.2 Parse Trees
- 10.4.3 Program Dependency Graph
- 10.5 Code Clone Detection
- 10.5.1 Raw Code and Tokens
- 10.5.2 Abstract Syntax Tree
- 10.5.3 Program Dependency Graph
- 10.6 Critical Analysis
- References
- 10.1 Malware Classification
- 11 Future Trends and Conclusion
- 11.1 Future Trends
- 11.2 Conclusion
- 1 Introduction (6 pages)
-
- Preface (1 page)
- Chapter 2: Taxonomy of Program Features (10 pages)
- Table of Contents (6 pages)
- 1 zip file containing 3 PDFs:
-
- https://binary.ninja/2022/06/20/introducing-tanto.html#potential-uses-and-some-speculation
-
What I’ve found most interesting, and have been speculating about, is using variable slices like these (though not directly through the UI) in the function fingerprinting space. I’ve long suspected that a dataflow-based approach to fingerprinting might prove to be robust against compiler optimizations and versions, as well as source code changes that don’t completely redefine the implementation of a function. Treating each variable slice as a record of what happens to data within a function, a similarity score for two slices could be generated from the count of matching operations, matching constant interactions (
2 + var_a
), and matching variable interactions (var_f + var_a
). Considering all slices, a confidence metric could be derived for whether two functions match. Significant research would be required to answer these questions concretely… and, if you could solve subgraph isomorphism at the same time, that’d be great!
-
- https://gist.github.com/0xdevalias
- https://github.com/0xdevalias/chatgpt-source-watch : Analyzing the evolution of ChatGPT's codebase through time with curated archives and scripts.
- Deobfuscating / Unminifying Obfuscated Web App Code (0xdevalias gist)
- Reverse Engineering Webpack Apps (0xdevalias gist)
- React Server Components, Next.js v13+, and Webpack: Notes on Streaming Wire Format (
__next_f
, etc) (0xdevalias' gist)) - JavaScript Web App Reverse Engineering - Module Identification (0xdevalias' gist)
- Reverse Engineered Webpack Tailwind-Styled-Component (0xdevalias' gist)
- Bypassing Cloudflare, Akamai, etc (0xdevalias gist)
- Debugging Electron Apps (and related memory issues) (0xdevalias gist)
- devalias' Beeper CSS Hacks (0xdevalias gist)
- Reverse Engineering Golang (0xdevalias' gist)
- Reverse Engineering on macOS (0xdevalias' gist)