Skip to content

Instantly share code, notes, and snippets.

@vladak
Last active May 29, 2019 14:48
Show Gist options
  • Save vladak/fd135dd56b72f7f9eddbf454656cab1d to your computer and use it in GitHub Desktop.
Save vladak/fd135dd56b72f7f9eddbf454656cab1d to your computer and use it in GitHub Desktop.

What is OpenGrok

  • blindingly fast source code search and cross-reference engine
    • can be used for searching and source code browsing
    • “grok” = profoundly understand
      • in our case, source code
  • the IP/project is still owned by Oracle
    • developed as open source (CDDL license)
  • Searching computer programs that use different semantics
  • written in Java/JSP
  • originally developed as a tool perform security vulnerability research quickly
    • used as primary search engine for OpenSolaris (1st Mercurial commit from 2006)

OpenGrok capabilities

  • supports all major Source Code Management systems
    • Mercurial, CVS, Teamware, SVN, Git, ...
    • is able to view/search history and provide diffs of changes
  • and languages and formats
    • C, C++, Java, Perl, Shell, SQL, ...
    • can even look into archive (tar/jar) and compressed files
    • provides syntax highlighting
  • returns search results almost instantaneously
  • RESTful API
  • suggester

How it works

  • Indexing
    • ctags provides semantics analysis of source code
    • analyzers for given file type provide tokens and generate cross-reference data
    • Lucene* creates inverted index from the tokens
      • i.e. word → set of documents
      • there are indexers for given file type
    • indexes are created for history too
  • Searching
    • web application accesses Lucene database and returns results as HTML page

Internal data flow for a source code file

  • indexer calls analyzer guru
    • analyzer guru calls analyzer specific for given file type
  • there is a hierarchy of analyzers
    • e.g. C/C++ analyzer → plain text analyzer* → File analyzer
  • analyzer guru calls history guru
    • who calls a history reader for specific SCM
  • there is a list of history readers
  • everything returns back to the indexer

UI

  • main search form
    • concepts of project groups/projects/repositories
  • list of directory
  • file view: history/annotate/Navigate
  • full search field can be used to enter field identifies (defs/refs/path)

Tokenization

  • search terms are tokenized
  • in practice this could mean unexpected results
  • e.g. to find all files with file name having “.c” suffix
    • one would enter *.c to the File Path field
  • however this produces just files with the '..c' suffix
    • or use just 'c'
  • but this will produce also files with e.g. '.c.txt' suffix
    • the reason is tokenization
    • solution: use “. c”~1
  • this means: find '.' and 'c' tokens that are within the distance of 1 word (proximity search)

Search

  • file search: searches path components
  • All search fields allow the use of boolean operators
    • e.g. search for directories/files named 'aes' outside of the uts tree
    • Use 'aes && !uts' or ('aes AND -uts') for search term in File Path field
  • Search terms can be grouped in phrases
    • e.g. performing full search for 'type cast' will return files containing the two words anywhere in the file
    • using '”type cast”' will search for the phrase, i.e. the two words appearing next to each other
  • but: not using exact match ! this search will match both “type cast” and also “function(..., type, cast)”
  • Wildcards
    • Full search for 'foo' will find things like "/foo/bar/baz/" but not “index("foofoofoo", ...)”
  • this is because of tokenization
  • use 'foo*' to get both
    • Full search for 'foo???' will get “foobar”, “footer”, “fooled”, ...
  • Escaping special characters
    • needed to search for characters which overlap with special characters for constructing search terms
    • e.g. to search for 'mp && bp' in the code use '“mp && bp”'
  • some characters are not indexed at all
    • e.g. '“' - depends on analyzer
  • Symbol reference search does not find 1 character symbols
    • i.e. does not find references to variables such as 'i'
    • this is done deliberately to keep the performance high and limit ambiguity of the results
  • Case (in)sensitivity
    • Full/File/History search is case insensitive
      • e.g. search for 'license' will find 'license', 'License', 'LICENSE'
    • Definition/Symbol search is case sensitive
      • e.g. search for 'MAIN' will find only definitions of MAIN, not main or Main and the like
  • multi-project search

Search tricks

  • Proximity search can reveal how APIs are used
    • refs:"chroot chdir"~50
  • function calls to chroot() and chdir() within 50 tokens
  • full:“void strcpy”~1
  • explicitly ignored return values from strcpy()
    • full:strcpy AND NOT "void strcpy"~1 AND NOT "= strcpy"~1 path:". c"~1
  • query for unchecked return value
    • catches things like: (void) strcat(strcat(strcpy(shadow_dest, ...
  • Binaries (ELF) are indexed
    • Allows for searching symbols (functions) being defined/referenced
      • e.g. +full:rsa_public_encrypt +path:". O"~1
    • Handy for examination of the contents of install ISO image
      • Flatten out (unzip/untar/...), index, search
  • Searching for potentially insecure patterns
    • e.g. /tmp/*.$$ in shell scripts
      • +full:"tmp $ $"~20 +(path:". sh"~1 path:". ksh"~1)

Operations

  • sizing (RAM, zpool), setup (ZFS, Apache w/ mod_proxy, authorization, periodic mirror+reindex)
  • from scratch reindex
  • Python tools, mirroring

Dependencies

  • compile time
    • Lucene
    • Maven (+ant)
    • JFlex
    • Apache bcel (Java class analysis)
    • Jersey/Jackson
    • Chronicle map
    • jrcs, cron utils, ...
  • run time
    • SCMs (optional)
    • Universal ctags
    • Python (optional)
  • testing: JUnit, mockito, Sonar, ...

Builds

  • CI engines: Travis (Linux, macOS), Wercker (Linux), AppVeyor (Windows)
  • triggered on commit / pull request
  • checkers: Sonar, Jacoco/Coveralls, style check

Release process

  • flexible release schedule (continuous development)
  • Docker image creation

Contributing

  • everything done on Github: https://github.com/oracle/opengrok/pulse
  • external contributors provide code with Oracle Contributor Agreement
    • issues open
    • wikis closed
  • contacts/discussions: Yahoo MLs, Slack
  • pull requests merged on case by case basis
    • squash/rebase
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment