OpenGrok.md

What is OpenGrok

blindingly fast source code search and cross-reference engine
- can be used for searching and source code browsing
- “grok” = profoundly understand
  - in our case, source code
the IP/project is still owned by Oracle
- developed as open source (CDDL license)
Searching computer programs that use different semantics
written in Java/JSP
originally developed as a tool perform security vulnerability research quickly
- used as primary search engine for OpenSolaris (1st Mercurial commit from 2006)

OpenGrok capabilities

supports all major Source Code Management systems
- Mercurial, CVS, Teamware, SVN, Git, ...
- is able to view/search history and provide diffs of changes
and languages and formats
- C, C++, Java, Perl, Shell, SQL, ...
- can even look into archive (tar/jar) and compressed files
- provides syntax highlighting
returns search results almost instantaneously
RESTful API
suggester

How it works

Indexing
- ctags provides semantics analysis of source code
- analyzers for given file type provide tokens and generate cross-reference data
- Lucene* creates inverted index from the tokens
  - i.e. word → set of documents
  - there are indexers for given file type
- indexes are created for history too
Searching
- web application accesses Lucene database and returns results as HTML page

Internal data flow for a source code file

indexer calls analyzer guru
- analyzer guru calls analyzer specific for given file type
there is a hierarchy of analyzers
- e.g. C/C++ analyzer → plain text analyzer* → File analyzer
analyzer guru calls history guru
- who calls a history reader for specific SCM
there is a list of history readers
everything returns back to the indexer

main search form
- concepts of project groups/projects/repositories
list of directory
file view: history/annotate/Navigate
full search field can be used to enter field identifies (defs/refs/path)

Tokenization

search terms are tokenized
in practice this could mean unexpected results
e.g. to find all files with file name having “.c” suffix
- one would enter *.c to the File Path field
however this produces just files with the '..c' suffix
- or use just 'c'
but this will produce also files with e.g. '.c.txt' suffix
- the reason is tokenization
- solution: use “. c”~1
this means: find '.' and 'c' tokens that are within the distance of 1 word (proximity search)

file search: searches path components
All search fields allow the use of boolean operators
- e.g. search for directories/files named 'aes' outside of the uts tree
- Use 'aes && !uts' or ('aes AND -uts') for search term in File Path field
Search terms can be grouped in phrases
- e.g. performing full search for 'type cast' will return files containing the two words anywhere in the file
- using '”type cast”' will search for the phrase, i.e. the two words appearing next to each other
but: not using exact match ! this search will match both “type cast” and also “function(..., type, cast)”
Wildcards
- Full search for 'foo' will find things like "/foo/bar/baz/" but not “index("foofoofoo", ...)”
this is because of tokenization
use 'foo*' to get both
- Full search for 'foo???' will get “foobar”, “footer”, “fooled”, ...
Escaping special characters
- needed to search for characters which overlap with special characters for constructing search terms
- e.g. to search for 'mp && bp' in the code use '“mp && bp”'
some characters are not indexed at all
- e.g. '“' - depends on analyzer
Symbol reference search does not find 1 character symbols
- i.e. does not find references to variables such as 'i'
- this is done deliberately to keep the performance high and limit ambiguity of the results
Case (in)sensitivity
- Full/File/History search is case insensitive
  - e.g. search for 'license' will find 'license', 'License', 'LICENSE'
- Definition/Symbol search is case sensitive
  - e.g. search for 'MAIN' will find only definitions of MAIN, not main or Main and the like
multi-project search

Search tricks

Proximity search can reveal how APIs are used
- refs:"chroot chdir"~50
function calls to chroot() and chdir() within 50 tokens
full:“void strcpy”~1
explicitly ignored return values from strcpy()
- full:strcpy AND NOT "void strcpy"~1 AND NOT "= strcpy"~1 path:". c"~1
query for unchecked return value
- catches things like: (void) strcat(strcat(strcpy(shadow_dest, ...
Binaries (ELF) are indexed
- Allows for searching symbols (functions) being defined/referenced
  - e.g. +full:rsa_public_encrypt +path:". O"~1
- Handy for examination of the contents of install ISO image
  - Flatten out (unzip/untar/...), index, search
Searching for potentially insecure patterns
- e.g. /tmp/*.$$ in shell scripts
  - +full:"tmp $ $"~20 +(path:". sh"~1 path:". ksh"~1)

Operations

sizing (RAM, zpool), setup (ZFS, Apache w/ mod_proxy, authorization, periodic mirror+reindex)
from scratch reindex
Python tools, mirroring

Dependencies

compile time
- Lucene
- Maven (+ant)
- JFlex
- Apache bcel (Java class analysis)
- Jersey/Jackson
- Chronicle map
- jrcs, cron utils, ...
run time
- SCMs (optional)
- Universal ctags
- Python (optional)
testing: JUnit, mockito, Sonar, ...

Builds

CI engines: Travis (Linux, macOS), Wercker (Linux), AppVeyor (Windows)
triggered on commit / pull request
checkers: Sonar, Jacoco/Coveralls, style check

Release process

flexible release schedule (continuous development)
Docker image creation

Contributing

everything done on Github: https://github.com/oracle/opengrok/pulse
external contributors provide code with Oracle Contributor Agreement
- issues open
- wikis closed
contacts/discussions: Yahoo MLs, Slack
pull requests merged on case by case basis
- squash/rebase

vladak/OpenGrok.md