What is OpenGrok
- blindingly fast source code search and cross-reference engine
- can be used for searching and source code browsing
- “grok” = profoundly understand
- in our case, source code
- the IP/project is still owned by Oracle
- developed as open source (CDDL license)
- Searching computer programs that use different semantics
- written in Java/JSP
- originally developed as a tool perform security vulnerability research quickly
- used as primary search engine for OpenSolaris (1st Mercurial commit from 2006)
OpenGrok capabilities
- supports all major Source Code Management systems
- Mercurial, CVS, Teamware, SVN, Git, ...
- is able to view/search history and provide diffs of changes
- and languages and formats
- C, C++, Java, Perl, Shell, SQL, ...
- can even look into archive (tar/jar) and compressed files
- provides syntax highlighting
- returns search results almost instantaneously
- RESTful API
- suggester
How it works
- Indexing
- ctags provides semantics analysis of source code
- analyzers for given file type provide tokens and generate cross-reference data
- Lucene* creates inverted index from the tokens
- i.e. word → set of documents
- there are indexers for given file type
- indexes are created for history too
- Searching
- web application accesses Lucene database and returns results as HTML page
Internal data flow for a source code file
- indexer calls analyzer guru
- analyzer guru calls analyzer specific for given file type
- there is a hierarchy of analyzers
- e.g. C/C++ analyzer → plain text analyzer* → File analyzer
- analyzer guru calls history guru
- who calls a history reader for specific SCM
- there is a list of history readers
- everything returns back to the indexer
UI
- main search form
- concepts of project groups/projects/repositories
- list of directory
- file view: history/annotate/Navigate
- full search field can be used to enter field identifies (defs/refs/path)
Tokenization
- search terms are tokenized
- in practice this could mean unexpected results
- e.g. to find all files with file name having “.c” suffix
- one would enter *.c to the File Path field
- however this produces just files with the '..c' suffix
- or use just 'c'
- but this will produce also files with e.g. '.c.txt' suffix
- the reason is tokenization
- solution: use “. c”~1
- this means: find '.' and 'c' tokens that are within the distance of 1 word (proximity search)
Search
- file search: searches path components
- All search fields allow the use of boolean operators
- e.g. search for directories/files named 'aes' outside of the uts tree
- Use 'aes && !uts' or ('aes AND -uts') for search term in File Path field
- Search terms can be grouped in phrases
- e.g. performing full search for 'type cast' will return files containing the two words anywhere in the file
- using '”type cast”' will search for the phrase, i.e. the two words appearing next to each other
- but: not using exact match ! this search will match both “type cast” and also “function(..., type, cast)”
- Wildcards
- Full search for 'foo' will find things like "/foo/bar/baz/" but not “index("foofoofoo", ...)”
- this is because of tokenization
- use 'foo*' to get both
- Full search for 'foo???' will get “foobar”, “footer”, “fooled”, ...
- Escaping special characters
- needed to search for characters which overlap with special characters for constructing search terms
- e.g. to search for 'mp && bp' in the code use '“mp && bp”'
- some characters are not indexed at all
- e.g. '“' - depends on analyzer
- Symbol reference search does not find 1 character symbols
- i.e. does not find references to variables such as 'i'
- this is done deliberately to keep the performance high and limit ambiguity of the results
- Case (in)sensitivity
- Full/File/History search is case insensitive
- e.g. search for 'license' will find 'license', 'License', 'LICENSE'
- Definition/Symbol search is case sensitive
- e.g. search for 'MAIN' will find only definitions of MAIN, not main or Main and the like
- Full/File/History search is case insensitive
- multi-project search
Search tricks
- Proximity search can reveal how APIs are used
- refs:"chroot chdir"~50
- function calls to chroot() and chdir() within 50 tokens
- full:“void strcpy”~1
- explicitly ignored return values from strcpy()
- full:strcpy AND NOT "void strcpy"~1 AND NOT "= strcpy"~1 path:". c"~1
- query for unchecked return value
- catches things like:
(void) strcat(strcat(strcpy(shadow_dest, ...
- catches things like:
- Binaries (ELF) are indexed
- Allows for searching symbols (functions) being defined/referenced
- e.g. +full:rsa_public_encrypt +path:". O"~1
- Handy for examination of the contents of install ISO image
- Flatten out (unzip/untar/...), index, search
- Allows for searching symbols (functions) being defined/referenced
- Searching for potentially insecure patterns
- e.g. /tmp/*.$$ in shell scripts
- +full:"tmp $ $"~20 +(path:". sh"~1 path:". ksh"~1)
- e.g. /tmp/*.$$ in shell scripts
Operations
- sizing (RAM, zpool), setup (ZFS, Apache w/ mod_proxy, authorization, periodic mirror+reindex)
- from scratch reindex
- Python tools, mirroring
Dependencies
- compile time
- Lucene
- Maven (+ant)
- JFlex
- Apache bcel (Java class analysis)
- Jersey/Jackson
- Chronicle map
- jrcs, cron utils, ...
- run time
- SCMs (optional)
- Universal ctags
- Python (optional)
- testing: JUnit, mockito, Sonar, ...
Builds
- CI engines: Travis (Linux, macOS), Wercker (Linux), AppVeyor (Windows)
- triggered on commit / pull request
- checkers: Sonar, Jacoco/Coveralls, style check
Release process
- flexible release schedule (continuous development)
- Docker image creation
Contributing
- everything done on Github: https://github.com/oracle/opengrok/pulse
- external contributors provide code with Oracle Contributor Agreement
- issues open
- wikis closed
- contacts/discussions: Yahoo MLs, Slack
- pull requests merged on case by case basis
- squash/rebase