##Breaking Down the Lucene Analysis Process
The Lucene analysis process is very powerful, but most of us only know enough of the basics to put together a simple analyzer chain. Search isn't always plug-and-play, and the ability to manipulate and compose tokenizers and token filters will be the differentiator in developing your search product.
Using visualizations of the analysis chain, I will break down the Lucene analysis process to its most basic parts: char filters, tokenizers, and token filters. I'll show how differences in the composition of the token filters affects the final output. We'll see how tokens are more than just a stream; that they can become a token graph using synonyms and generating word parts.
##Reviewer Comments
I've been working directly with Lucene for the past year, implementing Softek's proprietary ranking algorithm for searching radiology documents. In the process, I've submitted patches or extended core Lucene and Solr code. I've implemented our own query parser extension and token filters with a focus on support of payloads. I recently gave a 2 hour presentation on advanced Lucene and Solr concepts at KCDC. In that talk, I focused on the indexing and analysis process, as well as the querying process. This proposal is based largely on the analysis portion of the KCDC talk, reduced to fit into the 40 minute time window.
What are the most basic parts of the analysis process? Is it just tokenizing and token filters? Maybe you should list them.
-- good idea. I've done that.