Scott Smerchek smerchek

Using the JSON output of a Solr field analysis, to visualize the Lucene indexing/analysis pipeline.

##Breaking Down the Lucene Analysis Process

The Lucene analysis process is very powerful, but most of us only know enough of the basics to put together a simple analyzer chain. Search isn't always plug-and-play, and the ability to manipulate and compose tokenizers and token filters will be the differentiator in developing your search product.

Using visualizations of the analysis chain, I will break down the Lucene analysis process to its most basic parts: char filters, tokenizers, and token filters. I'll show how differences in the composition of the token filters affects the final output. We'll see how tokens are more than just a stream; that they can become a token graph using synonyms and generating word parts.

##Reviewer Comments

I've been working directly with Lucene for the past year, implementing Softek's proprietary ranking algorithm for searching radiology documents. In the process, I've submitted patches or extended core Lucene and Solr code. I've implemented our own query parser extension and

####DevOps: Automating Your Infrastructure with Puppet Puppet is an open source project built by PuppetLabs (http://puppetlabs.com) to automate the management of your IT infrastructure. Whether you manage a hosted environment or you run your own servers in-house, Puppet can help alleviate management headaches. Puppet lets your declaratively describe what a machine should look like, and then makes it happen (and makes sure it stays that way). This talk will go over the basics of Puppet, including: how to get started, the essentials of Puppet modules, using existing modules on the Puppet Forge, running Puppet on Windows. It will also touch on how to write a basic module.

####Beyond the Basics: Lucene and Solr If you are already using Lucene and/or Solr (or even ElasticSearch), then this is the talk for you. We will go beyond the basics of these brilliant open source search platforms. Not only are there many ways to customize Solr through the standard configuration file, but there is so much more. Payloads offer up many possibilities for customization, including the ability to tag word with part of speech information. There is also a lot of ways to extend Lucene and Solr by creating your own filters, query parsers, tokenizers, token filters, and even highlighters with some simple Java code. If search is a core feature of your application, then you need to be using these advanced features to set yourself apart.

	#setting up a machine to build deb packages
	#http://lenni.info/blog/2012/05/installing-ruby-1-9-3-on-ubuntu-12-04-precise-pengolin/
	sudo apt-get update

	sudo apt-get install ruby1.9.1 ruby1.9.1-dev \
	rubygems1.9.1 irb1.9.1 ri1.9.1 rdoc1.9.1 \
	build-essential libopenssl-ruby1.9.1 libssl-dev zlib1g-dev

	sudo update-alternatives --install /usr/bin/ruby ruby /usr/bin/ruby1.9.1 400 \
	--slave /usr/share/man/man1/ruby.1.gz ruby.1.gz \

	#source: http://projects.puppetlabs.com/projects/1/wiki/debian_preseed_patterns

	file { "/tmp/file.preseed":
	source => 'puppet:///modules/modulename/file.preseed',
	mode => 600,
	backup => false,
	}

	package { 'packagename':
	responsefile => '/tmp/file.preseed',

	#source: http://johnleach.co.uk/words/771/puppet-dependencies-and-run-stages

	exec { "apt-update":
	command => "/usr/bin/apt-get update"
	}

	Exec['apt-update'] -> Package <\| \|>

	<!DOCTYPE html>
	<meta charset="utf-8">
	<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js"></script>
	<script src="http://d3js.org/d3.v2.js?2.9.1"></script>
	<script src="https://raw.github.com/timrwood/moment/1.6.2/min/moment.min.js"></script>
	<style>
	html {
	font-family: Arial, Helvetica, sans-serif;
	font-size: 10pt;
	}

	# Keep 4 most recent files/sub-directories
	# source: http://superuser.com/a/260332
	ls -r1 \| tail -n +5 \| xargs rm -rf

	#each line in answers.txt represents a single answer.

	sh pkg.sh < answers.txt