Skip to content

Instantly share code, notes, and snippets.

@yashh
Created October 25, 2010 22:31
Show Gist options
  • Save yashh/645924 to your computer and use it in GitHub Desktop.
Save yashh/645924 to your computer and use it in GitHub Desktop.
Using Solr to perform substring match right from start of a word
## http://www.mail-archive.com/[email protected]/msg31768.html
Hi Ahmet,
Well after some more testing I am now convinced that you rock :)
I like the solution because its obviously way less hacky and more importantly I
expect this to be a lot faster and less memory intensive, since instead of a
facet prefix or terms search, I am doing an "equality" comparison on tokens
(albeit a fair number of them, but each much smaller). I can also have more
control over the ordering of the results. I can also make full use of the
stopword filter, which again should improve the sort order (like if I have a
stopword "ag" and a word starts with "ag" it will not be "overpowered" by tons
of strings containing "ag" as a single word). Obviously there is one limitation
if people enter search terms longer than 20, but I think I can safely ignore
this case. Even with long german words 15 letters should be enough to find what
the user is looking for. and if a word needs more characters, then its probably
a meaningless post fix like "versicherungsgesellschaft" which just means
"insurance agency" and the user is just being stupid.
I do loose the nice numbers telling the user how often a given term matched,
which has some merit for street/city names, less so for the names of people and
close to none for company names. There is also a minor niggle with how the data
is returned which I discuss at the end of the email.
I am using the following in my schema.xml
<fieldType name="prefix_token" class="solr.TextField"
positionIncrementGap="1">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="20" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
</analyzer>
</fieldType>
<field name="name" type="prefix_token" indexed="true" stored="true" />
<field name="firstname" type="prefix_token" indexed="true" stored="true" />
<field name="email" type="prefix_token" indexed="true" stored="true" />
<field name="city" type="prefix_token" indexed="true" stored="true" />
<field name="street" type="prefix_token" indexed="true" stored="true" />
<field name="telefon" type="prefix_token" indexed="true" stored="true" />
<field name="id" type="string" indexed="true" stored="true" required="true"
/>
and finally the following in my solrconfig.xml
<requestHandler name="auto" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="qf">name firstname email^0.5 telefon^0.5 city^0.6
street^0.6</str>
<str name="fl">name,firstname,telefon,email,city,street</str>
</lst>
</requestHandler>
This all works well. There is just one minor uglyness, which might still be
solveable inside solr, but I fixed it in the php frontend logic. The issue is
that I obviously get all the fields for each document returned and I need to
figure out for which I actually had a match to be presented in the autosuggest.
Is there some Solr magic that will do this work for me?
$query = new SolrQuery($searchstring);
$response = $this->solrClientAuto->query($query);
$numFound = empty($response->response->numFound) ? 0 :
$response->response->numFound;
$data = array('results' => array(), 'numFound' => $numFound);
if (!empty($response->response->docs)) {
$p = str_replace('"', '', substr($searchstring,
strpos($searchstring, ' ')));
foreach ($response->response->docs as $doc) {
foreach ((array)$doc as $value) {
if (stripos($value, $p) === 0 || stripos($value, ' '.$p)) {
$data['results'][$value] = 1;
}
}
}
}
Then again I have to review with the UI guys if we will always just show the
name anyways and replace the entire user entered term with the name which
should be sufficiently unique in most cases to get a small enough result set.
regards,
Lukas
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment