yashh · October 25, 2010 22:31
diff --git a/Solr EdgeNGramFilterFactory b/Solr EdgeNGramFilterFactory
 ## http://www.mail-archive.com/[email protected]/msg31768.html

 Hi Ahmet,

 Well after some more testing I am now convinced that you rock :)
 I like the solution because its obviously way less hacky and more importantly I 
 expect this to be a lot faster and less memory intensive, since instead of a 
 facet prefix or terms search, I am doing an "equality" comparison on tokens 
 (albeit a fair number of them, but each much smaller). I can also have more 
 control over the ordering of the results. I can also make full use of the 
 stopword filter, which again should improve the sort order (like if I have a 
 stopword "ag" and a word starts with "ag" it will not be "overpowered" by tons 
 of strings containing "ag" as a single word). Obviously there is one limitation 
 if people enter search terms longer than 20, but I think I can safely ignore 
 this case. Even with long german words 15 letters should be enough to find what 
 the user is looking for. and if a word needs more characters, then its probably 
 a meaningless post fix like "versicherungsgesellschaft" which just means 
 "insurance agency" and the user is just being stupid.

 I do loose the nice numbers telling the user how often a given term matched, 
 which has some merit for street/city names, less so for the names of people and 
 close to none for company names. There is also a minor niggle with how the data 
 is returned which I discuss at the end of the email.

 I am using the following in my schema.xml

    <fieldType name="prefix_token" class="solr.TextField" 
 positionIncrementGap="1">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
 generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
 splitOnCaseChange="1"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
 words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
 maxGramSize="20" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
 generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
 splitOnCaseChange="1"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
 words="stopwords.txt" enablePositionIncrements="true" />
      </analyzer>
    </fieldType>

   <field name="name" type="prefix_token" indexed="true" stored="true" />
   <field name="firstname" type="prefix_token" indexed="true" stored="true" />
   <field name="email" type="prefix_token" indexed="true" stored="true" />
   <field name="city" type="prefix_token" indexed="true" stored="true" />
   <field name="street" type="prefix_token" indexed="true" stored="true" />
   <field name="telefon" type="prefix_token" indexed="true" stored="true" />
   <field name="id" type="string" indexed="true" stored="true" required="true" 
 />

 and finally the following in my solrconfig.xml

  <requestHandler name="auto" class="solr.SearchHandler" default="true">
    <lst name="defaults">
     <str name="defType">dismax</str>
     <str name="echoParams">explicit</str>
     <int name="rows">10</int>
     <str name="qf">name firstname email^0.5 telefon^0.5 city^0.6 
 street^0.6</str>
     <str name="fl">name,firstname,telefon,email,city,street</str>
    </lst>
  </requestHandler>

 This all works well. There is just one minor uglyness, which might still be 
 solveable inside solr, but I fixed it in the php frontend logic. The issue is 
 that I obviously get all the fields for each document returned and I need to 
 figure out for which I actually had a match to be presented in the autosuggest. 
 Is there some Solr magic that will do this work for me?

        $query = new SolrQuery($searchstring);
        $response = $this->solrClientAuto->query($query);
        $numFound = empty($response->response->numFound) ? 0 : 
 $response->response->numFound;
        $data = array('results' => array(), 'numFound' => $numFound);

        if (!empty($response->response->docs)) {
            $p = str_replace('"', '', substr($searchstring, 
 strpos($searchstring, ' ')));
            foreach ($response->response->docs as $doc) {
                foreach ((array)$doc as $value) {
                    if (stripos($value, $p) === 0 || stripos($value, ' '.$p)) {
                        $data['results'][$value] = 1;
                    }
                }
            }
        }

 Then again I have to review with the UI guys if we will always just show the 
 name anyways and replace the entire user entered term with the name which 
 should be sufficiently unique in most cases to get a small enough result set.

 regards,
 Lukas
	## http://www.mail-archive.com/[email protected]/msg31768.html

	Hi Ahmet,

	Well after some more testing I am now convinced that you rock :)
	I like the solution because its obviously way less hacky and more importantly I
	expect this to be a lot faster and less memory intensive, since instead of a
	facet prefix or terms search, I am doing an "equality" comparison on tokens
	(albeit a fair number of them, but each much smaller). I can also have more
	control over the ordering of the results. I can also make full use of the
	stopword filter, which again should improve the sort order (like if I have a
	stopword "ag" and a word starts with "ag" it will not be "overpowered" by tons
	of strings containing "ag" as a single word). Obviously there is one limitation
	if people enter search terms longer than 20, but I think I can safely ignore
	this case. Even with long german words 15 letters should be enough to find what
	the user is looking for. and if a word needs more characters, then its probably
	a meaningless post fix like "versicherungsgesellschaft" which just means
	"insurance agency" and the user is just being stupid.

	I do loose the nice numbers telling the user how often a given term matched,
	which has some merit for street/city names, less so for the names of people and
	close to none for company names. There is also a minor niggle with how the data
	is returned which I discuss at the end of the email.

	I am using the following in my schema.xml

	<fieldType name="prefix_token" class="solr.TextField"
	positionIncrementGap="1">
	<analyzer type="index">
	<tokenizer class="solr.WhitespaceTokenizerFactory" />
	<filter class="solr.LowerCaseFilterFactory" />
	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
	generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
	splitOnCaseChange="1"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true"
	words="stopwords.txt" enablePositionIncrements="true" />
	<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
	maxGramSize="20" />
	</analyzer>
	<analyzer type="query">
	<tokenizer class="solr.WhitespaceTokenizerFactory" />
	<filter class="solr.LowerCaseFilterFactory" />
	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
	generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
	splitOnCaseChange="1"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true"
	words="stopwords.txt" enablePositionIncrements="true" />
	</analyzer>
	</fieldType>

	<field name="name" type="prefix_token" indexed="true" stored="true" />
	<field name="firstname" type="prefix_token" indexed="true" stored="true" />
	<field name="email" type="prefix_token" indexed="true" stored="true" />
	<field name="city" type="prefix_token" indexed="true" stored="true" />
	<field name="street" type="prefix_token" indexed="true" stored="true" />
	<field name="telefon" type="prefix_token" indexed="true" stored="true" />
	<field name="id" type="string" indexed="true" stored="true" required="true"
	/>

	and finally the following in my solrconfig.xml

	<requestHandler name="auto" class="solr.SearchHandler" default="true">
	<lst name="defaults">
	<str name="defType">dismax</str>
	<str name="echoParams">explicit</str>
	<int name="rows">10</int>
	<str name="qf">name firstname email^0.5 telefon^0.5 city^0.6
	street^0.6</str>
	<str name="fl">name,firstname,telefon,email,city,street</str>
	</lst>
	</requestHandler>

	This all works well. There is just one minor uglyness, which might still be
	solveable inside solr, but I fixed it in the php frontend logic. The issue is
	that I obviously get all the fields for each document returned and I need to
	figure out for which I actually had a match to be presented in the autosuggest.
	Is there some Solr magic that will do this work for me?

	$query = new SolrQuery($searchstring);
	$response = $this->solrClientAuto->query($query);
	$numFound = empty($response->response->numFound) ? 0 :
	$response->response->numFound;
	$data = array('results' => array(), 'numFound' => $numFound);

	if (!empty($response->response->docs)) {
	$p = str_replace('"', '', substr($searchstring,
	strpos($searchstring, ' ')));
	foreach ($response->response->docs as $doc) {
	foreach ((array)$doc as $value) {
	if (stripos($value, $p) === 0 \|\| stripos($value, ' '.$p)) {
	$data['results'][$value] = 1;
	}
	}
	}
	}

	Then again I have to review with the UI guys if we will always just show the
	name anyways and replace the entire user entered term with the name which
	should be sufficiently unique in most cases to get a small enough result set.

	regards,
	Lukas
No results found