Zend Lucene Search for Multi Language

posted in: Localization, Zend Framework | 0

Summary:

Seatch Implementation for MultiLanguages Using Zend Lucene

I have a problem with searching Russian strings,   with  Zend Search Lucene. Here is my actual code:

 ///Before

// Create index
$index = Zend_Search_Lucene::create(‘data/index’);
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text(‘samplefield’, ‘русский
текст; english text’));
$index->addDocument($doc);
$index->commit();

The problem here is default analyzer works only with ASCII Text.
That’s so because mbstring PHP extension is not included into PHP installation by default and iconv() doesn’t have necessary functionality.

You should use special UTF-8 analyzers to work with non-ASCII text which can’t be transliterated by iconv()

///Add this line extra to replace default analyzer with Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8. It looks like the analyzer you are using destroys the non-ASCII characters

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8 ());

//After
// Create index
$index = Zend_Search_Lucene::create(‘data/index’);
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text(‘samplefield’,
‘русский текст; english text’,
‘utf-8’));
$index->addDocument($doc); $index->commit();

This needs to be done at the time of creating index files.

 

While indexing database columns we need to execute my sql query .Before executing mysql_query we need to add mysql_query(“SET NAMES ‘utf8′”).

mysql_query(“SET NAMES ‘utf8′”);

$contents = mysql_query($query)

This will inform mysql that all incoming data are UTF-8, it will convert them into table/column encoding. Same will happen when mysql sends you the data back – they will be converted into UTF-8. You will also have to assure that you set the content-type response header to indicate the UTF-8 encoding of the pages.

///Searching  

The same Zend_Search_Lucene_Analysis_Analyzer we need to set as default analyzer before searching also.

////Before

// Query the index:
$queryStr = ‘english’;
$query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr, ‘utf-8’);
$hits = $index->find($query);
foreach ($hits as $hit) {
/*@var $hit Zend_Search_Lucene*/
$doc = $hit->getDocument();
echo $doc->getField(‘samplefield’)->value, PHP_EOL;
}

////After
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive ());

// Open index
$index = Zend_Search_Lucene::open(‘data/index’);

Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding(‘utf-8’);
foreach ($index->find($query) as $hit) {
echo $hit->samplefield, PHP_EOL;
}

UTF-8 compatible text analyzers

Zend_Search_Lucene also contains a set of UTF-8 compatible analyzers: Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8, Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num, Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive, Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive.

Any of this analyzers can be enabled with the code like this:

Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());

Warning

UTF-8 compatible analyzers were improved in Zend Framework 1.5. Early versions of analyzers assumed all non-ascii characters are letters. New analyzers implementation has more accurate behavior.

This may need you to re-build index to have data and search queries tokenized in the same way, otherwise search engine may return wrong result sets.

Previous Post
Next Post

Leave a Reply