Mar 1 / Richard Knop

A Simple Search Engine Implementing Zend_Search_Lucene

Zend_Search_Lucene is a PHP port of a popular Java search engine Apache Lucene. It is also an important part of Zend Framework. Some say that it is too sluggish to be used in robust web applications and recommend faster alternatives such as Sphinx but that is not today’s topic. In this post I will show you a basic implementation of Zend_Search_Lucene that has worked well so far for medium websites I have worked on. There are two main tasks you will have to take care of:

  1. Creating an index and updating it regularly.
  2. Searching the index with a powerful query language.

First, let’s create a fresh search index. I know it’s already tiresome but I will use a simple blog application for my example implementation. To simplify it even further, it will only be possible to search blog posts. The posts schema looks like this:

CREATE TABLE posts (
id INT NOT NULL AUTO_INCREMENT,
title VARCHAR(255) NOT NULL,
body TEXT NOT NULL,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL,
user_id INT NOT NULL,
INDEX (created_at),
INDEX (user_id),
FOREIGN KEY (user_id)
REFERENCES users(id)
ON UPDATE CASCADE
ON DELETE CASCADE,
PRIMARY KEY (id)
) ENGINE = INNODB;

Creating the search index is easy:

  1. Zend_Search_Lucene::setDefaultSearchField('contents');
  2.  
  3. // create blog posts index located in /data/posts_index
  4. // make sure the folder is writable
  5. $index = Zend_Search_Lucene::create('data/posts_index');
  6.  
  7. // $this->_getTable() is a method that returns a model
  8. // get() method of the model returns all posts from the database
  9. $posts = $this->_getTable('Posts')->get();
  10. // iterate through posts and build the index
  11. foreach ($posts as $p) {
  12.     $doc = new Zend_Search_Lucene_Document();
  13.     $doc->addField(Zend_Search_Lucene_Field::UnIndexed('entry_id', $p->id));
  14.     $doc->addField(Zend_Search_Lucene_Field::Keyword('title', $p->title));
  15.     $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $p->body));
  16.     $index->addDocument($doc);
  17. }
  18. // commit the index
  19. $index->commit();

Pretty straightforward. You can see I have used three different static methods for adding fields to the document:

  • UnIndexed: unindexed and unstored (therefor unsearchable) but they are returned with search results. Unindexed fields usually store primary keys, timestamps or file paths.
  • Text: indexed, stored and tokenized. Text fields are searchable and are returned with search hits. Titles, first and last names, cities and states, post codes and street names are all good candidates for keyword fields.
  • UnStored: indexed and unstored – ideal for large texts.

There are more types of fields you can use (keyword, binary) but you can read about them in the documentation.

Next thing you need to do is update the index every once in a while so the search hits return up-to-date information. There are two ways to get around this problem. The most obvious is to update the index every time a new post is published or an existing post is edited. Another approach would be to set up a cron job to run every now and then and rebuild the index. Which way you choose depends on many variables such as expected index size (a very large index can have few GBs in size).

Secondly, the index is already taken care of, so let’s search it:

  1. Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
  2.  
  3. Zend_Search_Lucene::setResultSetLimit(10);
  4.  
  5. // explode the search query to individual words
  6. $words  = explode(' ', urldecode($request->getParam('search_for')));
  7. // start a search query and add a term for each word to it
  8. $query = new Zend_Search_Lucene_Search_Query_MultiTerm();
  9. foreach ($words as $w) {
  10.     $query->addTerm(new Zend_Search_Lucene_Index_Term($w), true);
  11. }
  12.  
  13. // open and query the index
  14. $index = Zend_Search_Lucene::open('data/posts_index');
  15. $results = $index->find($query); // the search results

That was possibly the simplest possible example of a Lucene search query. You can, however, create very complex queries with the powerful Lucene query language. You can either build queries manually in PHP or you can use Zend_Search_Lucene methods to build them. It’s so easy a baby could do it.

To search for posts with words ‘hello’ and ‘word’ in the contents field you would write this query:

hello

To search for a post that must contain ‘hello’ and may contain ‘world’:

+hello world

To search for a post that must contain ‘hello’ in the contents field and may contain ‘world’ in the title field:

+hello title:"world"

And those were just basics. You can use boolean operators, wildcards, ranges and even perform a fuzzy search.

Leave a Comment