Recommendation: Don’t use Zend PHP Lucene!

This blog post has a single purpose: to write down a warning of using the Zend Framework PHP implemention of Lucene.

I was forced to use it in a project for some reasons. My customer wanted a sophisticated search engine for his Typo3 based website, especially for his business objects. I don’t do Typo3, and my first thought was to build this search engine as an external service in Java Lucene, let Lucene index both website and database business objects, and let Typo3 query Lucene through some HTTP based service – SOAP, XML-RPC, HTTP + JSON, whatever.

Next came some objections by my customer: We only had some „managed root servers“. Those were preconfigured by the hosting company, and the contract wouldn’t allow any major changes in the configuration due to service and support issues and warranties. In particular it wasn’t allowed to install a Java webcontainer such as Tomcat to run Lucene as a webapp, and a monitoring service would look for unwanted threads on each server and instantly kill them. No root access, and only PHP (+ Apache + MySQL) was allowed to run.

Therefore we decided to give the PHP implementation of Lucene inside the Zend Framework a try. But honestly, this turned out to be a nightmare. Here are some reasons:

  1. Memory. We also had a memory limit of 64 MB per PHP thread on that server. It was not possible to add 10.000 documents to the index within one php-cli run. We also requested 128MB from the hoster, it didn’t help. I had to shut down the indexing process, restart and re-open the index repeatedly, no matter what I tried. Even closing the index and unsetting all references didn’t help.
  2. Memory. In addition to this, my observation is that Java a far superior class loading, memory management and garbage collection in comparison with PHP. If you don’t pay strong attention to it, this makes any serious and bigger project really difficult. PHP Lucene seems to fail in this concern to me.
  3. Speed. If you want to build a big index: forget it. A Lucene index can contain some million documents – I wouldn’t know how to build such an index with PHP Lucene. My index currently contains 60.000 documents, and it takes PHP about 2 hours to build it from scratch. Also updating or optimizing an index is much faster in the Java version.
  4. Speed. This is where PHP Lucene really went bad: querying the index. I had no really complex queries with maybe 10 terms on different fields, including a few range query terms. It sometimes took more than 60 seconds to get a result – when I got a result at all instead of a fatal out of memory error or max execution timeout. The same query in Java Lucene, tested with Luke: some milliseconds! Unbelievable!
  5. Features. Java Lucene is a well known, proven project. There are many addons and related projects such as Solr that make your life with Lucene a joy. There is no comparable eco system in the PHP world. E.g. my customer came up with the idea a location based search. In Java I could use the Local Lucene addon to do geographical search. I wouldn’t know how to do this in PHP without reinventing such an algorithm in PHP.
  6. Features. Another request was: sort results by frequency or count of a term in a field. This sounds simple, but it’s close to impossible – at least in PHP. To the best of my knowledge, you can’t add a field with a TermVector or equivalent for that.

So what happened to my project? We ended up leaving the indexing part in PHP (until for now, I feel a final end coming as the amount of data will be constantly growing which will exceed all limitations of PHP Lucene), and some postprocessing and optimizing of the index and the querying part has been ported to Java, after I finally managed run a small footprint Jetty „illegally“ and unsupported on that server.

Conclusion: Don’t. consider. using. PHP. Lucene. Ever. (Unless your project and amount of data is quite small, and any limitations don’t matter.)

5 Antworten Subscribe to comments


  1. Christian

    Hi, ein Tipp: Da die Lucene-Implementation in Ruby auch nicht frei von Problemen ist, musste ich hier eine Alternative suchen. Dabei bin ich auf Sphinx gestossen:

    Sphinx indiziert Inhalte von SQL-Datenbanken und bietet Bindings zu etlichen Programmiersprachen, darunter natürlich PHP. Der Indexserver läuft als eigener Prozess und indiziert sehr schnell, in meinen Tests mit 6000-10000 Dokumenten pro Sekunde.

    http://sphinxsearch.com/

    Es gibt wohl etliche Leute, die diese Suche mit der Typo3-Extension indexed_search zusammenbringen. Google hilft.

    Beste Grüße,

    Christian

    22.07.2009 @ 09:04


  2. David Goodwin

    I can’t say I’ve been particularly impressed with Zend_Search_Lucene either – it seems slow. If/when I need to do another such system, I think I’ll use a standalone indexer (e.g. sphinx).

    16.10.2009 @ 06:19


  3. Daniel latter

    You cant compare PHP to Java that’s why this whole article is flawed.

    Ofcourse Java will be faster, it’s a compiled language whereas PHP
    is intrpreted. Also why doesn’t this article mention anything about caching, op code caches etc. To me, all this article does is show the authors lack of knowledge and experiance with PHP.

    16.10.2009 @ 08:31


  4. Hari K T

    You may be right, but it can be made as a comparison between two languages only as Daniel latter says .

    Anyway Thanks for pointing out, so if some one is looking for a comparison between languages will know what each ( Java and PHP ) people thinks.

    11.11.2009 @ 18:00


  5. mizmjctgda

    pv4jLq mxbasjwtardy, [url=http://hyhxaptzogbw.com/]hyhxaptzogbw[/url], [link=http://rfbdwbofzfkn.com/]rfbdwbofzfkn[/link], http://uddgmosuhqrs.com/

    24.11.2010 @ 23:17


Archiv
Kategorien
Suche