Why we now prefer the Sphinx XML input format

by Daniel Alexander Smith on November 7th, 2009

I’ve recently altered the mSpace automated data import system (of which we are still finalising a full open-source release) from using the MySQL index source in Sphinx to the xmlpipe2 input source. The difference is that for the MySQL index source, the system generates a mega SQL query that JOINs all of your tables and returns a row for each “document” to be indexed, with each column referring to a searchable full-text field. This means that where you have a one-to-many relationship in the data, you have to concatenate the results into a single field. We were using the GROUP_CONCAT function in MySQL to do this. This worked fine until recently.

A recent project required the interface to allow searching of the full text of large PDF files, and users were noticing that results were not being hit for documents that contained terms they were searching for. After some debugging I noticed that the problem was GROUP_CONCAT truncating the results. Some manual reading led me to discover the MySQL variable group_concat_max_len. I altered this to the maximum allowed, tested out on some documents and everything was fine. For a couple of weeks anyway.

The problem recurred, and this time no amount of debugging and manual reading could explain why the results were truncating. In my research I also realised that the length of a MySQL result row is limited anyway (to something like 32K, I forget). While this wasn’t the problem in this case, it would become a problem in future, and possibly for this specific project, as PDFs are reloaded from a repository automatically, and their length could well be very large.

Thus, I chose to suspend learning about the seemingly endless number of MySQL environment variables that limit different individual, combined and row-level length limits, and see if I can bypass this nightmare altogether. Specifically, I looked into the XML index source type of Sphinx. The fundamental difference with this source type is that instead of telling Sphinx what my data looks like in the database that I use for the mSpace itself, I pull all of the data out into a single XML file, and hand that to the Sphinx indexer. There is no reason why Sphinx needs to know about the database, and provided that the document IDs match the primary key of my First Order Facet table, it’ll match up perfectly.

In order to do this I created a Sphinx XML export plugin for the mSpace server. Naturally this first required me to implement a plugin architecture for the mSpace server backend, but this was on the “medium priority” section of my to-do list anyway. The mSpace import system was altered so that the mSpace is up and running slightly before the Sphinx index is finished (so that the import system can use the new export function of the server to get the XML file), and the new index is now integrated with Sphinx XML input type.

I am happy to confirm that exporting the data via this method does not suffer from the truncating problems of the MySQL input type, and appears to works perfectly. It should also pay some future dividends having a plugin system with a decent example export plugin ready too.

Share and Enjoy:
  • email
  • Print
  • Twitter
  • del.icio.us
  • Facebook
  • Digg
  • Technorati
  • StumbleUpon
  • Slashdot
  • Ping.fm
  • LinkedIn
  • Google Bookmarks
  • MySpace
  • Netvibes
  • Reddit
  • Tumblr

From Server, mSpace

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS