Build yourself a Xapian index of package info

Run the Debian indexer on your distro

The Debian Xapian indexer is called update-apt-xapian-index and normally it reads data from the Apt database. Luckily it also has an option (--pkgfile=file) for reading data from a plain file, which is used to build server-side indices and to build a test environment for its test suite. If you can generate a suitable input file, update-apt-xapian-index will build an index for you.

The input file has the same format as the Debian Packages file, which is similar to email or HTTP headers:

Package: 2vcard
Priority: optional
Section: utils
Installed-Size: 108
Maintainer: Martin Albisetti <argentina@gmail.com>
Architecture: all
Version: 0.5-3
Filename: pool/main/2/2vcard/2vcard_0.5-3_all.deb
Size: 14300
MD5sum: d831fd82a8605e9258b2314a7d703abe
SHA1: e903a05f168a825ff84c87326898a182635f8175
SHA256: 2be9a86f0ec99b1299880c6bf0f4da8257c74a61341c14c103b70c9ec04b10ec
Description: perl script to convert an addressbook to VCARD file format
 2vcard is a little perl script that you can use to convert the
 popular vcard file format. Currently 2vcard can only convert addressbooks
 and alias files from the following formats: abook,eudora,juno,ldif,mutt,
 mh and pine.
 .
 The VCARD format is used by gnomecard, for example, which is used by the
 balsa email client.
Tag: implemented-in::perl, role::program, use::converting

Package: 3dchess
[...]

Records are separated with an empty line, and long fields like 'Description' use continuation lines that start with spaces. The first line of the description is the short description, the rest is the long description; an empty line in the Description is represented with a dot.

For update-apt-xapian-index you only need the fields Package, Version, Description, Tag, Section, Installed-Size and Size. Tag, Section, Installed-Size and Size are all optional, although you probably want Tag for Debtags categories.

If you want to start playing with the indexer without building your own input file, you can run apt-cache dumpavail on any Debian or Ubuntu system to extract the whole system dataset. Alternatively, you can use any Packages file from a Debian mirror.

Dependencies:

  • python-xapian (Python bindings for Xapian)
  • python-debian (used to read some Debian-style files, source is straightforward to build)
  • python-chardet, dependency of python-debian, available in Fedora, Mandriva/Mageia and Suse with the same name Building the index:
git clone git://git.debian.org/git/collab-maint/apt-xapian-index.git
cd apt-xapian-index

# Testrun is just a simple wrapper that exports the variables needed
# to run the indexer in the current directory
./testrun --pkgfile=inputfile --force --verbose # Creates an index in testdb/

# Try querying it with Xapian's low-level "delve" tool, to see if it worked:
delve -1 -d -t edit testdb/index

The Xapian index itself is in testdb/index; testdb/ will contain other information about the index, including an autogenerated README file documenting its contents, especially the term prefixes used by the index.

Congratulations: you can now try querying the index. The Xapian website has documentation and examples for C++ and Python, Perl, PHP, Ruby, C#, Java and more bindings.

Patches welcome for alternative input file formats and extra plugins to index extra info you may need. Please update this page with your experience if you try it.

Possible things to try:

  • Change DEBTAGSDB in plugins/debtags.py to make it read Debtags information from one of the distromatch exports so you don't need to add them as Tag: fields
  • Get pkgshelf to work (it should only need export AXI_DB_PATH=testdb and editing pkgshelf/__init__.py to change /var/lib/debtags/package-tags with the location of your distromatch Debtags export.
  • If you need to index some extra information, take a look at plugins/template.py for a plugin template: you only need to redefine the method indexDeb822.
  • Build an indexer that reads the native package database for your own distribution, then get in touch with Enrico to see if it can all fit in the same codebase.