Saturday, October 12, 2013

Java Thread Dumps for Daemons With jstack

jstack is a really helpful utility that comes standard with most linux JDK versions. It allows you to generate java thread dumps in situations where kill -3 won't work. kill -3 (aka kill -QUIT) will dump a java process's threads to stderr — but this, of course, works only when you still have access to stderr. If you're running a java process as a daemon (like a jetty or tomcat or jboss etc server), stderr usually is inaccessible.

Fortunately, jstack allows you to generate thread dumps without needing to read stderr from the original java process. jstack will dump the threads to jstack's own stdout, which you can pipe to a file, or through a pager, or just send directly to your terminial. There a few tricks to using it, however, which don't seem to be documented anywhere:

1. Use the right jstack executable

jstack usually will work only if it came with the same exact JDK version as the target JVM process is running. Since a lot of servers end up having several different JVM versions installed on them, it's important to make sure that the version of jstack you're trying to use is the right one — the jstack executable at /usr/bin/jstack won't necessarily be correct. And since jstack doesn't accept a -version flag, it's pretty hard to tell which version /usr/bin/jstack actually is.

So the most reliable way to run jstack is from the bin directory of the JDK which you're using to run the target JVM process. On ubuntu, this usually will be a subdirectory of one of the JDKs in the /usr/lib/jvm directory (like /usr/lib/jvm/java-6-openjdk-amd64 for the 64-bit version of the java 6 JDK). In that case, you might run jstack like this (when the target JVM's process ID is 12345):

/usr/lib/jvm/java-6-openjdk-amd64/bin/jstack 12345
2. Run jstack as the same user as the target JVM

You need to run jstack as the same user as which the target JVM is running. For example, if you're running jetty as a user named jetty, (and the jetty process ID is 12345) use sudo to execute jstack as the jetty user:

sudo -u jetty jstack 12345
(I learned this trick from Michael Moser's jstack - the missing manual blog post — apparently jstack uses a named pipe to communicate with the target JVM process, and that pipe's permissions allow only the user who created the target JVM process to read or write the pipe.)
3. Try, try again

Sometimes, even if you do those first two things, jstack will still tell you to go get bent (or some other inscrutable error message of similar intent). I've found that if I just try running it again a couple of times, jstack magically will work on the second or third try.

4. Don't use -F

Even though jstack sometimes itself will suggest that you try -F (particularly if you've got a version mismatch between jstack and the target JVM), resist the temptation to "force" it. When you use jstack with the -F option, jstack will actually stop the target process (ie kill -STOP). Only use the -F option if your app is already good and hung (because it certainly will be once you use -F).

Thursday, September 26, 2013

Overriding toString() in Groovy Using Grails' ExtendedProxy

In Groovy, most of the time you can override the behavior of an object instance's method using the object's metaClass property, like to make the following code print "Yes" instead of "true":

def x = true
x.metaClass.toString { -> delegate ? 'Yes' : 'No' }
println x.toString()

But particularly with toString(), there are some cases (documented in GROOVY_2599) where this doesn't work; for example, the following code will still print "true":

def x = true
x.metaClass.toString { -> delegate ? 'Yes' : 'No' }
println "${x}"

To get around this issue for a project on which I was working recently, I used Grails' ExtendedProxy class to wrap other object instances for which I wanted to override the toString() method. The ExtendedProperty class delegates calls to get and set properties on the wrapped object, as well as method invocations. (It extends Groovy's Proxy class, which delegates method invocations only.)

This allowed me to apply some pretty formatting to a few standard Java objects (like to format Date objects with a US-style date format) without choosing between proxying every property/method explicitly or losing the other aspects of the wrapped objects' functionality. To maintain the functionality I wanted, the only other method (other than toString()) that I found I needed to proxy explicitly was asBoolean() (allowing for wrapped collections to behave as falsey when empty).

This was the wrapper class I ended up creating:

class PrettyToStringWrapper extends grails.util.ExtendedProxy {

    /** Wraps only if it makes a difference for the specified object. */
    static Object wrapMaybe(Object o) {
        (
            o instanceof Collection ||
            o instanceof Date ||
            o instanceof Boolean
        ) ? new PrettyToStringWrapper().wrap(o) : o
    }

    /** Proxies truthy and falsey. */
    boolean asBoolean() {
        getAdaptee().asBoolean()
    }

    /** Overrides toString() with pretty implementation. */
    String toString() {
        def wrapped = getAdaptee()

        if (wrapped instanceof Collection)
            return wrapped.toString().replaceAll(/^\[|\]$/, '')

        if (wrapped instanceof Date)
            return wrapped.format('MM/dd/yyyy')

        if (wrapped instanceof Boolean)
            return wrapped ? 'Yes' : 'No'

        return wrapped.toString()
    }

}

And used it like this:

def emptyList = PrettyToStringWrapper.wrapMaybe([])
println "${emptyList ? 'full' : 'empty'} list contains ${emptyList}"
// prints 'empty list contains '

def fullList = PrettyToStringWrapper.wrapMaybe([1, 2, 3])
println "${fullList ? 'full' : 'empty'} list contains ${fullList}"
// prints 'full list contains 1, 2, 3'

def date = PrettyToStringWrapper.wrapMaybe(new Date(0))
println "epoch begins on ${date}"
// prints 'epoch begins on 12/31/1969' (in US timezones)

def yup = PrettyToStringWrapper.wrapMaybe(true)
println "${yup} this is true"
// prints 'Yes this is true'

I added the static wrapMaybe() method to avoid wrapping objects needlessly — one caveat I found to using ExtendedProxy was that it doesn't proxy the dynamic properties of fancier classes which implement Groovy's special propertyMissing() method (propertyMissing() allows those classes to provide properties without declaring them anywhere).

And one other thing to watch out when using the ExtendedProperty class is that you must reference the wrapped object via the getAdaptee() method instead of simply accessing the adaptee property (the adaptee property is defined by the Proxy class). Using the adaptee property results in a call to the wrapper's getProperty() method for the adaptee property, and is delegated by ExtendedProperty to the wrapped object (raising an IllegalArgumentException as the wrapped object won't have an adaptee property); whereas getAdaptee() accesses the wrapper's adaptee property directly, without a call to getProperty().

Saturday, August 17, 2013

Archiva Repository Manager for Grails

Recently deployed Apache Archiva as a Maven repository manager, for use by our Grails projects (primarily as a local cache for remote artifacts). The default configuration for Archiva does this almost completely out-of-the-box — just needed a little extra configuration for Grails plugins. Here are the steps I took to install Archiva on Ubuntu 12.04 and configure our Grails projects to use it:

Install Archiva

There isn't yet an Ubuntu apt package for Archiva, so you have to download and install it manually. It's pretty straightforward, though:

# download archiva 1.3.6
wget http://download.nextag.com/apache/archiva/1.3.6/binaries/apache-archiva-1.3.6-bin.tar.gz

# extract and move to /opt/archiva
tar xf apache-archiva-1.3.6-bin.tar.gz
sudo mv apache-archiva-1.3.6 /opt/.
sudo ln -s /opt/apache-archiva-1.3.6 /opt/archiva

# delete wrapper-linux-x86-32 files (if you're using 64-bit linux -- otherwise keep them!)
sudo rm /opt/archiva/bin/wrapper-linux-x86-32
sudo rm /opt/archiva/lib/libwrapper-linux-x86-32.so

# create archiva working dir with the default conf files
sudo mkdir /srv/archiva
sudo cp -r /opt/archiva/conf /srv/archiva/.
sudo mkdir /srv/archiva/data
sudo mkdir /srv/archiva/logs

# add daemon user
sudo useradd -r archiva
sudo chown -R archiva:archiva /srv/archiva

# create daemon script
echo '#!/bin/sh -e
#
# /etc/init.d/archiva daemon control script
#
### BEGIN INIT INFO
# Provides:          archiva
# Required-Start:    $local_fs $remote_fs $network
# Required-Stop:     $local_fs $remote_fs $network
# Should-Start:      $named
# Should-Stop:       $named
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Start Archiva
# Description:       Start/Stop Aparche Archiva at /opt/archiva.
### END INIT INFO

export ARCHIVA_BASE=/srv/archiva
export RUN_AS_USER=archiva

/opt/archiva/bin/archiva $@
' | sudo tee /etc/init.d/archiva
sudo update-rc.d archiva defaults 80 20

The above script will install Archiva 1.3.6 at /opt/archiva, create a working dir for it at /srv/archiva, create a new, unprivileged archiva user, and create an /etc/init.d/archiva script to run Archiva as a daemon. You can now start Archiva with the command sudo service archiva start, and it automatically will start whenever the machine boots.

Before I started it, however, I also configured Archiva to use a MySQL DB as its data store, since MySQL was already running on the same box (Archiva uses Apache Derby by default). To do so, I first created a database for Archiva and a database for Apache Redback (which Archiva uses for its user store):

echo "create archiva mysql db as root..."
mysql -uroot -p - e'
    CREATE DATABASE archiva DEFAULT CHARACTER SET ascii;
    GRANT ALL ON archiva.* TO "archiva" IDENTIFIED BY "secret-archiva-password";

    CREATE DATABASE redback DEFAULT CHARACTER SET ascii;
    GRANT ALL ON redback.* TO "redback" IDENTIFIED BY "secret-redback-password";
'

And configured Archiva to use MySQL by altering its /srv/archiva/conf/jetty.xml configuration file to use MySQL settings in place of Derby:

  <!-- Archiva Database -->

  <New id="archiva" class="org.mortbay.jetty.plus.naming.Resource">
    <Arg>jdbc/archiva</Arg>
    <Arg>
      <New class="com.mysql.jdbc.jdbc2.optional.MysqlDataSource">
        <Set name="serverName">localhost</Set>
        <Set name="databaseName">archiva</Set>
        <Set name="user">archiva</Set>
        <Set name="password">archiva-secret-password</Set>
      </New>
    </Arg>
  </New>

  <New id="archivaShutdown" class="org.mortbay.jetty.plus.naming.Resource">
    <Arg>jdbc/archivaShutdown</Arg>
    <Arg>
      <New class="com.mysql.jdbc.jdbc2.optional.MysqlDataSource">
        <Set name="serverName">localhost</Set>
        <Set name="databaseName">archiva</Set>
        <Set name="user">archiva</Set>
        <Set name="password">archiva-secret-password</Set>
      </New>
    </Arg>
  </New>

  <!-- Users / Security Database -->

  <New id="users" class="org.mortbay.jetty.plus.naming.Resource">
    <Arg>jdbc/users</Arg>
    <Arg>
      <New class="com.mysql.jdbc.jdbc2.optional.MysqlDataSource">
        <Set name="serverName">localhost</Set>
        <Set name="databaseName">redback</Set>
        <Set name="user">redback</Set>
        <Set name="password">redback-secret-password</Set>
      </New>
    </Arg>
  </New>

  <New id="usersShutdown" class="org.mortbay.jetty.plus.naming.Resource">
    <Arg>jdbc/usersShutdown</Arg>
    <Arg>
      <New class="com.mysql.jdbc.jdbc2.optional.MysqlDataSource">
        <Set name="serverName">localhost</Set>
        <Set name="databaseName">redback</Set>
        <Set name="user">redback</Set>
        <Set name="password">redback-secret-password</Set>
      </New>
    </Arg>
  </New>

And finally, linked the MySQL java driver into Archiva's lib directory:

sudo ln -s /usr/share/java/mysql.jar /opt/archiva/lib/.

Proxy Archiva Thru Apache

Archiva runs on port 8080 by default. To avoid conflicts with other services, I changed it to port 6161:

sudo perl -pli -e 's/(name="jetty.port" default=")\d+/\16161/' /srv/archiva/conf/jetty.xml

(Restart Archiva after making this change, if you've already started it.) Then I added an Apache (web server) virtual host for it at /etc/apache2/sites-available/archiva, to proxy it from port 80 (running on a server with a DNS entry of archiva.example.com):

echo '
<VirtualHost *:80>
    ServerName archiva.example.com

    ProxyPreserveHost On
    RewriteEngine On

    # redirect / to /archiva
    RewriteRule ^/$ /archiva [L,R=301]

    # forward all archiva requests to archiva servlet
    RewriteRule (.*) http://localhost:6161$1 [P]
</VirtualHost>
' | sudo tee /etc/apache2/sites-available/archiva

sudo a2enmod proxy proxy_http rewrite
sudo a2ensite archiva
sudo service apache2 restart

Now you should be able to access Archiva simply by navigating to http://archiva.example.com/ (which will redirect to http://archiva.example.com/archiva). The first time you access it, you'll be prompted to create a new admin user. Do that so you can configure a few more things.

Add Proxy Connector for Grails Plugins

Once Archiva is up and running, and you've logged in as admin, navigate to the "Administration" > "Repositories" section of Archiva by using the leftnav. Click the "Add" link on the right side of the "Remote Repositories" section of the page, and enter the following settings:

Identifier: grails-plugins
Name: Grails Plugins
URL: http://repo.grails.org/grails/plugins/
Username:
Password:
Timeout in seconds: 60
Type: Maven 2.x Repository

Click the "Add Repository" button to save the new remote repo. Then navigate to the "Administration" > "Proxy Connectors" section using the leftnav. Click the "Add" link at the top-right of the page, and enter the following settings:

Network Proxy: (direct connection)
Managed Repository: internal
Remote Repository: grails-plugins
Return error when: always
Cache failures: yes
Releases: once
On remote error: stop
Checksum: fix
Snapshots: hourly

Click the "Save Proxy Connector" button to save the new proxy connector. The Archiva server should now be acting as a proxy for the Grails Plugins repo. It already comes configured as a proxy for the Maven Central repo, so you should be ready to use it with Grails.

Update BuildConfig.groovy

You can now comment out all the other default repos in the repositories section of the conf/BuildConfig.groovy files of your various Grails projects, and add a repo entry for your new Archiva server:

    repositories {
        mavenRepo 'http://archiva.example.com/archiva/repository/internal/'

        //grailsPlugins()
        //grailsHome()
        //grailsCentral()
        //mavenCentral()
        //mavenLocal()
        //...
    }

After updating your BuildConfig.groovy file, test out your changes by deleting your ivy2 cache folder (~/.ivy2/cache), and running a clean grails build (which will re-download all the dependencies for the Grails project through your new Archiva server).

Friday, August 16, 2013

Jenkins EC2 Slave Plugin

Just finished configuring our Jenkins build machine to use the Jenkins EC2 Plugin (currently at version 1.18), which allows the Jenkins server to spin up AWS EC2 instances on demand to use as slaves for selected build jobs. It's very cool, and requires only a little bit of configuration to get it running your existing jobs automatically on EC2 slaves. These were the steps I had to take:

Create an IAM user for Jenkins

The Jenkins server needs access to your AWS account in order to run and kill EC2 instances; the best way to enable this is to create a separate IAM user that is used only by Jenkins, and has only the minimum permissions required. I used the IAM section of the AWS Console for this (although you can also do it via command line). When you create the new IAM user, create an access key for the user, and make sure you save the secret key part of it, since AWS does not store the secret key (you have to generate a new access key if you lose the secret part of it). You will use this access key when you configure the Jenkins EC2 plugin.

Once you create the user, you need to attach a policy to it that will allow the user to run and kill EC2 instances. Via trial-and-error, I found that these were the minimum permissions currently required by the Jenkins EC2 plugin (I limited it to a single region, us-west-2; if you want to allow Jenkins to manage instances in all regions, remove the ec2:Region condition):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ec2:DescribeRegions"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": [
                "ec2:CreateTags",
                "ec2:DescribeInstances",
                "ec2:DescribeKeyPairs",
                "ec2:GetConsoleOutput",
                "ec2:RunInstances",
                "ec2:StartInstances",
                "ec2:StopInstances",
                "ec2:TerminateInstances"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "ec2:Region": "us-west-2"
                }
            }
        }
    ]
}

Create an SSH key pair for Jenkins

Once the Jenkins server starts a new EC2 slave, it will connect to it via SSH, using public-key authentication. At minimum, you need to provide Jenkins with the private key of a user account that has password-less sudo privileges on the slave; the easiest way to do this with a stock AMI image is to just let Jenkins use the image's default account via the EC2-registered key-pair used to boot it.

I just used the stock Ubuntu AMI (12.04, 64-bit intance-store variety), where the default user (ubuntu) has password-less sudo privileges. And I created an SSH key-pair for Jenkins to use via the EC2 section of the AWS Console (although you can also do it via the command line). As with the secret part of the IAM user's access key, make sure you save the private key for the key pair, since AWS does not store it (you'll have to generate a new key pair if you lose the private key).

Create an EC2 security-group for the slave

Whenever you launch an EC2 instance, you must specify the "security group" to which the instance belongs. This is basically just a set of firewall rules for the built-in EC2 firewall that restrict inbound access to a limited set of ports (optionally from a limited set of hosts). You can change the firewall rules for a group even while that group has running instances, but you can't switch an instance to use a different group once the instance has been launched. So I recommend creating a separate group just for your Jenkins slaves, even if the group has the same rules as you use for other security groups (so that you can change the rules for the different groups independently, should the need ever arise).

The only port you'll need inbound access to on the slaves is 22 (SSH). Optionally, you can set the source for that rule to the IP address of the Jenkins master server (if you want to disallow any inbound access to the slave other than from the master).

Create an init script for the slave

When Jenkins starts a new EC2 slave (and once the slave has booted and Jenkins connects to it), Jenkins will run a script that you specify to prepare the slave to run builds. Jenkins automatically will bootstrap the Jenkins-specific environment on the slave (once your init script has run), so you really only need to setup a few things, like java and git (or svn, hg, etc).

Following is the init script I used with the stock Ubuntu AMI. It installs a few things via Apt; installs a few particular versions of Grails that we use; and then installs and runs the latest version of PhantomJS (which we need to have running with some specific arguments for our functional tests). It also sets-up a working directory for Jenkins to use as the ubuntu user at /srv/jenkins; creates a bigger swap file than comes with the Ubuntu AMIs; and moves the /srv, /var, and /tmp directories to the faster "ephemeral" drive on the EC2 instance (mounted on the smaller EC2 instances at /mnt):

#!/bin/sh

# keep trying to install until it works
for I in 1 2 3; do
    sudo DEBIAN_FRONTEND=noninteractive apt-get -y -q update
    sudo DEBIAN_FRONTEND=noninteractive apt-get -y -q upgrade
    sudo DEBIAN_FRONTEND=noninteractive apt-get -y -q install \
        git-core \
        groovy \
        openjdk-6-jdk \
        vim \
        zip
    sleep 1
done

# fix missing java links
if [ ! -e /usr/lib/jvm/default-java ]; then
    sudo ln -s /usr/lib/jvm/java-6-openjdk-amd64 /usr/lib/jvm/default-java
fi
if [ ! -e /usr/lib/jvm/java-6-openjdk ]; then
    sudo ln -s /usr/lib/jvm/java-6-openjdk-amd64 /usr/lib/jvm/java-6-openjdk
fi

# download grails if necessary
if [ ! -e /opt/grails-2.1.0 ]; then
    # download grails zip
    if [ ! -e "/tmp/grails.zip" ]; then
        wget -nd -O /tmp/grails.zip http://dist.springframework.org.s3.amazonaws.com/release/GRAILS/grails-1.3.7.zip
    fi
    # unzip grails
    unzip /tmp/grails.zip
    # move to /opt
    sudo mv grails-1.3.7 /opt/.

    # download grails zip
    if [ ! -e "/tmp/grails2.zip" ]; then
        wget -nd -O /tmp/grails2.zip http://dist.springframework.org.s3.amazonaws.com/release/GRAILS/grails-2.1.0.zip
    fi
    # unzip grails
    unzip /tmp/grails2.zip
    # move to /opt
    sudo mv grails-2.1.0 /opt/.

    # expand max-memory size for grails
    sudo perl -pli -e 's/Xmx\d+/Xmx2048/; s/MaxPermSize=\d+/MaxPermSize=1024/' /opt/grails*/bin/startGrails
fi

# download phantomjs if necessary
HAS_PHANTOMJS=`whereis phantomjs | grep bin`
if [ -z "$HAS_PHANTOMJS" ]; then
    # download phantomjs binary
    if [ ! -e "/tmp/phantomjs.tar.bz2" ]; then
        wget -nd -O /tmp/phantomjs.tar.bz2 https://phantomjs.googlecode.com/files/phantomjs-1.9.1-linux-x86_64.tar.bz2
    fi
    # unzip phantomjs
    tar xf /tmp/phantomjs.tar.bz2
    # move to /opt
    sudo mv phantomjs-1.9.1-linux-x86_64 /opt/.
    sudo ln -s /opt/phantomjs-1.9.1-linux-x86_64 /opt/phantomjs
    # run in background
    nohup /opt/phantomjs/bin/phantomjs \
        --webdriver=7483 \
        --webdriver-logfile=webdriver.log \
        --ignore-ssl-errors=true \
        &>phantomjs.log &
fi

# creating jenkins working directory
sudo mkdir /srv/jenkins && sudo chown ubuntu:ubuntu /srv/jenkins

# add more swap
SWAPFILE=/mnt/swap1
if [ ! -f $SWAPFILE ]; then
    # creates 2G (1M * 2K) /mnt/swap1 file
    sudo dd if=/dev/zero of=$SWAPFILE bs=1M count=2K
    sudo chmod 600 $SWAPFILE
    sudo mkswap $SWAPFILE

    # add new swap to config and start using it
    echo "$SWAPFILE none swap defaults 0 0" | sudo tee -a /etc/fstab
    sudo swapon -a
fi

# move /tmp to larger/faster/transient /mnt volume
if [ ! -e /mnt/tmp ]; then
    # move each directory, then create link from old location to new
    for DIR in /srv /tmp /var; do
        echo $DIR
        sudo mv $DIR /mnt$DIR
        sudo mkdir $DIR
        echo "/mnt$DIR  $DIR    none    bind    0   0" | sudo tee -a /etc/fstab
        sudo mount $DIR
    done
fi

Among a few quirky things that the script does is re-try apt-get three times — I found that at least half the time the first run of apt-get would fail to update/install any packages without any comprehensible error messages (I think maybe because the system was still lazy-initializing some components?). Re-running it after a second seems to solve the issue, however.

Also, while installing only the openjdk-6-jdk package installs enough of java 6 for our build purposes, it doesn't install some of the links in the /usr/lib/jvm directory through which we reference the jdk or java home in our build scripts (these links usually are created by some other unknown java packages); so the script manually creates these java links.

And finally, the default JVM memory settings used by grails aren't sufficient for running or testing our apps; the quick fix for this is just to overwrite the Xmx and MaxPermSize settings in the default GRAILS_OPTS of the startGrails script that comes with each version of grails.

Configure the Jenkins EC2 plugin

Now, finally, you're ready to configure the EC2 plugin itself. Once you've installed the plugin, you navigate to the main "Manage Jenkins" > "Configure System" page, and scroll down near the bottom to the "Cloud" section. There, you click the "Add a new cloud" button, and select the "Amazon EC2" option. This will display the UI for configuring the EC2 plugin. The first things you configure are the access key that you created for the Jenkins IAM user (via the "Access Key ID" and "Secret Access Key" fields). If you've configured the permissions for the Jenkins IAM user correctly, this will populate the "Region" dropdown, and allow you to select the AWS region to use. Next, paste in the text from the secret-key pem file for the SSH key pair that you created for Jenkins in the "EC2 Key Pair's Private Key" field (this text will start with the line "-----BEGIN RSA PRIVATE KEY-----").

In the "AMIs" section of the configuration UI, you'll configure the per-slave settings. If you want, you can generate multiple slave profiles (from different AMI images, or different instance sizes, or different initialization parameters, etc). But you can start with just one, which you can add by clicking the "Add" button at the bottom of the "AMIs" section.

For each slave profile, you configure the profile with a description ("Standard EC2 Slave"), as well as the ID of the AMI to use (I used the ami-5168f861 AMI, the current official Ubuntu 12.04, 64-bit instance-store AMI). Next, select the instance type; a "micro" instance is probably too small for just about anything other than a one-line shell job; a "small" instance may be fine for some jobs; but with most of our builds (which all include at least one grails build step), we get a 3-4x improvement over "small" instance times with a "medium" instance (which is 2x the price of a "small").

You can optionally select a specific availability-zone within your selected region in which to launch the slave; this matters a lot if you have an existing EBS volume to which you will attach to the slave, and it matters a little if you have other EC2 instances which the slave is going to access (like if the master Jenkins server is in EC2, or if you have a version-control repo in EC2 from which the slave is going to download source code, etc); otherwise you can just leave the "Availability Zone" field blank, and AWS automatically will launch the slave in whatever zone of your configured region that is least active when the slave is launched.

Enter the name of the EC2 security-group that you created for the slave in the "Security group names" field. Enter the path to the directory that Jenkins should use as its working directory (similar to the /var/lib/jenkins directory on the master); you probably created this in your slave init script (it should be writable by the user Jenkins uses on the slave; the directory that my init script created for this was /srv/jenkins). Specify the user Jeknins should run as on the slave in the "Remote user" field; on a stock Ubuntu AMI, this is ubuntu. And unless you use root for this user, enter sudo in the "Root command prefix" field.

Enter one or more labels (separated by spaces) in the "Labels" field. You can assign jobs to specific slave profiles via labels, so if you have multiple slave profiles, you may want to include a label for each distinguishing feature of the slave. For example, if you have one profile for a small instance with no DB, you might label it just "small"; and for another profile using a medium instance with a MySQL DB, you might label as "medium mysql". Then for a job that only needs a small instance, you can set the job to use the slave labeled "small"; and for a job that requires a medium instance, you can set it to use the slave labeled "medium"; and for a job that requires MySQL, you can set the job to use the slave with the "mysql" label. If you have only one slave profile, you can just use a simple label like "slave" or "ec2" (and then configure any job that you want to run on the slave with the "slave" or "ec2" label).

The default "Usage" setting for both the master Jenkins server and each slave server is to "Utilize this slave as much as possible". This means that Jenkins will not boot a slave for a job unless the job specifically has been configured (via label) to use a slave that currently is not running, or unless all the build-executors on the master currently are in use. If you instead change the slave's "Usage" setting to "Leave this machine for tied jobs only", Jenkins will use the slave only if it can't run the job on the master or any other slaves. See the "Usage" settings matrix table below for a clearer description of the interaction of the "Usage" setting between slaves and the master.

Set "Idle termination time" to the number of minutes a slave must sit at idle before Jenkins shuts it down (default is 30 minutes — if you use the slaves only for scheduled jobs you might want to cut this down to 0). Paste your init script into the "Init script" field. Jenkins executes the init script by writing your script to a file to the slave, setting execute permissions on it, and then running the file as the user you specified in the "Remote user" field — so make sure that you include #!/bin/sh at the top of the script (if it's a shell script, or the appropriate "sha-bang" if you you a different scripting language).

Click the "Advanced..." button at the bottom of the AMI's configuration section to access a few more options. A couple of settings that you may want to customize are "Number of executors" (you may want to set this to 1 — unless you intend to run multiple jobs on the same slave at the same time); and "Instance Cap", which is the maximum number of slaves from this profile that Jenkins can have running at the same time.

Click the "Save" button at the bottom of the page to save and apply your changes.

Configure the job-selecting behavior of the master

Once you've saved your first slave profile, go back to the same "Manage Jenkins" > "Configure System" page, and at the top of the page you'll find a new "Labels" and a new "Usage" field (just below the "# of executors" field). The "Usage" field determines how Jenkins utilizes the master for jobs. If you leave the "Utilize this slave as much as possible" option selected, whenever a job is triggered that either doesn't have a label, or is labeled so that it could be executed either by the master or by another slave, Jenkins will run the job on an already-running slave only if the slave has a free executor; but otherwise it will try to run the job on the master, and only boot a new slave if the master has no free executors. If you change the "Usage" field to "Leave this machine for tied jobs only", whenever a job is triggered that either doesn't have a label, or is labeled so that it could be executed either by the master or by another slave, Jenkins will first try a running slave, and then try booting a slave; and only try using the master if it can't boot any more slaves.

Here's a description of the interaction between the master and slave "Usage" settings in tabular form:

"Usage" setting matrix
Master
Utilize this slave as much as possibleLeave this machine for tied jobs only
Slave Utilize this slave as much as possible
  1. use executor on running slave if free
  2. use executor on master if free
  3. boot new slave if below instance cap
  4. wait
  1. use executor on running slave if free
  2. boot new slave if below instance cap
  3. use executor on master if free
  4. wait
Leave this machine for tied jobs only
  1. use executor on master if free
  2. use executor on running slave if free
  3. boot new slave if below instance cap
  4. wait
  1. use executor on running slave if free
  2. use executor on master if free
  3. boot new slave if below instance cap
  4. wait

Note that if a job is labeled with a label that the master doesn't have, Jenkins will not run it on the master regardless of the "Usage" setting — it will wait until it can boot or otherwise free up an executor on a slave that does have that label. So if there are jobs that you do want to run on the master, do give the master a label or two via the "Labels" field. For example, if you want to allow only small jobs and one or two master-specific jobs to run on the master, you might label the master "small master".

Earmark jobs for slaves with labels

The last step is to re-configure individual jobs to run on your new slaves. You can skip this step entirely if any slave can take any job — just configure the "Usage" setting of the master and slaves to indicate how Jenkins should utilize the slaves (if it should boot slaves to run jobs on them even if the master is free, or if it should max out the master before booting slaves).

Otherwise, navigate to the configuration page of each job for which you want to specify the type of slave to run the job, and check the "Restrict where this project can be run" checkbox at the bottom of the first section of the page. This will reveal the "Label expression" field; enter the label (or space-separated labels) that defines what kind of slave the job requires. For example, if the job requires a MySQL DB, you might enter "mysql" as the label (requiring a slave with the "mysql" label); or if the job can run on any small instance, you might enter "small" as the label (requiring a slave with the "small" label).

The next time you trigger a job that is labeled for a slave, Jenkins automatically will boot the slave (if no slaves of that type are currently running and have free executors), and run the job. The display of Jenkins' leftnav will also change, to include the list of executors on the slave in the "Build Executor" box (and when the slave is terminated, the slave's executors will be removed from this box).

Sunday, July 28, 2013

Moving the MySQL Tmpdir on Ubuntu

One thing I frequently forget when changing the default directories for various services in Ubuntu is that the AppArmor config for those services also needs to be updated. Case in point, the other day I needed to change MySQL's tmpdir to a different disk, and struggled for a while with "Can't create/write to file 'xyz' (Errcode: 13)" errors until I remembered AppArmor. These were the steps I ended up taking:

Create the new tmpdir
mkdir /mnt/foo/tmp && sudo chown mysql:mysql /mnt/foo/tmp
Change /etc/mysql/my.cnf to use the new tmpdir
tmpdir = /mnt/foo/tmp
Add new tmpdir entries to /etc/apparmor.d/local/usr.sbin.mysqld
/mnt/foo/tmp/ r,
/mnt/foo/tmp/** rw,
Reload AppArmor
sudo service apparmor reload
Restart MySQL
sudo service mysql restart

While I was troubleshooting, I found a nice, in-depth blog entry by Jeremy Smyth explaining how to debug issues with AppArmor and MySQL.

Sunday, July 21, 2013

Tuning Lucene to Get the Most Relevant Results

Just spent the last week tuning our search engine using the latest version of Lucene (4.3.1). While Lucene works amazingly well right out of the box, to get "Google-like" relevancy for your results, you usually need to devise a custom strategy for indexing and querying the particular content your application has. Here are a few tricks we used for our content (which is English-only, jargon-heavy, and contains many terms used only by a few documents), plus some more basic techniques that just took us a while to figure out:

Use a custom analyzer for English text

Lucene's StandardAnalyzer does a good job generally of tokenizing text into individual words (aka "terms"), and it skips English "stopwords" (like the, a, etc) by default — but if you have only English text, you can get better results by using the EnglishAnalyzer. Beyond the tokenizing filters that the StandardAnalyzer includes, the EnglishAnalyzer also includes the EnglishPossesiveFilter (for stripping 's from words) and the PorterStemFilter (for chopping off common word suffixes, like removing ming from stemming, etc).

Because some of our text includes non-English names with letters not in the English alphabet (like é in liberté), and we know our users are going to want to search for those names using just English-alphabet letters, we implemented our own analyzer that included the ASCIIFoldingFilter on top of the filters in the regular EnglishAnalyzer. This filter converts characters not in the (7-byte) ASCII range to the ASCII characters that they resemble most closely; for example, it converts é to e (and © to (c), etc).

A custom analyzer is easy to implement; this is what ours looks like in java (the matchVersion and stopwords variables are fields from its Analyzer and StopwordAnalyzerBase superclasses, and the TokenStreamComponents is an inner class of Analyzer):

import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.en.EnglishPossessiveFilter;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;

public class CustomEnglishAnalyzer extends StopwordAnalyzerBase {

    /** Tokens longer than this length are discarded. Defaults to 50 chars. */
    public int maxTokenLength = 50;

    public CustomEnglishAnalyzer() {
        super(Version.LUCENE_43, StandardAnalyzer.STOP_WORDS_SET);
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        final Tokenizer source = new StandardTokenizer(matchVersion, reader);
        source.setMaxTokenLength(maxTokenLength);

        TokenStream pipeline = source;
        pipeline = new StandardFilter(matchVersion, pipeline);
        pipeline = new EnglishPossessiveFilter(matchVersion, pipeline);
        pipeline = new ASCIIFoldingFilter(pipeline);
        pipeline = new LowerCaseFilter(matchVersion, pipeline);
        pipeline = new StopFilter(matchVersion, pipeline, stopwords);
        pipeline = new PorterStemFilter(pipeline);
        return new TokenStreamComponents(source, pipeline);
    }
}

Note that when you use a custom analyzer for indexing, it's important to use the same (or least a similar) analyzer for querying (and vice versa). For example, the EnglishAnalyzer will tokenize the phrase it's easily processed into two terms: easili (sic) and process. If you index this text with the EnglishAnalyzer, searching for the terms it's, easily, or processed will find no results — you have to create the query using the same analyzer to make sure the terms for which you're querying are actually easili and process.

You can use Lucene's StandardQueryParser to build an appropriate query for you out of a phrase, using Lucene's fancy querying syntax; or you can simply tokenize the phrase yourself with the following code, and build the query out of it yourself:

import java.util.ArrayList;
import java.util.List;
import java.io.StringReader;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

...

    List<String> tokenizePhrase(String phrase) {
        List<String> tokens = new ArrayList<String>();
        TokenStream stream = new EnglishAnalyzer(Version.LUCENE_43).tokenStream(
            "someField", new StringReader(phrase));

        stream.reset();
        while (steam.incrementToken())
            tokens.add(stream.getAttribute(CharTermAttribute).toString());
        stream.end();
        stream.close();

        return tokens;
    }
Use a custom scorer

The results you get back from a query and their order are heavily influenced by a number of factors: the text you have in your index, how you've tokenized and stored the text in the different fields of your index, and how you structure the query itself. You can also influence the ordering of results to a lesser degree by using a custom Similarity class when you build your index.

Lucene's default similarity class uses some fancy math to score the terms in its index (see this Lucene Scoring tutorial for a simpler explanation of the scoring algorithm), and you'll probably want to tweak only one or two of those factors. We implemented our own custom Similarity class that completely ignores document length, and provides a bigger boost for infrequently-appearing terms:

import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.search.similarities.DefaultSimilarity;

public class CustomSimilarity extends DefaultSimilarity {

    @Override
    public float lengthNorm(FieldInvertState state) {
        // simply return the field's configured boost value
        // instead of also factoring in the field's length
        return state.getBoost();
    }

    @Override
    public float idf(long docFreq, long numDocs) {
        // more-heavily weight terms that appear infrequently
        return (float) (Math.sqrt(numDocs/(double)(docFreq+1)) + 1.0);
    }
}

Once implemented, you can use this CustomSimilarity class when indexing by setting it on the IndexWriterConfig that you use for writing to the index, like this:

import java.io.File;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

...

    void indexSomething() {
        EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_43);
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_43, analyzer);
        config.setSimilarity(new CustomSimilarity());

        FSDirectory directory = FSDirectory.open(new File("my-index"));
        IndexWriter writer = new IndexWriter(directory, config);
        // ... index something ...
        writer.close();
    }
Build your own query

Probably the single biggest way we improved our "result relevancy" in the eyes of our users was to build our queries programmatically from a user's query input, rather than asking them to use Lucene's standard query syntax. Our algorithm for generating queries first expands any abbreviations in the query (not using Lucene, just using an in-memory hashtable of our own custom list of abbreviations); then it builds a big query consisting of:

  1. the exact query phrase (with a little slop), boosted heavily
  2. varying combinations of the terms in the query phrase, boosted according to the number of matching terms
  3. individual terms in individual fields (using the boost associated with those fields)
  4. individual terms with no boost

This querying strategy compliments our indexing strategy, which is to index a few important fields of each document separately (like "name", "keywords", etc) with boost added to those fields at index time; and then to index all the text related to each document in on big fat field (the "all" field) with no boost associated with it. The parts of the query that check for different terms appearing in the same document (#1 and #2 from the list above) rely on the "all" field; whereas the parts of the query that check in which fields the terms appear (#3 and #4) make use of the other, specially-boosted fields.

Doing it this way allows us to instruct Lucene to weight results that contain more matches of different terms (or the exact phrase) more heavily than results that simply match the same term many times; but also to weight matches in important fields (like "name" and "keywords") above matches from the general text of the document.

The actual query-building part of our code looks like this (I removed the abbreviation-expanding bits for simplicity, though). The fields argument is the list of custom fields to query; the defaultField argument is the name of the "all" field; and it uses the tokenizePhrase() method from above to split the phrase into individual words:

import java.lang.Math;
import java.util.List;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;

...

    Query buildQuery(String phrase, List<String> fields, String defaultField) {
        List<String> words = tokenizePhrase(String phrase);
        BooleanQuery q = new BooleanQuery();

        // create term combinations if there are multiple words in the query
        if (words.size() > 1) {
            // exact-phrase query
            PhraseQuery phraseQ = new PhraseQuery();
            for (int w = 0; w < tokens.size(); w++)
                phraseQ.add(new Term(defaultField, words.get(w)));
            phraseQ.setBoost(words.size() * 5);
            phraseQ.setSlop(2);
            q.add(phraseQ, BooleanClause.Occur.SHOULD);

            // 2 out of 4, 3 out of 4, 4 out of 4 (any order), etc
            // stop at 7 in case user enters a pathologically long query
            int maxRequired = Math.min(tokens.size(), 7);
            for (int minRequired = 2; minRequired <= maxRequired; minRequired++) { 
                BooleanQuery comboQ = new BooleanQuery();
                for (int w = 0; w < tokens.size(); w++)
                    comboQ.add(new Term(defaultField, words.get(w)), BooleanClause.Occur.SHOULD);
                comboQ.setBoost(minRequired * 3);
                comboQ.setMinimumNumberShouldMatch(minRequired);
                q.add(comboQ, BooleanClause.Occur.SHOULD);
            }
        }

        // create an individual term query for each word for each field
        for (int w = 0; w < tokens.size(); w++)
            for (int f = 0; f < fields.size(); f++)
                q.add(new Term(fields.get(f), words.get(w)), BooleanClause.Occur.SHOULD);

        return q;
    }
Boost important fields when indexing

When we do the document indexing, we set the boost of some of the important fields (like "name" and "keywords", etc), as described above, while dumping all the document's text (including name and keywords) into the "all" field. Following is an example (in which we use our own customized FieldType so that we can configure the field with the IndexOptions that the result highlighter needs, discussed later). The toDocument() method translates some particular type of domain object to a Lucene Document, with appropriate "kewords", "name", "all", etc fields; it would be called by our indexing process (from the indexSomething() method above) for each instance of that domain type that we have in our system in order to create a separate document with which to index each domain:

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.FieldInfo;

...

    protected static final FieldType TEXT_FIELD_TYPE = getTextFieldType();

    static FieldType getTextFieldType() {
        FieldType type = new FieldType();
        type.setIndexed(true);
        type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
        type.setStored(true);
        type.setTokenized(true);
        return type;
    }

    Document toDocument(MyDomainObject domain) {
        Document doc = new Document();

        Field keywordsField = new Field("keywords", domain.keywords, TEXT_FIELD_TYPE);
        keywordsField.setBoost(3f);
        doc.add(keywordsField);

        Field nameField = new Field("name", domain.name, TEXT_FIELD_TYPE);
        nameField.setBoost(2f);
        doc.add(nameField);

        // ... other fields ...

        StringBuilder all = new StringBuilder().
            append(domain.kewords).append("\n").
            append(domain.name).append("\n").
            append(domain.text).append("\n").
            append(domain.moreText).append("\n").
            toString();
        Field allField = new Field("all", all, TEXT_FIELD_TYPE);
        doc.add(allField);

        return doc;
    }
Filter by date with a NumericRangeQuery

Many of our individual documents are relevant only during a short time period, with the exact start and end dates defined by the document. When we query for anything, we query against a specific day chosen by the user. In our Lucene searches, we implement this with a filter that wraps a pair of NumericRangeQuerys, querying the "startDate" and "endDate" fields (although a more common scenario in other applications, however, might be to have a single "publishedDate" for each document, and allow users to choose separate start and end dates against which to filter -- in that case, you'd use a single NumericRangeQuery). We index the "startDate" and "endDate" fields like this, using an integer field of the form 20010203 to represent a date like 2001-02-03 (Feb 3, 2001):

import org.apache.lucene.document.Document;
import org.apache.lucene.document.IntField;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.FieldInfo;

...

    protected static final FieldType DATE_FIELD_TYPE = new FieldType();
    static {
        TEXT_FIELD_TYPE.setIndexed(true);
        TEXT_FIELD_TYPE.setIndexOptions(FieldInfo.IndexOptions.DOCS_ONLY);
        TEXT_FIELD_TYPE.setNumericType(FieldType.NumericType.INT);
        TEXT_FIELD_TYPE.setOmitNorms(true);
        TEXT_FIELD_TYPE.setStored(true);
    }

    Document toDocument(MyDomainObject domain) {
        Document doc = new Document();

        Field startField = new Field("startDate", domain.startDate, DATE_FIELD_TYPE);
        doc.add(startField);

        Field endField = new Field("endDate", domain.endDate, DATE_FIELD_TYPE);
        doc.add(endField);

        // ... other fields ...

        return doc;
    }

Then we build a filter like this, caching a separate filter instance per date (dates again represented in integer form like 20010203 to stand for 2001-02-03):

import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Filter;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.QueryWrapperFilter;

...

    synchronized protected Map cachedFilters = new HashMap();

    Filter getDateFilter(int date) {
        Filter filter = cachedFilters.get(date);

        if (filter == null) {
            BooleanQuery q = new BooleanQuery();

            // startDate must be on or before the specified date
            q.add(NumericRangeQuery.newIntRange(
                "startDate", 0, date, true, true
            ), BooleanClause.Occur.MUST);

            // endDate must be on or after the specified date
            // 30000000 represents the distant future (just prior to the year 3000)
            add NumericRangeQuery.newIntRange(
                "endDate", date, 30000000, true, true
            ), BooleanClause.Occur.MUST);

            filter = new QueryWrapperFilter(q);
            cachedFilters.put(date, filter);
        }

        return filter
    }
Use a SearcherManager for multi-threaded searching

To manage the access of multiple threads searching the index, Lucene provides a simple SearchManager class. Once the index has been created, you can instantiate it and call its acquire() method to check out a IndexSearcher instance.

We needed to initialize our IndexSearcher instances with our custom Similarity class (discussed above), so we initialized the manager with a custom SearcherFactory, which then allowed us to customize the IndexSearcher initialization process:

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.SearcherFactory;

public class CustomSearcherFactory extends SearcherFactory {

    @Override
    public IndexSearcher newSearcher(IndexReader r) throws IOException {
        IndexSearcher searcher = new IndexSearcher(r);
        searcher.setSimilarity(new CustomSimilarity());
        return searcher;
    }
}

To use it, we create a SearcherManager instance when initializing the index (in the init() method) — note that the index must already exist before creating the SearcherManager; and then acquire and release the IndexSearcher it provides whenever we actually need to run a search on the index (in the search() method):

import java.io.File;
import org.apache.lucene.search.Filter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.SearcherManager;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;

...

    protected SearcherManager searchManager;

    protected init() {
        FSDirectory directory = FSDirectory.open(new File("my-index"));
        searchManager = new SearcherManager(directory, new CustomSearcherFactory());
    }

    public TopDocs search(Query query, Filter filter, int maxResults) {
        IndexSearcher searcher = searchManager.acquire();
        try {
            return searcher.search(query, filter, maxResults);
        } finally {
            searchManager.release(searcher);
        }
    }

After re-indexing, make sure to call maybeRefresh() on the SearchManager to refresh the managed IndexSearchers with the latest copy of the index. In other words, indexSomething() method from above would be finished like this:

    void indexSomething() {
        // ... index something ...
        writer.close();

        searchManager.maybeRefresh();
    }
Highlight results with a PostingsHighlighter

The PostingsHighlighter class is the newest implementation of a results highlighter for Lucene (the component that comes up with the fragments of text to display for each result in the search-results UI). It's only been part of Lucene since the 4.1 release, but our experience has been that it selects more-clearly relevant sections of the text than the previous highlighter implementation, the FastVectorHighlighter.

The first step to using a results highlighter is to make sure that you include at index time the data that the highlighter will need at search time. With the FastVectorHighlighter, we used this configuration for a regular indexed field:

import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.FieldInfo;

...

    static FieldType getTextFieldType() {
        FieldType type = new FieldType();
        type.setIndexed(true);
        type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
        type.setStored(true);
        type.setStoredTermVectorOffsets(true);
        type.setStoredTermVectorPayloads(true);
        type.setStoredTermVectorPositions(true);
        type.setStoredTermVectors(true);
        type.setTokenized(true);
        return type;
    }

But with the PostingsHighlighter, we found we didn't need to store the term vectors anymore — but we did need to index the term offsets:

import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.FieldInfo;

...

    static FieldType getTextFieldType() {
        FieldType type = new FieldType();
        type.setIndexed(true);
        type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
        type.setStored(true);
        type.setTokenized(true);
        return type;
    }

The PostingsHighlighter, by default, selects complete sentences to show. We have a lot of text that isn't in the form of proper sentences, however (much of our text isn't in the form of sentences begun with a captial letter and completed with a period and whitespace), so we subclassed the PostingsHighlighter with a class that uses a custom BreakIterator implementation that selects just a few words around each term to display.

With or without a custom BreakIterator, it's easy to use the PostingsHighlighter. You do need to have the IndexSearcher and TopDocs instance from the initial search results to use the PostingsHighlighter, so you might as well do both the search and the highlighting in the same method, returning the combined results in some intermediate data structure. For example, we can use a custom inner class called Result for each individual result, and combine one Lucene document object from the search results with the corresponding highlights text string from the highlighter in each returned Result:

import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.Filter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.postingshighlight.PostingsHighlighter;

...

    public class Result {
        public Document document;
        public String highlights;

        public Result(Document document, String highlights) {
            this.document = document;
            this.highlights = highlights;
        }
    }

    protected PostingsHighlighter highlighter = new PostingsHighlighter();

    public List<Result> search(Query query, Filter filter, int maxResults) {
        IndexSearcher searcher = searchManager.acquire();
        try {
            TopDocs topDocs = searcher.search(query, filter, maxResults);
            // select up to the three best highlights from the "all" field
            // of each result, concatenated with ellipses
            String[] highlights = highlighter.highlight("all", query, searcher, topDocs, 3);

            int length = topDocs.scoreDocs.length;
            List<Result> results = new ArrayList<Result>(length);
            for (int i = 0; i < length; i++) {
                int docId = topDocs.scoreDocs[i].doc;
                results.add(new Result(searcher.doc(docId), highlights[i]));
            }
            return results;

        } finally {
            searchManager.release(searcher);
        }
    }
With a tree, index leaves only

Some of our data is in hierarchical form, and we display the search results for that data in tree from. Rather than indexing all the nodes in the tree, however, we just index the leaves, and make sure that each leaf also includes the relevant text from its ancestors.

We also include the necessary info to render the leaf's branch as a separate, non-indexed "hierarchy" field in each leaf. When the leaf is returned as a search result, we build the branch out of that "hierarchy" field, and then merge the branches together to show each leaf in the context of the full tree.

This is the field configuration we use for the non-indexed "hierarchy" field:

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.FieldInfo;

...

    protected static final FieldType NON_INDEXED_FIELD_TYPE = getNonIndexedFieldType();

    static FieldType getNonIndexedFieldType() {
        FieldType type = new FieldType();
        type.setIndexed(false);
        type.setOmitNorms(true);
        type.setStored(false);
        return type;
    }

    Document toDocument(MyDomainObject domain) {
        Document doc = new Document();

        // ... other fields ...

        String hierarchy = domain.getHierarchyText();
        Field allField = new Field("hierarchy", hierarchy, NON_INDEXED_FIELD_TYPE);
        doc.add(allField);

        return doc;
    }
Use a SpellChecker for auto-complete suggestions

For auto-complete suggestions in our application's search box, we created a custom search index of common words in our application domain that were at least six letters long, and used Lucene's SpellChecker class to index and search this word list. We skipped words less than six letters long to avoid suggesting simple words when the user has typed in only the first few letters of a word. To build the index, we created a plain text file with one word on each line, and indexed it with the following indexDictionary() method:

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.SearcherFactory;
import org.apache.lucene.search.SearcherManager;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spell.PlainTextDictionary;
import org.apache.lucene.search.spell.SpellChecker;
import org.apache.lucene.search.spell.SuggestMode;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Suggestor {

    File directory = new File("suggestion-index");
    SpellChecker spellChecker = new SpellChecker(FSDirectory.open(directory));

    public void indexDictionary(File dictionaryFile) {
        PlainTextDictionary dictionary = new PlainTextDictionary(dictionaryFile);
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_43,
            new StandardAnalyzer(Version.LUCENE_43));
        spellChecker.indexDictionary(dictionary, config, true);
    }
}

To then search it, we used a simple PrefxQuery for (partial) words less than 5 letters long; for longer words we used the SpellChecker's built-in fuzzy suggestion algorithm (with a 0.2f factor to make it even more fuzzy than the default). The suggestSimilar() method of our Suggestor class will return a list of up to 10 words appropriate as auto-completions for the partial word specified in the argument to suggestSimilar(). It delegates to helper prefixSearch() and fuzzySuggest() methods to actually run the search based on the length of the specified partial word:

    protected SearcherManager manager;

    protected getSearcherManager() {
        synchronized (directory) {
            if (manager == null)
                manager = new SearcherManager(
                    FSDirectory.open(directory), new SearcherFactory());
            return manager;
        }
    }

    public List<String> suggestSimilar(String s) {
        // search with prefix query if less than 5 chars
        // otherwise use spellChecker's built-in fuzzy suggestions
        return s.length() < 5 ? prefixSearch(s) : fuzzySuggest(s);
    }

    protected List<String> prefixSearch(String s) {
        SuggestionManager manager = getSearcherManager();
        IndexSearcher searcher = manager.acquire();
        try {
            // search for the top 10 words starting with s
            Term term = new Term("word", s.toLowerCase())
            TopDocs topDocs = searcher.search(new PrefixQuery(term), 10);

            int length = topDocs.scoreDocs.length;
            List<String> results = new ArrayList<String>(length);
            for (int i = 0; i < length; i++) {
                int docId = topDocs.scoreDocs[i].doc;
                results.add((searcher.doc(docId).get("word"));
            }
            return results;
        } finally {
            manager.release(searcher);
        }
    }

    protected List<String> fuzzySuggest(String s) {
        // search for 10 most popular words not exactly matching s
        String[] similar = spellChecker.suggestSimilar(
            s.toLowerCase(), 10, null, null,
            SuggestMode.SUGGEST_MORE_POPULAR, 0.2f);
        List<String> results = Arrays.asList(similar);

        // include queried term if it is itself a recognized word
        if (spellChecker.exist(term) {
            if (results.isEmpty())
                results.append(term);
            else
                results.set(results.size() - 1, term);
        }

        return results;
    }