Introduction

This post describes how to debug JavaScript in Alfresco/Share.

There are two types of js files used in Alfresco/Share:

  • client side – they are placed in Share root directory
  • server side – they are placed in the path within WEB-INF/alfresco directory in Share and Alfresco and are used for example for by web scripts

Client side

Share Debugger

To debug JavaScript on client side client-debug and client-debug-autologging flags in Share configuration file share/WEB-INF/classes/alfresco/share-config.xml can be set to true as presented below. That allows to use JavaScript debugger after pressing (Ctrl, Ctrl, Shift, Shift). Setting client-debug to true causes using original *.js files instead of their minimised versions *-min.js. Setting client-debug-autologging to true enables the JavaScript debugger console.

  <flags>
         <!--
            Developer debugging setting to turn on DEBUG mode for client scripts in the browser
         -->
         <client-debug>true</client-debug>
         <!--
            LOGGING can always be toggled at runtime when in DEBUG mode (Ctrl, Ctrl, Shift, Shift).
            This flag automatically activates logging on page load.
         -->
         <client-debug-autologging>true</client-debug-autologging>
      </flags>

Web Browser Debugger

Apart from that standard tools provided by web browsers can be used. They are really great and include:

  • Web Console (Tools -> Web Developer) in Firefox
  • Developer Tools (Tools) in Chrome

Server side

Log file

It is not so straight forward to debug server side script in Alfresco. Therefore there is logging class that saves the logging messages from JavaScript to standard log files/output. To see those change logging level for org.alfresco.repo.jscript.ScriptLogger class to DEBUG. Corresponding line of WEB-INF/classes/log4j.properties file is presented below:

log4j.logger.org.alfresco.repo.jscript.ScriptLogger=DEBUG

Then you can use the following command in your JavaScript to log the messages:

 logger.log("Log me");

Alfresco/Share Debuger

You can also activate server side JavasScript debugger to assist your development. To do so use the following links and enable debugger there:

  • Share: share/service/api/javascript/debugger
  • Alfresco: alfresco/service/api/javascript/debugger

Make sure that that the following lines is set to “on” in WEB-INF/classes/log4j.properties

log4j.logger.org.springframework.extensions.webscripts.ScriptDebugger=ON
log4j.logger.org.alfresco.repo.web.scripts.AlfrescoRhinoScriptDebugger=ON

Normally authentication is handled by Symfony nearly automatically – you just need to define and configure your firewalls. Sometimes, however you may want to perform authentication manually from the controller.
Imagine implementing automated login for a user upon visiting a URL like: /autologin/{secret}. I am not considering here the security of such a solution – you are discourage to do it this way, unless the information available for this kind of “logins” is not confidential.

Here is a fragment from my security.yml:

security:
    firewalls:
        secured_area:
            pattern:    ^/
            form_login:
                check_path: /login_check
                login_path: /login

The actual authentication is very straight-forward. Since I’m redirecting at the end of request, I don’t even need the user to be authenticated in this action. All that is needed is to persist the information about authenticated user to the session. This means storing serialized class that implements TokenInterface. Normally this is done by Symfony framework in ContextListener. In my scenario I’m using form login that uses UsernamePasswordToken, so in short here is what I need to do:

  • Find user
  • Create the Token
  • Store Token in the session

Pay attention to “secured_area” string – it matches the firewall name from the security.yml and is used to create the token and when creating a session key.

 /**
     * @Route("/autologin/{secret}")
     */
    public function autologinAction($secret) {
        $em = $this->getDoctrine()->getEntityManager();
        $repository = $em->getRepository('MiedzywodzieClientBundle:Reservation');
        $result = $repository->matchLoginKey($secret);
        if (!$result) {
            return $this->render('MiedzywodzieClientBundle:Default:autologin_incorrect.html.twig');
        }
        $result = $result[0]; 
 
        $token = new UsernamePasswordToken($result, $result->getPassword(), 'secured_area', $result->getRoles());
 
        $request = $this->getRequest();
        $session = $request->getSession();
        $session->set('_security_secured_area',  serialize($token));
 
        $router = $this->get('router');
        $url = $router->generate('miedzywodzie_client_default_dashboard');
 
        return $this->redirect($url);
    }

Introduction

The purpose of this post is to present creation of new workflow that would copy attached file to selected location depending whether the document was approved or rejected. In addition, I explain in more detail wokflow console and show how to gather more information regarding workflows from it.

Creation of workflow and gathering information from workflow console

Let’s create simple workflow ‘Review and Approve’. The workflow has one document attached. The screen shot with initial worflow settings is presented below.

Start Workflow

Run workflow console by running URL presented below. In this post all the URLs start with ‘http://localhost:8080/alfresco’ where it is path to your Alfresco deployment.

http://localhost:8080/alfresco/faces/jsp/admin/workflow-console.jsp

In workflow console run the command to show all the workflows.

show workflows all

You get the following information:

id: activiti$4265 , desc: Please review , start date: Tue May 15 20:18:07 IST 2012 , def: activiti$activitiReview v1

Let’s see more details about the workflow we have just started. As we can see in previous listing the id of the workflow is ‘activiti$4265′.

desc workflow activiti$4265

The outcome of the command is presented below. Note that under information about a package we have node reference.

definition: activiti$activitiReview
id: activiti$4265
description: Please review
active: true
start date: Tue May 15 20:18:07 IST 2012
end date: null
initiator: workspace://SpacesStore/08b80f86-1db3-44ed-b71a-02ebe4e932aa
context: null
package: workspace://SpacesStore/8d33211a-9f65-42f8-836e-54e2e445d140

Let’s run the node browser and check the node reference from package (workspace://SpacesStore/8d33211a-9f65-42f8-836e-54e2e445d140).

http://localhost:8080/alfresco/faces/jsp/admin/node-browser.jsp

The relevant information about the node are presented below. As we can see the reference node is container for all the documents attached to the workflow. In our case it contains the file ‘mikolajek.jpg’ attached on workflow creation. This information is going to be useful when we have to find nodes to be copied.

Children

Child Name	        Child Node	                                                Primary	Association Type	                                Index
mikolajek.jpg	        workspace://SpacesStore/5351a554-3913-433f-8919-022d6dead7ce	false	{http://www.alfresco.org/model/bpm/1.0}packageContains	-1

Creation of new workflow

This section describes how to create new workflow that depending on whether task was approved or rejected is going to add appropriate aspect to all the files attached to the workflow. Let’s call the aspect ‘workflowOutcomeAspect’ and allow it to have two values: ‘approved’ or ‘rejected’. The definition of new aspect is presented below.

 <constraint name="wf:allowedOutcome" type="LIST">
		<parameter name="allowedValues">
		    <list>
		        <value></value>
			<value>approved</value>
			<value>rejected</value>
		    </list>
		</parameter>
	    </constraint>
 

Following that let’s modify the initial workflow (‘Review and Approve’) to add ‘workflowOutcomeAspect’ to all the child nodes of package node and set property ‘workflowOutcome’ of that aspect to ‘approved’ or ‘rejected’ depending on user action. To note, ‘Review and Approve’ workflow is one of the standard workflows available with Alfresco deployment. The package is available in JavaScript under ‘bpm_package’ variable and its children can be obtained by invocation of ‘bpm_package.children’. More information about creation and management of workflows can be found in my post Creation of workflow in Alfresco using Activiti step by step.

<aspect name="wf:workflowOutcomeAspect">
			<title>Workflow Outcome</title>
 
			<properties>
				<property name="wf:workflowOutcome">
					<title>Workflow Outcome</title>
					<type>d:text</type>
					<mandatory>false</mandatory>
					<default></default>
					<constraints>
					    <constraint ref="wf:allowedOutcome" />
					</constraints>
				</property>
			</properties>
 
		</aspect>

Creation of rule to copy the documents

On workflow approval or rejection the aspect variable ‘workflowOutcome’ will be set to appropriate value. In Alfresco Explorer or Share let’s create the rule that would check whether some documents in particular folder have ‘workflowOutcome’ set and depending on its value copy the documents to selected folder. Select ‘copy’ action as a rule. The rule summary is presented below. In fact, I have created two rules – one to copy approved documents and one to copy rejected ones.

Rule summary

Rule Type:	update
Name:	Approved documents
Description:	
Apply rule to sub spaces:	No
Run rule in background:	Yes
Disable rule:	No
Conditions:	Text Property 'wf:workflowOutcome' Equals To 'approved'
Actions:	Move to 'approved'

Rule Type:	update
Name:	Rejected documents
Description:	
Apply rule to sub spaces:	No
Run rule in background:	Yes
Disable rule:	No
Conditions:	Text Property 'wf:workflowOutcome' Equals To 'rejected'
Actions:	Move to 'rejected'

I hope that you have enjoyed the post and find it useful


Introduction

This article describes few useful bits and pieces about running Apache Tomcat.

Setup of Tomcat environment variables – setenv.sh

As stated in CATALINA_BASE/bin/catalina.sh file the following environment variables can be set in CATALINA_BASE/bin/setenv.sh . setenv.sh script is run on Tomcat startup. It is not present in standard Tomcat distribution, so has to be created.

  • CATALINA_HOME May point at your Catalina “build” directory.
  • CATALINA_BASE (Optional) Base directory for resolving dynamic portions of a Catalina installation. If not present, resolves to the same directory that CATALINA_HOME points to.
  • CATALINA_OUT (Optional) Full path to a file where stdout and stderr will be redirected. Default is $CATALINA_BASE/logs/catalina.out
  • CATALINA_OPTS (Optional) Java runtime options used when the “start”, “run” or “debug” command is executed. Include here and not in JAVA_OPTS all options, that should only be used by Tomcat itself, not by the stop process, the version command etc. Examples are heap size, GC logging, JMX ports etc.
  • CATALINA_TMPDIR (Optional) Directory path location of temporary directory the JVM should use (java.io.tmpdir). Defaults to $CATALINA_BASE/temp.
  • JAVA_HOME Must point at your Java Development Kit installation. Required to run the with the “debug” argument.
  • JRE_HOME Must point at your Java Runtime installation.Defaults to JAVA_HOME if empty. If JRE_HOME and JAVA_HOME are both set, JRE_HOME is used.
  • JAVA_OPTS (Optional) Java runtime options used when any command is executed.Include here and not in CATALINA_OPTS all options, that should be used by Tomcat and also by the stop process, the version command etc. Most options should go into CATALINA_OPTS.
  • JAVA_ENDORSED_DIRS (Optional) Lists of of colon separated directories containing some jars in order to allow replacement of APIs created outside of the JCP (i.e. DOM and SAX from W3C). It can also be used to update the XML parser implementation. Defaults to $CATALINA_HOME/endorsed.
  • JPDA_TRANSPORT (Optional) JPDA transport used when the “jpda start” command is executed. The default is “dt_socket”.
  • JPDA_ADDRESS (Optional) Java runtime options used when the “jpda start” command is executed. The default is 8000.
  • JPDA_SUSPEND (Optional) Java runtime options used when the “jpda start” command is executed. Specifies whether JVM should suspend execution immediately after startup. Default is “n”.
  • JPDA_OPTS (Optional) Java runtime options used when the “jpda start” command is executed. If used, JPDA_TRANSPORT, JPDA_ADDRESS, and JPDA_SUSPEND are ignored. Thus, all required jpda options MUST be specified. The default is:
    -agentlib:jdwp=transport=$JPDA_TRANSPORT,address=$JPDA_ADDRESS,server=y,suspend=$JPDA_SUSPEND
  • CATALINA_PID (Optional) Path of the file which should contains the pid of the catalina startup java process, when start (fork) is used
  • LOGGING_CONFIG (Optional) Override Tomcat’s logging config file Example (all one line) LOGGING_CONFIG=”-Djava.util.logging.config.file=$CATALINA_BASE/conf/logging.properties
  • LOGGING_MANAGER (Optional) Override Tomcat’s logging managerExample (all one line)
  • LOGGING_MANAGER=-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager”

In case you need more memory to run your Tomcat instance just put the following line in setenv.sh file.

export JAVA_OPTS="-XX:MaxPermSize=1024m -Xms512m -Xmx4096m"

Running Tomcat – catalina.sh start|run|debug|jpda start

To run Tomcat you can use catalina.sh script with different options:

  • start: The Tomcat process is started in its own shell/session. Instead of that command you can run: startup.sh
  • run: The Tomcat process is started in current shell/session, the startup process output will be printed on the console, and the execution will be stopped on session close or on ctrl+c.
  • debug: The Tomcat starts under jdb – The Java Debugger.
  • jpda start: The Tomcat runs with remote debugging support.

For more details see JPDA_* variables above and Remote debugging of web application deployed on Tomcat server or using Jetty Maven plugin with Eclipse.


To test a performance of multiple parallel file downloads, I had to make sure that a download takes significant amount of time. I could use huge files but that’s not very helpful if you work on a local, 1Gb LAN. So I’ve decided to limit download speeds from my Apache server to my PC. Here we go.

1. Mark packages to be throttled, in my case those originating from port 80

$ iptables -A OUTPUT -p tcp --sport 80 -j MARK --set-mark 100

2. Use tc utility to limit traffic for the packages marked as above (handle 100):

$ tc qdisc add dev eth0 root handle 1:0 htb default 10
$ tc class add dev eth0 parent 1:0 classid 1:10 htb rate 1024kbps ceil 2048kbps prio 0
$ tc filter add dev eth0 parent 1:0 prio 0 protocol ip handle 100 fw flowid 1:10

3. That’s it, you can monitor/check your rules with:

$ tc filter show dev eth0
$ tc -s -d class show dev eth0

and finally remove the throttling with:

$ tc qdisc del dev eth0 root
$ iptables -D OUTPUT -p tcp --sport 80 -j MARK --set-mark 100

Some time ago I had to porcess a lot of images in a simple way – remove the top and bottom part of them. It was not a task I could automate – the amount of image I had to cut from the top & bottom varied for each photo. To make the mundane work a bit easier, I’ve created a script – python plugin.

The script assumes you have put two guide lines onto the image. It finds them, cuts the image from between them and saves as a new file.

To create such a simple script in python you need to:

  • import gimpfu
  • run register method that tells gimp (among other things), a function name that implements the script (special_crop) and where to put a link to the script in gimp menu (<Image>/Filters)
  • implement your function
  • copy script to your custom scripts folder (e.g. /home/…/.gimp-2.6/plug-ins)

The other locations you could use when choosing where in menu system a script should appear are:
“<Toolbox>”, “<Image>”, “<Layers>”, “<Channels>”, “<Vectors>”, “<Colormap>”, “<Load>”, “<Save>”, “<Brushes>”, “<Gradients>”, “<Palettes>”, “<Patterns>” or “<Buffers>”

And finally, the script itself. It’s fairly self-explanatory – enjoy and happy gimping!

#!/usr/bin/env python
from gimpfu import *
 
def special_crop(image):
        print "Start"
        pdb = gimp.pdb
        top = pdb.gimp_image_find_next_guide(image, 0)
        top_y = pdb.gimp_image_get_guide_position(image,top)
        bottom = pdb.gimp_image_find_next_guide(image, top)
        bottom_y = pdb.gimp_image_get_guide_position(image,bottom)
        if top_y > bottom_y:
                temp_y = top_y
                top_y = bottom_y
                bottom_y = temp_y
        print "Cutting from", top_y,"to",bottom_y
        pdb.gimp_rect_select(image, 0, top_y, image.width, bottom_y-top_y, CHANNEL_OP_REPLACE, FALSE, 0)
        pdb.gimp_edit_copy(image.active_drawable)
        image2 = pdb.gimp_edit_paste_as_new()
        new_filename = image.filename[0:-4]+"_cut.jpg"
        pdb.file_jpeg_save(image2, image2.active_drawable, new_filename, "raw_filename", 0.9, 0.5, 0, 0, "New file", 0, 0, 0, 0)
        pdb.gimp_image_delete(image2)
 
register(
    "python-fu-special-crop",
    "Crop an image",
    "Crops the image.",
    "Tomasz Muras",
    "Tomasz Muras",
    "2011",
    "Special crop",
    "*",
    [
        (PF_IMAGE, "image","Input image", None),
    ],
    [],
    special_crop,
    menu="<Image>/Filters",
    )
 
main()

Introduction

Sometimes it can be useful to monitor performance of Java Virtual Machine (VM) on remote host. To do so, a very nice tool – VisualVM – can be used. It can be run on local host and get information from jstatd running on a remote host. In addition, VisualVM comes with a number of useful plugins. This blog describes how to run VisualVM with VisualGC, which is Visual Garbage Collection Monitoring Tool to monitor Tomcat on remote machine. However, the solution can be also applied to other applications running on JavaVM.

Remote machine

Run Tomcat

Add the following options to CATALINA_OPTS variable to enable JMX support in Apache Tomcat.

CATALINA_OPTS="
    -Dcom.sun.management.jmxremote=true
    -Dcom.sun.management.jmxremote.port=<portNumber>
    -Dcom.sun.management.jmxremote.ssl=false
    -Dcom.sun.management.jmxremote.authenticate=false
    -Djava.rmi.server.hostname=<hostIP>"

in my case:

CATALINA_OPTS="
    -Dcom.sun.management.jmxremote=true
    -Dcom.sun.management.jmxremote.port=8084
    -Dcom.sun.management.jmxremote.ssl=false
    -Dcom.sun.management.jmxremote.authenticate=false
    -Djava.rmi.server.hostname=192.168.1.20"

You can find more information about Monitoring and Managing Tomcat here: Monitoring and Managing Tomcat

Run Tomcat:

startup.sh

We want to monitor Tomcat instance running on remote machine. To check whether it is running use:

ps aux | grep tomcat

The above command run on my remote machine returns the following:

tomcat6  28743  181  3.4 5136644 210932 ?      Sl   13:15   0:20 /opt/java-1.6.0_22/bin/java -Djava.util.logging.config.file=/var/lib/tomcat6/conf/logging.properties -Djava.awt.headless=true -Xms512m -Xmx3048m -XX:+UseConcMarkSweepGC -XX:MaxPermSize=512m -XX:-DisableExplicitGC -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=8084 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Djava.rmi.server.hostname=192.168.1.20 -Djava.endorsed.dirs=/usr/share/tomcat6/endorsed -classpath /usr/share/tomcat6/bin/bootstrap.jar -Dcatalina.base=/var/lib/tomcat6 -Dcatalina.home=/usr/share/tomcat6 -Djava.io.tmpdir=/tmp/tomcat6-tomcat6-tmp org.apache.catalina.startup.Bootstrap start

Notice that pid is ’28743′.

To get JavaVM process status you can run jps command, which is Java Virtual Machine Process Status Tool. jps is located is your Java JDK HOME/bin directory. Description of jps command can be found here. Note that jps returns only Java processes run by the user, who runs jps. To get list of all Java processes run sudo jps. See examples below.

Outcome of jps run on my remote machine:

28144 Jps

Outcome of sudo jps run on my remote machine:

28743 Bootstrap
1673 CPService
28159 Jps
28897 Jstatd

Notice that lvmid for Tomcat (in this case – Bootstrap) is ’28743′ which is the same as pid.

Hostname

Run

hostname

to check the host name, e.g., agile003.

Make sure that in /etc/hosts file this hostname has IP by which it is visible to the machine that will be running VisualVM (local machine), e.g., 192.168.1.20 agile003.

Run jstat Deamon

jstatd, which is jstat Daemon can be found in Java JDK HOME/bin. As described in documentation to jstatd, which can be found here: , create a file called jstatd.policy in any directory you choose, e.g., /home/joanna. The file should contain the following text:

grant codebase "file:${java.home}/../lib/tools.jar" {
    permission java.security.AllPermission;
};

Run jstatd using the following command. Make sure you run it with root permissions.

sudo jstatd -J-Djava.security.policy=/home/joanna/jstatd.policy

You can add the following flag to log the calls and see what is going on:

-J-Djava.rmi.server.logCalls=true

Local machine

Check access to jstatd

Run:

jps <hostname>

in my case

jps agile003

You should see the same output as for sudo jps running on remote machine.

1673 CPService
28743 Bootstrap
28897 Jstatd

Run VisualVM

Run:

jvisualvm

or

visualvm

jvisualvm can be located in your Java JDK HOME/bin directory also it can be downloaded from here: JVM download

Go to

Tools -> Plugins

and install the plugins you require including VisualGC plugin. My selection is presented in the screen shot below.

VisualVM plugins

Restart VisualVM.

Add remote host – one that jstatd is running on.

File -> AddRemoteHost...

give IP of the host, e.g., 192.158.1.20. The same IP should be set for the host name in /etc/hosts on remote server.

Now you should be able to access the Java processes running on remote host including Tomcat as presented below.

Java Processes on Remote Host

All tabs including VisualGC on right hand site should now show appropriate graphs. See sample screen shot below:

Visual GC

Enjoy!


Introduction

This guide will describe how to serve git repository on HTTP port using Apache. This should work on any recent Ubuntu or Debian release, I’ve tested it on Ubuntu Server 11.10. I’m setting it up on my local server 192.168.1.20 under git/agilesparkle, so my repository will be available at http://192.168.1.20/git/agilesparkle. I want it to be password protected but with only single user with following credentials: myusername/mypassword.

Server side

I assume you have Apache installed already. Switch to root account so we won’t need to add sudo all the time and install git:

$ sudo su
$ apt-get install git-core

Create directory for your git repository:

$ mkdir -p /var/www/git/agilesparkle

Create bare git repository inside and set the rights so Apache has write access:

$ cd /var/www/git/agilesparkle
$ git --bare init
$ git update-server-info
$ chown -R www-data.www-data .

Enable dav_fs module. This will automatically enable dav module as well:

$ a2enmod dav_fs

Configure Apache to serve git repository using dav:

$ vim /etc/apache2/conf.d/git.conf

Copy following inside the newly created file:

<Location /git/agilesparkle>
	DAV on
	AuthType Basic
	AuthName "Git"
	AuthUserFile /etc/apache2/passwd.git
	Require valid-user
</Location>

Create new file with new user and his password:

$ htpasswd -c /etc/apache2/passwd.git myusername

Restart Apache server

$ service apache2 restart

Client side

Clone the repository:

% git clone http://192.168.1.20/git/agilesparkle
Cloning into agilesparkle...
Username: 
Password: 
warning: You appear to have cloned an empty repository.

Create sample file and push it into empty repository:

% cd agilesparkle
% echo test file > readme.txt
% git add readme.txt
% git commit
% git push origin master

For the following pushes you can simply use:

% git push

If you don’t want to be prompted for the password each time and you don’t mind storing it in plain text, edit following file:

% vim ~/.netrc

and add following:

machine 192.168.1.20
login myusername
password mypassword

Unless provided explicitly, Java VM will set up several performance-related options depending on current environment. This mechanism is called ergonomics. You can see what defaults would be used on the machine by invoking:

$ java -XX:+PrintCommandLineFlags -version

The decision on the settings will be made based on the number of processors and total memory installed in the system. On my 32bit EeePC with 2 processors (as visible by OS) and 2GB memory, the output is:

$ java -XX:+PrintCommandLineFlags -version
-XX:InitialHeapSize=32847872 -XX:MaxHeapSize=536870912 -XX:ParallelGCThreads=2 -XX:+PrintCommandLineFlags -XX:+UseParallelGC 
java version "1.6.0_23"
OpenJDK Runtime Environment (IcedTea6 1.11pre) (6b23~pre11-0ubuntu1.11.10)
OpenJDK Server VM (build 20.0-b11, mixed mode)

And just for comparison, the output from Oracle Java 7:

$ java -XX:+PrintCommandLineFlags -version
-XX:InitialHeapSize=32847872 -XX:MaxHeapSize=525565952 -XX:ParallelGCThreads=2 -XX:+PrintCommandLineFlags -XX:+UseParallelGC 
java version "1.7.0_03"
Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
Java HotSpot(TM) Server VM (build 22.1-b02, mixed mode)

On 64bit system with 8 CPUs and 16GB memory, the output is:

$ java -XX:+PrintCommandLineFlags -version
-XX:InitialHeapSize=263071232 -XX:MaxHeapSize=4209139712 -XX:ParallelGCThreads=8 -XX:+PrintCommandLineFlags -XX:+UseCompressedOops -XX:+UseParallelGC 
java version "1.6.0_23"
OpenJDK Runtime Environment (IcedTea6 1.11pre) (6b23~pre11-0ubuntu1.11.10.2)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)

Oracle Java 7 again gives exactly the same ergonomics defaults.


Introduction

I’ve been working for some time on rewriting Global Search feature for Moodle. This is basically a search functionality that would span different regions of Moodle. Ideally it should allow to search everywhere within Moodle: forums, physical documents attached as resources, etc. The implementation should work in PHP, so as a search engine I’ve decided to use Zend’s implementation of Lucene. The library unfortunately doesn’t seem to be actively maintained – there were very few changes in SVN log – practically there was no development of Search Lucene since November 2010 (few entries in 2011 are just fixing typos or updating copyright date). The bug tracker is also full of Lucene issues and very little activity.
Having said that, I didn’t find any other search engine library implemented natively in PHP, so Zend_Search_Lucene it is! (please, please let me know if you know any alternatives)

Zend Lucene indexing performance-related settings

There are only 2 variables that can be changed to affect the performance of indexing:

  • $maxBufferedDocs
  • $maxMergeDocs

maxBufferedDocs

From the documentation:

 Number of documents required before the buffered in-memory
 documents are written into a new Segment
 Default value is 10

This simply means that every $maxBufferedDocs times you use addDocument() function, the index will be commited. Commiting requires obtaining write lock to the Lucene index.
So it should be straightforward: the smaller the value is, the less often index is flushed – therefore: overall performance (e.g. number of documents indexed per second) is higher but the memory footprint is bigger.

maxMergeDocs

The documentation says:

 mergeFactor determines how often segment indices are merged by addDocument().
 With smaller values, less RAM is used while indexing, and searches on unoptimized indices are faster,
 but indexing speed is slower.
 With larger values, more RAM is used during indexing, and while searches on unoptimized indices are slower,
 indexing is faster.
 Thus larger values (> 10) are best for batch index creation,
 and smaller values (< 10) for indices that are interactively maintained.

So it seems it’s pretty simple – for initial indexing we should set maxMergeDocs as high as possible and then lower it when more content is added to the index later on. With maxBufferedDocs we should simply find a balance between speed & memory consumption.

Testing indexing speed

I’ve tested various settings with my initial code for Global Search. As a test data I’ve created Moodle site with 1000 courses (really 999 courses as I didn’t use course id=1 – a frontpage course in Moodle). Each course has 10 sections and there is 1 label inside each section. That is: 10 labels per course (note: number of courses and sections is not really relevant for testing indexing speed).

Each label is about 10k long simple HTML text randomly generated, based on the words from “Hitchhiker’s guide to the galaxy”. Here is a fragment of a sample label text (DB column intro):

<h2>whine the world, so far an inch wide, and</h2>
<h2>humanoid, but really knows all she said. - That</h2>
<span>stellar neighbour, Alpha Centauri for half an interstellar distances between different planet. Armed intruders in then turned it as it was take us in a run through finger the about is important. - shouted the style and decided of programmers with distaste at the ship a new breakthrough in mid-air and we drop your white mice, -  of it's wise Zaphod Beeblebrox. Something pretty improbable no longer be a preparation for you. - Come off for century or so, - The two suns! It is. (Sass:</span>
[...9693 characters more...]

The intro and the name of a label is index. The total amount of data to index is about 100MB, exactly: 104,899,975 (SELECT SUM( CHAR_LENGTH( `name` ) ) + SUM( CHAR_LENGTH( `intro` ) ) FROM `mdl_label`) in 9990 labels. (Note for picky ones: no, there are no multi-byte characters there).
I’ve tested it on my local machine running: 64 bit Ubuntu 11.10, apache2-mpm-prefork (2.2.20-1ubuntu1.2), mysql-server-5.1 (5.1.61-0ubuntu0.11.10.1), php5 (5.3.6-13ubuntu3.6) with php5-xcache (1.3.2-1). Hardware: Intel Core i7-2600K @ 3.40GHz, 16GB RAM.
The results:

Time maxBufferedDocs mergeFactor
1430.1 100 10
1464.7 300 400
1471.1 200 10
1540.9 200 100
1543.3 300 100
1549.7 200 200
1557.5 100 5
1559.3 300 200
1560.4 300 300
1577.0 200 300
1578.9 50 10
1581.5 200 5
1584.6 300 50
1586.6 300 10
1589.3 200 50
1591.2 200 400
1616.7 100 50
1742.2 50 5
1746.4 400 5
1770.7 400 10
1776.1 300 5
1802.3 400 50
1803.9 400 200
1815.7 50 50
1830.7 400 100
1839.4 400 400
1854.9 100 300
1870.1 400 300
1894.1 100 100
1897.2 100 200
1909.7 100 400
1924.4 10 10
1955.1 10 50
2133.4 5 10
2189.0 10 5
2257.6 10 100
2269.8 50 100
2282.7 5 50
2393.5 5 5
2466.8 5 100
2979.4 10 200
3146.8 5 200
3395.9 50 400
3427.9 50 200
3471.9 50 300
3747.0 10 300
3998.1 5 300
4449.8 10 400
5070.0 5 400

The results are not what I would expect – and definitely not what the documentation suggests: increasing both values should decrease total indexing time. In fact, I was so surprised that the first thing I suspected was that my tests were invalid because of something on the server affecting the performance. So I’ve repeated few tests:

First test Second test maxBufferedDocs mergeFactor
1430.1 1444.9 100 10
1464.7 1490.6 300 400
1471.1 1491.1 200 10
1540.9 1593.5 200 100
1894.1 1867.7 100 100
1924.4 1931.2 10 10
1909.7 1920.4 100 400
5070.0 5133.3 5 400

The tests look OK! Here is a 3d graph of the results (lower values are better):

result1

Explaining the results would require more analysis of the library implementation but for end-users like myself, it makes the decision very simple: maxBufferedDocs should be set to 100, mergeFactor to 10 (default value). As you can see on the graph, once you set maxBufferedDocs to 100, both settings don’t really make too much of a difference (the surface is flat). Setting both higher will only increase the memory usage.
With those settings, on the commodity hardware, the indexing speed was 71kB text per second (7 big labels per second). The indexing process is clearly cpu bound, further optimization would require optimizing the Zend_Search_Lucene code.

Testing performance degradation

The next thing to check is does the indexing speed degrade over the time. The speed of 71 kB/sec may be OK but if it degrades much over the time, then it may slow down to unacceptable values. To test it I’ve created ~100k labels of the total size 1,049,020,746 (1GB) and run the indexer again. The graph below shows the times it took to add each 1000 documents.

result2

The time to add a single document is initially 0.05 sec and it keeps growing up to 0.15 at the end (100k documents). There is a spike every 100 documents, related to the value of maxBufferedDocs. But there are also bigger spikes in processing time 1,000 documents, then even bigger every 10,000. I think that this is caused by Zend_Lucene merging documents into single segment but I didn’t study the code deeply enough to be 100% sure.
It took in total 5.5h to index 1GB of data. The average throughput dropped from 73,356 bytes/sec (when indexing 100MB) to 53,903 bytes/sec (indexing 1GB of text).

The bottomline is that the speed of indexing keeps decreasing as the index grows but not significantly.

The last thing to check is the memory consumption. I checked the memory consumption after every document indexed then for each group of 1000 document I graphed the maximum memory used (the current memory used will keep jumping).

memory1

The maximum peak memory usage does increase but very slowly (1MB after indexing 100k documents).