Saturday, April 24, 2010

Office 2007 and Office 2010 documents Text extraction using Tika

We were earlier using various different libraries to extract text out of word, pdf, ppt, excel and it was tricky to maintain it. Our CTO found this cool apache Tika project that made our life easy. Now extracting text out of various documents is a piece of cake. Beauty of tika library is that it can detect mimetype and other metadata automatically. Here is a sample code to extract text using Tika

    @Override
    public String getText(InputStream stream, int maxSize) {
        Tika tika = new Tika();
        tika.setMaxStringLength(maxSize);
        try {
            return tika.parseToString(stream);
        } catch (Throwable t) {
            logger.error("Error extracting text from document of type" + logIdentifier, t);
            return " ";
        }
    }

Friday, April 23, 2010

IE browsing slowness and mod_ssl

We were observing that our pages were loading very slow in IE compared to FF and we thought that IE is inherently slow parsing the page, that was a wrong assumption. Using fiddler with IE shows that the server somehow was making too many SSL handshakes with server compared to when we plugged in Fiddler with FF or Safari. This was traced down to an Apache configuration.

Apache comes with a default mod_ssl conf

SetEnvIf User-Agent ".*MSIE.*" \
nokeepalive ssl-unclean-shutdown \
downgrade-1.0 force-response-1.0



What this tells is that for  all IE browsers dont use Keep-alive and downgrade HTTP response to 1.0. Apache was consuming significant time doing this when compared to  same request in FF. Well all this was required for really old browsers.


Changing the conf as shown below solved the issue and now we get good initial load performance in IE browsers


SetEnvIf User-Agent ".*MSIE [1-4].*" \
nokeepalive ssl-unclean-shutdown \
downgrade-1.0 force-response-1.0

SetEnvIf User-Agent ".*MSIE [5-9].*" \
ssl-unclean-shutdown

Applet JRE 1.6.0_19 security popup issue

We recently ran into an issue where suddenly customers using Java applet for multi file upload started seeing security warning and the worrying thing about this dialog was the "Block" was the first choice so customers keep on clicking Block.

The reason this dialog was coming is that our applet was making a http call to download some properties files and Java applet was treating it as a security warning because Applet jars were signed but the properties file were not and they can't be signed.

The fix for this issue was to bundle the properties file in the jar file.

Temporarily you can also ask users to enable this setting

Java Applet CACHE_VERSION Mac v/s Windows

Java Applets uses a property called as CACHE_VERSION which is of format 4.0.5.452d that is comprised of 4 hexadecimal values separated by ".". The Applet plug-in uses this to determine whether to download new Jars or not. The sun documentation says that applet plugin would download the new jars if the CACHE_VERSION is higher than the previous one. My findings on this:

Windows plugin in IE/FF/Chrome/"Safari on windows" all will download new jars regardless of whether the jar cache version is greater or not. Earlier our version was 4.0.5.XXX, I tried updating version to 4.0.4.XXX or even 1.2.3.XXX and windows would happily download it.

We recently ran into an issue where applet would not work fro some mac users and it was random, the culprit was that during one deployment our operations team had updated jar as 4.0.6.XXX and we use XXX as svn changelist number of the jar file so we never changed the "4.0.5" portion. When the new build was deployed the jar version was again set to "4.0.5.XXX" in the jsp page.

The reason was the MAC Java plugin enforces the rules mentioned by Sun to download jars only if CACHE_VERSION is greater than the previous one. Changing the version to 5.2.1.XXX solved the issue and we changed the build also to reflect 5.2.1

Tuesday, April 20, 2010

Unix timing a command

To time any command in unix just prefix the command with "time ". for e.g.
  1. "time ps"
  2. "time ls"
  3. "time python scripts/dev_scripts/test_backup_upload_locally.py ../vmshare"
The output would be something like

  PID TTY          TIME CMD
14648 pts/0    00:00:00 bash
14778 pts/0    00:00:00 ps

real    0m0.032s
user    0m0.004s
sys    0m0.024s

Monday, April 12, 2010

Tiff image thumbnails that are visible in all browsers

It seems that not all browsers show tiff image properly. using showed up properly only in Safari (way to go apple) and it didnt showed properly in Firefox, IE and Chrome. The reason I wanted to do this was to generate thumbnail images for files added to our Cloud server, even though PIL was able to generate the tiff thumbnail it was only visible in Safari. The solution was simple and I used ImageMagic to generate thumbnails for Tiff images in Jpg format ;).



nice -n 10 convert Sample.tiff -thumbnail 100x100 -bordercolor white -border 50 -background white -gravity center -crop 100x100+0+0 +repage -limit memory 32 -limit map 32 -limit disk 500 Sample.jpg

First page thumbnail for a multi page tiff

Discovered a new thing that you can have multipage images in tiff format. The way I discovered that was when I used image magic to generate the file it generated 7 images for a file and that broke the code. the way to fix was to generate first page image was to use [0] after input file in image magic. The beauty of the solution is that it works fine even if the image has only 1 page.

nice -n 10 convert kp.tiff[0] -thumbnail 100x100 -bordercolor white -border 50 -background white -gravity center -crop 100x100+0+0 +repage -limit memory 32 -limit map 32 -limit disk 500 kp.jpg

Thursday, April 8, 2010

Tika0.7 OutOfMemory compile issue

Not sure why don't they generate and put binaries on the site. Was trying to compile Tika0.7 and faced compile issues as tests were failing due to OutOfMemory issue.

setting the below env variables before doing mvn install solved the issue
export MAVEN_OPTS="-Xmx1024m"

Wednesday, April 7, 2010

Python CMYK images

Recently ran into an issue when implementing thumbnail generation for our website. Some of the Jpeg  images were getting blue color thumbnail background causing customer complaints. Using image magic solved the issue but Image magic was out of process and very slow.We use PIL for image generation and we were using PIL 1.1.6. You can check your PIL version by doing
import Image
Image.VERSION

For those of you facing similar issue, upgrading to PIL 1.1.7 fixed the issue.

At first installing PIL 1.1.7 was not working as the python was somehow still picking up 1.1.6, I had to remove all old references by doing "apt-get remove python-imaging" and that solved the issue.

Applet Jar download without browser restart

We use an applet in our website for uploading multiple files/folder tree to the server. Recently we ran into an issue where our code signing certificate was expired and we had to sign the jars again and publish new jars. We use CACHE_VERSION to give each jar a version that way on each browser restart the applet doesn't go to server for checking if a new version is available on the server or not. Refer http://java.sun.com/products/plugin/1.3/docs/appletcaching.html for more details on CACHE_VERSION.

We ran into an issue where even after uploading the new jars to the server and giving them each a different CACHE_VERSION customers were still complaining about the expired certificate dialog. Doing some googling found that its a common problem in Java plugins in most browsers and a restart of browser would fix it. The browsers will check the cache version in an open browser only once and then even if you render the applet tag again it wont check the cache version. Wow so many people hadn't restarted a browser in 2-3 days, as this can happen again where we can push a server change that's incompatible with the old jars, we need to find a solution.

The solution to the problem was simple, every time the jar on server is changed generate a unique name for the jar. This was done by appending the changelist number of the file in svn to jar name in JSP. The jar name can be generated as upload.V13456.jar. At server we can write an apache rewrite rule that would strip of this .V13456 and serve the jar file. We are already doing this for our images. We use a _@version_@ tag in JSP files that gets replaced during build time using python with svn changelist no of that file. Using the same logic here solved the issue.

Update expired certificate in a signed jar

Recently our website code signing certificate expired and we had to update a jar that we long time back got from a third party. There was no way to get the unsigned jar back so I had to find a trick to update the expired certificate. The solution was elegant and simple :
  1. Unzip the jar
  2. remove the META-INF folder
  3. Use jar command to create the jar again
  4. Use jarsigner to sign the jar with the updated keystore

Tuesday, April 6, 2010

Sending CTRL + BREAK to a java linux process

use "kill -QUIT pid" to send CTRL + BREAK to a running java linux process. This would print the threaddump

Ant append to a file using echo task

Learnt a new thing that you can use echo task and redirect its output to a file.

        <propertyfile
            file="${deploy.path}/svninfo.txt"
            comment="File containing build version,Build Date and svn info">
          <entry  key="Version" value="${revisionProperty}"/>
          <entry  key="Build Date" type="date" value="now"/>
        </propertyfile>

        <echo file="${deploy.path}/svninfo.txt" append="true">
==============Svn Info==============
${svnInfoOut} 
====================================
        </echo>

RabbitMQ purge a queue

Such a simple operation is not available in rabbitmqctl. You can list the queues but not clear it so wrote a python client for it. Better solution is to install BQL plugin but for now this would suffice

import sys
from amqplib import client_0_8 as amqp
if __name__ == '__main__':
    if len(sys.argv) < 6:
       print "Usage python purge_queue.py mq_url mq_user mq_pass mq_vhost queue_name"
       exit()
  
    mq_url=sys.argv[1]
    mq_user=sys.argv[2]
    mq_pass=sys.argv[3]
    mq_vhost=sys.argv[4]
    mq_queue_name=sys.argv[5]
    conn = amqp.Connection(host=mq_url,
                           userid=mq_user,
                           password=mq_pass,
                           virtual_host=mq_vhost,
                           insist=False);
    chan = conn.channel();
    n=chan.queue_purge(mq_queue_name);   
    if n==0:
        print "purged %s sucessfully" % mq_queue_name
    else:
        print "unable to purge %s. There are still %s messages in queue" % (mq_queue_name, n)
    chan.close();
    conn.close();