Thursday, July 22, 2010

PMC Open Access JAXB Bindings

If you're trying to generate Java JAXB for the NLM Journal Publishing DTD http://dtd.nlm.nih.gov/publishing/w3c-schema.html you're going to run into some problems with the bindings. Some of these are related to MathML and some are related to the NLM schema. In particular mine look like:
Trying to override old definition of datatype resources
OpenAccesJaxb:
     [exec] parsing a schema...
     [exec] [ERROR] Property "Title" is already defined. Use <jaxb:property> to resolve this conflict.
     [exec]   line 1185 of http://dtd.nlm.nih.gov/publishing/3.0/xsd/journalpublishing3.xsd
     [exec] [ERROR] The following location is relevant to the above error
     [exec]   line 1227 of http://dtd.nlm.nih.gov/publishing/3.0/xsd/journalpublishing3.xsd
     [exec] [ERROR] Element "{http://www.w3.org/1998/Math/MathML}ms" shows up in more than one properties.
     [exec]   line 132 of http://dtd.nlm.nih.gov/publishing/3.0/xsd/ncbi-mathml2/presentation/scripts.xsd
     [exec] [ERROR] The following location is relevant to the above error
     [exec]   line 113 of http://dtd.nlm.nih.gov/publishing/3.0/xsd/ncbi-mathml2/presentation/tokens.xsd
     [exec] [ERROR] Property "MiOrMoOrMn" is already defined. Use <jaxb:property> to resolve this conflict.
     [exec]   line 132 of http://dtd.nlm.nih.gov/publishing/3.0/xsd/ncbi-mathml2/presentation/scripts.xsd
     [exec] [ERROR] The following location is relevant to the above error
     [exec]   line 138 of http://dtd.nlm.nih.gov/publishing/3.0/xsd/ncbi-mathml2/presentation/scripts.xsd
     [exec] [ERROR] Element "{http://www.w3.org/1998/Math/MathML}ms" shows up in more than one properties.
     [exec]   line 138 of http://dtd.nlm.nih.gov/publishing/3.0/xsd/ncbi-mathml2/presentation/scripts.xsd
     [exec] [ERROR] The following location is relevant to the above error
     [exec]   line 113 of http://dtd.nlm.nih.gov/publishing/3.0/xsd/ncbi-mathml2/presentation/tokens.xsd
     [exec] Failed to parse a schema.
     [exec] Result: -1

The easiest solution is to add a custom bindings file. Thanks to this useful post the MathML errors are easily resolved. Then you just need to add one more to fix the "title" property problem. My final XJB file ends up looking like:
<?xml version="1.0" encoding="UTF-8"?>
<bindings xmlns="http://java.sun.com/xml/ns/jaxb" xmlns:xsd="http://www.w3.org/2001/XMLSchema" version="2.0">

  <bindings schemaLocation="http://dtd.nlm.nih.gov/publishing/3.0/xsd/journalpublishing3.xsd"
    node="/xsd:schema/xsd:element[@name='bio']//xsd:element[@ref='title']">
    <property name="biotitle" />
  </bindings>

  <bindings schemaLocation="http://dtd.nlm.nih.gov/publishing/3.0/xsd/ncbi-mathml2/common/common-attribs.xsd"
    node="/xsd:schema/xsd:attributeGroup[@name='Common.attrib']/xsd:attribute[@name='class']">
    <property name="clazz" />
  </bindings>

  <bindings schemaLocation="http://dtd.nlm.nih.gov/publishing/3.0/xsd/ncbi-mathml2/presentation/scripts.xsd"
    node="/xsd:schema/xsd:group[@name='mmultiscripts.content']">
    <property name="content" />
  </bindings>

</bindings>
and my ant task to generate the bindings looks like:
<target name="OpenAccesJaxb" description="Generate the Open Access JAXB bindings">
  <exec executable="xjc">
    <arg value="-b"/>
    <arg value="pmcOa.xjb"/>
    <arg value="-d"/>
    <arg value="src"/>
    <arg value="-p"/>
    <arg value="${pmc.oa.package.jaxb}"/>
    <arg value="${pmc.oa.schema}"/>
  </exec>
</target>

Where the last two properties point to the package the code should be generated in and pubmed schema location respectively...

Friday, May 7, 2010

Remote Katta client example

I couldn't find too much information on running external Katta clients so here's an example of how I ended up getting it to work:
import net.sf.katta.lib.lucene.Hit;
import net.sf.katta.lib.lucene.Hits;
import net.sf.katta.lib.lucene.ILuceneClient;

import org.apache.hadoop.io.MapWritable;
import org.apache.hadoop.io.Text;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Query;
import org.apache.lucene.util.Version;

import com.google.inject.Inject;
import com.google.inject.internal.Lists;

public class SomeSearchKattaImpl implements SomeSearch {

 private final Analyzer analyzer;
 private final ILuceneClient client;
 private final List indexedFields;
 private static final String[] indexName = {"field"}; 
 
 @Inject
 protected SomeSearchKattaImpl(Logger logger, StandardAnalyzer analyzer,
   ILuceneClient client) {
  this.logger = logger;
  this.client = client;
  this.analyzer = analyzer;
  
  indexedFields = Lists.newArrayList();
  indexedFields.add(Field1);
  indexedFields.add(Field2);
  this.logger.fine(getClass().getName() + " loaded...");
 }
 
 private Query getQuery(String query) throws ParseException {  
  MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_30,
    (String[])indexedFields.toArray(new String[0]), analyzer);
  return parser.parse(query);
 }
 
 @Override
 public List getResults(
   String query, int start, int count) throws Exception {
  Query q = getQuery(query);
  Hits hits = client.search(q, indexName, start + count);
  List window = hits.getHits().subList(start, 
    (hits.getHits().size() < start + count) ? hits.getHits().size() : start + count);
  List results = Lists.newArrayList();
  //TODO: Should we limit this by field?
  for (MapWritable writable: client.getDetails(window)) {
   ResultLite result = new ResultLite();

   result.setSomething(writable.get(new Text("title")).toString());
   results.add(result);
  }
  return results;
 }
}

and I injected the LuceneClient like this:
@Provides 
public ILuceneClient getLuceneClient() {  
 ZkConfiguration config = 
  new ZkConfiguration("/katta/katta.zk.properties");
        
 ILuceneClient client = new LuceneClient(config);
 return client;
}

Thursday, February 18, 2010

Postgres JDBC as a Memory Hog

I had a strange problem today when building an offline index of a large Postgres ResultSet. I launched jconsole and just watched the memory disappear until the heap was exhausted. After digging around it turns out that Postgres (unlike other JDBC drivers) will not use a cursor unless you set auto commit to false on your connection:
conn.setAutoCommit(false);
In addition I set specific options to the Statement:
Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
which also may have helped, but I didn't test without those settings. Now I can watch the memory oscillate in jconsole but it never gets above 50m and more importantly I can close my ticket...