March 21, 2008

Fast XML Pull Parser 0.3 released

I've been doing quite a bit of work on Faxpp recently. My enthusiasm had kind of ground to a halt for a while after I realised the full complexity of implementing entities, but then I decided I just needed to knuckle down and get it finished. The fruit of my labours can now be downloaded from Sourceforge.

I think I've got a robust framework for resolving and parsing internal and external entities - and I've learnt things about XML that I'm not sure many people in the world know:

Continue reading "Fast XML Pull Parser 0.3 released"

Posted by john at 12:18 AM | Comments (7)

March 18, 2008

XQilla in the News

Oracle officially announced the XQilla license change today. It feels like this has been a long time coming - I was involved in pushing for the original Pathan project to be open sourced in 2003 when I worked at Decisionsoft. Later when I worked for Sleepycat, I was involved in pushing for a liberally licensed release of the XQuery implementation and improvements to Pathan which became XQilla some 3 years later.

It's great to see something that I've worked on for the last 7 years start to get the exposure I always thought it deserved. XQuery has huge potential to change the way that people use their data, and it's close relationship to the web means now might be the right time, and XQilla might be in the right place.

Thanks has to go to Mike Olson who put in the lion's share of the work needed to make this happen. Hopefully his efforts will make it easier for even more Oracle code to reach it's potential by being released as open source.

Posted by john at 11:19 PM | Comments (2)

March 12, 2008

Google Summer of Code 2008

I've just finished the application process for XQilla to be a part of Google Summer of Code 2008. I'm always coming up with way too many projects to do, and there's never enough time to get around to all of them - the ideas list we've put together has some of the most interesting and self contained of those projects. Take a look if you fancy learning more about XQuery, open source or XQilla.

Posted by john at 01:39 PM | Comments (1)

February 14, 2008

Parsing JSON into XQuery

Doug Crockford stirred things up at XML 2007 with his comments on JSON and XML. He was wrong - mainly because he's only looking at structured data transfer, rather than anything else that XML is very good at like documents and semi-structured data. However he got me thinking about how easy it would be to process JSON in XQuery, which as Data Direct has shown is very good at manipulating all sorts of data formats.

Continue reading "Parsing JSON into XQuery"

Posted by john at 05:21 PM | Comments (6)

February 12, 2008

DB XML with Python

At Oxford Geek Night 5 last week I bumped into James Gardner, apparently now known as a Pylons Guru, and a friend of mine from university.

I encouraged him to take a look at Berkeley DB XML's Python bindings, which he has done. His initial experience is written up in his latest blog post, which is worth reading if you're looking to get DB XML working from Python.

I said the same thing to James as I did to Greg Pollack when I met him at XML 2007 - why don't agile web frameworks like Pylons and Ruby on Rails support XML databases as a back end, rather than a SQL database? Surely XML databases are a much better fit for most data in web frameworks?

Posted by john at 04:23 PM | Comments (0)

January 25, 2008

Latency vs Throughput

I was helping a Berkeley DB XML user recently who complained that his query took as long as ~55s to run. It turned out he was trying to run 40 concurrent queries, so I tried out the query on his data set using our concurrency testing framework. I was confused when it reported to me that I was getting ~2 ops/s, which is a completely different ballpark.

It took me a while, but eventually I figured it out: When the user was saying that his query took ~55s what he meant was that a thread's view of how long a query takes can be as much as 55s. When I talked about getting ~2 ops/s, I was looking at the system as a whole. The problem was that he was talking about latency, and I was talking about throughput.

Typically the way that a server gets smaller latency is to implement a thread pool and work queue. Since DB XML is an embedded library and does not spawn it's own threads, it is down to the application using DB XML to manage it's threads well.

So I created a Java test program that used a thread pool to manage the queries being run through DB XML (in Java it's really easy). I created a test function that could run the query with arguments of how many threads to create, how many queries to run, and how many queries to start per second.

Continue reading "Latency vs Throughput"

Posted by john at 10:42 AM | Comments (2)

January 17, 2008

Why DB XML Doesn't Do Path Indexes

Every so often we have a Berkeley DB XML user ask us if we implement path indexes. Or maybe they assume that we do implement them and are confused about why their index isn't working. It turns out that a lot of these users have come to us after using eXist, which for a long time has only been able to specify path indexes.

A path index is specified by describing the path to the nodes that you want to index - often with a subset of XPath. So you might want to index all "firstname" elements, but only if they are children of an "author" element - which you might write like this:

//author/firstname

Continue reading "Why DB XML Doesn't Do Path Indexes"

Posted by john at 04:55 PM | Comments (0)