Nokogiri’s Xpath Search Is Fast(er)!

{{{
Recently I ran into an architectural problem when parsing XML with “Nokogiri”:https://github.com/sparklemotion/nokogiri. I used an xpath to find child elements in a document. Coming to the conclusion that replacing that xpath @#search@ with a self-baked @#find_all@ would lead to a better design I set up a quick benchmark.

The XML contains a root node with 1000 empty childs.


  
  
  ... 998 times

This is the code I am using now.

Nokogiri::XML(xml).root.search("./item")

Note the usage of @#search@ which evaluates the xpath expression and returns a list of matching nodes.

The replacement code comes here.

Nokogiri::XML(xml).root.children.find_all do |c|
  c.name == "item"
end

Instead of invoking the internal search I do it myself by querying each child.

Benchmarking time.

xpath:    0.003901151
find_all: 0.014400985

Going the “official” way by *using an xpath is about 3.5 times faster!* Wow.

It turns out that the manual comparison in @find_all@ is the bottleneck. I guess Nokogiri has some internal optimization which saves the creation of the child nodes.

Nokogiri::XML(xml).root.children

children: 0.003361085

Takes about the same amount of time than the xpath search (without having filtered matching elements).

Here’s the “benchmark code”:https://gist.github.com/4545761. I’ll keep going with the xpath search.
}}}

Advertisements

2 thoughts on “Nokogiri’s Xpath Search Is Fast(er)!

  1. Yes, the underlying libxml2 library is implemented in C and is very fast, and in Nokogiri you’ll always be better off using built-ins than Ruby code. Note that you could have used .xpath(“item”) instead of search(), because search() first makes a determination whether you are using XPath or CSS, and without the leading dot-slash it would have assumed CSS, which has a different behavior.

    Like

  2. Mark: Thanks! I thought xpath("item") would return any item node in the tree, not just the children of the context node – RTFM would have helped 😉

    Also, I thought Nokogiri uses libxml2 just for parsing, not searching etc. Makes sense now.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s