Thursday 15 December 2011

Improving the Feature Join Experience (or: Sort all the things!)

Remember this screenshot?


Here's the current numbers as of right now


How did we achieve such godly performance numbers from a provider (SDF) that cannot take advantage of the recently implemented Feature Join optimization shortcut yet produces numbers just a tad higher than the new shortcut?

The answer: The existing Feature Join engine can actually perform well, it just picked the absolutely wrong join algorithm to do the join!

As explained previously, the GWS Query Engine "joins" two input feature readers (from both sides of the join) and returns a merged feature reader. Depending on the capabilities of both sides of the join, the following join algorithms are employed:

Sort-Merge is the best algorithm of the lot here because it does not need to re-iterate any side of the join to process all the results. The new performance numbers in the second screenshot is due to the use of the Sort-Merge algorithm.

The reason we got such terrible numbers in the first screenshot is because Nested Loops was the algorithm chosen by the GWS Query Engine. It chose this algorithm because it determined that the SDF provider could not support ordered results. 

This is actually incorrect because both SDF and SHP can actually produce sorted results, but can only sort on a single property (which I would wager is the standard Feature Join use case. Joining on multiple properties would be quite uncommon). It turns out that the GWS Query Engine fails to notice this when selecting the join algorithm to use.

So after tweaking the sortability check in the GWS Query Engine, the existing Feature Join performs slightly slower than the optimization shortcut, which brings up another interesting point: Aside from some obscure providers (OGR, WFS), nearly all the vector FDO providers support ordering, meaning now there is a very high chance that the feature joins you set up will have both sides as sortable, and thus the most optimal join algorithm (Sort-Merge) will almost always be used. 

So with Sort-Merge being the most likely algorithm being used with this new modification, performance then boils down to how fast we can read the rows of data and in this arena, SQLite is the king. Another reason why SQLite should be the default format of choice for your spatial data.

So in summary: For best feature join performance, use SQLite or data formats that can be sorted by MapGuide (which will be nearly everything with this patch)

No comments: