Friday, 02 April 2010 - Nesterovsky bros

Friday, 02 April 2010

Xslt 1.0 has been designed based on the best intentions. Xslt 2.0 got a legacy baggage.

If you're not entirely concentrated during translation of your algorithms into xslt 2.0 you can get into trap, as we did.

Consider a code snapshot:

<xsl:variable name="elements" as="element()+"> <xsl:apply-templates/> </xsl:variable> <xsl:variable name="converted-elements" as="element()+" select="$elements/t:convert(.)"/>

Looks simple, isn't it?

Our intention was to get converted elements, which result from some xsl:apply-templates logic.

Well, this code works... but rather sporadically, as results are often in wrong order! This bug is very close to what is called a Heisenbug.

So, where is the problem?

Elementary, my dear Watson:

xsl:apply-templates constructs a sequence of rootless elements.
$elements/t:convert(.) converts elements and orders them in document order.

Here is a tricky part:

The relative order of nodes in distinct trees is stable but implementation-dependent...

Clearly each rootless element belongs to a unique tree.

After that we have realized what the problem is, code has been immediately rewritten as:

<xsl:variable name="elements" as="element()+"> <xsl:apply-templates/> </xsl:variable> <xsl:variable name="converted-elements" as="element()+" select=" for $element in $elements return t:convert($element)"/>

P.S. Taking into an accout a size of our xslt code base, it took a half an hour to localize the problem. Now, we're at position to review all uses of slashes in xslt. As you like it?

Friday, 02 April 2010 17:53:18 UTC

Comments [0] -
Thinking aloud | xslt

Saturday, 27 March 2010

Xml namespaces

Opinions on xml namespaces

olegtk: @srivatsn Originally the idea was that namespace URI would point to some schema definition. Long abandoned idea.

Not so long ago, I've seen a good reasoning about the same subject:

XML Namespaces by James Clark;
A comment by M. Kay.

Saturday, 27 March 2010 09:49:45 UTC

Comments [0] -
xslt

Monday, 22 March 2010

Inline functions in xslt 2.1

Inline functions in xslt 2.1 look often as a some strange aberration. Sure, there are very usefull cases when they are delegates of program logic (e.g. comparators, and filters), but often (probably more often) we can see that it's use is to model data structures.

As an example, suppose you want to model a structure with three properties say a, b, and c. You implement this creating functions that wrap and unwrap the data:

function make-data($a as item(), $b as item(), $c as item()) as function() as item()+ { function() { $a, $b, $c } }

function a($data as function() as item()+) as item() { $data()[1] }

function b($data as function() as item()+) as item() { $data()[2] }

function c($data as function() as item()+) as item() { $data()[3] }

Clever?

Sure, it is! Here, we have modeled structrue with the help of sequence, which we have wrapped into a function item.

Alas, clever is not always good (often it's a sign of a bad). We just wanted to define a simple structure. What it has in common with function?

There is a distance between what we want to express, designing an algorithm, and what we see looking at the code. The greater the distance, the more efforts are required to write, and to read the code.

It would be so good to have simpler way to express such concept as a structure. Let's dream a little. Suppose you already have a structure, and just want to access its members. An idea we can think about is an xpath like access method:

$data/a, $data/b, $data/c

But wait a second, doesn't $data looks very like an xml element, and its accessors are just node tests? That's correct, so data constructor may coincide with element constructor.

Then what pros and cons of using of xml elements to model structures?

Pros are: existing xml type system, and sensibly looking code (you just understand that here we're constructing a structure).

Cons are: xml trees are implemented the way that does not assume fast (from the perfromace perspective) composition, as when you construct a structure a copy of data is made.

But here we observe that "implemented" is very important word in this context. If xml tree implementation would not store reference to the parent node then subtrees could be composed very efficiently (note that tree is assumed to be immutable). Parent node could be available through a tree navigator, which would contain reference to a node itself and to a parent tree navigator (or to store child parent map somewhere near the root).

Such tree structure would probably help not only in this particular case but also with other conventional xslt code patterns.

P.S. Saxon probably could implement its NodeInfo this way.

Update: see also Custom tree model.

Monday, 22 March 2010 11:02:07 UTC

Comments [0] -
Thinking aloud | xslt

Monday, 15 March 2010

Maps in exslt2?

A while ago we have proposed to introduce maps as built-in types in xpath/xquery type system: Tuples and maps.

The suggestion has been declined (probably our arguments were not convincing). We, however, still think that maps along with sequences are primitives, required to build sensible (not obscure one) algorithms. To see that map is at times is the best way to resolve the problems we shall refer to an utility function to allocate names in scope. Its signature looks like this:

 <xsl:function name="t:allocate-names" as="xs:string*"> <xsl:param name="prefixes" as="xs:string*"/> <xsl:param name="names" as="xs:string*"/> <xsl:param name="name-max-length" as="xs:integer?"/>

Just try to program it and you will find yourselves coding something like one defined at cobolxom.

To be efficient such maps should provide, at worst, a logarithmic operation complexity:

Access to the map through a key (and/or by index) - complexity is LnN;
Creation of a new map with added or removed item - complexity is LnN;
Construction of the map from ordered items - complexity is O(N);
Construction of the map from unordered items - complexity is O(N*LnN);
Total enumeration of all items - complexity is O(N*LnN);

These performance metrics are found in many functional and procedural implementations of the maps. Typical RB and AVL tree based maps satisfy these restrictions.

What we suggest is to introduce map implementation into the exslt2 (assuming inline functions are present). As a sketch we have implemented pure AVL Tree in Xslt 2.0:

test.xslt - a little test;
tree.xslt - an AVL tree implementation;
tuple.xslt - a tuple interface and implementation.

We do not assume that implementation like this should be used, but rather think of some extension function(s) that provides a better performance.

What do you think?

Monday, 15 March 2010 13:59:19 UTC

Comments [1] -
Thinking aloud | xslt

Sunday, 28 February 2010

AVL Tree in Xslt

The story about immutable tree would not be complete without xslt implementation. To make it possible one needs something to approxomate tree nodes. You cannot implement such consruct efficiently in pure xslt 2.0 (it would be either unefficient or not pure).

To isolate the problem we have used tuple interface:

tuple:ref($items as item()*) as item() - to wrap items into a tuple;
tuple:deref($tuple as item()?) as item()* - to unwrap items from a tuple;
tuple:is-same($first as item(), $second as item()) as xs:boolean - to test whether two tuples are the same.

and defined inefficient implementation based on xml elements. Every other part of code is a regular AVL algorithm implementation.

We want to stress again that even assuming that there is a good tuple implementation we would prefer built-in associative container implementation. Why the heck you need to include about 1000 lines of code just to use a map?

Source code is:

test.xslt - a little test;
tree.xslt - an AVL tree implementation;
tuple.xslt - a tuple interface and implementation.

Sunday, 28 February 2010 19:28:07 UTC

Comments [0] -
Thinking aloud | xslt

VS 2010, Still Beta

We like Visual Studio very much, and try to adopt new version earlier.

For the last time our VS's use pattern is centered around xml and xslt. In our opinion VS 2008 is the best xslt 2 editor we have ever seen even with lack of support of xslt 2.0 debugging.

Unfortunatelly, that is still a true when VS 2010 is almost out. VS 2008 is just 2 - 3 times faster. You can observe this working with xslt files like those in languages-xom.zip (1000 - 2000 rows). Things just become slow.

We still hope that VS 2010 will make a final effort to outdo what VS 2008 has already delivered.

Sunday, 28 February 2010 18:37:58 UTC

Comments [0] -
Thinking aloud | xslt

Thursday, 25 February 2010

t:allocate-names() implementation

While bemoaning about lack of associative containers in xpath type system, we have came up with a good implementation of t:allocate-names(). Implementation can be seen at location cobol-names.xslt.

It is based on recursion and on the use of xsl:for-each-group. Alogrithmic worst case complexity is O(N*LogN*LogL), where N is number of names, and L is a length of a longest name.

This does not invalidate the idea that associative containers are very wishful, as blessed one who naturally types such implementation. For us, it went the hard way, and has taken three days to realize that original algorithm is problematic, and to work out the better one.

In practice this means 2 seconds for the new implementation against 25 minutes for the old one.

Thursday, 25 February 2010 07:19:06 UTC

Comments [0] -
xslt

Wednesday, 24 February 2010

Welcome to VS 2010!

Ladies and gentlemen of the jury...

Xml Editor in VS 2010

Wednesday, 24 February 2010 21:00:36 UTC

Comments [7] -

Again on immutable binary trees

Why do we return to this theme again?

Well, it's itching!

In cobolxom there is an utility function to allocate names in scope. Its signature looks like this:

We have created several different implementations (all use recursion). Every implementation works fair for relatively small input sequences, say N < 100, but we have cases when there are about 10000 items on input. Algorithm's worst case complexity, in absence of associative containers, is O(N*N), and be sure it's an O-o-o-oh... due to xslt engine implementation.

If there were associative containers with efficient access (complexity is O(LogN)), and construction of updated container (complexity is also O(LogN)) then implementation would be straightforward and had complexity O(N*LogN).

Wednesday, 24 February 2010 07:34:07 UTC

Comments [0] -
Thinking aloud | xslt

Wednesday, 17 February 2010

Integer and fraction in xslt

The very same simple tasks tend to appear in different languages (e.g. C# Haiku). Now we have to find:

integer and fractional part of a decimal;
length and precision of a decimal.

These tasks have no trivial solutions in xslt 2.0.

At present we have came up with the following answers:

Fractional part:

<xsl:function name="t:fraction" as="xs:decimal"> <xsl:param name="value" as="xs:decimal"/> <xsl:sequence select="$value mod 1"/> </xsl:function>

Integer part v1:

<xsl:function name="t:integer" as="xs:decimal"> <xsl:param name="value" as="xs:decimal"/> <xsl:sequence select="$value - t:fraction($value)"/> </xsl:function>

Integer part v2:

<xsl:function name="t:integer" as="xs:decimal"> <xsl:param name="value" as="xs:decimal"/> <xsl:sequence select=" if ($value ge 0) then floor($value) else -floor(-$value)"/> </xsl:function>

Length and precision:

<xsl:function name="t:decimal-spec" as="xs:integer+"> <xsl:param name="value" as="xs:decimal"/> <xsl:variable name="text" as="xs:string" select=" if ($value lt 0) then xs:string(-$value) else xs:string($value)"/> <xsl:variable name="length" as="xs:integer" select="string-length($text)"/> <xsl:variable name="integer-length" as="xs:integer" select="string-length(substring-before($text, '.'))"/> <xsl:sequence select=" if ($integer-length) then ($length - 1, $length - $integer-length - 1) else ($length, 0)"/> </xsl:function>

The last function looks odious. In many other languages its implementation would be considered as embarrassing.

Wednesday, 17 February 2010 07:29:55 UTC

Comments [0] -
Tips and tricks | xslt

Sunday, 07 February 2010

C# readonly ring

Given:

public class N { public readonly N next; }

What needs to be done to construct a ring of N: n1 refers to n2, n2 to n3, ... nk to n1? Is it possible?

Sunday, 07 February 2010 07:57:08 UTC

Comments [2] -
Thinking aloud | Tips and tricks

Saturday, 06 February 2010

Map, based on immutable trees

To end with immutable trees, at least for now, we've implemented IDictionary<K, V>. It's named Map<K, V>. Functionally it looks very like SortedDictionary<K, V>. there are some differences, however:

Map in contrast to SortedDictionary is very cheap on copy.
Bacause Map is based on AVL tree, which is more rigorly balanced than RB tree, so it's a little bit faster asymptotically for lookup than SortedDictionary, and a little bit slower on modification.
Due to the storage structure: node + navigator, Map consumes less memory than SortedDictionary, and is probably cheaper for GC (simple garbage graphs).
As AVL tree stores left and right subtree sizes, in contrast to a "color" in RB tree, we able to index data in two ways: with integer index, and with key value.

Sources are:

Update:

It was impossible to withstand temptation to commit some primitive performance comparision. Map outperforms SortedDictionary both in population and in access. this does not aggree with pure algorithm's theory, but there might be other unaccounted factors: memory consumption, quality of implementation, and so on.

Program.cs is updated with measurements.

Update 2:

More occurate tests show that for some key types Map's faster, for others SortedDictionary's faster. Usually Map's slower during population (mutable AVL tree navigator may fix this). the odd thing is that Map<string, int> is faster than SortedDictionary<string, int> both for allocaction and for access. See excel report.

Update 3:

Interesing observation. The following table shows maximal and average tree heights for different node sizes in AVL and RB trees after a random population:

	AVL		RB
Size	Max	Avg	Max	Avg
10	4	2.90	5	3.00
50	7	4.94	8	4.94
100	8	5.84	9	5.86
500	11	8.14	14	8.39
1000	12	9.14	16	9.38
5000	15	11.51	18	11.47
10000	16	12.53	20	12.47
50000	19	14.89	23	14.72
100000	20	15.90	25	15.72
500000	25	18.26	28	18.27
1000000	25	19.28	30	19.27

Here, according with theory, the height of AVL tree is shorter than the height of RB tree. But what is most interesting is that the depth of an "average node". This value describes a number of steps required to find a random key. RB tree is very close and often is better than AVL in this regard.

Saturday, 06 February 2010 18:31:13 UTC

Comments [0] -
Thinking aloud | Tips and tricks

Wednesday, 03 February 2010

Convenience of Generics in C#

It was obvious as hell from day one of generics that there will appear obscure long names when you will start to parametrize your types. It was the easiest thing in the world to take care of this in advanvce. Alas, C# inherits C++'s bad practices.

Read Associative containers in a functional languages and Program.cs to see what we're talking about.

Briefly, there is a pair (string, int), which in C# should be declared as:

System.Collections.Generic.KeyValuePair<string, int>

Obviously we would like to write it in a short way. These are our attempts, which fail:

1. Introduce generic alias Pair<K, V>:

using System.Collections.Generic; using Pair<K, V> = KeyValuePair<K, V>;

2. Introduce type alias for a generic type with specific types.

using System.Collections.Generic; using Pair = KeyValuePair<string, int>;

And this is only one that works:

using Pair = System.Collections.Generic.KeyValuePair<string, int>;

Do you think is it bearable? Well, consider the following:

There is a generic type ValueNode<T>, where T should be Pair.
There is a generic type TreeNavigator<N>, where N is should be ValueNode<Pair>.

The declaration looks like this:

using Pair = System.Collections.Generic.KeyValuePair<string, int>; using Node = NesterovskyBros.Collections.AVL.ValueNode< System.Collections.Generic.KeyValuePair<string, int>>; using Navigator = NesterovskyBros.Collections.AVL.TreeNavigator< NesterovskyBros.Collections.AVL.ValueNode< System.Collections.Generic.KeyValuePair<string, int>>>;

Do you still think is it acceptable?

P.S. Legacy thinking led C#'s and java's designers to the use of word "new" for the object construction. It is not required at all. Consider new Pair("A", 1) vs Pair("A", 1). C++ prefers second form. C# and java always use the first one.

Wednesday, 03 February 2010 11:59:19 UTC

Comments [1] -
Thinking aloud | Tips and tricks

Wednesday, 27 January 2010

Associative containers in a functional languages

Continuing with the post "Ongoing xslt/xquery spec update" we would like to articulate what options regarding associative containers do we have in a functional languages (e.g. xslt, xquery), assuming that variables are immutable and implementation is efficient (in some sense).

There are three common implementation techniques:

store data (keys, value pairs) in sorted array, and use binary search to access values by a key;
store data in a hash map;
store data in a binary tree (usually RB or AVL trees).

Implementation choice considerably depends on operations, which are taken over the container. Usually these are:

construction;
value lookup by key;
key enumeration (ordered or not);
container modification (add and remove data into the container);
access elements by index;

Note that modification in a functional programming means a creation of a new container, so here is a division:

If container's use pattern does not include modification, then probably the simplest solution is to build it as an ordered sequence of pairs, and use binary search to access the data. Alternatively, one could implement associative container as a hash map.
If modification is essential then neither ordered sequence of pairs, hash map nor classical tree implementation can be used, as they are either too slow or too greedy for a memory, either during modification or during access.

On the other hand to deal with container's modifications one can build an implementation, which uses "top-down" RB or AVL trees. To see the difference consider a classical tree structure and its functional variant:

	Classical	Functional
Node structure:	`node parent left right other data`	`node left right other data`
Node reference:	node itself	node path from a root of a tree
Modification:	either mutable or requires a completely new tree	O(LnN) nodes are created

Here we observe that:

one can implement efficient map (lookup time no worse than O(LnN)) with no modification support, using ordered array;
one can implement efficient map with support of modification, using immutable binary tree;
one can implement all these algorithms purely in xslt and xquery (provided that inline functions are supported);
any such imlementation will lose against the same implementation written in C++, C#, java;
the best implementation would probably start from sorted array and will switch to binary tree after some size threshold.

Here we provide a C# implementation of a functional AVL tree, which also supports element indexing:

Our intention was to show that the usual algorithms for associative containers apply in functional programming; thus a feature complete functional language must support associative containers to make development more conscious, and to free a developer from inventing basic things existing already for almost a half of century.

Wednesday, 27 January 2010 07:00:55 UTC

Comments [0] -
Thinking aloud | Tips and tricks | xslt

Tuesday, 19 January 2010

Unintentional help

Several years ago we have started a new project. We do not like neither hate any particular language, thus the decision what language to use was pragmatical: xslt 2.0 fitted perfectly.

At present it's a solid volume of xslt code. It exhibits all the virtues of any other good project in other language: clean design, modularity, documentation, sophisticationless (good code should not be clever).

Runtime profile of the project is that it deals with xml documents with sizes from a few dozens of bytes to several megabytes, and with xml schemas from simple ones like a tabular data, and to rich like xhtml and untyped. Pipeline of stylesheets processes gigabytes of data stored in the database and in files.

All the bragging above is needed here to introduce the context for the following couple of lines.

The diversity of load conditions and a big code base, exposed xslt engine of choice to a good proof test. The victim is Saxon. In the course of project we have found and reported many bugs. Some of them are nasty and important, and others are bearable. To Michael Kay's credit (he's owner of Saxon) all bugs are being fixed promtly (see the last one).

Such project helps us to understand a weak sides of xslt (it seems sometimes they, in WG, lack such experience, which should lead them through).

Well, it has happened so that we're helping to Saxon project. Unintentionally, however! :-)

P.S. About language preferences.

Nowdays we're polishing a COBOL generation. To this end we have familiarized ourselves with this language. That's the beatiful language. Its straightforwardness helps to see the evolution of computer languages and to understand what and why today's languages try to timidly hide.

Tuesday, 19 January 2010 19:56:04 UTC

Comments [0] -
Thinking aloud | xslt

Navigation