RSS 2.0
Sign In
# Friday, January 13, 2012

A database we support for a client contains multi-billion row tables. Many users query the data from that database, and it's permanently populated with a new data.

Every day we load several millions rows of a new data. Such loads can lock tables for a considerable time, so our loading procedures collect new data into intermediate tables and insert it into a final destination by chunks, and usually after work hours.

SQL Server 2008 R2 introduced READ_COMMITTED_SNAPSHOT database option. This feature trades locks for an increased tempdb size (to store row versions) and possible performance degradation during a transaction.

When we have switched the database to that option we did not notice any considerable performance change. Encouraged, we've decided to increase size of chunks of data we insert at once.

Earlier we have found that when we insert no more than 1000 rows at once, users don't notice impact, but for a bigger chunk sizes users start to complain on performance degradation. This has probably happened due to locks escalations.

Now, with chunks of 10000 or even 100000 rows we have found that no queries became slower. But load process became several times faster.

We were ready to pay for increased tempdb and transaction log size to increase performance, but in our case we didn't approach limits assigned by the DBA. Another gain is that we can easily load data at any time. This makes data we store more up to date.

Friday, January 13, 2012 1:43:56 PM UTC  #    Comments [0] -
SQL Server puzzle | Thinking aloud | Tips and tricks
# Saturday, December 17, 2011

Yesterday, by accident, we've seen an article about some design principles of V8 JavaScript Engine. It made clearer what techniques are used in today's script implementations.

In particular V8 engine optimizes property access using "dynamically created hidden classes". These are structures to store object's layout, they are derived when ש new property is created (deleted) on the object. When code accesses a property, and if a cached object's dynamic hidden class is available at the code point then access time is comparable to one of native fields.

In our opinion this tactics might lead to a proliferation of such dynamic hidden classes, which requires a considerable housekeeping, which also slows property write access, especially when it's written for the first time.

We would like to suggest a slightly different strategy, which exploits the cache matches, and does not require a dynamic hidden classes.

Consider an implementation data type with following characteristics:

  • object is implemented as a hash map of property id to property value: Map<ID, Value>;
  • it stores data as an array of pairs and can be accessed directly: Pair<ID, Value> values[];
  • property index can be acquired with a method: int index(ID);

A pseudo code for the property access looks like this:

pair = object.values[cachedIndex];

if (pair.ID == propertyID)
{
   value = pair.Value;
}
else
{
  // Cache miss.
  cachedIndex = object.index(propertyID);
  value = objec.values[cachedIndex].Value;
}

This approach brings us back to dictionary like implementation but with important optimization of array speed access when property index is cached, and with no dynamic hidden classes.

Saturday, December 17, 2011 10:18:14 AM UTC  #    Comments [0] -
Thinking aloud
# Saturday, December 10, 2011

@michaelhkay Saxon 9.4 is out.

But why author does not state that HE version is still xslt/xpath 2.0, as neither xslt maps, nor function items are supported.

Saturday, December 10, 2011 12:16:28 PM UTC  #    Comments [0] -
Thinking aloud | xslt
# Saturday, December 03, 2011

Recently, we have found and reported the bug in the SQL Server 2008 (see SQL Server 2008 with(recompile), and also Microsoft Connect).

Persons, who's responsible for the bug evaluation has closed it, as if "By Design". This strange resolution, in our opinion, says about those persons only.

Well, we shall try once more (see Microsoft Connect). We have posted another trivial demonstartion of the bug, where we show that option(recompile) is not used, which leads to table scan (nothing worse can happen for a huge table).

Saturday, December 03, 2011 3:06:44 PM UTC  #    Comments [0] -
SQL Server puzzle | Thinking aloud
# Thursday, November 10, 2011

A bit history: the first release of this solution was about 9.5 years ago...

Today we've run into a strange situation. One of our clients ask us about automatic conversion of data from mainframe (that were defined as COBOL copybooks) into XML or Java/.NET objects. On our suggestion to use eXperanto, which is well known to him, he stated that he wouldn't like to use a tool of a company that is no more exists...

The situation, in our opinion, become more strange when you consider the following:

  • eXperanto (the design-time tool and run-time libraries for Java and .NET) were developed, well tested, and delivered by us to production already several years ago.
  • the client bought this set (the tool and libraries).
  • the set is in production yet already in another big company, and is used time to time by our company in different migration projects.
  • the client talks with developers of this tool and run-time libraries, and he knows about this fact.
  • the client uses widely open source solutions even without dedicated vendors or support warranties.
Thursday, November 10, 2011 10:14:09 PM UTC  #    Comments [2] -
.NET | Java | Thinking aloud
# Friday, October 28, 2011

It has happened so, that we have never worked with jQuery, however were aware of it.

In early 2000 we have developed a web application that contained rich javascript APIs, including UI components. Later, we were actively practicing in ASP.NET, and later in JSF.

At present, looking at jQuery more closely we regret that we have failed to start using it earlier.

Separation of business logic and presentation is remarkable when one uses JSON web services. In fact server part can be seen as a set of web services representing a business logic and a set of resources: html, styles, scripts, others. Nor ASP.NET or JSF approach such a consistent separation.

The only trouble, in our opinion, is that jQuery has no standard data binding: a way to bind JSON data to (and from) html controls. The technique that will probably be standardized is called jQuery Templates or JsViews .

Unfortunatelly after reading about this binding API, and being in love with Xslt and XQuery we just want to cry. We don't know what would be the best solution for the task, but what we see looks uncomfortable to us.

Friday, October 28, 2011 10:59:23 PM UTC  #    Comments [0] -
ASP.NET | JSF and Facelets | Thinking aloud | Tips and tricks | xslt
# Sunday, October 16, 2011

Incidentally, we have found one new implementation of yield return in java that is in the development stage. Sources can be found at https://github.com/peichhorn/lombok-pg/zipball/master. Just to be sure we have copied those sources at other place peichhorn-lombok-pg-0.10.0-39-g384fb7b.zip (you may search "yield" in the archive).

It's broken according to source tracker, but the funny thing is that sources, however different, still resemble our yield return implementation (Yield.jar, Yield.3.7.jar - Indigo, Yield.zip - sources) very much: variable names, error messages, algorithmic structure.

Those programmers probably have forgotten good manners: to reference a base work, at least.

Well, we generously forgive them this blunder.

P.S. our implementation, in contrast, works without bugs.

P.P.S. misunderstanding is resolved. See comments.

Sunday, October 16, 2011 1:49:33 PM UTC  #    Comments [3] -
Java | Thinking aloud
# Thursday, September 29, 2011

A couple of weeks ago, we have suggested to introduce a enumerator function into the XPath (see [F+O30] A enumerator function):

I would like the WG to consider an addition of a function that turns a sequence into a enumeration of values.

Consider a function like this:  fn:enumerator($items as item()*) as function() as item()?;

alternatively, signature could be:

 fn:enumerator($items as function() as item()*) as function() as item()?;

This function receives a sequence, and returns a function item, which upon N's call shall return N's element of the original sequence. This way, a sequence of items is turned into a function providing a enumeration of items of the sequence.

As an example consider two functions:

a) t:rand($seed as xs:double) as xs:double* - a function producing a random number sequence;
b) t:work($input as element()) as element() - a function that generates output from it's input, and that needs random numbers in the course of the execution.

t:work() may contain a code like this:
  let $rand := fn:enumerator(t:rand($seed)),

and later it can call $rand() to get a random numbers.

Enumerators will help to compose algorithms where one algorithm communicate with other independant algorithms, thus making code simpler. The most obvious class of enumerators are generators: ordered numbers, unique identifiers, random numbers.

Technically, function returned from fn:enumerator() is nondetermenistic, but its "side effect" is similar to a "side effect" of a function generate-id() from a newly created node (see bug #13747, and bug #13494).

The idea is inspired by a generator function, which returns a new value upon each call.

Such function can be seen as a stateful object. But our goal is to look at it in a more functional way. So, we look at the algorithm as a function that produces a sequence of output, which is pure functional; and an enumerator that allows to iterate over algorithm's output.

This way, we see the function that implements an algorithm and the function that uses it can be seen as two thread of functional programs that use messaging to communicate to each other.

Honestly, we doubt that WG will accept it, but it's interesting to watch the discussion.

Thursday, September 29, 2011 11:56:05 AM UTC  #    Comments [0] -
Thinking aloud | xslt
# Wednesday, September 14, 2011

More than month has passed since we have reported a problem to the saxon forum (see Saxon optimizer bug and Saxon 9.2 generate-id() bug).

The essence of the problem is that we have constructed argumentless function to return a unique identifiers each time function is called. To achieve the effect we have created a temporary node and returned its generate-id() value.

Such a function is nondetermenistic, as we cannot state that its result depends on arguments only. This means that engine's optimizer is not free to reorder calls to such a function. That's what happens in Saxon 9.2, and Saxon 9.3 where engine elevates function call out of cycle thus producing invalid results.

Michael Kay, the author of the Saxon engine, argued that this is "a gray area of the xslt spec":

If the spec were stricter about defining exactly when you can rely on identity-dependent operations then I would be obliged to follow it, but I think it's probably deliberate that it currently allows implementations some latitude, effectively signalling to users that they should avoid depending on this aspect of the behaviour.

He adviced to raise a bug in the w3c bugzilla to resolve the issue. In the end two related bugs have been raised:

  • Bug 13494 - Node uniqueness returned from XSLT function;
  • Bug 13747 - [XPath 3.0] Determinism of expressions returning constructed nodes.

Yesterday, the WG has resolved the issue:

The Working Group agreed that default behavior should continue to require these nodes to be constructed with unique IDs. We believe that this is the kind of thing implementations can do with annotations or declaration options, and it would be best to get implementation experience with this before standardizing.

This means that the technique we used to generate unique identifiers is correct and the behaviour is well defined.

The only problem is to wait when Saxon will fix its behaviour accordingly.

Wednesday, September 14, 2011 5:54:56 AM UTC  #    Comments [0] -
Thinking aloud | xslt
# Tuesday, September 06, 2011

We're not big fans of Entity Framework, as we don't directly expose the database structure to the client program but rather through stored procedures and functions. So, EF for us is a tool to expose those stored procedures as .NET wrappers. This limited use of EF still greatly automates the data access code.

But what we have lately found is that the EF has a problem with char parameters. Namely, if you import a procedure say MyProc that accepts char(1), and then will call it through the generated wrapper, the you will see in sql profiler that char(1) parameter is passed with many trailing spaces as if it were char(8000). There isn't necessity to prove that this is highly ineffective.

We can see that the problem happens in VS 2010 designer rather than in the EF runtime, as SP's parameters are not attributed with length, see model xml (*.edmx):

<Function Name="MyProc" Schema="Data">
  ...
  <Parameter Name="recipientType" Type="char" Mode="In" />
  ...
</Function>

while if we set:

  <Parameter Name="recipientType" Type="char" MaxLength="1" Mode="In" />

the runtime starts working as expected. So the workaround is to fix model file manually.

See also: Stored Proc and Char parm

Tuesday, September 06, 2011 9:11:38 PM UTC  #    Comments [0] -
.NET | Thinking aloud | Tips and tricks
# Friday, August 12, 2011

1. query.dll vs tquery.dll

We have installed Windows Search 4 on a Windows 2003 server. The goal was to index huge compressed xml files (see Windows Search Notifications). But for some reason it did not want to index content.

No "select System.ItemUrl from SystemIndex where contains('...')" has ever returned a row.

We thought that the problem was in our protocol handler, and tried to localize it, but finally have discovered that Windows Search is not able to find anything within text files.

Registry comparision has shown that *.txt extension was indexed by the IFilter defined in the query.dll, while on the other computers, where everything worked, the implementation was in the tquery.dll.

Both libraries were present on the Windows 2003 server, so we have corrected the registry and everything has started to work.

As far as we understand query.dll is part of legacy Indexing Service, and tquery.dll is up to date implementation.

2. Search index size

We have to index a considerable amout of data. But before we can do it we have to estimate the size of index.

In the past it seems we saw somewhere a statement that search index needs a storage that's about 10% of original data for its purposes. Unfortunatelly we cannot find this estimation at present, neither we cannot find any other estimation. This complicates our planning.

To get empirical estimate we've indexed several thousands *.xml-gz files, which are gz'ed big xmls. The total size of this files is about 4.5GB. Total uncompressed size of xmls ~50GB. Xml contained about 10 millions pages of data.

According to 10% criteria we had to arrive to ~5GB search index.

But what we have discovered is that the index has grown to more than 50GB. That's very disappointing. We cannot afford such expense, as we've commited test on a tiny part of data, which increases over time.

So, the solution is to find out what's wrong, and how can it be cured, or to fulltext index only most recent subset of data.

P.S. We have tried to mark folder with search index as compressed, but it did not work.

P.P.S. We have found the reference to Windows Search 4 index size estimation. It is in Windows Search Frequently Asked Questions, see answer on "What is average size of a user's index?" question.

Friday, August 12, 2011 9:20:18 AM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Friday, July 22, 2011

We needed to track a stream position during creation of xml file. This is to allow random access to a huge xml file (the task is related to WindowsSearch).

This is a simplified form of the xml:

<data>
  <item>...</item>
   ...
  <item>...</item> 
</data>

The goal was to have stream position of each item element. With this in mind, we've decided to:

  • open a stream, and then xml writer over it;
  • write data into xml writer;
  • call Flush() method of the xml writer before measuring stream offset;

That's a code sample:

var stream = new MemoryStream();
var writer = XmlWriter.Create(stream);

writer.WriteStartDocument();
writer.WriteStartElement("data");

for(var i = 0; i < 10; ++i)
{
  writer.Flush();

  Console.WriteLine("Flush offset: {0}, char: {1}",
    stream.Position,
    (char)stream.GetBuffer()[stream.Position - 1]);
 
  writer.WriteStartElement("item");
  writer.WriteValue("item " + i);
  writer.WriteEndElement();
}

writer.WriteEndElement();
writer.WriteEndDocument();

That's the output:

Flush offset: 46, char: a
Flush offset: 66, char: >
Flush offset: 85, char: >
Flush offset: 104, char: >
Flush offset: 123, char: >
Flush offset: 142, char: >
Flush offset: 161, char: >
Flush offset: 180, char: >
Flush offset: 199, char: >
Flush offset: 218, char: >

Funny, isn't it?

After feeding the start tag <data>, and flushing xml writer we observe that only "<data" has been written down to the stream. Well, Flush() have never promissed anything particular about the content of the stream, so we cannot claim any violation, however we expected to see whole start tag.

Inspection of the implementation of xml writer reveals laziness during writting data down the stream. In particular start tag is closed when one starts the content. This is probably to implement empty tags: <data/>.

To do the trick we had to issue empty content, moreover, to call a particular method with particular parameters of the xml writer. So the code after the fix looks like this:

var stream = new MemoryStream();
var writer = XmlWriter.Create(stream);

writer.WriteStartDocument();
writer.WriteStartElement("data");

char[] empty = { ' ' };

for(var i = 0; i < 10; ++i)
{
  writer.WriteChars(empty, 0, 0);
  writer.Flush();

  Console.WriteLine("Flush offset: {0}, char: {1}",
    stream.Position,
    (char)stream.GetBuffer()[stream.Position - 1]);

  writer.WriteStartElement("item");
  writer.WriteValue("item " + i);
  writer.WriteEndElement();
}

writer.WriteEndElement();
writer.WriteEndDocument();

And output is:

Flush offset: 47, char: >
Flush offset: 66, char: >
Flush offset: 85, char: >
Flush offset: 104, char: >
Flush offset: 123, char: >
Flush offset: 142, char: >
Flush offset: 161, char: >
Flush offset: 180, char: >
Flush offset: 199, char: >
Flush offset: 218, char: >

While this code works, we feel uneasy with it.

What's the better way to solve the task?

Update: further analysis shows that it's only possible behaviour, as after the call to write srart element, you either can write attributes, content or end of element, so writer may write either space, '>' or '/>'. The only question is why it takes WriteChars(empty, 0, 0) into account and WriteValue("") it doesn't.

Friday, July 22, 2011 9:08:36 PM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Wednesday, July 13, 2011

As you probably know we have implemented our custom Protocol Handler for the Windows Search.

It's called .xml-gz, and has a goal to index compressed xml files and to have search results with a subtree precision. So, for xml:

<data>
  <item>...</item>
  <item>...</item>
  ...
</data>

search finds results within item and returns xml's url and stream offset of the item. Using ZLIB API we can compress data with stream bookmarks, so fast random access to the data is possible.

The only problem we have is about notification of changes (create, delete, update) of such files.

Spec describes several techniques (nothing has worked for us):

1. Call catalogManager.ReindexMatchingURLs() - it just returns without any impact.

2.Call changeSink.OnItemsChanged() - returns error.

3. Implement .xml-gz IFilter and call IGatherNotifyInline (see " have your .zip urls indexed when they are created or modified") - that's a mistery, as:

4. Implement root url in form .xml-gz:/// and perform Windows Search:

SELECT
  System.ItemUrl, System.DateModified
FROM
  SystemIndex WHERE System.FileExtension='.xml-gz'

to find all .xml-gz sources. This is not reliable, as your protocol handler can be (and is) called before file is indexed.

So, the only reliable way to index your data is to (re-)add indexing rule for the protocol handler, which in most cases reindexes everything.

The only bearable solution we found is to define indexing rule in the form: .xml-gz://file:d:/data/... and to use IShellFolder(2) interfaces to discover sub items and their modification times. This technique allows minimal data scan when you're (re-)add indexing rule.

Wednesday, July 13, 2011 8:21:00 PM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Saturday, July 09, 2011

Being unexperienced with Windows Search we tried to build queries to find data in the huge storage. We needed to find a document that matches some name pattern and contains some text.

Our naive query was like this:

select top 1000
  System.ItemUrl
from
  SystemIndex
where
  scope = '...' and
  System.ItemName like '...%' and
  contains('...')

In most cases this query returns nothing and runs very long. It's interesting to note that it may start returning data if "top" clause is missing or uses a bigger number, but in this cases query is slower even more.

Next try was like this:

select top 1000
  System.ItemUrl
from
  SystemIndex
where
  scope = '...' and
  System.ItemName >= '...' and System.ItemName < '...' and
  contains('...')

This query is also slow, but at least it returns some results.

At some point we have started to question the  utility of Windows Search if it's so slow, but then we have found that there is a property System.ItemNameDisplay, which in our case coincides with the value of property System.ItemName, so we have tried the query:

select top 1000
  System.ItemUrl
from
  SystemIndex
where
  scope = '...' and
  System.ItemNameDisplay like '...%' and
  contains('...')

This query worked fast, and produced good results. This hints that search engine has index on System.ItemNameDisplay in contrast to System.ItemName property.

We've looked at property definitions:

System.ItemNameDisplay

The display name in "most complete" form. It is the unique representation of the item name most appropriate for end users.

propertyDescription
    name = System.ItemNameDisplay
    shellPKey = PKEY_ItemNameDisplay
    formatID = B725F130-47EF-101A-A5F1-02608C9EEBAC
    propID = 10
    searchInfo
       inInvertedIndex = true
       isColumn = true
       isColumnSparse = false
       columnIndexType = OnDisk
       maxSize = 128

System.ItemName

The base name of the System.ItemNameDisplay property.

propertyDescription
    name = System.ItemName
    shellPKey = PKEY_ItemName
    formatID = 6B8DA074-3B5C-43BC-886F-0A2CDCE00B6F
    propID = 100
    searchInfo
       inInvertedIndex = false
       isColumn = true
       isColumnSparse = false
       columnIndexType = OnDisk
       maxSize = 128

Indeed, one property is indexed, while the other is not.

As with other databases, query is powerful when engine uses indices rather than performs data scan. This is also correct for Windows Search.

The differences in results that variations of query produce also manifests that Windows Search nevertheless is very different from relational database.

Saturday, July 09, 2011 10:01:36 AM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Tuesday, July 05, 2011

We have developed our custom Windows Search Protocol Handler. The role of this component is to expose items of complex content (or unusual storage) to Windows Search.

You can think of some virtual folder, so a Protocol Handler allows to enumerate it's files, file properties, and contents.

The goal of our Protocol Handler is to represent some data structure as a set of xml files. We expected that if we found a data within a folder with these files, then a search within Protocol Handler's scope would bring the same (or almost the same) results.

Reality is different.

For some reason .xml IFilter (a component to extract text data to index) works differently with file system and with our storage. We cannot state that it does not work, but for some reason many words that Windows Search finds within a file are never found within Protocol Handler scope.

We have observed that if, for purpose of indexing, we represent content xml items as .txt files, then search works as expected. So, our workaround was to present only xml's text data for the indexing, and to use .txt IFilter (this in fact roughly what .xml IFilter does by itself).

Is there a conclusion?

Well, Windows Search is a black box probably containing bugs. Its behaviour is not always obvious.

Tuesday, July 05, 2011 8:31:47 PM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Friday, June 24, 2011

Let's put it blatantly: Windows Search 4 has design and implementation problems.

You discover this immediatelly when you start implementing indexing of custom file format.

If you want to index simple file format then you need to imlement you IFilter interface. But if it has happened so that you want to index compound data then you should invent you own protocols.

If you will fugure out how to implement your protocol to index that compound data, then you will most probably stuck on the problem on how to notify indexer about the changes.

The problem is that Windows Search 4 has API to reindex urls, which simply does not work, or to notify indexer about changes, which throws an error (returns error HRESULT) for custom protocols. At least, we were not able to make it run.

Friday, June 24, 2011 7:39:36 PM UTC  #    Comments [0] -
Thinking aloud | Window Search
# Thursday, June 02, 2011

Do you know that the best JSF/Facelets visual editor, in our opinion, is ... Microsoft Visual Studio 2008? Another rather good JSF editor is presented in IBM RAD 7.xx. The most popular, open source Java IDE Eclipse contains an ugly implementation of such useful thing.

Thursday, June 02, 2011 8:37:09 AM UTC  #    Comments [0] -
JSF and Facelets | Thinking aloud
# Tuesday, May 03, 2011

Already for a couple of days we're trying to create a UserControl containing a TabContainer. To achieve the goal we have created a page with a ToolkitScriptManager and a user control itself.

Page:

<form runat="server">
  <ajax:ToolkitScriptManager ID="ToolkitScriptManager1" runat="server"/>
  <uc1:WebUserControl1 ID="WebUserControl11" runat="server" />
</form>

User control:

<%@ Control Language="C#" %>
<%@ Register
  Assembly="AjaxControlToolkit"
  Namespace="AjaxControlToolkit"
  TagPrefix="ajax" %>

<ajax:TabContainer ID="Tab" runat="server" Width="100%">
  <ajax:TabPanel runat="server" HeaderText="Tab1" ID="Tab1">
    <ContentTemplate>Panel 1</ContentTemplate>
  </ajax:TabPanel>
</ajax:TabContainer>

What could be more simple?

But no, there is a problem. At run-time control works perfectly, but at the designer it shows an error instead of a normal design view:

Error Rendering Control - TabContainer1
An unhandled exception has occurred.
Could not find any resources appropriate for the specified culture or the neutral culture. Make sure "AjaxControlToolkit.Properties.Resources.NET4.resources" was correctly embedded or linked into assembly "AjaxControlToolkit" at compile time, or that all the satellite assemblies required are loadable and fully signed.

That's a stupid error, which says nothing about the real problem reason. We had to attach a debugger to a Visual Studio just to realize what the problem is.

So, the error occurs at the following code of AjaxControlToolkit.ScriptControlBase:

private void EnsureScriptManager()
{
  if (this._scriptManager == null)
  {
    this._scriptManager = ScriptManager.GetCurrent(this.Page);

    if (this._scriptManager == null)
    {
      throw new HttpException(Resources.E_NoScriptManager);
    }
  }
}

Originally, the problem is due to the fact that ScriptManager is not found, and code wants to report an HttpException, but fun is that we recieve a different exception, which is releted to a missing resouce text for a message Resources.E_NoScriptManager. It turns out that E_NoScriptManager text is found neither in primary no in resource assemblies.

As for original problem, it's hard to say about reason of why ScriptManager is not available at design time. We, however, observed that a ScriptManager registers itself for a ScriptManager.GetCurrent() method at run-time only:

protected internal override void OnInit(EventArgs e)
{
...
  if (!base.DesignMode)
  {
...
    iPage.Items[typeof(ScriptManager)] = this;
...
  }
}

So, it's not clear what they (toolkit's developers) expected to get at design time.

These observations leave uneasiness regarding the quality of the library.

Tuesday, May 03, 2011 7:45:51 PM UTC  #    Comments [8] -
ASP.NET | Thinking aloud
# Tuesday, April 26, 2011

Earlier, we have described an approach to call Windows Search from SQL Server 2008. But it has turned out that our problem is more complicated...

All has started from the initial task:

  • to allow free text search in a store of huge xml files;
  • files should be compressed, so these are *.xml.gz;
  • search results should be addressable to a fragment within xml.

Later we shall describe how we have solved this task, and now it's enough to say that we have implemented a Protocol Handler for Windows Search named '.xml-gz:'. This way original file stored say at 'file:///c:/store/data.xml-gz' is seen as a container by the Windows Search:

  • .xml-gz:///file:c:/store/data.xml-gz/id1.xml
  • .xml-gz:///file:c:/store/data.xml-gz/id2.xml
  • ...

This way search xml should be like this:

select System.ItemUrl from SystemIndex where scope='.xml-gz:' and contains(...)

Everything has worked during test: we have succeeded to issue Windows Search selects from SQL Server and join results with other sql queries.

But later on when we considered a runtime environment we have seen that our design won't work. The reason is simple. Windows Search will work on a computer different from those where SQL Servers run. So, the search query should look like this:

select System.ItemUrl from Computer.SystemIndex where scope='.xml-gz:' and contains(...)

Here we have realized the limitation of current (Windows Search 4) implementation: remote search works for shared folders only, thus query may only look like:

select System.ItemUrl from Computer.SystemIndex where scope='file://Computer/share/' and contains(...)

Notice that search restricts the scope to a file protocol, this way remoter search will never return our results. The only way to search in our scope is to perform a local search.

We have considered following approaches to resolve the issue.

The simplest one would be to access Search protocol on remote computer using a connection string: "Provider=Search.CollatorDSO;Data Source=Computer" and use local queries. This does not work, as provider simply disregards Data Source parameter.

The other try was to use MS Remote OLEDB provider. We tried hard to configure it but it always returns obscure error, and more than that it's deprecated (Microsoft claims to remove it in future).

So, we decided to forward request manually:

  • SQL Server calls a web service (through a CLR function);
  • Web service queries Windows Search locally.

Here we considered WCF Data Services and a custom web service.

The advantage of WCF Data Services is that it's a technology that has ambitions of a standard but it's rather complex task to create implementation that will talk with Windows Search SQL dialect, so we have decided to build a primitive http handler to get query parameter. That's trivial and also has a virtue of simple implementation and high streamability.

So, that's our http handler (WindowsSearch.ashx):

<%@ WebHandler Language="C#" Class="WindowsSearch" %>

using System;
using System.Web;
using System.Xml;
using System.Text;
using System.Data.OleDb;

/// <summary>
/// A Windows Search request handler.
/// </summary>
public class WindowsSearch: IHttpHandler
{
  /// <summary>
  /// Handles the request.
  /// </summary>
  /// <param name="context">A request context.</param>
  public void ProcessRequest(HttpContext context)
  {
    var request = context.Request;
    var query = request.Params["query"];
    var response = context.Response;

    response.ContentType = "text/xml";
    response.ContentEncoding = Encoding.UTF8;

    var writer = XmlWriter.Create(response.Output);

    writer.WriteStartDocument();
    writer.WriteStartElement("resultset");

    if (!string.IsNullOrEmpty(query))
    {
      using(var connection = new OleDbConnection(provider))
      using(var command = new OleDbCommand(query, connection))
      {
        connection.Open();

        using(var reader = command.ExecuteReader())
        {
          string[] names = null;

          while(reader.Read())
          {
            if (names == null)
            {
              names = new string[reader.FieldCount];

              for (int i = 0; i < names.Length; ++i)
              {
                names[i] = XmlConvert.EncodeLocalName(reader.GetName(i));
              }
            }

            writer.WriteStartElement("row");

            for(int i = 0; i < names.Length; ++i)
            {
              writer.WriteElementString(
                names[i],
                Convert.ToString(reader[i]));
            }

            writer.WriteEndElement();
          }
        }
      }
    }

    writer.WriteEndElement();
    writer.WriteEndDocument();

    writer.Flush();
  }

  /// <summary>
  /// Indicates that a handler is reusable.
  /// </summary>
  public bool IsReusable { get { return true; } }

  /// <summary>
  /// A connection string.
  /// </summary>
  private const string provider =
    "Provider=Search.CollatorDSO;" +
    "Extended Properties='Application=Windows';" +
    "OLE DB Services=-4";
}

And a SQL CLR function looks like this:

using System;
using System.Collections;
using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Net;
using System.IO;
using System.Xml;

/// <summary>
/// A user defined function.
/// </summary>
public class UserDefinedFunctions
{
  /// <summary>
  /// A Windows Search returning result as xml strings.
  /// </summary>
  /// <param name="url">A search url.</param>
  /// <param name="userName">A user name for a web request.</param>
  /// <param name="password">A password for a web request.</param>
  /// <param name="query">A Windows Search SQL.</param>
  /// <returns>A result rows.</returns>
  [SqlFunction(
    IsDeterministic = false,
    Name = "WindowsSearch",
    FillRowMethodName = "FillWindowsSearch",
    TableDefinition = "value nvarchar(max)")]
  public static IEnumerable Search(
    string url,
    string userName,
    string password,
    string query)
  {
    return SearchEnumerator(url, userName, password, query);
  }

  /// <summary>
  /// A filler of WindowsSearch function.
  /// </summary>
  /// <param name="value">A value returned from the enumerator.</param>
  /// <param name="row">An output value.</param>
  public static void FillWindowsSearch(object value, out string row)
  {
    row = (string)value;
  }

  /// <summary>
  /// Gets a search row enumerator.
  /// </summary>
  /// <param name="url">A search url.</param>
  /// <param name="userName">A user name for a web request.</param>
  /// <param name="password">A password for a web request.</param>
  /// <param name="query">A Windows Search SQL.</param>
  /// <returns>A result rows.</returns>
  private static IEnumerable<string> SearchEnumerator(
    string url,
    string userName,
    string password,
    string query)
  {
    if (string.IsNullOrEmpty(url))
    {
      throw new ArgumentException("url");
    }

    if (string.IsNullOrEmpty(query))
    {
      throw new ArgumentException("query");
    }

    var requestUrl = url + "?query=" + Uri.EscapeDataString(query);

    var request = WebRequest.Create(requestUrl);

    request.Credentials = string.IsNullOrEmpty(userName) ?
      CredentialCache.DefaultCredentials :
      new NetworkCredential(userName, password);

    using(var response = request.GetResponse())
    using(var stream = response.GetResponseStream())
    using(var reader = XmlReader.Create(stream))
    {
      bool read = true;

      while(!read || reader.Read())
      {
        if ((reader.Depth == 1) && reader.IsStartElement())
        {
          // Note that ReadInnerXml() advances the reader similar to Read().
          yield return reader.ReadInnerXml();

          read = false;
        }
        else
        {
          read = true;
        }
      }
    }
  }
}

And, finally, when you call this service from SQL Server you write query like this:

with search as
(
  select
    cast(value as xml) value
  from
    dbo.WindowsSearch
    (
      N'http://machine/WindowsSearchService/WindowsSearch.ashx',
      null,
      null,
      N'
        select
          "System.ItemUrl"
        from
          SystemIndex
        where
          scope=''.xml-gz:'' and contains(''...'')'
    )
)
select
  value.value('/System.ItemUrl[1]', 'nvarchar(max)')
from
  search

Design is not trivial but it works somehow.

After dealing with all these problems some questions remain unanswered:

  • Why SQL Server does not allow to query Windows Search directly?
  • Why Windows Search OLEDB provider does not support "Data Source" parameter?
  • Why Windows Search does not support custom protocols during remote search?
  • Why SQL Server does not support web request/web services natively?
Tuesday, April 26, 2011 8:26:10 AM UTC  #    Comments [0] -
SQL Server puzzle | Thinking aloud | Tips and tricks | Window Search
# Friday, March 04, 2011

We were trying to query Windows Search from an SQL Server 2008.

Documentation states that Windows Search is exposed as OLE DB datasource. This meant that we could just query result like this:

SELECT
  *
FROM
  OPENROWSET(
    'Search.CollatorDSO.1',
    'Application=Windows',
    'SELECT "System.ItemName", "System.FileName" FROM SystemIndex');

But no, such select never works. Instead it returns obscure error messages:

OLE DB provider "Search.CollatorDSO.1" for linked server "(null)" returned message "Command was not prepared.".
Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "Search.CollatorDSO.1" for linked server "(null)" reported an error. Command was not prepared.
Msg 7350, Level 16, State 2, Line 1
Cannot get the column information from OLE DB provider "Search.CollatorDSO.1" for linked server "(null)".

Microsoft is silent about reasons of such behaviour. People came to a conclusion that the problem is in the SQL Server, as one can query search results through OleDbConnection without problems.

This is very unfortunate, as it bans many use cases.

As a workaround we have defined a CLR function wrapping Windows Search call and returning rows as xml fragments. So now the query looks like this:

select
  value.value('System.ItemName[1]', 'nvarchar(max)') ItemName,
  value.value('System.FileName[1]', 'nvarchar(max)') FileName
from
  dbo.WindowsSearch('SELECT "System.ItemName", "System.FileName" FROM SystemIndex')

Notice how we decompose xml fragment back to fields with the value() function.

The C# function looks like this:

using System;
using System.Collections;
using System.IO;
using System.Xml;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using System.Data.OleDb;

using Microsoft.SqlServer.Server;

public class UserDefinedFunctions
{
  [SqlFunction(
    FillRowMethodName = "FillSearch",
    TableDefinition="value xml")]
  public static IEnumerator WindowsSearch(SqlString query)
  {
    const string provider =
      "Provider=Search.CollatorDSO;" +
      "Extended Properties='Application=Windows';" +
      "OLE DB Services=-4";

    var settings = new XmlWriterSettings
    {
      Indent = false,
      CloseOutput = false,
      ConformanceLevel = ConformanceLevel.Fragment,
      OmitXmlDeclaration = true
    };

    string[] names = null;

    using(var connection = new OleDbConnection(provider))
    using(var command = new OleDbCommand(query.Value, connection))
    {
      connection.Open();

      using(var reader = command.ExecuteReader())
      {
        while(reader.Read())
        {
          if (names == null)
          {
            names = new string[reader.FieldCount];

            for (int i = 0; i < names.Length; ++i)
            {
              names[i] = XmlConvert.EncodeLocalName(reader.GetName(i));
            }
          }

          var stream = new MemoryStream();
          var writer = XmlWriter.Create(stream, settings);

          for(int i = 0; i < names.Length; ++i)
          {
            writer.WriteElementString(names[i], Convert.ToString(reader[i]));
          }

          writer.Close();

          yield return new SqlXml(stream);
        }
      }
    }
  }

  public static void FillSearch(object value, out SqlXml row)
  {
    row = (SqlXml)value;
  }
}

Notes:

  •  Notice the use of "OLE DB Services=-4" in provider string to avoid transaction enlistment (required in SQL Server 2008).
  • Permission level of the project that defines this extension function should be set to unsafe (see Project Properties/Database in Visual Studio) otherwise it does not allow the use OLE DB.
  • SQL Server should be configured to allow CLR functions, see Server/Facets/Surface Area Configuration/ClrIntegrationEnabled in Microsoft SQL Server Management Studio
  • Assembly should either be signed or a database should be marked as trustworthy, see Database/Facets/Trustworthy in Microsoft SQL Server Management Studio.
Friday, March 04, 2011 9:22:49 AM UTC  #    Comments [0] -
SQL Server puzzle | Thinking aloud | Tips and tricks | Window Search
# Sunday, February 20, 2011

Last few days we were testing Java web-applications that expose web-services. During these tests we've found few interesting features.

The first feature allows to retrieve info about all endpoints supported by the web-application on GET request. The feature works at least for Metro that implements JAX-WS API v2.x. In order to get such info, a client sends any endpoint's URL to the server. The result is an HTML page with a table. Each row of such table contains an endpoint's data for each supported web-service method. This feature may be used as a web-services discovery mechanism.

The second feature is bad rather than good. JAX-WS API supposes that a developer annotates classes and methods that he/she wants to expose as web-services. Then, an implementation generates additional layer-bridge between developer's code and API that does all routine work behind the scene. May be that was a good idea, but Metro's implementation is imperfect. Metro dynamically generates such classes at run-time when a web-application starts. Moreover, Metro does such generation for all classes at once. So, in our case, when the generated web-based application contains dozens or even hundreds of web-services, the application's startup takes a lot of time.

Probably, Metro developers didn't want to deal with implementation of lazy algorithms, when a web-service is generated and cached on demand. We hope this issue will be solved in next releases.

Sunday, February 20, 2011 10:20:12 AM UTC  #    Comments [0] -
Java | Thinking aloud | Tips and tricks
# Tuesday, February 08, 2011

A while ago we have created a simple cache for Java application. It was modelled like a Map<K, V>: it cached values for keys.

Use cases were:

Cache<String, Object> cache = new Cache<String, Object>();
...
instance = cache.get("key");
cache.put("key", instance);

But now we thought of different implementation like a WeakReference<V> and with map access as additional utility methods.

 Consider an examples:

1. Free standing CachedReference<V> instance.

CachedReference<Data> ref = new CachedReference<Data>(1000, true);
...
ref.set(data);
...
data = ref.get();

2. Map of CachedReference<V> instances.

ConcurrentHashMap<String, CachedReference<Data>> cache =
  new ConcurrentHashMap<String, CachedReference<Data>>();

CachedReference.put(cache, "key", data, 1000, true);
...
data = CachedReference.get(cache, "key");

The first case is faster than original Cache<K, V> as it does not use any hash map at all. The later case provides the same performance as Cache<K, V> but gives a better control over the storage. Incidentally, CachedReference<V> is more compact than Cache<K, V>.

The new implementation is CachedReference.java, the old one Cache.java.

Tuesday, February 08, 2011 3:20:29 PM UTC  #    Comments [0] -
Java | Thinking aloud
# Thursday, February 03, 2011

michaelhkay: Just posted a new internal draft of XSLT 3.0. Moving forward on maps, nested sequences, and JSON support.

Hope they will finally appear there!

See also: Tuples and maps - next try, Tuples and maps - Status: CLOSED, WONTFIX, Tuples and maps in Saxon and other blog posts on our site about immutable maps.

Thursday, February 03, 2011 11:07:49 AM UTC  #    Comments [0] -
Thinking aloud | xslt
# Thursday, January 27, 2011

A method pattern we have suggested to use along with @Yield annotation brought funny questions like: "why should I mark my method with @Yield annotation at all?"

Well, in many cases you may live with ArrayList populated with data, and then to perform iteration. But in some cases this approach is not practical either due to amount of data or due to the time required to get first item.

In later case you usually want to build an iterator that calculates items on demand. The @Yield annotation is designed as a marker of such methods. They are refactored into state machines at compilation time, where each addition to a result list is transformed into a new item yielded by the iterator.

So, if you have decided to use @Yield annotation then at some point you will ask yourself what happens with resources acquired during iteration. Will resources be released if iteration is interrupted in the middle due to exception or a break statement?

To address the problem yield iterator implements Closeable interface.

This way when you call close() before iteration reached the end, the state machine works as if break statement of the method body is injected after the yield point. Thus all finally blocks of the original method are executed and resources are released.

Consider an example of data iterator:

@Yield
public Iterable<Data> getData(final Connection connection)
  throws Exception
{
  ArrayList<Data> result = new ArrayList<Data>();

  PreparedStatement statement =
    connection.prepareStatement("select key, value from table");

  try
  {
    ResultSet resultSet = statement.executeQuery();

    try
    {
      while(resultSet.next())
      {
        Data data = new Data();

        data.key = resultSet.getInt(1);
        data.value = resultSet.getString(2);

        result.add(data); // yield point
      }
    }
    finally
    {
      resultSet.close();
    }
  }
  finally
  {
    statement.close();
  }

  return result;
}

private static void close(Object value)
  throws IOException
{
  if (value instanceof Closeable)
  {
    Closeable closeable = (Closeable)value;

    closeable.close();
  }
}

public void daoAction(Connection connection)
  throws Exception
{
  Iterable<Data> items = getData(connection);

  try
  {
    for(Data data: items)
    {
      // do something that potentially throws exception.
    }
  }
  finally
  {
    close(items);
  }
}

getData() iterates over sql data. During the lifecycle it creates and releases PreparedStatement and ResultSet.

daoAction() iterates over results provided by getData() and performs some actions that potentially throw an exception. The goal of close() is to release opened sql resources in case of such an exception.

Here you can inspect how state machine is implemented for such a method:

@Yield()
public static Iterable<Data> getData(final Connection connection)
  throws Exception
{
  assert (java.util.ArrayList<Data>)(ArrayList<Data>)null == null;

  class $state implements java.lang.Iterable<Data>, java.util.Iterator<Data>, java.io.Closeable
  {
    public java.util.Iterator<Data> iterator() {
      if ($state$id == 0) {
        $state$id = 1;

        return this;
      } else return new $state();
    }

    public boolean hasNext() {
      if (!$state$nextDefined) {
        $state$hasNext = $state$next();
        $state$nextDefined = true;
      }

      return $state$hasNext;
    }

    public Data next() {
      if (!hasNext()) throw new java.util.NoSuchElementException();

      $state$nextDefined = false;

      return $state$next;
    }

    public void remove() {
      throw new java.lang.UnsupportedOperationException();
    }

    public void close() {
      do switch ($state$id) {
      case 3:
        $state$id2 = 8;
        $state$id = 5;

        continue;
      default:
        $state$id = 8;

        continue;
      } while ($state$next());
    }

    private boolean $state$next() {
      java.lang.Throwable $state$exception;

      while (true) {
        try {
          switch ($state$id) {
          case 0:
            $state$id = 1;
          case 1:
            statement = connection.prepareStatement("select key, value from table");
            $state$exception1 = null;
            $state$id1 = 8;
            $state$id = 2;
          case 2:
            resultSet = statement.executeQuery();
            $state$exception2 = null;
            $state$id2 = 6;
            $state$id = 3;
          case 3:
            if (!resultSet.next()) {
              $state$id = 4;

              continue;
            }

            data = new Data();
            data.key = resultSet.getInt(1);
            data.value = resultSet.getString(2);
            $state$next = data;
            $state$id = 3;

            return true;
          case 4:
            $state$id = 5;
          case 5:
            {
              resultSet.close();
            }

            if ($state$exception2 != null) {
              $state$exception = $state$exception2;

              break;
            }

            if ($state$id2 > 7) {
              $state$id1 = $state$id2;
              $state$id = 7;
            } else $state$id = $state$id2;

            continue;
          case 6:
            $state$id = 7;
          case 7:
            {
              statement.close();
            }

            if ($state$exception1 != null) {
              $state$exception = $state$exception1;

              break;
            }

            $state$id = $state$id1;

            continue;
          case 8:
          default:
            return false;
          }
        } catch (java.lang.Throwable e) {
          $state$exception = e;
        }

        switch ($state$id) {
        case 3:
        case 4:
          $state$exception2 = $state$exception;
          $state$id = 5;

          continue;
        case 2:
        case 5:
        case 6:
          $state$exception1 = $state$exception;
          $state$id = 7;

          continue;
        default:
          $state$id = 8;

          java.util.ConcurrentModificationException ce = new java.util.ConcurrentModificationException();

          ce.initCause($state$exception);

          throw ce;
        }
      }
    }

    private PreparedStatement statement;
    private ResultSet resultSet;
    private Data data;
    private int $state$id;
    private boolean $state$hasNext;
    private boolean $state$nextDefined;
    private Data $state$next;
    private java.lang.Throwable $state$exception1;
    private int $state$id1;
    private java.lang.Throwable $state$exception2;
    private int $state$id2;
  }

  return new $state();
}

Now, you can estimate for what it worth to write an algorithm as a sound state machine comparing to the conventional implementation.

Yield annotation processor can be downloaded from Yield.zip or Yield.jar

See also Yield return feature in java.

Thursday, January 27, 2011 10:33:54 AM UTC  #    Comments [0] -
Java | Thinking aloud | Tips and tricks
# Monday, January 24, 2011

We're happy to announce that we have implemented @Yield annotation both in javac and in eclipse compilers.

This way you get built-in IDE support for the feature!

To download yield annotation processor please use the following link: Yield.zip

It contains both yield annotation processor, and a test project.

If you do not want to compile the sources, you can download Yield.jar

We would like to reiterate on how @Yield annotation works:

  1. A developer defines a method that returns either Iterator<T> or Iterable<T> instance and marks it with @Yield annotation.
  2. A developer implements iteration logic following the pattern:
    • declare a variable to accumulate results:
        ArrayList<T> items = new ArrayList<T>();
    • use the following statement to add item to result:
        items.add(...);
    • use
        return items;
      or
        return items.iterator();
      to return result;
    • mark method's params, if any, as final.
  3. A devoloper ensures that yield annotation processor is available during compilation (see details below).
  4. YieldProcessor rewrites method into a state machine at compilation time.

The following is an example of such a method:

@Yield
public static Iterable<Integer> generate(final int from, final int to)
{
  ArrayList<Integer> items = new ArrayList<Integer>();

  for(int i = from; i < to; ++i)
  {
    items.add(i);
  }

  return items;
}

The use is like this:

for(int value: generate(7, 20))
{
  System.out.println("generator: " + value);
}

Notice that method's implementation still will be correct in absence of YieldProcessor.

Other important feature is that the state machine returned after the yield processor is closeable.

This means that if you're breaking the iteration before the end is reached you can release resources acquired during the iteration.

Consider the example where break exits iteration:

@Yield
public static Iterable<String> resourceIteration()
{
  ArrayList<String> items = new ArrayList<String>();

  acquire();

  try
  {
    for(int i = 0; i < 100; ++i)
    {
      items.add(String.valueOf(i));
    }
  }
  finally
  {
    release();
  }

  return items;
}

and the use

int i = 0;
Iterable<String> iterator = resourceIteration();

try
{
  for(String item: iterator)
  {
    System.out.println("item " + i + ":" + item);

    if (i++ > 30)
    {
      break;
    }
  }
}
finally
{
  close(iterator);
}

...

private static <T> void close(T value)
  throws IOException
{
  if (value instanceof Closeable)
  {
    Closeable closeable = (Closeable)value;

    closeable.close();
  }
}

Close will execute all required finally blocks. This way resources will be released.

To configure yield processor a developer needs to refer Yield.jar in build path, as it contains @Yield annotation. For javac it's enough, as compiler will find annotation processor automatically.

Eclipse users need to open project properties and:

  • go to the "Java Compiler"/"Annotation Processing"
  • mark "Enable project specific settings"
  • select "Java Compiler"/"Annotation Processing"/"Factory Path"
  • mark "Enable project specific settings"
  • add Yield.jar to the list of "plug-ins and JARs that contain annotation processors".

At the end we want to point that @Yield annotation is a syntactic suggar, but it's important the way the foreach statement is important, as it helps to write concise and an error free code.

See also
  Yield feature in java implemented!
  Yield feature in java

Monday, January 24, 2011 10:23:53 AM UTC  #    Comments [2] -
Announce | Java | Thinking aloud | Tips and tricks
# Tuesday, January 11, 2011

We could not stand the temptation to implement the @Yield annotation that we described earlier.

Idea is rather clear but people are saying that it's not an easy task to update the sources.

They were right!

Implementation has its price, as we were forced to access JDK's classes of javac compiler. As result, at present, we don't support other compilers such as EclipseCompiler. We shall look later what can be done in this area.

At present, annotation processor works perfectly when you run javac either from the command line, from ant, or from other build tool.

Here is an example of how method is refactored:

@Yield
public static Iterable<Long> fibonachi()
{
  ArrayList<Long> items = new ArrayList<Long>();

  long Ti = 0;
  long Ti1 = 1;

  while(true)
  {
    items.add(Ti);

    long value = Ti + Ti1;

    Ti = Ti1;
    Ti1 = value;
  }
}

And that's how we transform it:

@Yield()
public static Iterable<Long> fibonachi() {
  assert (java.util.ArrayList<Long>)(ArrayList<Long>)null == null : null;

  class $state$ implements java.lang.Iterable<Long>, java.util.Iterator<Long>, java.io.Closeable {

    public java.util.Iterator<Long> iterator() {
      if ($state$id == 0) {
        $state$id = 1;
        return this;
      } else return new $state$();
    }

    public boolean hasNext() {
      if (!$state$nextDefined) {
        $state$hasNext = $state$next();
        $state$nextDefined = true;
      }

      return $state$hasNext;
    }

    public Long next() {
      if (!hasNext()) throw new java.util.NoSuchElementException();

      $state$nextDefined = false;

      return $state$next;
    }

    public void remove() {
      throw new java.lang.UnsupportedOperationException();
    }

    public void close() {
      $state$id = 5;
    }

    private boolean $state$next() {
      while (true) switch ($state$id) {
      case 0:
        $state$id = 1;
      case 1:
        Ti = 0;
        Ti1 = 1;
      case 2:
        if (!true) {
          $state$id = 4;
          break;
        }

        $state$next = Ti;
        $state$id = 3;

        return true;
      case 3:
        value = Ti + Ti1;
        Ti = Ti1;
        Ti1 = value;
        $state$id = 2;

        break;
      case 4:
      case 5:
      default:
        $state$id = 5;

        return false;
      }
    }

    private long Ti;
    private long Ti1;
    private long value;
    private int $state$id;
    private boolean $state$hasNext;
    private boolean $state$nextDefined;
    private Long $state$next;
  }

  return new $state$();
}

Formatting is automatic, sorry, but anyway it's for diagnostics only. You will never see this code.

It's iteresting to say that this implementation is very precisely mimics xslt state machine implementation we have done back in 2008.

You can download YieldProcessor here. We hope that someone will find our solution very interesting.

Tuesday, January 11, 2011 4:08:41 PM UTC  #    Comments [0] -
Announce | Thinking aloud | Tips and tricks | xslt | Java
# Monday, December 20, 2010

Several times we have already wished to see yield feature in java and all the time came to the same implementation: infomancers-collections. And every time with dissatisfaction turned away, and continued with regular iterators.

Why? Well, in spite of the fact it's the best implementation of the feature we have seen, it's still too heavy, as it's playing with java byte code at run-time.

We never grasped the idea why it's done this way, while there is post-compile time annotation processing in java.

If we would implemented the yeild feature in java we would created a @Yield annotation and would demanded to implement some well defined code pattern like this:

@Yield
Iteratable<String> iterator()
{
  // This is part of pattern.
  ArrayList<String> list = new ArrayList<String>();

  for(int i = 0; i < 10; ++i)
  {
    // list.add() plays the role of yield return.
    list.add(String.valueOf(i));
  }

  // This is part of pattern.
  return list;
}

or

@Yield
Iterator<String> iterator()
{
  // This is part of pattern.
  ArrayList<String> list = new ArrayList<String>();

  for(int i = 0; i < 10; ++i)
  {
    // list.add() plays the role of yield return.
    list.add(String.valueOf(i));
  }

  // This is part of pattern.
  return list.iterator();
}

Note that the code will work correctly even, if by mischance, post-compile-time processing will not take place.

At post comile time we would do all required refactoring to turn these implementations into a state machines thus runtime would not contain any third party components.

It's iteresting to recall that we have also implemented similar refactoring in pure xslt.

See What you can do with jxom.

Update: implementation can be found at Yield.zip

Monday, December 20, 2010 4:28:35 PM UTC  #    Comments [0] -
Java | Thinking aloud | Tips and tricks | xslt
# Saturday, December 11, 2010

We have a class Beans used to serialize a list of generic objects into an xml. This is done like this:

public class Call
{
  public Beans input;
  public Beans output;
  ...
}

@XmlJavaTypeAdapter(value = BeanAdapter.class)
public class Beans
{
  public List<Object> bean;
}

Thanks to @XmlJavaTypeAdapter, we're able to write xml in whatever form we want.

When we're serializing a Call instance:

Call call = ...
Beans beans = ...;

call.setInput(beans);

JAXBContext context = ...;
Marshaller marshaler = context.createMarshaller();
ObjectFactory factory = ...;

marshaler.marshal(factory.createCall(call), result);

things work as expected, meaning that BeanAdapter is used during xml serialization. But if it's happened that you want to serialize a Beans instance itself, you start getting problems with the serialization of unknown objects. That's because JAXB does not use BeanAdapter.

We have found a similar case "How to assign an adapter to the root element?", unfortunately with no satisfactory explanation.

That is strange.

Saturday, December 11, 2010 8:48:00 AM UTC  #    Comments [0] -
Java | Thinking aloud
# Sunday, November 07, 2010

michaelhkay: Saxon 9.3 has been out for 8 days: only two bugs so far, one found by me. I think that's a record.

Not necessary. We, for example, who use Saxon HE, have found nothing new in Saxon 9.3, while expected to see xslt 3.0. Disappointed. No actual reason to migrate.

P.S. We were among the first who were finding early bugs in previous releases.

Sunday, November 07, 2010 9:07:11 AM UTC  #    Comments [0] -
Thinking aloud | xslt
# Tuesday, November 02, 2010

Reading individual papers of C++ WG, you can find the following one:

N3174 10-0164 To move or not to move Bjarne Stroustrup 2010-10-17 2010-10 Core

There, Bjarne Stroustrup thinks about issues with implicitly generated copy and move operations in C++.

It's always a pleasure to see how one can deal with a problem burdened with antagonisms. To conduct his position Bjarne skilfully uses not only rational but also emotional argumentation:

...We may deem this “bad code that deserves to be broken” or “unrealistic”, but this example demonstrates that the problem with a generated move has an exact counterpart for copy (which we have lived with for 27 years)...

...In 1984, I missed the chance to protect us against copy and we have lived with the problems ever since. I should have instituted some rule along the lines “if a class has a destructor, no copy operations are generated” or “if a class has a pointer member, no copy operations are generated.”...

It's impossible to recall this numbers without shivering. :-)

Tuesday, November 02, 2010 10:16:08 AM UTC  #    Comments [0] -
Thinking aloud

We're following w3's "Bug 9069 - Function to invoke an XSLT transformation".

There, people argue about xpath API to invoke xslt transformations. Function should look roughly like this:

transform
(
  $node-tree as node()?,
  $stylesheet as item(),
  $parameters as XXX
) as node()

The discussion is spinning around the last argument: $parameters as XXX. Should it be an xml element describing parameters, a function returning values for parameter names, or some new type modelling immutable map?

What is most interesting in this discussion is the leak about plans to introduce a map type:

Comment 7 Michael Kay, 2010-09-14 22:46:58 UTC

We're currently talking about adding an immutable map to XSLT as a new data type (the put operation would return a new map). There appear to be a number of possible efficient implementations. It would be ideally suited for this purpose, because unlike the mechanism used for serialization parameters, the values can be any data type (including nodes), not only strings.

There is a hope that map will finally appear in xslt!

See also:
Bug 5630 - [DM] Tuples and maps,
Tuples and maps - Status: CLOSED, WONTFIX,
Map, based on immutable trees,
Maps in exslt2?

Tuesday, November 02, 2010 8:34:52 AM UTC  #    Comments [0] -
Thinking aloud | xslt
# Friday, October 22, 2010

In the previous post we have announced an API to parse a COBOL source into the cobolxom.

We exploited the incremental parser to build a grammar xml tree and then were planning to create an xslt transformation to generate cobolxom.

Now, we would like to declare that such xslt is ready.

At present all standard COBOL constructs are supported, but more tests are required. Preprocessor support is still in the todo list.

You may peek into an examples of COBOL:

Cobol grammar:

And cobolxom:

While we were building a grammar to cobolxom stylesheet we asked ourselves whether the COBOL parsing could be done entirely in xslt. The answer is yes, so who knows it might be that we shall turn this task into pure xslt one. :-)

Friday, October 22, 2010 1:24:31 PM UTC  #    Comments [0] -
Announce | Incremental Parser | Thinking aloud | xslt
# Wednesday, August 25, 2010

Accidentally we have found that implementation of String and StringBuilder have been considerably revised, while public interface has remained the same.

public sealed class String
{
  private int m_arrayLength;
  private int m_stringLength;
  private char m_firstChar;
}

This layout is dated to .NET 1.0.

VM, in fact, allocates more memory than that defined in C# class, as &m_firstChar refers to an inline char buffer.

This way string's buffer length and string's length were two different values, thus StringBuilder used this fact and stored its content in a private string which it modified in place.

In .NET 4, string is different:

public sealed class String
{
  private int m_stringLength;
  private char m_firstChar;
}

Memory footprint of such structure is smaller, but string's length should always be the same as its buffer. In fact layout of string is now the same as layout of char[].

This modification leads to implementation redesign of the StringBuilder.

Earlier, StringBuilder looked like the following:

public sealed class StringBuilder
{
  internal IntPtr m_currentThread;
  internal int m_MaxCapacity;
  internal volatile string m_StringValue;
}

Notice that m_StringValue is used as a storage, and m_currentThread is used to preserve thread affinity of the internal string value.

Now, guys at Microsoft have decided to implement StringBuilder very differently:

public sealed class StringBuilder
{
  internal int m_MaxCapacity;
  internal int m_ChunkLength;
  internal int m_ChunkOffset;
  internal char[] m_ChunkChars;
  internal StringBuilder m_ChunkPrevious;
}

Inspection of this layout immediately reveals implementation technique. It's a list of chunks. Instance itself references the last chunk (most recently appended), and the previous chunks.

Characteristics of this design are:

  • while Length is small, performance almost the same as it was earlier;
  • there are no more thread affinity checks;
  • Append(), and ToString() works as fast a in the old version.
  • Insert() in the middle works faster, as only a chuck should be splitted and probably reallocated (copied), instead of the whole string;
  • Random access is fast at the end O(1) and slows when you approaching the start O(chunk-count).

Personally, we would select a slightly different design:

public sealed class StringBuilder
{
  private struct Chunk
  {
    public int length; // Chunk length.
    public int offset; // Chunk offset.
    public char[] buffer; 
  }

  private int m_MaxCapacity;

  // Alternatively, one can use
  // private List<Chunk> chunks;
  private int chunkCount; // Number of used chunks.
  private Chunk[] chunks; // Array of chunks except last.

  private Chunk last; // Last chunk.
  private bool nonHomogenous; // false if all chunks are of the same size.
}

This design has better memory footprint, and random access time is O(1) when there were no inserts in the middle (nonHomogenous=false), and O(log(chunkCount)) after such inserts. All other characteristics are the same.

Wednesday, August 25, 2010 9:36:55 AM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks
# Thursday, July 15, 2010

We have run into another xslt bug, which depends on several independent circumstances and often behaves differently being observed. That's clearly a Heisenbug.

Xslt designers failed to realize that a syntactic suggar they introduce into xpath can turn into obscure bugs. Well, it's easy to be wise afterwards...

To the point.

Consider you have a sequence consisting of text nodes and elements, and now you want to "normalize" this sequence wrapping adjacent text nodes into separate elements. The following stylesheet is supposed to do the work:

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:t="http://www.nesterovsky-bros.com/xslt/this"
  exclude-result-prefixes="xs t">

  <xsl:template match="/">
    <xsl:variable name="nodes" as="node()*">
      <xsl:text>Hello, </xsl:text>
      <string value="World"/>
      <xsl:text>! </xsl:text>
      <xsl:text>Well, </xsl:text>
      <string value="hello"/>
      <xsl:text>, if not joking!</xsl:text>
    </xsl:variable>
 
    <result>
      <xsl:sequence select="t:normalize($nodes)"/>
    </result>
  </xsl:template>

  <xsl:function name="t:normalize" as="node()*">
    <xsl:param name="nodes" as="node()*"/>

    <xsl:for-each-group select="$nodes" group-starting-with="*">
      <xsl:variable name="string" as="element()?" select="self::string"/>
      <xsl:variable name="texts" as="node()*"
        select="current-group() except $string"/>

      <xsl:sequence select="$string"/>

      <xsl:if test="exists($texts)">
        <string value="{string-join($texts, '')}"/>
      </xsl:if>
    </xsl:for-each-group>
  </xsl:function>

</xsl:stylesheet>

We're expecting the following output:

<result>
  <string value="Hello, "/>
  <string value="World"/>
  <string value="! Well, "/>
  <string value="hello"/>
  <string value=", if not joking!"/>
</result>

But often we're getting other results, like:

<result>
  <string value="Hello, "/>
  <string value="World"/>
  <string value="Well, ! "/>
  <string value="hello"/>
  <string value=", if not joking!"/>
</result>

Such output may seriously confuse, unless you will recall the rule for the xpath except operator:

The except operator takes two node sequences as operands and returns a sequence containing all the nodes that occur in the first operand but not in the second operand.

... these operators eliminate duplicate nodes from their result sequences based on node identity. The resulting sequence is returned in document order..

...
The relative order of nodes in distinct trees is stable but implementation-dependent

These words mean that result sequence may be very different from original sequence.

In contrast, if we change $text definition to:

<xsl:variable name="texts" as="node()*"
  select="current-group()[not(. is $string)]"/>

then the result becomes stable, but less clear.

See also Xslt Heisenbug

Thursday, July 15, 2010 8:22:13 AM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | xslt
# Sunday, July 11, 2010

It does not matter that DataBindExtender looks not usual in the ASP.NET. It turns to be so handy that built-in data binding is not considered to be an option.

After a short try, you uderstand that people tried very hard and have invented many controls and methods like ObjectDataSource, FormView, Eval(), and Bind() with outcome, which is very specific and limited.

In contrast DataBindExtender performs:

  • Two or one way data binding of any business data property to any control property;
  • Converts value before it's passed to the control, or into the business data;
  • Validates the value.

See an example:

<asp:TextBox id=Field8 EnableViewState="false" runat="server"></asp:TextBox>
<bphx:DataBindExtender runat='server'
  EnableViewState='false'
  TargetControlID='Field8'
  ControlProperty='Text'
  DataSource='<%# Import.ClearingMemberFirm %>'
  DataMember='Id'
  Converter='<%# Converters.AsString("XXXXX", false) %>'
  Validator='<%# (extender, value) => Functions.CheckID(value as string) %>'/>

Here, we beside a regualar two way data binding of a property Import.ClearingMemberFirm.Id to a property Field8.Text, format (parse) Converters.AsString("XXXXX", false), and finally validate an input value with a lambda function (extender, value) => Functions.CheckID(value as string).

DataBindExtender works also well in template controls like asp:Repeater, asp:GridView, and so on. Having your business data available, you may reduce a size of the ViewState with EnableViewState='false'. This way DataBindExtender approaches page development to a pattern called MVC.

Recently, we have found that it's also useful to have a way to run a javascript during the page load (e.g. you want to attach some client side event, or register a component). DataBindExtender provides this with OnClientInit property, which is a javascript to run on a client, where this refers to a DOM element:

... OnClientInit='$addHandler(this, "change", function() { handleEvent(event, "Field8"); } );'/>

allows us to attach onchange javascript event to the asp:TextBox.

So, meantime we're very satisfied with what we can achieve with DataBindExtender. It's more than JSF allows, and much more stronger and neater to what ASP.NET has provided.

The sources can be found at DataBindExtender.cs

Sunday, July 11, 2010 7:07:03 AM UTC  #    Comments [2] -
ASP.NET | Thinking aloud | Tips and tricks
# Monday, July 05, 2010

Lately, we have found that we've accustomed to declare C#'s local variables using var:

var exitStateName = exitState == null ? "" : exitState.Name;
var rules = Environment.NavigationRules;
var rule = rules[caller.Name];
var flow = rule.NavigationCases[procedure.OriginExitState];

This makes code cleaner, and in presense of good IDE still allows to figure out types very easely.

We, howerer, found that var tends to have exceptions in its uses. E.g. for some reason most of boolean locals in our code tend to remain explicit (matter of taste?):

bool succeed = false;

try
{
  ...

  succeed = true;
}
finally
{
  if (!succeed)
  {
    ...
  }
}

Also, type often survives in for, but not in foreach:

for(int i = 0; i < sourceDataMapping.Length; ++i)
{
  ...
}

foreach(var property in properties)
{
  ...
}

In addition var has some limitations, as one cannot easily initialize such local with null. From the following we prefer the first approach:

  • IWindowContext context = null;
  • var context = (IWindowContext)null;
  • var context = null as IWindowContext;
  • var context = default(IWindowContext);

We might need to figure out a consistent code style as for var. It might be like that:

  • Numeric, booleans and string locals should use explicit type;
  • Try to avoid locals initialized with null, or without initializer, or use type if such variable cannot be avoided;
  • Use var in all other cases.

Another code style could be like that:

  • For the consistency, completely avoid the use of keyword var.
Monday, July 05, 2010 9:09:26 AM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks
# Tuesday, June 22, 2010

Recently we were raising a question about serialization of ASPX output in xslt.

The question went like this:

What's the recommended way of ASPX page generation?
E.g.:

------------------------
 <%@ Page AutoEventWireup="true"
   CodeBehind="CurMainMenuP.aspx.cs"
   EnableSessionState="True"
   Inherits="Currency.CurMainMenuP"
   Language="C#"
   MaintainScrollPositionOnPostback="True"
   MasterPageFile="Screen.Master" %>

<asp:Content ID="Content1" runat="server" ContentPlaceHolderID="Title">CUR_MAIN_MENU_P</asp:Content>

<asp:Content ID="Content2" runat="server" ContentPlaceHolderID="Content">
  <span id="id1222146581" runat="server"
    class="inputField system UpperCase" enableviewstate="false">
    <%# Dialog.Global.TranCode %>
  </span>
  ...
------------------------

Notice aspx page directives, data binding expessions, and prefixed tag names without namespace declarations.

There was a whole range of expected answers. We, however, looked whether somebody have already dealed with the task and has a ready solution at hands.

In general it seems that xslt community is very angry about ASPX: both format and technology. Well, put this aside.

The task of producing ASPX, which is almost xml, is not solvable when you're staying with pure xml serializer. Xslt's xsl:character-map does not work at all. In fact it looks as a childish attempt to address the problem, as it does not support character escapes but only grabs characters and substitutes them with strings.

We have decided to create ASPX serializer API producing required output text. This way you use <xsl:output method="text"/> to generate ASPX pages.

With this goal in mind we have defined a little xml schema to describe ASPX irregularities in xml form. These are:

  • <xs:element name="declared-prefix"> - to describe known prefixes, which should not be declared;
  • <xs:element name="directive"> - to describe directives like <%@ Page %>;
  • <xs:element name="content"> - a transparent content wrapper;
  • <xs:element name="entity"> - to issue xml entity;
  • <xs:element name="expression"> - to describe aspx expression like <%# Eval("A") %>;
  • <xs:element name="attribute"> - to describe an attribute of the parent element.

This approach greately simplified for us an ASPX generation process.

The API includes:

Tuesday, June 22, 2010 10:25:41 AM UTC  #    Comments [0] -
Announce | ASP.NET | Thinking aloud | Tips and tricks | xslt
# Tuesday, June 15, 2010

In previous posts we were crying about problems with JSF to ASP.NET migration. Let's point to another one.

Consider that you have an input field, whose value should be validated:

<input type="text" runat="server" ID="id1222146409" maxlength="4"/>
<bphx:DataBindExtender runat="server"
  TargetControlID="id1222146409" ControlProperty="Value"
  DataSource="<%# Import.AaControlAttributes %>"
  DataMember="UserEnteredTrancode"/>

Here we have an input control, whose value is bound to Import.AaControlAttributes.UserEnteredTrancode property. But what is missed is a value validation. Somewhere we have a function that could answer the question whether the value is valid. It should be called like this: Functions.IsTransactionCodeValid(value).

Staying within standard components we can use a custom validator on the page:

<asp:CustomValidator runat="server"
  ControlToValidate="id1222146409"
  OnServerValidate="ValidateTransaction"
  ErrorMessage="Invalid transaction code."/>

and add the following code-behind:

protected void ValidateTransaction(object source, ServerValidateEventArgs args)
{
  args.IsValid = Functions.IsTransactionCodeValid(args.Value);
}

This approach works, however it pollutes the code-behind with many very similar methods. The problem is that the validation rules in most cases are not property of page but one of data model. That's why page validation methods just forward check to somewhere.

While thinking on how to simplify the code we have came up with more conscious and short way to express validators, namely using lambda functions. To that end we have introduced a Validator property of type ValueValidator over DataBindExtender. Where

/// <summary>A delegate to validate values.</summary>
/// <param name="extender">An extender instance.</param>
/// <param name="value">A value to validate.</param>
/// <returns>true for valid value, and false otherwise.</returns>
public delegate bool ValueValidator(DataBindExtender extender, object value);

/// <summary>An optional data member validator.</summary>
public virtual ValueValidator Validator { get; set; }

With this new property the page markup looks like this:

<input type="text" runat="server" ID="id1222146409" maxlength="4"/>
<bphx:DataBindExtender runat="server"
  TargetControlID="id1222146409" ControlProperty="Value"
  DataSource="<%# Import.AaControlAttributes %>"
  DataMember="UserEnteredTrancode"
  Validator='<%# (extender, value) => Functions.IsTransactionCodeValid(value as string) %>'
  ErrorMessage="Invalid transaction code."/>

This is almost like an event handler, however it allowed us to call data model validation logic without unnecessary code-behind.

The updated DataBindExtender can be found at DataBindExtender.cs.

Tuesday, June 15, 2010 6:36:44 AM UTC  #    Comments [0] -
ASP.NET | Thinking aloud | Tips and tricks
# Thursday, June 10, 2010

Being well behind of the latest news and traps of the ASP.NET, we're readily falling on each problem. :-)

This time it's a script injection during data binding.

In JSF there is a component to output data called h:outputText. Its use is like this:

<span jsfc="h:outputText" value="#{myBean.myProperty}"/>

The output is a span element with data bound value embeded into content. The natural alternative in ASP.NET seems to be an asp:Label control:

<asp:Label runat="server" Text="<%# Eval("MyProperty") %>"/>

This almost works except that the h:outputText escapes data (you may override this and specify attribute escape="false"), and asp:Label never escapes the data.

This looks as a very serious omission in ASP.NET (in fact very close to a security hole). What are chances that when you're creating a new page, which uses data binding, you will not forget to fix code that wizard created for you and to change it to:

<asp:Label runat="server" Text="<%# Server.HtmlEncode(Eval("MyProperty")) %>"/>

Eh? Think what will happen if MyProperty will return a text that looks like a script (e.g.: <script>alert(1)</script>), while you just wanted to output a label?

To address the issue we've also introduced a property Escape into DataBindExtender. So at present we have a code like this:

<asp:Label runat="server" ID="MyLabel"/>
<bphx:DataBindExtender runat="server" TargetControlID="MyLabel"
  ControlProperty="Text" ReadOnly="true" Escape="true"
  DataSource="<%# MyBean %>" DataMember="MyProperty"/>

See also: A DataBindExtender, Experience of JSF to ASP.NET migration

Thursday, June 10, 2010 1:06:19 PM UTC  #    Comments [0] -
ASP.NET | Thinking aloud | Tips and tricks
# Saturday, June 05, 2010

After struggling with ASP.NET data binding we found no other way but to introduce our little extender control to address the issue.

We were trying to be minimalistic and to introduce two way data binding and to support data conversion. This way extender control (called DataBindExtender) have following page syntax:

<asp:TextBox id=TextBox1 runat="server"></asp:TextBox>
<cc1:DataBindExtender runat="server"
  DataSource="<%# Data %>"
  DataMember="ID"
  TargetControlID="TextBox1"
  ControlProperty="Text" />

Two way data binding is provided with DataSource object (notice data binding over this property) and a DataMember property from the one side, and TargetControlID and ControlProperty from the other side. DataBindExtender supports Converter property of type TypeConverter to support custom converters.

DataBindExtender is based on AjaxControlToolkit.ExtenderControlBase class and implements System.Web.UI.IValidator. ExtenderControlBase makes implementation of extenders extremely easy, while IValidator plugs natuarally into page validation (Validate method, Validators collections, ValidationSummary control).

The good point about extenders is that they are not visible in designer, while it exposes properties in extended control itself. The disadvantage is that it requires Ajax Control Toolkit, and also ScriptManager component of the page.

To simplify the use DataBindExtender gets data from control and puts the value into data source in Validate method, and puts data into control in OnPreRender method; thus no specific action is required to perform data binding.

Source for the DataBindExtender is DataBindExtender.cs.

Saturday, June 05, 2010 11:22:03 AM UTC  #    Comments [0] -
ASP.NET | Thinking aloud | Tips and tricks
# Saturday, May 29, 2010

We used to think that ASP.NET is a way too powerful than JSF. It might be still true, but not when you are accustomed to JSF and spoiled with its code practice...

Looking at both technologies from a greater distance, we now realize that they give almost the same level of comfort during development, but they are different. You can feel this after you were working for some time with one technology and now are to implement similar solution in opposite one. That is where we have found ourselves at present.

The funny thing is that we did expect some problems but in a different place. Indeed, both ASP.NET and JSF are means to define a page layout and to map input and output of business data. While with the presentation (controls, their compositions, masters, styles and so on) you can find more or less equal analogies, the differences of implementation of data binding is a kind of a pain.

We have found that data binding in ASP.NET is somewhat awkward. Its Eval and Bind is bearable in simple cases but almost unusable when you business data is less trivial, or if you have to apply custom data formatting.

In JSF, with its Expression Language, we can perform two way data binding for rather complex properties like ${data.items[index + 5].property}, or to create property adapters ${my:asSomething(data.bean, "property").Value}, or add standard or custom property converters. In contrast data binding in ASP.NET is limited to simple property path (no expressions are supported), neither custom formatters are supported (try to format number as a telephone number).

Things work well when you're designing ASP.NET application from scratch, as you naturally avoid pitfalls, however when you got existing business logic and need to expose it to the web, you have no other way but to write a lot of code behind just to smooth out the problems that ASP.NET exhibits.

Another solution would be to design something like extender control that would attach more proper data binding and formatting facilities to control properties. That would allow to make page definitions in more declarative way, like what we have now in JSF.

Saturday, May 29, 2010 2:16:05 PM UTC  #    Comments [0] -
ASP.NET | JSF and Facelets | Thinking aloud
# Sunday, May 23, 2010

While porting a solution from JSF to ASP.NET we have seen an issue with synchronization of access to a data stored in a session from multiple requests.

Consider a case when you store a business object in a session.

Going through the request lifecycle we observe that this business object may be accessed at different stages: data binding, postback event handler, security filters, other.

Usually this business object is mutable and does not assume concurent access. Browsers, however, may easily issue multiple requests to the same session at the same time. In fact, such behaviour, is not even an exception, as browsers nowadays are often sending concurrent requests.

In the JSF we're using a sync object, which is part of business object itself; lock it and unlock at the begin and at the end of a request correspondingly. This works perfectly as JSF guarantees that:

  • lock is released after it's acquired (we use request scope bean with @PostConstruct and @PreDestroy annotations to lock and unlock);
  • both lock and unlock take place in the same thread.

ASP.NET, in contrast, tries to be more asynchronous, and allows for different stages of request to take place in different threads. This could be seen indirectly in the documentation, which does not give any commitments in this regards, and with code inspection where you can see that request can begin in one thread, and a next stage can be queued for the execution into the other thread.

In addition, ASP.NET does not guarantee that if BeginRequest has been executed then EndRequest will also run.

The conclusion is that we should not use locks to synchronize access to the same session object, but rather try to invent other means to avoid data races.

Update msdn states:

Concurrent Requests and Session State

Access to ASP.NET session state is exclusive per session, which means that if two different users make concurrent requests, access to each separate session is granted concurrently. However, if two concurrent requests are made for the same session (by using the same SessionID value), the first request gets exclusive access to the session information. The second request executes only after the first request is finished. (The second session can also get access if the exclusive lock on the information is freed because the first request exceeds the lock time-out.)

This means that the required synchronization is already built into ASP.NET. That's good.

Sunday, May 23, 2010 12:22:35 PM UTC  #    Comments [0] -
ASP.NET | Thinking aloud
# Friday, May 14, 2010

We have implemented report parser in C#. Bacause things are spinned around C#, a schema definition is changed.

We have started from classes defining a report definition tree, annotated these classes for xml serialization, and, finally, produced xml schema for such tree. So, at present, it is not an xml schema with annotations but a separate xml schema.

In addition we have defined APIs:

  • to enumerate report data (having report definition and report data one can get IEnumerable<ViewValue> to iterate report data in structured form);
  • to read report through XmlReader, which allows, for example, to have report as input for an xslt tranformation.
  • to write report directly into XmlWriter.

An example of report definition as C# code is: MyReport.cs. The very same report definition but serialized into xml is my-report.xml. A generated xml schema for a report definition is: schema0.xsd.

The good point about this solution is that it's already flexible enough to describe every report layout we have at hands, and it's extendable. Our measurments show that report parsing is extremely fast and have very small memory footprint due to forward only nature of report definitions.

From the design point of view report definition is a view of original text data with view info attached.

At present we have defined following views:

  • Element - a named view to generate output from a content view;
  •  Content - a view to aggregate other views together;
  • Choice - a view to produce output from one of content views;
  • Sequence - a view to sequence input view by key expressions, and to attach an index to each sequence item;
  • Iterator - a view to generate output from input view while some condition is true, and to attach an iteration index to each part of output view;
  • Page - a view to remove page headers and footers in the input view, and to attach an index to each page;
  • Compute - a named view to produce result of evaluation of expression as output view;
  •  Data - a named view to produce output value from some bounds of input view, and optionally to convert, validate and format the value.

To specify details of definitions there are:

  • expressions to deal with integers: Add, Div, Integer, MatchProperty, Max, Min, Mod, Mul, Neg, Null, Sub, VariableRef, ViewProperty, Case;
  • conditions to deal with booleans: And, EQ, GE, GT, IsMatch, LE, LT, NE, Not, Or.

At present there is no a specification of a report definitions. Probably, it's the most complex part to create such a spec for a user without deep knowledge. At present, our idea is that one should use xml schema (we should polish generated schema) for the report definition and schema aware editor to build report definitions. That's very robust approach working perfectly with languages xom.

C# sources can be found at: ReportLayout.zip including report definition classes and a sample report.

Friday, May 14, 2010 12:45:42 PM UTC  #    Comments [0] -
Announce | Thinking aloud | xslt
# Sunday, May 09, 2010
Ribbon of Saint George

We're facing a task of parsing reports produced from legacy applications and converting them into a structured form, e.g. into xml. These xml files can be processed further with up to date tools to produce good looking reports.

Reports at hands are of very different structure and of size: from a couple of KB to a several GB. The good part is that they mostly have a tabular form, so it's easy to think of specific parsers in case of each report type.

Our goal is to create an environment where a less qualified person(s) could create and manage such parsers, and only rarely to engage someone who will handle less untrivial cases.

Our analysis has shown that it's possible to write such parser in almost any language: xslt, C#, java.

Our approach was to create an xml schema annotations that from one side define a data structure, and from the other map report layout. Then we're able to create an xslt that will generate either xslt, C#, or java parser according to the schema definitions. Because of languages xom, providing XML Object Model and serialization stylesheets for C# and java, it does not really matter what we shall generate xslt or C#/java, as code will look the same.

The approach we're going to use to describe reports is not as powerfull as conventional parsers. Its virtue, however, is simplicity of specification.

Consider a report sample (a data to extract is in bold):

1 TITLE ...                    PAGE:            1
 BUSINESS DATE: 09/30/09   ... RUN DATE: 02/23/10
 CYCLE : ITD      RUN: 001 ... RUN TIME: 09:22:39

        CM         BUS   ...
  CO    NBR  FRM   FUNC  ...
 ----- ----- ----- -----  
 XXX   065   065   CLR   ...
 YYY   ...
...
1 TITLE ...                    PAGE:            2
 BUSINESS DATE: 09/30/09   ... RUN DATE: 02/23/10
 CYCLE : ITD      RUN: 001 ... RUN TIME: 09:22:39

        CM         BUS   ...
  CO    NBR  FRM   FUNC  ...
 ----- ----- ----- -----  
 AAA   NNN   MMM   PPP   ...
 BBB   ...
...

* * * * *  E N D   O F   R E P O R T  * * * * *

We're approaching to the report through a sequence of views (filters) of this report. Each veiw localizes some report data either for the subsequent filterring or for the extraction of final data.

Looking into the example one can build following views of the report:

  1. View of data before the "E N D   O F   R E P O R T" line.
  2. View of remaining data without page headers and footers.
  3. Views of table rows.
  4. Views of cells.

A sequence of filters allows us to build a pipeline of transformations of original text. This also allows us to generate a clean xslt, C# or java code to parse the data.

At first, our favorite language for such parser was xslt. Unfortunatelly, we're dealing with Saxon xslt implementation, which is not very strong in streaming processing. Without a couple of extension functions to prevent caching, it tends to cache whole input in the memory, which is not acceptable.

At present we have decided to start from C# code, which is pure C# naturally. :-)

Code still is in the development but at present we would like to share the xml schema annotations describing report layout: report-mapping.xsd, and a sample of report description: test.xsd.

Sunday, May 09, 2010 5:18:57 AM UTC  #    Comments [0] -
Announce | Thinking aloud | xslt
# Saturday, May 01, 2010

To see that the problem with Generator functions in xslt is a bit more complicated compare two functions.

The first one is quoted from the earlier post:

  <xsl:function name="t:generate" as="xs:integer*">
    <xsl:param name="value" as="xs:integer"/>

    <xsl:sequence select="$value"/>
    <xsl:sequence select="t:generate($value * 2)"/>
  </xsl:function>

It does not work in Saxon: crashes with out of memory.

The second one is slightly modified version of the same function:

  <xsl:function name="t:generate" as="xs:integer*">
    <xsl:param name="value" as="xs:integer"/>

    <xsl:sequence select="$value + 0"/>
    <xsl:sequence select="t:generate($value * 2)"/>
  </xsl:function>

It's working without problems. In first case Saxon decides to cache all function's output, in the second case it decides to evaluate data lazily on demand.

It seems that optimization algorithms implemented in Saxon are so plentiful and complex that at times they fool one another. :-)

See also: Generator functions

Saturday, May 01, 2010 7:18:24 AM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | xslt
# Friday, April 23, 2010

There are some complications with streamed tree that we have implemented in saxon. They are due to the fact that only a view of input data is available at any time. Whenever you access some element that's is not available you're getting an exception.

Consider an example. We have a log created with java logging. It looks like this:

<log>
  <record>
    <date>...</date>
    <millis>...</millis>
    <sequence>...</sequence>
    <logger>...</logger>
    <level>INFO</level>
    <class>...</class>
    <method>...</method>
    <thread>...</thread>
    <message>...</message>
  </record>
  <record>
    ...
  </record>
  ...

We would like to write an xslt that returns a page of log as html:

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:t="http://www.nesterovsky-bros.com/xslt/this"
  xmlns="http://www.w3.org/1999/xhtml"
  exclude-result-prefixes="xs t">

  <xsl:param name="start-page" as="xs:integer" select="1"/>
  <xsl:param name="page-size" as="xs:integer" select="50"/>

  <xsl:output method="xhtml" byte-order-mark="yes" indent="yes"/>

  <!-- Entry point. -->
  <xsl:template match="/log">
    <xsl:variable name="start" as="xs:integer"
      select="($start-page - 1) * $page-size + 1"/>
    <xsl:variable name="records" as="element()*"
      select="subsequence(record, $start, $page-size)"/>

    <html>
      <head>
        <title>
          <xsl:text>A log file. Page: </xsl:text>
          <xsl:value-of select="$start-page"/>
        </title>
      </head>
      <body>
        <table border="1">
          <thead>
            <tr>
              <th>Level</th>
              <th>Message</th>
            </tr>
          </thead>
          <tbody>
            <xsl:apply-templates mode="t:record" select="$records"/>
          </tbody>
        </table>
      </body>
    </html>
  </xsl:template>

  <xsl:template mode="t:record" match="record">
    <!-- Make a copy of record to avoid streaming access problems. -->
    <xsl:variable name="log">
      <xsl:copy-of select="."/>
    </xsl:variable>

    <xsl:variable name="level" as="xs:string"
      select="$log/record/level"/>
    <xsl:variable name="message" as="xs:string"
      select="$log/record/message"/>

    <tr>
      <td>
        <xsl:value-of select="$level"/>
      </td>
      <td>
        <xsl:value-of select="$message"/>
      </td>
    </tr>
  </xsl:template>

</xsl:stylesheet>

This code does not work. Guess why? Yes, it's subsequence(), which is too greedy. It always wants to know what's the next node, so it naturally skips a content of the current node. Algorithmically, such saxon code could be rewritten, and could possibly work better also in modes other than streaming.

A viable workaround, which does not use subsequence, looks rather untrivial:

<!-- Entry point. -->
<xsl:template match="/log">
  <xsl:variable name="start" as="xs:integer"
    select="($start-page - 1) * $page-size + 1"/>
  <xsl:variable name="end" as="xs:integer"
    select="$start + $page-size"/>

  <html>
    <head>
      <title>
        <xsl:text>A log file. Page: </xsl:text>
        <xsl:value-of select="$start-page"/>
      </title>
    </head>
    <body>
      <table border="1">
        <thead>
          <tr>
            <th>Level</th>
            <th>Message</th>
          </tr>
        </thead>
        <tbody>
          <xsl:sequence select="
            t:generate-records(record, $start, $end, ())"/>
        </tbody>
      </table>
    </body>
  </html>
</xsl:template>

<xsl:function name="t:generate-records" as="element()*">
  <xsl:param name="records" as="element()*"/>
  <xsl:param name="start" as="xs:integer"/>
  <xsl:param name="end" as="xs:integer?"/>
  <xsl:param name="result" as="element()*"/>

  <xsl:variable name="record" as="element()?" select="$records[$start]"/>

  <xsl:choose>
    <xsl:when test="(exists($end) and ($start > $end)) or empty($record)">
      <xsl:sequence select="$result"/>
    </xsl:when>
    <xsl:otherwise>
      <!-- Make a copy of record to avoid streaming access problems. -->
      <xsl:variable name="log">
        <xsl:copy-of select="$record"/>
      </xsl:variable>

      <xsl:variable name="level" as="xs:string"
        select="$log/record/level"/>
      <xsl:variable name="message" as="xs:string"
        select="$log/record/message"/>

      <xsl:variable name="next-result" as="element()*">
        <tr>
          <td>
            <xsl:value-of select="$level"/>
          </td>
          <td>
            <xsl:value-of select="$message"/>
          </td>
        </tr>
      </xsl:variable>

      <xsl:sequence select="
        t:generate-records
        (
          $records,
          $start + 1,
          $end,
          ($result, $next-result)
        )"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:function>

Here we observed the greediness of saxon, which too early tried to consume more input than it's required. In the other cases we have seen that it may defer actual data access to the point when there is no data anymore.

So, without tuning internal saxon logic it's possible but not easy to write stylesheets that exploit streaming features.

P.S. Updated sources are at streamedtree.zip

Friday, April 23, 2010 10:12:38 AM UTC  #    Comments [0] -
Thinking aloud | xslt
# Thursday, April 22, 2010

At some point we needed to have an array with volatile elements in java.

We knew that such beast is not found in the java world. So we searched the Internet and found the answers that are so wrong, and introduce so obscure threading bugs that the guys who provided them would better hide them and run immediately to fix their buggy programs...

The first one is Volatile arrays in Java. They suggest such solution:

volatile int[] arr = new int[...];

...
arr[4] = 100;
arr = arr;

The number two: What Volatile Means in Java

A guy assures that this code works:

Fields:

int answer = 0;
volatile boolean ready = false;

Thread1:

answer = 42;
ready = true;

Thread2:

if (ready)
{
  print(answer);
}

They are very wrong! Non volatile access can be reordered by the implementation. See Java's Threads and Locks:

The rules for volatile variables effectively require that main memory be touched exactly once for each use or assign of a volatile variable by a thread, and that main memory be touched in exactly the order dictated by the thread execution semantics. However, such memory actions are not ordered with respect to read and write actions on nonvolatile variables.

They probably thought of locks when they argued about volatiles:

a lock action acts as if it flushes all variables from the thread's working memory; before use they must be assigned or loaded from main memory.

P.S. They would better recommend AtomicReferenceArray.

Thursday, April 22, 2010 1:05:48 PM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks
# Wednesday, April 21, 2010

When time has come to process big xml log files we've decided to implement streamable tree in saxon the very same way it was implemented in .net eight years ago (see How would we approach to streaming facility in xslt).

It's interesting enough that the implementation is similar to one of composable tree. There a node never stores a reference to a parent, while in the streamed tree no references to children are stored. This way only a limited subview of tree is available at any time. Implementation does not support preceding and preceding-sibling axes. Also, one cannot navigate to a node that is out of scope.

Implementation is external (there are no changes to saxon itself). To use it one needs to create an instance of DocumentInfo, which pulls data from XMLStreamReader, and to pass it as an input to a transformation:

Controller controller = (Controller)transformer;
XMLInputFactory factory = XMLInputFactory.newInstance();
StreamSource inputSource = new StreamSource(new File(input));
XMLStreamReader reader = factory.createXMLStreamReader(inputSource);
StaxBridge bridge = new StaxBridge();

bridge.setPipelineConfiguration(
  controller.makePipelineConfiguration());
bridge.setXMLStreamReader(reader);
inputSource = new DocumentImpl(bridge);

transformer.transform(inputSource, new StreamResult(output));

This helped us to format an xml log file of arbitrary size. An xslt like this can do the work:

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns="http://www.w3.org/1999/xhtml"
  exclude-result-prefixes="xs">

  <xsl:template match="/log">
    <html>
      <head>
        <title>Log</title>
      </head>
      <body>
        <xsl:apply-templates/>
      </body>
    </html>
  </xsl:template>

  <xsl:template match="message">
   ...
  </xsl:template>

  <xsl:template match="message[@error]">
    ...
  </xsl:template>

  ...

</xsl:stylesheet>

Implementation can be found at: streamedtree.zip

Wednesday, April 21, 2010 7:10:34 AM UTC  #    Comments [0] -
Announce | Thinking aloud | xslt
# Friday, April 09, 2010

By the generator we assume a function that produces an infinitive output sequence for a particular input.

That's a rather theoretical question, as xslt does not allow infinitive sequence, but look at the example:

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:t="http://www.nesterovsky-bros.com/xslt"
  exclude-result-prefixes="xs t">

  <xsl:template match="/">
    <xsl:variable name="value" as="xs:string" select="'10100101'"/>

    <xsl:variable name="values" as="xs:integer+" select="t:generate(1)"/>

    <!--<xsl:variable name="values" as="xs:integer+">
      <xsl:call-template name="t:generate">
        <xsl:with-param name="value" select="1"/>
      </xsl:call-template>
    </xsl:variable>-->

    <xsl:variable name="integer" as="xs:integer" select="
      sum
      (
        for $index in 1 to string-length($value) return
          $values[$index][substring($value, $index, 1) = '1']
      )"/>

    <xsl:message select="$integer"/>
  </xsl:template>

  <xsl:function name="t:generate" as="xs:integer*">
    <xsl:param name="value" as="xs:integer"/>

    <xsl:sequence select="$value"/>
    <xsl:sequence select="t:generate($value * 2)"/>
  </xsl:function>

  <!--<xsl:template name="t:generate" as="xs:integer*">
    <xsl:param name="value" as="xs:integer"/>

    <xsl:sequence select="$value"/>

    <xsl:call-template name="t:generate">
      <xsl:with-param name="value" select="$value * 2"/>
    </xsl:call-template>
  </xsl:template>-->

</xsl:stylesheet>

Here the logic uses such a generator and decides by itself where to break.

Should such code be valid?

From the algorithmic perspective example would better to work, as separation of generator logic and its use are two different things.

Friday, April 09, 2010 2:38:34 PM UTC  #    Comments [0] -
Thinking aloud | xslt

Lately, after playing a little with saxon tree models, we thought that design would be more cleaner and implementation faster if NamePool were implemented differently.

Now, saxon is very pessimistic about java objects thus it prefers to encode qualified names with integers. The encoding and decoding is done in the NamePool. Other parts of code use these integer values.

Operations done over these integers are:

  • equality comparision of two such integers in order to check whether to qualified or extended names are equal;
  • get different parts of qualified name from NamePool.

We would design this differently. We would:

  1. create a QualifiedName class to store all name parts.
  2. declare NamePool to create and cache QualifiedName instances.

This way:

  • equality comparision would be a reference comparision of two instances;
  • different parts of qualified name would become a trivial getter;
  • contention of such name pool would be lower.

That's the implementation we would propose: QualifiedName.java, NameCache.java

Friday, April 09, 2010 1:05:30 PM UTC  #    Comments [0] -
Thinking aloud | xslt
# Thursday, April 08, 2010

Earlier, in the entry "Inline functions in xslt 2.1" we've described an implementation of xml tree model that may share subtrees among different trees.

This way, in a code:

<xsl:variable name="elements" as="element()*" select="..."/>

<xsl:variable name="result" as="element()">
  <result>
    <xsl:sequence select="$elements"/>
  </result>
</xsl:variable>

the implementation shares internal representation among $elements and subtree of $result. From the perspective of xslt it looks as completely different subtrees with different node identities, which is in the accordance with its view of the world.

After a short study we've decided to create a research implementation of this tree model in saxon. It's took only a couple of days to introduce a minimal changes to engine, to refactor linked tree into a new composable tree, and to perform some tests.

In many cases saxon has benefited immediately from this new tree model, in some other cases more tunings are required.

Our tests've showed that this new tree performed better than linked tree, but a little bit worser than tiny tree. On the other hand, it's obvious that conventional code patterns avoid subtree copying, assuming it's expensive operation, thus one should rethink some code practices to benefit from composable tree.

Implementation can be downloaded at: saxon.composabletree.zip

Thursday, April 08, 2010 6:26:02 AM UTC  #    Comments [0] -
Announce | Thinking aloud | xslt
# Sunday, April 04, 2010

From the web we know that xslt WG is thinking now on how to make xslt more friendly to a huge documents. They will probably introduce some xslt syntax to allow implementation to be ready for a such processing.

They will probably introduce an indicator marking a specific mode for streaming. XPath in this mode will probably be restricted to a some its subset.

The funny part is that we have implemented similar feature back in 2002 in .net. It was called XPathForwardOnlyNavigator.

Implementation stored only several nodes at a time (context node and its ancestors), and read data from XmlReader perforce. Thus one could navigate to ancestor elements, to children, and to the following siblings, but never to any previous node. When one tried to reach a node that was already not available we threw an exception.

It was simple, not perfect (too restrictive) but it was pluggable in .net's xslt, and allowed to process files of any size.

That old implementation looks very attractive even now in 2010. We expect that WG with their decisions will not rule out such or similar solutions, and will not force implementers to write alternative engine for xslt streaming.

Sunday, April 04, 2010 8:53:27 PM UTC  #    Comments [0] -
Thinking aloud | xslt
# Friday, April 02, 2010

Xslt 1.0 has been designed based on the best intentions. Xslt 2.0 got a legacy baggage.

If you're not entirely concentrated during translation of your algorithms into xslt 2.0 you can get into trap, as we did.

Consider a code snapshot:

<xsl:variable name="elements" as="element()+">
  <xsl:apply-templates/>
</xsl:variable>

<xsl:variable name="converted-elements" as="element()+"
  select="$elements/t:convert(.)"/>

Looks simple, isn't it?

Our intention was to get converted elements, which result from some xsl:apply-templates logic.

Well, this code works... but rather sporadically, as results are often in wrong order! This bug is very close to what is called a Heisenbug.

So, where is the problem?

Elementary, my dear Watson:

  1. xsl:apply-templates constructs a sequence of rootless elements.
  2. $elements/t:convert(.) converts elements and orders them in document order.

Here is a tricky part:

The relative order of nodes in distinct trees is stable but implementation-dependent...

Clearly each rootless element belongs to a unique tree.

After that we have realized what the problem is, code has been immediately rewritten as:

<xsl:variable name="elements" as="element()+">
  <xsl:apply-templates/>
</xsl:variable>

<xsl:variable name="converted-elements" as="element()+" select="
  for $element in $elements return
    t:convert($element)"/>

P.S. Taking into an accout a size of our xslt code base, it took a half an hour to localize the problem. Now, we're at position to review all uses of slashes in xslt. As you like it?

Friday, April 02, 2010 5:53:18 PM UTC  #    Comments [0] -
Thinking aloud | xslt
# Monday, March 22, 2010

Inline functions in xslt 2.1 look often as a some strange aberration. Sure, there are very usefull cases when they are delegates of program logic (e.g. comparators, and filters), but often (probably more often) we can see that it's use is to model data structures.

As an example, suppose you want to model a structure with three properties say a, b, and c. You implement this creating functions that wrap and unwrap the data:

function make-data($a as item(), $b as item(), $c as item()) as function() as item()+
{
  function() { $a, $b, $c }
}

function a($data as function() as item()+) as item()
{
  $data()[1]
}

function b($data as function() as item()+) as item()
{
  $data()[2]
}

function c($data as function() as item()+) as item()
{
  $data()[3]
}

Clever?

Sure, it is! Here, we have modeled structrue with the help of sequence, which we have wrapped into a function item.

Alas, clever is not always good (often it's a sign of a bad). We just wanted to define a simple structure. What it has in common with function?

There is a distance between what we want to express, designing an algorithm, and what we see looking at the code. The greater the distance, the more efforts are required to write, and to read the code.

It would be so good to have simpler way to express such concept as a structure. Let's dream a little. Suppose you already have a structure, and just want to access its members. An idea we can think about is an xpath like access method:

$data/a, $data/b, $data/c

But wait a second, doesn't $data looks very like an xml element, and its accessors are just node tests? That's correct, so data constructor may coincide with element constructor.

Then what pros and cons of using of xml elements to model structures?

Pros are: existing xml type system, and sensibly looking code (you just understand that here we're constructing a structure).

Cons are: xml trees are implemented the way that does not assume fast (from the perfromace perspective) composition, as when you construct a structure a copy of data is made.

But here we observe that "implemented" is very important word in this context. If xml tree implementation would not store reference to the parent node then subtrees could be composed very efficiently (note that tree is assumed to be immutable). Parent node could be available through a tree navigator, which would contain reference to a node itself and to a parent tree navigator (or to store child parent map somewhere near the root).

Such tree structure would probably help not only in this particular case but also with other conventional xslt code patterns.

P.S. Saxon probably could implement its NodeInfo this way.

Update: see also Custom tree model.

Monday, March 22, 2010 11:02:07 AM UTC  #    Comments [0] -
Thinking aloud | xslt
# Monday, March 15, 2010

A while ago we have proposed to introduce maps as built-in types in xpath/xquery type system: Tuples and maps.

The suggestion has been declined (probably our arguments were not convincing). We, however, still think that maps along with sequences are primitives, required to build sensible (not obscure one) algorithms. To see that map is at times is the best way to resolve the problems we shall refer to an utility function to allocate names in scope. Its signature looks like this:

<!--
Allocates unique names in the form $prefix{number}?.
Note that prefixes may coincide.
$prefixes - a name prefixes.
$names - allocated names pool.
$name-max-length - a longest allowable name length.
Returns unique names.
-->
<xsl:function name="t:allocate-names" as="xs:string*">
  <xsl:param name="prefixes" as="xs:string*"/>
  <xsl:param name="names" as="xs:string*"/>
  <xsl:param name="name-max-length" as="xs:integer?"/>

Just try to program it and you will find yourselves coding something like one defined at cobolxom.

To be efficient such maps should provide, at worst, a logarithmic operation complexity:

  • Access to the map through a key (and/or by index) - complexity is LnN;
  • Creation of a new map with added or removed item - complexity is LnN;
  • Construction of the map from ordered items - complexity is O(N);
  • Construction of the map from unordered items - complexity is O(N*LnN);
  • Total enumeration of all items - complexity is O(N*LnN);

These performance metrics are found in many functional and procedural implementations of the maps. Typical RB and AVL tree based maps satisfy these restrictions.

What we suggest is to introduce map implementation into the exslt2 (assuming inline functions are present). As a sketch we have implemented pure AVL Tree in Xslt 2.0:

We do not assume that implementation like this should be used, but rather think of some extension function(s) that provides a better performance.

What do you think?

Monday, March 15, 2010 1:59:19 PM UTC  #    Comments [1] -
Thinking aloud | xslt
# Sunday, February 28, 2010

The story about immutable tree would not be complete without xslt implementation. To make it possible one needs something to approxomate tree nodes. You cannot implement such consruct efficiently in pure xslt 2.0 (it would be either unefficient or not pure).

To isolate the problem we have used tuple interface:

  • tuple:ref($items as item()*) as item() - to wrap items into a tuple;
  • tuple:deref($tuple as item()?) as item()* - to unwrap items from a tuple;
  • tuple:is-same($first as item(), $second as item()) as xs:boolean - to test whether two tuples are the same.

and defined inefficient implementation based on xml elements. Every other part of code is a regular AVL algorithm implementation.

We want to stress again that even assuming that there is a good tuple implementation we would prefer built-in associative container implementation. Why the heck you need to include about 1000 lines of code just to use a map?

Source code is:

Sunday, February 28, 2010 7:28:07 PM UTC  #    Comments [0] -
Thinking aloud | xslt

We like Visual Studio very much, and try to adopt new version earlier.

For the last time our VS's use pattern is centered around xml and xslt. In our opinion VS 2008 is the best xslt 2 editor we have ever seen even with lack of support of xslt 2.0 debugging.

Unfortunatelly, that is still a true when VS 2010 is almost out. VS 2008 is just 2 - 3 times faster. You can observe this working with xslt files like those in languages-xom.zip (1000 - 2000 rows). Things just become slow.

We still hope that VS 2010 will make a final effort to outdo what VS 2008 has already delivered.

Sunday, February 28, 2010 6:37:58 PM UTC  #    Comments [0] -
Thinking aloud | xslt
# Wednesday, February 24, 2010

Why do we return to this theme again?

Well, it's itching!

In cobolxom there is an utility function to allocate names in scope. Its signature looks like this:

<!--
  Allocates unique names in the form $prefix{number}?.
  Note that prefixes may coincide.
    $prefixes - a name prefixes.
    $names - allocated names pool.
    $name-max-length - a longest allowable name length.
    Returns unique names.
-->
<xsl:function name="t:allocate-names" as="xs:string*">
  <xsl:param name="prefixes" as="xs:string*"/>
  <xsl:param name="names" as="xs:string*"/>
  <xsl:param name="name-max-length" as="xs:integer?"/>

We have created several different implementations (all use recursion). Every implementation works fair for relatively small input sequences, say N < 100, but we have cases when there are about 10000 items on input. Algorithm's worst case complexity, in absence of associative containers, is O(N*N), and be sure it's an O-o-o-oh... due to xslt engine implementation.

If there were associative containers with efficient access (complexity is O(LogN)), and construction of updated container (complexity is also O(LogN)) then implementation would be straightforward and had complexity O(N*LogN).

Wednesday, February 24, 2010 7:34:07 AM UTC  #    Comments [0] -
Thinking aloud | xslt
# Sunday, February 07, 2010

Given:

public class N
{
  public readonly N next;
}

What needs to be done to construct a ring of N: n1 refers to n2, n2 to n3, ... nk to n1? Is it possible?

Sunday, February 07, 2010 7:57:08 AM UTC  #    Comments [2] -
Thinking aloud | Tips and tricks
# Saturday, February 06, 2010

To end with immutable trees, at least for now, we've implemented IDictionary<K, V>. It's named Map<K, V>. Functionally it looks very like SortedDictionary<K, V>. there are some differences, however:

  • Map in contrast to SortedDictionary is very cheap on copy.
  • Bacause Map is based on AVL tree, which is more rigorly balanced than RB tree, so it's a little bit faster asymptotically for lookup than SortedDictionary, and a little bit slower on modification.
  • Due to the storage structure: node + navigator, Map consumes less memory than SortedDictionary, and is probably cheaper for GC (simple garbage graphs).
  • As AVL tree stores left and right subtree sizes, in contrast to a "color" in RB tree, we able to index data in two ways: with integer index, and with key value.

Sources are:

Update:

It was impossible to withstand temptation to commit some primitive performance comparision. Map outperforms SortedDictionary both in population and in access. this does not aggree with pure algorithm's theory, but there might be other unaccounted factors: memory consumption, quality of implementation, and so on.

Program.cs is updated with measurements.

Update 2:

More occurate tests show that for some key types Map's faster, for others SortedDictionary's faster. Usually Map's slower during population (mutable AVL tree navigator may fix this). the odd thing is that Map<string, int> is faster than SortedDictionary<string, int> both for allocaction and for access. See excel report.

Update 3:

Interesing observation. The following table shows maximal and average tree heights for different node sizes in AVL and RB trees after a random population:

AVL RB
Size Max Avg Max Avg
10 4 2.90 5 3.00
50 7 4.94 8 4.94
100 8 5.84 9 5.86
500 11 8.14 14 8.39
1000 12 9.14 16 9.38
5000 15 11.51 18 11.47
10000 16 12.53 20 12.47
50000 19 14.89 23 14.72
100000 20 15.90 25 15.72
500000 25 18.26 28 18.27
1000000 25 19.28 30 19.27

Here, according with theory, the height of AVL tree is shorter than the height of RB tree. But what is most interesting is that the depth of an "average node". This value describes a number of steps required to find a random key. RB tree is very close and often is better than AVL in this regard.

Saturday, February 06, 2010 6:31:13 PM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks
# Wednesday, February 03, 2010

It was obvious as hell from day one of generics that there will appear obscure long names when you will start to parametrize your types. It was the easiest thing in the world to take care of this in advanvce. Alas, C# inherits C++'s bad practices.

Read Associative containers in a functional languages and Program.cs to see what we're talking about.

Briefly, there is a pair (string, int), which in C# should be declared as:

System.Collections.Generic.KeyValuePair<string, int>

Obviously we would like to write it in a short way. These are our attempts, which fail:

1. Introduce generic alias Pair<K, V>:

using System.Collections.Generic;
using Pair<K, V> = KeyValuePair<K, V>;

2. Introduce type alias for a generic type with specific types.

using System.Collections.Generic;
using Pair = KeyValuePair<string, int>;

And this is only one that works:

using Pair = System.Collections.Generic.KeyValuePair<string, int>;

Do you think is it bearable? Well, consider the following:

  • There is a generic type ValueNode<T>, where T should be Pair.
  • There is a generic type TreeNavigator<N>, where N is should be ValueNode<Pair>.

The declaration looks like this:

using Pair = System.Collections.Generic.KeyValuePair<string, int>;
using Node = NesterovskyBros.Collections.AVL.ValueNode<
  System.Collections.Generic.KeyValuePair<string, int>>;
using Navigator = NesterovskyBros.Collections.AVL.TreeNavigator<
  NesterovskyBros.Collections.AVL.ValueNode<
    System.Collections.Generic.KeyValuePair<string, int>>>;

Do you still think is it acceptable?

P.S. Legacy thinking led C#'s and java's designers to the use of word "new" for the object construction. It is not required at all. Consider new Pair("A", 1) vs Pair("A", 1). C++ prefers second form. C# and java always use the first one.

Wednesday, February 03, 2010 11:59:19 AM UTC  #    Comments [1] -
Thinking aloud | Tips and tricks
# Wednesday, January 27, 2010

Continuing with the post "Ongoing xslt/xquery spec update" we would like to articulate what options regarding associative containers do we have in a functional languages (e.g. xslt, xquery), assuming that variables are immutable and implementation is efficient (in some sense).

There are three common implementation techniques:

  • store data (keys, value pairs) in sorted array, and use binary search to access values by a key;
  • store data in a hash map;
  • store data in a binary tree (usually RB or AVL trees).

Implementation choice considerably depends on operations, which are taken over the container. Usually these are:

  1. construction;
  2. value lookup by key;
  3. key enumeration (ordered or not);
  4. container modification (add and remove data into the container);
  5. access elements by index;

Note that modification in a functional programming means a creation of a new container, so here is a division:

  1. If container's use pattern does not include modification, then probably the simplest solution is to build it as an ordered sequence of pairs, and use binary search to access the data. Alternatively, one could implement associative container as a hash map.
  2. If modification is essential then neither ordered sequence of pairs, hash map nor classical tree implementation can be used, as they are either too slow or too greedy for a memory, either during modification or during access.

On the other hand to deal with container's modifications one can build an implementation, which uses "top-down" RB or AVL trees. To see the difference consider a classical tree structure and its functional variant:

Classical Functional
Node structure: node
  parent
  left
  right
  other data
node
 
  left
  right
  other data
Node reference: node itself node path from a root of a tree
Modification: either mutable or requires a completely new tree O(LnN) nodes are created

Here we observe that:

  1. one can implement efficient map (lookup time no worse than O(LnN)) with no modification support, using ordered array;
  2. one can implement efficient map with support of modification, using immutable binary tree;
  3. one can implement all these algorithms purely in xslt and xquery (provided that inline functions are supported);
  4. any such imlementation will lose against the same implementation written in C++, C#, java;
  5. the best implementation would probably start from sorted array and will switch to binary tree after some size threshold.

Here we provide a C# implementation of a functional AVL tree, which also supports element indexing:

Our intention was to show that the usual algorithms for associative containers apply in functional programming; thus a feature complete functional language must support associative containers to make development more conscious, and to free a developer from inventing basic things existing already for almost a half of century.

Wednesday, January 27, 2010 7:00:55 AM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | xslt
# Tuesday, January 19, 2010

Several years ago we have started a new project. We do not like neither hate any particular language, thus the decision what language to use was pragmatical: xslt 2.0 fitted perfectly.

At present it's a solid volume of xslt code. It exhibits all the virtues of any other good project in other language: clean design, modularity, documentation, sophisticationless (good code should not be clever).

Runtime profile of the project is that it deals with xml documents with sizes from a few dozens of bytes to several megabytes, and with xml schemas from simple ones like a tabular data, and to rich like xhtml and untyped. Pipeline of stylesheets processes gigabytes of data stored in the database and in files.

All the bragging above is needed here to introduce the context for the following couple of lines.

The diversity of load conditions and a big code base, exposed xslt engine of choice to a good proof test. The victim is Saxon. In the course of project we have found and reported many bugs. Some of them are nasty and important, and others are bearable. To Michael Kay's credit (he's owner of Saxon) all bugs are being fixed promtly (see the last one).

Such project helps us to understand a weak sides of xslt (it seems sometimes they, in WG, lack such experience, which should lead them through).

Well, it has happened so that we're helping to Saxon project. Unintentionally, however! :-)

P.S. About language preferences.

Nowdays we're polishing a COBOL generation. To this end we have familiarized ourselves with this language. That's the beatiful language. Its straightforwardness helps to see the evolution of computer languages and to understand what and why today's languages try to timidly hide.

Tuesday, January 19, 2010 7:56:04 PM UTC  #    Comments [0] -
Thinking aloud | xslt
# Saturday, December 19, 2009

In spite of the fact that our last projects are being developed in Java, the .NET is definitly our favorite platform.

In a twitter I saw the phrase: "Java the language is a stagnant mess". It's said in favour of C#. It's true that C# significantly affects now even on Java (let's remember generics, jaxb, web services, etc.), but in my opinion, the C# won't be the leading language for worldwide enterprise applications in the nearest future.

One of causes is that the main platform for .NET still is Windows. The situation could be changed by Mono project, but I think there are yet not enough projects on platforms other than Windows.

My guess is confirmed by some of observations that I did as a software engineer of an IT company. Our company performs different software porting projects from legacy programming languages like COBOL, ADSO, Natural etc. into up to date languages like Java, C# etc. It worth to say that clients rarely select to migrate to .NET despite to our advices.

The main reason of such choice, according to most of our clients, is that they want to be platform independent and only Java gives them this choice.

It worth for Microsoft to think about cooperation with Mono in order to make .NET really platform indpendent, otherwise C# will always be a step behind Java despite apparent advantages of C# as a programming language.

Saturday, December 19, 2009 5:08:23 PM UTC  #    Comments [1] -
Thinking aloud
# Tuesday, September 22, 2009

Suppose you have a library, which is out in the production and is used by many clients. At the same time the library evolves: API is extended, bugs are being fixed, code becomes faster and cleaner, bla bla bla...

At some point you're fixing some important bug that's been hiding for a long time in the bowels of your library. You're happy that you've spotted it before clients got into troubles. You're notifying all the clients that there is an important fix, and that they need to update the library.

What do you think you hear in return?

Well, we're not perfect, there are bugs in our software. We and our clients realize this. Nothing will eliminate bugs to creep into a code from time to time.

That's a train of thought of a particular client:

We agree that there is a bug and that it has to be fixed. We, however, want to touch the library in a minimal way, as who knows what other new bugs had they introduced, so let's ask them to fix this particular bug in our version of the library.

That's fair from the client's perspective. They don't want better code, they just want that particular bug fixed!

For us, however, this means branching some old version of the library, fixing bug and supporting this branch for the particular client. It's fair to expect similar position from each client, thus should we create and support library branches per client, and branch a main branch for a new client only?

For us (Arthur and Vladimir) it looks as enormous waste of resources. We (our company) should either hire more and more scaled people or experience gradual slowdown of support and development.

Our answer could be obvious if not position of top managers who value client relations so much that they easily promise whatever client wishes. No arguments that latest version is better tested, more conforming to specifications, more reliable, faster and so on are accepted. The main argument against our position is that the client's applications run in the production, and no new potential bugs are acceptable.

Here is our dilemma: we neither can convince the client (more precisely our managers) that we're right, nor are convinced with their arguments...

Tuesday, September 22, 2009 8:16:11 AM UTC  #    Comments [3] -
Thinking aloud
# Sunday, January 18, 2004

From time to time I like to tell, to express (or even to brag about) what keeps our minds busy. Sometimes it's even useful, and you unexpectedly discover, you are on the edge of community interests, other time you are outsider.

So, what are we doing now? Well, already several years we are working in Multiconn. One of goals of our company is to help to expose mainframe functionality. Many different solutions of this task are available, and even Multiconn provides several different approaches. Now go right to our business - web service as delegate of mainframe.

To be honest, I think mainframes are legacy monsters, dinosaurs. I realize, it's only a bad perspective, but the first experience is often very strong. For us as developers mainframe is a source of data in different formats (sure formats are legacy also): COBOL records, terminal messages, and so on. Our idea is to allow to mainframe to rest in peace and to gracefully consume its data. Combining consumed data in xml form and operation bindings that describe mainframe's application flow, we arrive to a view of mainframe application as a web service.

To achieve this goal we have worked up extension to xml schema that allows mapping schema elements to a data. It should be pointed, it's perfectly legal to extend xml schema. One way to extend it, is to use elements in custom namespace in appInfo element.

The following was to create importer of schema with our extensions. It's somehow similar to xsd.exe tool. Our tool generates classes with annotations XmlXxxAttribute for xml serialization and LayoutXxxAttribute (this is our custom attributes) for instance serializing and deserializing to and from a mainframe data.

The next stage was to create data serializer. This is a counterpart to XmlSerializer. It inspects class meta-data and creates plan for serialization and deserialization.

On the next stage we have worked up wsdl schema bindings for our technology.

After that we have created tool to import wsdl (similar to wsdl.exe) that generates web service which passes input messages to communication layer that serializes and forwards data to mainframe, accepts and deserializes response and returns it to the web service, which in its turn returns result to a client.

The next was communication layer that interacts with mainframe. This layer consists of abstract (general) sublayer and specializations which support different patterns of communications.

As result we have web service implemented in .NET. This web service represents some mainframe's application.

There are plans to generate similar web services in the Java.

Sunday, January 18, 2004 9:40:59 AM UTC  #    Comments [0] -
Thinking aloud
Archive
<February 2012>
SunMonTueWedThuFriSat
2930311234
567891011
12131415161718
19202122232425
26272829123
45678910
Statistics
Total Posts: 241
This Year: 5
This Month: 1
This Week: 0
Comments: 180
Locations of visitors to this page
Disclaimer
The opinions expressed herein are our own personal opinions and do not represent our employer's view in anyway.

© 2012, Nesterovsky bros
All Content © 2012, Nesterovsky bros
DasBlog theme 'Business' created by Christoph De Baene (delarou)