RSS 2.0
Sign In
# Friday, 26 January 2024

Introduction

We migrate code out of MF to Azure. Tool we use produces plain good functionally equivalent C# code. But it turns it's not enough!

So, what's the problem?

Converted code is very slow, especially for batch processing, where MF completes job, say in 30 minutes, while converted code finishes in 8 hours.

At this point usually someone appears and whispers in the ear:

Look, those old technologies are proven by the time. It worth to stick to old good Cobol, or better to Assembler if you want to do the real thing.

We're curious though: why is there a difference?

Turns out the issue lies in differences of network topology between MF and Azure solutions. On MF all programs, database and file storage virtually sit in a single box, thus network latency is negligible.

It's rather usual to see chatty SQL programs on MF that are doing a lot of small SQL queries.

In Azure - programs, database, file storage are different services most certainly sitting in different phisical boxes. You should be thankfull if they are co-located in a single datacenter. So, network latency immediately becomes a factor. Even if it just adds 1 millisecond per SQL roundtrip, it adds up in loops, and turns in the showstopper.

There is no easy workaround on the hardware level.

People advice to write programs differently: "Tune applications and databases for performance in Azure SQL Database".

That's a good advice for a new development but discouraging for migration done by a tool.

So, what is the way forward?

Well, there is one. While accepting weak sides of Azure we can exploit its strong sides.

Parallel refactoring

Before continuing let's consider a code demoing the problem:

  public void CreateReport(StringWriter writer)
  {
    var index = 0;

    foreach(var transaction in dataService.
      GetTransactions().
      OrderBy(item => (item.At, item.SourceAccountId)))
    {
      var sourceAccount = dataService.GetAccount(transaction.SourceAccountId);
      var targetAccount = transaction.TargetAccountId != null ?
        dataService.GetAccount(transaction.TargetAccountId) : null;

      ++index;

      if (index % 100 == 0)
      { 
        Console.WriteLine(index);
      }

      writer.WriteLine($"{index},{transaction.Id},{
        transaction.At},{transaction.Type},{transaction.Amount},{
        transaction.SourceAccountId},{sourceAccount?.Name},{
        transaction.TargetAccountId},{targetAccount?.Name}");
    }
  }

This cycle queries transactions, along with two more small queries to get source and target accounts for each transaction. Results are printed into a report.

If we assume query latency just 1 millisecond, and try to run such code for 100K transactions we easily come to 200+ seconds of execution.

Reality turns to be much worse. Program spends most of its lifecycle waiting for database results, and iterations don't advance until all work of previous iterations is complete.

We could do better even without trying to rewrite all code! Let's articulate our goals:

  1. To make code fast.
  2. To leave code recognizable.

The idea is to form two processing pipelines:

  • (a) one that processes data in parallel out of order;
  • (b) other that processes data serially, in original order;

Each pipeline may post sub-tasks to the other, so (a) runs its tasks in parallel unordered, while (b) runs its tasks as if everything was running serially.

So, parallel plan would be like this:

  1. Queue parallel sub-tasks (a) for each transaction.
  2. Parallel sub-task in (a) reads source and target accounts, and queues serial sub-task (b) passing transaction and accounts.
  3. Serial sub-task (b) increments index, and writes report record.
  4. Wait for all tasks to complete.

To reduce burden of task piplelines we use Dataflow (Task Parallel Library), and encapsulate everything in a small wrapper.

Consider refactored code:

  public void CreateReport(StringWriter writer)
  {
    using var parallel = new Parallel(options.Value.Parallelism); //     ⬅️ 1
    var index = 0;

    parallel.ForEachAsync( //     ⬅️ 2
      dataService.
        GetTransactions().
        OrderBy(item => (item.At, item.SourceAccountId)),
      transaction => //     ⬅️ 3
      {
        var sourceAccount = dataService.GetAccount(transaction.SourceAccountId);
        var targetAccount = transaction.TargetAccountId != null ?
          dataService.GetAccount(transaction.TargetAccountId) : null;

        parallel.PostSync(  //     ⬅️ 4
          (transaction, sourceAccount, targetAccount),  //     ⬅️ 5
          data =>
          {
            var (transaction, sourceAccount, targetAccount) = data; //     ⬅️ 6

            ++index;

            if (index % 100 == 0)
            {
              Console.WriteLine(index);
            }

            writer.WriteLine($"{index},{transaction.Id},{
              transaction.At},{transaction.Type},{transaction.Amount},{
              transaction.SourceAccountId},{sourceAccount?.Name},{
              transaction.TargetAccountId},{targetAccount?.Name}");
          });
      });
  }

Consider ⬅️ points:

  1. We create Parallel utility class passing degree of parallelism requested.
  2. We iterate transactions using parallel.ForEachAsync() that queues parallel sub-tasks for each transaction, and then waits until all tasks are complete.
  3. Each parallel sub-task recieves a transaction. It may be called from a different thread.
  4. Having recieved required accounts we queue a sub-task for synchronous execution using parallel.PostSync(), and
  5. Pass there data collected in parallel sub-task: transaction and accounts.
  6. We deconstruct data passed into variables, and then proceed with serial logic.

What we achieve with this refactoring:

  1. Top level query that brings transactions is done and iterated serially.
  2. But each iteration body is run in parallel. By default we set it up to allow up to 100 parallel executions. All those parallel sub-task do not wait on each other so their waitings do not add up.
  3. Sync sub-tasks are queued and executed in order of their serial appearance, so increments and report records are not a subject of race conditions, nor a subject of reordering of output records.

We think that such refactored code is still recognizible.

As for performance this is what log shows:

Serial test
100
...
Execution time: 00:01:33.8152540

Parallel test
100
...
Execution time: 00:00:05.8705468

Reference

Please take a look at project to understand implementation details, and in particular Parallel class implementing API to post parallel and serial tasks, run cycles and some more.

Please continue reading on GitHub.

Friday, 26 January 2024 13:27:16 UTC  #    Comments [0] -
.NET | Thinking aloud | Tips and tricks
# Monday, 01 January 2024

As we wrote some time ago, we started to use AI and ML APIs in our applications, but this time we've tried to use Bing chat as a tool.

The initial task was to collect sequences of integer numbers and return a list of pairs integers where the first number is a range's start and the second one is a count of values in the range. For example, for sequence 1, 2, 3, 5, 8, 9 we'd like to get the following results: 1, 3 5, 1 8, 2

We both started to solve this issue. My brother started to play with SSMS using his deep knowledges in SQL and analytic functions, whereas me started to play with Bing chat. Looking ahead, I can state that we both solved the issue using almost the same SQL statement almost in the same time.

Here are my prompts and Bing's answers:

Prompt:
Act as database analyst and software engineer. Given the table (T) with a numeric field (N), which values are orderd ranges of integers. In a range each next value is greater than previous one on 1. When value differs from the previous one more than on 1 then in belongs to another range. Generate select statement that returns rows with two fields. The first field (R) is a number from which a range starts and the second field (C) is count elements of that range:
Answer:
Sure, I can help with that. Here's a SQL query that should do the trick. This query uses a self-join to identify the start of each range, and then groups by the start of the range to count the number of elements. ```sql SELECT start_of_range AS R, COUNT(*) AS C FROM ( SELECT N, N - ROW_NUMBER() OVER (ORDER BY N) AS start_of_range FROM T ) subquery GROUP BY start_of_range ORDER BY start_of_range; ``` In this query, `start_of_range` is calculated as the difference between the current value of `N` and its row number (when ordered by `N`). This difference will be the same for all `N` in the same range. Then we group by `start_of_range` to get the count of elements in each range. The outer query selects the start of each range (`R`) and the count of elements in that range (`C`).
Me, scratching the head, hmmm...
Question:
This is a bit incorrect SQL select. For example for values 1, 2, 3, 5, 8, 9 I'd like to get the following results: 1, 3 5, 1 8, 2 What a SQL select shall I use to get such results?
Answer:
I apologize for the confusion earlier. You're right, the previous query doesn't handle the scenario you've described correctly. Here's a corrected SQL query that should give you the desired output. This query uses a window function to compare each row with the previous one and identify the start of a new range. Then it groups by the start of the range to count the elements. ```sql WITH ranges AS ( SELECT N, CASE WHEN N - LAG(N, 1, N) OVER (ORDER BY N) > 1 THEN 1 ELSE 0 END AS is_start_of_range FROM T ), range_groups AS ( SELECT N, SUM(is_start_of_range) OVER (ORDER BY N) AS range_group FROM ranges ) SELECT MIN(N) AS R, COUNT(*) AS C FROM range_groups GROUP BY range_group ORDER BY R; ``` In this query, `is_start_of_range` is a flag that indicates the start of a new range (when the difference between the current and previous `N` is more than 1). `range_group` is a running total of `is_start_of_range` that serves as a unique identifier for each range. The outer query then groups by `range_group` to get the start (`R`) and count (`C`) of each range. I hope this helps! Let me know if you have any other questions.

That answer more satisfied me, and I've checked the following example:

with T as ( select 10 N union all select 11 N union all select 30 N union all select 31 N union all select 32 N union all select 50 N union all select 80 N union all select 81 N union all select 82 N union all select 83 N union all select 90 N ), ranges AS ( SELECT N, CASE WHEN N - LAG(N, 1, N) OVER (ORDER BY N) > 1 THEN 1 ELSE 0 END AS is_start_of_range FROM T ), range_groups AS ( SELECT N, SUM(is_start_of_range) OVER (ORDER BY N) AS range_group FROM ranges ) SELECT MIN(N) AS R, COUNT(*) AS C FROM range_groups GROUP BY range_group ORDER BY R;

The result was:

R C ------------ 10 2 30 3 50 1 80 4 90 1

Conclusion:
Nowdays, using AI as a tool, you may improve your performance to the expert level and, may be, study something new...

P.S. Don't fool youself that AI can replace a developer (see the first answer of the chat), but together they can be much stornger than separatelly.

P.P.S. Another interesting solution gave us our colleague. She used an additional field to avoid using analytics function, and she reached the same result:

with T as ( select 10 ID, 1 N union all select 11 ID, 1 N union all select 30 ID, 1 N union all select 31 ID, 1 N union all select 32 ID, 1 N union all select 50 ID, 1 N union all select 80 ID, 1 N union all select 81 ID, 1 N union all select 82 ID, 1 N union all select 83 ID, 1 N union all select 90 ID, 1 N ), Groups AS ( SELECT ID, N, ROW_NUMBER() OVER (ORDER BY ID) - ID AS GroupNumber FROM T ) SELECT MIN(ID) AS R, SUM(N) AS C FROM Groups GROUP BY GroupNumber ORDER BY StartID;
Monday, 01 January 2024 14:02:01 UTC  #    Comments [0] -
AI | SQL Server puzzle | Thinking aloud | Tips and tricks
# Thursday, 22 June 2023

Many years ago we implemented Akinator like engine purely within SQL Server.

Today we use exactly the same technique to implement vector database.

Please see our GitHub repo: vector-database.

Thursday, 22 June 2023 22:01:45 UTC  #    Comments [0] -
Announce | SQL Server puzzle | Thinking aloud
# Thursday, 23 March 2023

Last few days we play with OpenAI API and out of pure interest have asked about few slogans for our team. As an input we fed info from "About us" page. And as one of the first slogans we've gotten the following slogan that catgh our eyes:

Excellence Through Experience: Nesterovsky Bros.

ChatGPT is not a bad copywriter at all...

Thursday, 23 March 2023 11:30:03 UTC  #    Comments [0] -
Thinking aloud
# Saturday, 16 April 2022

Xslt is oftentimes thought as a tool to take input xml, and run transformation to get html or some xml on output. Our use case is more complex, and is closer to a data mining of big data in batch. Our transformation pipelines often take hour or more to run even with SSD disks and with CPU cores fully loaded with work.

So, we're looking for performance opportunities, and xml vs json might be promising.

Here are our hypotheses:

  • json is lighter than xml to serialize and deserialize;
  • json stored as map(*), array(*) and other items() are ligher than node() at runtime, in particular subtree copy is zero cost in json;
  • templates with match patterns are efficiently can be implemented with maps();
  • there is incremental way forward from use of xml to use of json.

If it pays off we might be switching xml format to json all over, even though it is a development effort.

But to proceed we need to commit an experiment to measure processing speed of xml vs json in xslt.

Now our task is to find an isolated small representative sample to prove or reject our hypotheses.

Better to start off with some existing transformation, and change it from use of xml to json.

The question is whether there is such a candidate.

Saturday, 16 April 2022 19:03:04 UTC  #    Comments [0] -
Thinking aloud | xslt
# Saturday, 15 January 2022
Couple of days ago, while integrating with someones C# library, we had to debug it, as something went wrong. The code is big and obscure but for the integration purposes it's rather simple: you just create and call a class, that's all. Yet, something just did not work. We had to prove that it's not our fault, as the other side is uncooperative and would not run common debug session to resolve the problem.

To simplify the matter as much as possible here is the case:

var input = ...
var x = new X();
var output = x.Execute(input);

You pass correct input, and get correct output. Simple, right? But it did not work! So, we delved into the foreign code, and this is what we have seen:

class X: Y
{
  public Output Execute(Input input)
  {
    return Perform(input);
  }

  protected override Output Run(Input input)
  { 
    ...

     return output;
  }
}

class Y: Z
{
  ...
}

class Z
{
  protected Output Perform(Input input)
  {
    return Run(Input);
  }
        
  protected virtual Output Run(Input input)
  {
    return null;
  }
}

Do you see, still flow is simple, right? We call X.Execute(), it calls Z.Perform(), which in turn calls overriden X.Run() that returns the result.

But to our puzzlement we got null on output, as if Z.Run() was called!

We stepped through the code in debugger and confirmed that Z.Perform() calls Z.Run(), even though "this" instance is of type X.

How can it be? It's a nonsence! Yet, no overriden method was ever called.

No matter how much scrunity we applied to sources X and Z it just did not work.

We verified that the signature of X.Run() matches the signature of Z.Run(), so it overrides the method.

Then what do we see here?

And then enlightenment come! Yes, X.Run() overrides the method, but what method?

We looked closely at class Y, and bingo, we can see there following:

class Y: Z
{
  ...

  protected virtual Output Run(Input input)
  {
    return null;
  }
      
  ...
}

So, X.Run() overrides Y.Run() and not Z.Run()!

Per .NET Y.Run() and Z.Run() are two independant virtual methods, where Y.Run() in addition hides Z.Run().

IDE even issued a warning that it's better declare Y.Run() as:

protected new virtual Output Run(Input input)
{
  return null;
}

So, someones code was plainly wrong: Y.Run() had to use override rather than virtual.

We won, right?

Well, it's hard to call it a win.

We spent a hour looking at someones ugly code just to prove we're still sane.

So, what is conclusion of this story?

We think here it is:

  • be cautious while looking at someones code;
  • look at IDE warnings, don't disregard them, and try to resolve all of them in your code base;
  • try to write simple code.
Saturday, 15 January 2022 13:55:08 UTC  #    Comments [0] -
.NET | Thinking aloud
# Wednesday, 20 October 2021

Lately we work great deal of time with Azure's CosmosDB.

This is how it's defined:

"It is schema-agnostic, horizontally scalable, and generally classified as a NoSQL database."

This, unconfident in itself, quote below is clarified as:

"The SQL API lets clients create, update and delete containers and items. Items can be queried with a read-only, JSON-friendly SQL dialect."

To be honest this SQL API made us favor CosmosDB.

So, we started a development with CosmosDB as a data storage.

The next development ingredient we learned the hard way is to try to refrain from clever techniques.

The lesson we learned is simple: after you finish with a project, provided it's not a toy, you give it to people who will be supporting it. You should think about those future developers before you're going to insert some cleverness in you code.

With this common sense we selected EF Core as a library that will serve as an interface between C# and the database.

Initialy all went well until we needed to store a list of strings as a document property and found it's not possible.

Why? - was a naive question.

Answer puzzled us a lot - because string is not an "Entity" (what ever it means), and EF is about framework of entities.

You could argue with this argument as long as you like, it just does not work. It is especially bad if you need to store a class that you do not directly control e.g. structures returned from other services.

Next pothole with EF was when we tried to run an innocent query that joins the data from document: e.g. document contains items, and you want to query some items from some documents.

Guess what?

Right, EF Core does not support it.

Why?

Because!

Later we have found that many other constructs and functions that you easily use in SQL dialect of CosmosDB are not possible or supported in EF Core.

We were very upset with those crutches and came to a conclusion that EF Core harms more than helps when you work with CosmosDB.

We went on and looked at how you work directly with CosmosDB client, and have found that it has all features ready:

  • you may give it SQL and bind parameters;
  • you may convert result items to objects;
  • you may create, delete, update and query data;

So, do we need EF Core?

We answered, no.

This does not mean we reject the value of EF Core but here our conclusion was that this API layer just complicated things instead making them simpler. It might be that EF Core for CosmosDB is not mature enough at this time.

Wednesday, 20 October 2021 16:11:39 UTC  #    Comments [0] -
Azure | Thinking aloud
# Tuesday, 02 February 2021

Recently we have found that BinaryFormatter.Serialize and BinaryFormatter.Deserialize methods are marked as obsolete in .NET 5.0, and are declared dangerous:

The BinaryFormatter type is dangerous and is not recommended for data processing. Applications should stop using BinaryFormatter as soon as possible, even if they believe the data they're processing to be trustworthy. BinaryFormatter is insecure and can't be made secure.

See BinaryFormatter security guide for more details.

That guide along with its links go and expand on what problems BinaryFormatter poses. The schema of dangeous use cases, we have seen so far, is like that:

  • two different sides communicate to each other;
  • one side supplies input in BinaryFormatter's format;
  • other side reads input using BinaryFormatter and instantiates classes.

A threat arises when two sides cannot trust to each other or cannot establish trusted communication chanel. In these cases malign input can be supplied to a side reading the data, which might lead to unexpected code execution, deny of service, data exposure and to other bad consequences.

Arguing like this, today's maintainers of .NET concluded that it's better to tear down BinaryFormatter and similar APIs out of the framework.

Note that they don't claim BinaryFormatter itself, or Reflection API that it uses, as a core of the problem. They blame on communication.

Spelling out clearly what are concerns could help to everyone to better understand how to address it. In the area of security of communication there are a lot of ready solutions like:

  • use signature to avoid tampering the data;
  • use encription to avoid spying the data;
  • use access rights to avoid even access to the data;
  • use secure communication channels.

We can surely state that without applying these solutions no other serialization format is reliable and is subject of the same vulnerabilities.

 

After all it looked like an attempt to throw out the baby with the bath water. The good news is that thankfully to now modular structure of .NET runtime we're able to access binary serialization library, which are (and will be) available on nugets repositories. So, it's futile effort to erase this usefull API.

Tuesday, 02 February 2021 12:39:37 UTC  #    Comments [0] -
.NET | Thinking aloud
# Friday, 01 January 2021

Earlier we wrote that recently we've gotten few tasks related to Machine Learning. The prerequisites to such task is to collect and prepare the input data. Usually the required data is scattered across public sites, some of them are in plain text format (or close to it), but others are accessible as output of public applications. To obtain the required data for such sites you have to navigate thourgh pages, which often requires keeping state between navigations.

In order to implement this task you need some kind of crawler/scraper of the websites. Fortunately, there are a lot of frameworks, libraries and tools in C# (and in other languages too) that allow to do this (visit this or this site to see most popular of them), for example:

  • ScrapySharp
  • ABot
  • HtmlAgilityPack
  • DotnetSpider

There are pros and cons of using these libraries. Most crucial cons is a lack of support of rich UI based on heavy client-side scripts and client-side state support. Since not all such libraries implement fully browser emulation and even more, some of them do not support Javascript execution. So, they suit for gathering information from simple web pages, but no library allows to easy navigate to some page of a web application that keeps rich client-side state. Even best of them, like ScrapySharp, require heavy programming to achieve the result.

Then, suddenly, we've recalled that already for several years we're using Selenium and web drivers to automate web tests for AngularJS/Angular projects. After short discussion we came to conclusion that there is no big difference between testing web application and collecting data, since one of testing stages is collecting of actual results (data) from the tested page, and usually our tests consist of chains of actions performed on consequently visited pages.

This way we came to idea to use WebDriver API implemented by Selenium project. There are implementations of this API in different languages, and in C# too.

Using WebDriver we easily implement cumbersome navigation of a complex web application and can collect required data. Moreover, it allows to run WebDriver in screenless mode. Some of its features allow to create a snapshots of virtual screen and store HTML sources that would resulted of Javascript execution. These features are very useful during run-time troubleshooting. To create a complex web application navigation we need only a bit more knowledge than usual web application's user - we need to identify somehow pages' elements for example by CSS selectors or by id of HTML elements (as we do this for tests). All the rest, like coockies, view state (if any), value of hidden fields, some Javascript events will be transparent in this case.

Although one may say that approach with Selenium is rather fat, it's ought to mention that it is rather scalable. You may either to run several threads with different WebDriver instances in each thread or run several processes simultaneously.

However, beside pros there are cons in the solution with Selenium. They will appear when you'll decide to publish it, e.g. to Azure environment. Take a note that approach with Selenium requires a browser on the server, there is also a problem with Azure itself, as it's Microsoft's platform and Selenium is a product of their main competitor Google... So, some issues aren't techincals. The only possible solution is to use PaaS approach instead of SaaS, but in this case you have to support everything by yourself...

The other problem is that if your application will implement rather aggressive crawling, so either servers where you gather data or your own host might ban it. So, be gentle, play nice, and implement delays between requests.

Also, take into account that when you're implementing any crawler some problems may appear on law level, since not all web sites allow pull anything you want. Many sites use terms & conditions that defines rules for the site users (that you cralwer should follow to), otherwise legal actions may be used against them (or their owners in case of crawler). There is very interesting article that describes many pitfalls when you implement your own crawler.

To summarize everything we told early, the Selenium project could be used in many scenarios, and one of them is to create a powerful crawler.

Friday, 01 January 2021 14:34:37 UTC  #    Comments [0] -
ML.NET | Thinking aloud | Tips and tricks
# Tuesday, 29 December 2020

While doing Cool:GEN migratiotions to Java and C# we produce rather big Angular applications.

Everything is fine: server runs a REST APIs, and client is an Angular application with components per each window, dialog or screen. The only problem is with the word big.

We observe that enterprises that used Cool:GEN to develop their workflow come to migration stage with applications containing thousands of windows. In simple cases, after assessment, clients are able to split their monolith workflow into a set of independent applications. But even then we are dealing with Angular applications containing hundreds to many thousands components.

Now, lets look at Angular world. Best practices advice to (and actually almost force you to) use Ahead Of Time, Ivy compilation of all components and their templates.

Naive attempt to build such monolith Angular application will most surely fail. Angular uses nodejs for build, and chances are close to 100% of nodejs to run out of memory during the ng build.

You may fight and throw at it a capable build machine with 16 or better with 32GB of RAM, and instruct nodejs to use all of it.

Well, it's rather poor and extensive way of dealing with scale problems but it works.

Next hurdle you run into is time. We know it might take days just to build it.

You may ask why?

Well, angular is doing its best to validate templates!

Unfortunately the only viable workaround is to switch this nice feature off for such a big project.

With such setup you're able to build angular project in just 20-30 minutes!

Well, this is a big progress if you compare complete failure vs something that at least passes the build.

But what's next?

Sure, there are next problems:

  • scripts both development and production are of nonsense size: like several dozen megabytes for production, and some even higher number for development.
  • ng serve eats even more memory and builds even longer making nightmare out of development and support of such an application;
  • startup of such application, if it will start at all, is very slow.

So, what can be done? How can we create a manageable Angular application containing that many components?

Angular advices Lazy-loading feature modules.

That's reasonable advice!

We can create coarse grained modules containing subsets of components or fine grained modules containing one component.

So, does it help?

Yes, but it does not solve all problems:

  • ng build and ng serve are still very slow;
  • build produces multiple small scripts that are loaded on demand, so at least application works in browser.

Yet, other important problem is still there: we have multiple severly separated server REST controllers with components that represent them.

On the server side we have Java or C# REST controllers hosting business logic. They are separated per multiple projects probably managed by separate teams, and probably kept in separate GITs (or whatever). On the client side we have a fat angular project storing everything kept in single source repository.

This is not going to work from management perspective.

So the next step is try to split fat Angular project into multiple small projects. So, let's say we shall have 100 small angular libraries combinded by master project.

This is not going to work either due to nature of npm projects, as it will requre terabytes of cloned node_modules folders for each library, and many hours to build each of them.

It seems there is a room for improvments in npm area. There is no point to make dedicated copies of node_modules. It's much easier to have a local cache of artifacts.

So, what is the direction? How to create big angular project after all?

What we have here is:

  • a big enterprise level application;
  • it is modular but modules must work together to form desired workflow;
  • different modules are supported by different teams (both server and client side);
  • client Angular components correspond to REST controllers on the server.
  • all pages use the same styles and the same set of UI controls;

Looking from this perspective all development can be seen as:

  • development and support of unified styles and ui components that must be reused through the application;
  • development of server side and REST controllers that implement business logic;
  • development of templates of components (note that components themselves do nothing except expose their templates).

Studying this design suggests important and independent role of templates just like it is in AngularJS!

In contrast Angular templates are only a tool used by components. It's not obvious how to use thousands of templates without first building thousands components; neither it's obvious how to dynamically host those templates (routes do not help here).

Though not obvious it's still possible to do it though it requires use a bit lower level API than tutorials suggest. Ingredients are:

  • use of Just In Time (in contrast to Ahead Of Time) compilation, and use View Enginev (in contrast to Ivy);
  • use ViewContainerRef to host components dynamically;
  • Dynamic components and modules that you can create on demand using templates loaded e.g. through HttpClient.

To make things short we shall show the example of dynamic components in next article.

Here we shall emphasize that such design allows us to create small angular application that builds under 20 seconds with component templates served along with the REST controllers, and stored in the same Git.

So, if you say have a server subproject exposing REST controller say in the form: api/area/MyComponent then its template may be exposed as resource: page/area/MyComponent. Templates are loaded and compiled on demand at runtime thus making application light. At the same time templates may be cached in browser cache thus reducing number of roundtrips to the server.

Tuesday, 29 December 2020 18:15:25 UTC  #    Comments [0] -
Angular | AngularJS | Thinking aloud
# Wednesday, 05 August 2020

Recently our colleague turned to us and asked to help to deal with some complex query.

It has turned out that the complex part was to understand what he wants to achieve.

After listening to him we have forumulated the task in our words and have confirmed that that is what he wants.

So, that's the task in our formulation:

  • Assume you have events.
  • Each event acts upon one or more accounts.
  • Find all events that act on the same set of accounts.
  • Note we deal with mutiple millions of events and accounts.

Data is defined like this:

create table dbo.Event
(
  EventID bigint not null,
  AccountID varchar(18) not null,
  primary key(EventID, AccountID)
);

Requested query turned out to be very simple, yet, not as simple as one would think to account big amout of data:

with D as
(
  select * from dbo.Event
),
S as
(
  select
    EventID,
    count(*) Items,
    checksum_agg(checksum(AccountID)) Hash
  from
    D
  group by
    EventID
)
select
  S1.EventID, S2.EventID
from
  S S1
  inner join
  S S2
  on
    S1.EventID < S2.EventID and
    S1.Items = S2.Items and
    S1.Hash = S2.Hash and
    not exists
    (
      select AccountID from D where EventID = S1.EventID
      except
      select AccountID from D where EventID = S2.EventID
    );

The idea is to:

  1. calculate a hash derived from list of accounts for each group;
  2. join groups with the same hash;
  3. verify that matched groups fit perfectly.

Even simpler solution that does not use hashes is not scaleable, as it's performance is slower than O(N^2), where N - is a number of events. It has unacceptable time with N ~1e4, nothing to say about N ~1e7.

 

At this point our colleague was already satisfied, as he got result in couple of minutes for a task that he could not even formalize as SQL.

But we felt it could be even better.

We looked at statistics:

with D as
(
  select * from dbo.Event
),
S as
(
  select
    EventID,
    count(*) Items
  from
    D
  group by
    EventID
)
select
  Items, count(*) EventCount
from
  S
group by
  Items
order by
  EventCount desc;

and have seen that most of the events, about 90%, deal with single account, and all other with two and more (some of them act upon big number of accounts).

The nature of the dataset gave us a hint of more verbose but more fast query:

with D as
(
  select * from dbo.Event
),
S as
(
  select
    EventID,
    min(AccountID) AccountID,
    count(*) Items,
    checksum_agg(checksum(AccountID)) Hash
  from
    D
  group by
    EventID
)
select
  S1.EventID, S2.EventID
from
  S S1
  inner join
  S S2
  on
    S1.EventID < S2.EventID and
    S1.Items = 1 and
    S2.Items = 1 and
    S1.AccountID = S2.AccountID
union all
select
  S1.EventID, S2.EventID
from
  S S1
  inner join
  S S2
  on
    S1.EventID < S2.EventID and
    S1.Items > 1 and
    S2.Items > 1 and
    S1.Items = S2.Items and
    S1.Hash = S2.Hash and
    not exists
    (
      select AccountID from D where EventID = S1.EventID
      except
      select AccountID from D where EventID = S2.EventID
    );

This query produced results in twenty seconds instead of couple of minutes for a dataset with ~1e7 rows.

Wednesday, 05 August 2020 07:44:07 UTC  #    Comments [0] -
SQL Server puzzle | Thinking aloud | Tips and tricks
# Sunday, 24 May 2020

While working on algorithm to trace Biconnected components for Graph API in the XSLT  we realized that we implemented it unconventionally.

A pseudocode in Wikipedia is:

GetArticulationPoints(i, d)
    visited[i] := true
    depth[i] := d
    low[i] := d
    childCount := 0
    isArticulation := false

    for each ni in adj[i] do
        if not visited[ni] then
            parent[ni] := i
            GetArticulationPoints(ni, d + 1)
            childCount := childCount + 1
            if low[ni] ≥ depth[i] then
                isArticulation := true
            low[i] := Min (low[i], low[ni])
        else if ni ≠ parent[i] then
            low[i] := Min (low[i], depth[ni])
    if (parent[i] ≠ null and isArticulation) or (parent[i] = null and childCount > 1) then
        Output i as articulation point

That algorithm is based on the fact that connected graph can be represented as a tree of biconnected components. Vertices of such tree are called articulation points. Implementation deals with a depth of each vertex, and with a lowpoint parameter that is also related to vertex depth during Depth-First-Search.

Out of interest we approached to the problem from different perspective. A vertex is an articulation point if it has neighbors that cannot be combined into a path not containing this vertex. As well as classical algorithm we use Depth-First-Search to navigate the graph, but in contrast we collect cycles that pass through each vertex. If during back pass of Depth-First-Search we find not cycle from "child" to "ancestor" then it is necessary an articulation point.

Here is pseudocode:

GetArticulationPoints(v, p) -> result
    index = index + 1
    visited[v] = index 
    result = index
    articulation = p = null ? -1 : 0

    for each n in neighbors of v except p do
        if visited[n] = 0 then
            nresult = GetArticulationPoints(n, v)
            result = min(result, nresult)

            if nresult >= visited[v] then
                articulation = articulation + 1
        else
            result = min(result, visited[n])

    if articulation > 0 then
        Output v as articulation point

Algorithms' complexity are the same.

What is interesting is that we see no obvious way to transform one algorithm into the other except from starting from Graph theory.

More is on Wiki.

Sunday, 24 May 2020 12:15:02 UTC  #    Comments [0] -
Thinking aloud | xslt
# Tuesday, 19 May 2020

Michael Key's "A Proposal for XSLT 4.0" has spinned our interest in what could be added or changed in XSLT. This way we decided to implement Graph API purely in xslt. Our goal was to prove that:

  • it's possible to provide efficient implementation of different Graph Algorithms in XSLT;
  • to build Graph API the way that engine could provide native implementations of Grahp Algorithms.
  • to find through an experiments what could be added to XSLT as a language.

At present we may confirm that first two goals are reachable; and experiments have shown that XSLT could provide more help to make program better, e.g. we have seen that language could simplify coding cycles.

Graph algorithms are often expressed with while cycles, e.g "Dijkstra's algorithm" has:

12      while Q is not empty:
13          u ← vertex in Q with min dist[u]  

body is executed when condition is satisfied, but condition is impacted by body itself.

In xslt 3.0 we did this with simple recursion:

<xsl:template name="f:while" as="item()*">
  <xsl:param name="condition" as="function(item()*) as xs:boolean"/>
  <xsl:param name="action" as="function(item()*) as item()*"/>
  <xsl:param name="next" as="function(item()*, item()*) as item()*"/>
  <xsl:param name="state" as="item()*"/>

  <xsl:if test="$condition($state)">
    <xsl:variable name="items" as="item()*" select="$action($state)"/>

    <xsl:sequence select="$items"/>

    <xsl:call-template name="f:while">
      <xsl:with-param name="condition" select="$condition"/>
      <xsl:with-param name="action" select="$action"/>
      <xsl:with-param name="next" select="$next"/>
      <xsl:with-param name="state" select="$next($state, $items)"/>
    </xsl:call-template>
  </xsl:if>
</xsl:template>

But here is the point. It could be done in more comprehended way. E.g. to let xsl:iterate without select to cycle until xsl:break is reached.

<xsl:iterate>
  <xsl:param name="name" as="..." value="..."/>
  
  <xsl:if test="...">
    <xsl:break/>
  </xsl:if>

  ...
</xsl:iterate>

So, what we propose is to let xsl:iterate/@select to be optional, and change the behavior of processor when the attribute is missing from compilation error to a valid behavior. This should not impact on any existing valid XSLT 3.0 program.

Tuesday, 19 May 2020 07:00:25 UTC  #    Comments [0] -
Thinking aloud | xslt
# Tuesday, 12 May 2020

Recently we've read an article "A Proposal for XSLT 4.0", and thought it worth to suggest one more idea. We have written a message to Michael Kay, author of this proposal. Here it is:

A&V

Historically xslt, xquery and xpath were dealing with trees. Nowadays it became much common to process graphs. Many tasks can be formulated in terms of graphs, and in particular any task processing trees is also graph task.

I suggest to take a deeper look in this direction.

As an inspiration I may suggest to look at "P1709R2: Graph Library" - the C++ proposal.


Michael Kay

I have for many years found it frustrating that XML is confined to hierarchic relationships (things like IDREF and XLink are clumsy workarounds); also the fact that the arbitrary division of data into "documents" plays such a decisive role: documents should only exist in the serialized representation of the model, not in the model itself.

I started my career working with the Codasyl-defined network data model. It's a fine and very flexible data model; its downfall was the (DOM-like) procedural navigation language. So I've often wondered what one could do trying to re-invent the Codasyl model in a more modern idiom, coupling it with an XPath-like declarative access language extended to handle networks (graphs) rather than hierarchies.

I've no idea how close a reinventiion of Codasyl would be to some of the modern graph data models; it would be interesting to see. The other interesting aspect of this is whether you can make it work for schema-less data.

But I don't think that would be an incremental evolution of XSLT; I think it would be something completely new.


A&V

I was not so radical in my thoughts. :-)

Even C++ API is not so radical, as they do not impose hard requirements on internal graph representation but rather define template API that will work both with third party representations (they even mention Fortran) or several built-in implementations that uses standard vectors.

Their strong point is in algorithms provided as part of library and not graph internal structure (I think authors of that paper have structured it not the best way). E.g. in the second part they list graph algorithms: Depth First Search (DFS); Breadth First Search (BFS); Topological Sort (TopoSort); Shortest Paths Algorithms; Dijkstra Algorithms; and so on.

If we shall try to map it to xpath world them graph on API level might be represented as a user function or as a map of user functions.

On a storage level user may implement graph using a sequence of maps or map of maps, or even using xdm elements.

So, my approach is evolutional. In fact I suggest pure API that could even be implemented now.


Michael Kay

Yes, there's certainly scope for graph-oriented functions such as closure($origin, $function) and is-reachable($origin, $function) and find-path($origin, $destination, $function) where we use the existing data model, treating any item as a node in a graph, and representing the arcs using functions. There are a few complications, e.g. what's the identity comparison between arbitrary items, but it can probably be done.


A&V
> There are a few complications, e.g. what's the identity comparison between arbitrary items, but it can probably be done.

One approach to address this is through definition of graph API. E.g. to define graph as a map (interface analogy) of functions, with equality functions, if required:

map
{
  vertices: function(),
  edges: function(),
  value: function(vertex),
  in-vertex: function(edge),
  out-vertex: function(edge),
  edges: function(vertex),
  is-in-vertex: function(edge, vertex),
  is-out-vertex: function(edge, vertex)
  ...
}

Not sure how far this will go but who knows.

Tuesday, 12 May 2020 06:08:51 UTC  #    Comments [0] -
Thinking aloud | xslt
# Wednesday, 15 April 2020

It's not a secret that COVID-19 epidemic will change our world significantly. picture more than 30 years agoIt impacts on economics and public services, especially on social services of our countries. We saw this in our country and now the same happens in US. Probably the same thing happens in all countries all over the world that suffer from COVID-19.

Usually, nowadays, such services are exposed online, but nobody expected such extreme loading of these services. And they start molder under such load. Programs start crash... and somebody has to fix them. It's just a temporary technical inconvenience when there is staff that familiar with such programs and technologies, but what about situation when programs and technologies are obsolete? When staff that may support them are about to retire due to ages, when knowledge were almost lost... It's very scary when such applications rules very important spheres of our life such social services, finances, medicine, defence etc. Something similar happened in US, so US government asked IBM about a help with their stuff written in COBOL.

Probably, in short term, they'll close the gaps, but taking in account the fact that epidemic won't dissolve in a month, there is a risk to still in the same hole... There are two ways to solve this issue in long term:

  1. to make COBOL widely used program language and to teach enough programmers that will use it. This is exactly what IBM tries to do, see this article, but this way to nowhere, since it is not too popular in society of software developers.
  2. to migrate such application to new technologies and new platform (e.g. Java or C# on UNIX/Windows). In this case organizations obtain scalable applications and ability to find human resources that may fix/modernize such applications step by step, in spare time, without loosing existing functionality. This is what our company Advanced may provide. And we are not alone. There are many such companies that may implement such migration on high level of quality.

And many professionals (even those that deal with COBOL on day by day basis) think that only 2nd way is viable. Let's see what will happen... More about the issue, see here.

Wednesday, 15 April 2020 06:00:51 UTC  #    Comments [0] -
Thinking aloud
# Saturday, 04 April 2020

People compare these two technologies, and it seems an established fact is that Angular is evolutionally more advanced framework. We're not going to contradict, contrary, we agree with it, but it's better for an opinion to be grounded on facts that one can evaluate and verify.

Fortunately we got a chance to make such a comparison.

We support conversions of Cool:GEN (a legacy CASE tool with roots in 80th) to java or C#. In its time Cool:GEN allowed to greatly automate enterprise development using Mainframes as a server side and Mainframe terminals or Win32 GUIs as clients.

The legacy of this tool are probably hundreds of business and database models, milions of programs generated on COBOL on Mainframes and on C or Java on Windows and Linux. All this runs to this time in many econimic sectors.

Usually the client is some enterprise that invested a lot into design, development and support of their business model using Cool:GEN but now most such clients a trying not to lose this legacy but to convert it into something that goes in parallel with todays technologies.

As original technology is sound, so it is possible to map it to todays Java or C# on server, REST or SOAP as a transport, and Angular, AngularJS or some other on client. Such automatic conversion is an essense of our conversions efforts.

To understand a scope consider a typical enterprise client that has 2-3 thousand windows that are backed by 20-30 thousand programs.

Now, consider that the conversion is done. On output among other things we produce a clean java or C# web application with REST and SOAP interface, and Angular or AngularJS web client that encapsulates those 2-3 thousand windows.

Each window definition is rather small 5-10 KB in html form, but whole mass of windows takes 10-30 MB, which is not small any more.

For AngularJS we generate just those html templates, but for Angular we need to generate separate components for each window that includes typescript class, template and style.

While amout of generated resource for AngularJS stays in those 10-30 MB, generated Angular takes at least 5-10 MB more.

The next step is build.

AngularJS builds distribution that includes all used libraries and a set of templates, and it takes something like a minute from the CPU. Produced output is about 300 KB minified script and those 10-30 MB of templates (multiple files with 5-10 KB each one).

Angular (here we talk about version 9) builds distribution that includes all used libraries and a set of compiled components that are to be loaded lazily on demand. Without of the both angular builder that performs tree shaking build takes days. With tree shaking off it takes 40 minutes. This is the first notable difference. Produced output for ES2015 (latest javascript) is about 1 MB, and 15-100 KB per each compiled component. This is the second notable difference that already impacts end user rather than developer.

The third difference is in the end user experience. Though we have built equalvalent Angular and AngularJS frontend we observe load time of angular is higher. This cannot only be ascribed to bigger file sizes. It seems internal initialization also takes more time for Angular.

So, our experience in this particular test shows that Angular has more room to improve. In particular: compile time, bundle size, runtime speed and simplicity of dynamic loading (we have strong cases when template compilation is not the best approach).

Saturday, 04 April 2020 12:37:15 UTC  #    Comments [0] -
AngularJS | Java | Thinking aloud | Tips and tricks
# Friday, 26 July 2019

We were asked to help with search service in one enterprise. We were told that their SharePoint portal does not serve their need. Main complaints were about the quality of search results.

They have decided to implement external index of SharePoint content, using Elastic, and expose custom search API within the enterprise.

We questioned their conclusions, asked why did they think Elastic will give much better results, asked did they try to figure out why SharePoint give no desired results.

Answers did not convince us though we have joined the project.

What do you think? Elastic did not help at all though they hoped very much that its query language will help to rank results in a way that matched documents will be found. After all they thought it was a problem of ranking of results.

Here we have started our analysis. We took a specific document that must be found but is never returned from search.

It turned to be well known problem, at least we dealt with closely related one in the past. There are two ingredients here:

  • documents that have low chances to be found are PDFs;
  • we live in Israel, so most texts are Hebrew, which means words are written from right to left, while some other texts from left to right. See Bi-directional text.

Traditionally PDF documents are provided in a way that only distantly resembles logical structure of original content. E.g., paragraphs of texts are often represented as unrelated runs of text lines, or as set of text runs representing single words, or independant characters. No need to say that additional complication comes from that Hebrew text are often represented visually (from left to right, as if "hello" would be stored as "olleh" and would be just printed from right to left). Another common feature of PDF are custom fonts with uncanonical mappings, or images with glyphs of letters.

You can implement these tricks in other document formats but for some reason PDF is only format we have seen that regularly and intensively uses these techniques.

At this point we have realized that it's not a fault of a search engine to find the document but the feature of the PDF to expose its text to a crawler in a form that cannot be used for search. In fact, PDF cannot search by itself in such documents, as when you try to find some text in the document opened in a pdf viewer, that you clearly see in the document, you often find nothing.

A question. What should you do in this case when no any PDF text extractor can give you correct text but text is there when you looking at document in a pdf viewer?

We decided it's time to go in the direction of image recognition. Thankfully, nowdays it's a matter of available processing resources.

Our goal was:

  1. Have images of each PDF page. This task is immediately solved with Apache PDFBox (A Java PDF Library) - it's time to say this is java project.
  2. Run Optical Character Recognition (OCR) over images, and get extracted texts. This is perfectly done by tesseract-ocr/tesseract, and thankfully to its java wrapper bytedeco/javacpp-presets we can immediately call this C++ API from java.

The only small nuisance of tesseract is that it does not expose table recognition info, but we can easily overcome it (we solved this task in the past), as along with each text run tesseract exposes its position.

What are results of the run of such program?

  1. Full success! It works with high quality of recognition. Indeed, there is no any physical noise that impacts quality.
  2. Slow speed - up to several seconds per recognition per page.
  3. Scalable solution. Slow speed can be compensated by almost unlimited theoretical scalability.

So, what is the lesson we have taked from this experience?

Well, you should question yourself, test and verify ideas on the ground before building any theories that will lead you in completely wrong direction. After all people started to realize there was no need to claim on SharePoint, to throw it, and to spend great deal of time and money just to prove that the problem is in the different place.

A sample source code can be found at App.java

Friday, 26 July 2019 16:38:11 UTC  #    Comments [0] -
C++ | Java | Thinking aloud | Tips and tricks
# Sunday, 24 March 2019

This story started half year ago when Michael Kay, author of Saxon XSLT processor, was dealing with performance in multithreaded environment. See Bug #3958.

The problem is like this.

Given XSLT:

<xsl:stylesheet exclude-result-prefixes="#all" 
  version="3.0" 
  xmlns:saxon="http://saxon.sf.net/"
  xmlns:xs="http://www.w3.org/2001/XMLSchema" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="text" />

  <xsl:template name="main">
    <xsl:for-each saxon:threads="4" select="1 to 10">
      <xsl:choose>
        <xsl:when test=". eq 1">
          <!-- Will take 10 seconds -->
          <xsl:sequence select="
            json-doc('https://httpbin.org/delay/10')?url"/>
        </xsl:when>
        <xsl:when test=". eq 5">
          <!-- Will take 9 seconds -->
          <xsl:sequence select="
            json-doc('https://httpbin.org/delay/9')?url"/>
        </xsl:when>
        <xsl:when test=". eq 10">
          <!-- Will take 8 seconds -->
          <xsl:sequence select="
            json-doc('https://httpbin.org/delay/8')?url"/>
        </xsl:when>
      </xsl:choose>
    </xsl:for-each>
    <xsl:text>
</xsl:text>
  </xsl:template>
</xsl:stylesheet>

Implement engine to achieve best performance of parallel for-each.

Naive implementation that will distribute iterations per threads will run into unfair load on threads, so some load-balancing is required. That was the case Saxon EE.

Michael Kay has been trying to find most elegant way for the implementation and has written the comment:

I can't help feeling that the answer to this must lie in using the Streams machinery, and Spliterators in particular. I've spent another hour or so reading all about Spliterators, and I have to confess I really don't understand the paradigm. If someone can enlighten me, please go ahead...

We have decided to take the challange and to model the expected behavior using Streams. Here is our go:

import java.util.stream.IntStream;
import java.util.stream.Stream;
import java.util.function.Consumer;
import java.util.function.Function;

public class Streams
{
  public static class Item<T>
  {
    public Item(int index, T data)
    {
      this.index = index;
      this.data = data;
    }
    
    int index;
    T data;
  }

  public static void main(String[] args)
  {
    run(
      "Sequential",
      input(),
      Streams::action,
      Streams::output,
      true);
    
    run(
      "Parallel ordered", 
      input().parallel(),
      Streams::action,
      Streams::output,
      true);
    
    run(
      "Parallel unordered", 
      input().parallel(),
      Streams::action,
      Streams::output,
      false);    
  }
  
  private static void run(
    String description,
    Stream<Item<String>> input,
    Function<Item<String>, String[]> action,
    Consumer<String[]> output,
    boolean ordered)
  {
    System.out.println(description);
    
    long start = System.currentTimeMillis();
   
    if (ordered)
    {
      input.map(action).forEachOrdered(output);
    }
    else
    {
      input.map(action).forEach(output);
    }
    
    long end = System.currentTimeMillis();
    
    System.out.println("Execution time: " + (end - start) + "ms.");
    System.out.println();
  }
  
  private static Stream<Item<String>> input()
  {
    return IntStream.range(0, 10).
      mapToObj(i -> new Item<String>(i + 1, "Data " + (i + 1)));
  }
  
  private static String[] action(Item<String> item)
  {
    switch(item.index)
    {
      case 1:
      {
        sleep(10);
        
        break;
      }
      case 5:
      {
        sleep(9);
        
        break;
      }
      case 10:
      {
        sleep(8);
        
        break;
      }
      default:
      {
        sleep(1);
        
        break;
      }
    }
    
    String[] result = { "data:", item.data, "index:", item.index + "" };
    
    return result;
  }
  
  private synchronized static void output(String[] value)
  {
    boolean first = true;
    
    for(String item: value)
    {
      if (first)
      {
        first = false;
      }
      else
      {
        System.out.print(' ');
      }
    
      System.out.print(item);
    }

    System.out.println();
  }
  
  private static void sleep(int seconds)
  {
    try
    {
      Thread.sleep(seconds * 1000);
    }
    catch(InterruptedException e)
    {
      throw new IllegalStateException(e);
    }
  }
}

We model three cases:

"Sequential"
slowest, single threaded execution with output:
data: Data 1 index: 1
data: Data 2 index: 2
data: Data 3 index: 3
data: Data 4 index: 4
data: Data 5 index: 5
data: Data 6 index: 6
data: Data 7 index: 7
data: Data 8 index: 8
data: Data 9 index: 9
data: Data 10 index: 10
Execution time: 34009ms.
"Parallel ordered"
fast, multithread execution preserving order, with output:
data: Data 1 index: 1
data: Data 2 index: 2
data: Data 3 index: 3
data: Data 4 index: 4
data: Data 5 index: 5
data: Data 6 index: 6
data: Data 7 index: 7
data: Data 8 index: 8
data: Data 9 index: 9
data: Data 10 index: 10
Execution time: 10019ms.
"Parallel unordered"
fastest, multithread execution not preserving order, with output:
data: Data 6 index: 6
data: Data 2 index: 2
data: Data 4 index: 4
data: Data 3 index: 3
data: Data 9 index: 9
data: Data 7 index: 7
data: Data 8 index: 8
data: Data 5 index: 5
data: Data 10 index: 10
data: Data 1 index: 1
Execution time: 10001ms.

What we can add in conclusion is that xslt engine could try automatically decide what approach to use, as many SQL engines are doing, and not to force developer to go into low level engine details.

Sunday, 24 March 2019 07:52:02 UTC  #    Comments [0] -
Java | Thinking aloud | xslt
# Saturday, 03 November 2018

Recently we observed how we solved the same task in different versions of XPath: 2.0, 3.0, and 3.1.

Consider, you have a sequence $items, and you want to call some function over each item of the sequence, and to return combined result.

In XPath 2.0 this was solved like this:

for $item in $items return
  f:func($item)

In XPath 3.0 this was solved like this:

$items!f:func(.)

And now with XPath 3.1 that defined an arrow operator => we attempted to write something as simple as:

$items=>f:func()

That is definitely not working, as it is the same as f:func($items).

Next attempt was:

$items!=>f:func()

That even does not compile.

So, finally, working expression using => looks like this:

$items!(.=>f:func())

This looks like a step back comparing to XPath 3.0 variant.

More than that, XPath grammar of arrow operator forbids the use of predictes, axis or mapping operators, so this won't compile:

$items!(.=>f:func()[1])
$items!(.=>f:func()!something)

Our conclusion is that arrow operator is rather confusing addition to XPath.

Saturday, 03 November 2018 20:59:28 UTC  #    Comments [0] -
Thinking aloud | xslt
# Tuesday, 02 October 2018

Xslt 3.0 defines a feature called streamability: a technique to write xslt code that is able to handle arbitrary sized inputs.

This contrasts with conventional xslt code (and xslt engines) where inputs are completely loaded in memory.

To make code streamable a developer should declare her code as such, and the code should pass Streamability analysis.

The goal is to define subset of xslt/xpath operations that allow to process input in one pass.

In simple case it's indeed a simple task to verify that code is streamable, but the more complex your code is the less trivial it's to witness it is streamable.

On the forums we have seen a lot of discussions, where experts were trying to figure out whether particular xslt is streamable. At times it's remarkably untrivial task!

This, in our opinion, clearly manifests that the feature is largerly failed attempt to inscribe some optimization technique into xslt spec.

The place of such optimization is in the implementation space, and not in spec. Engine had to attempt such optimization and fallback to the traditional implementation.

The last such example is: Getting SXST0060 "No streamable path found in expression" when trying to push a map with grounded nodes to a template of a streamable mode, where both xslt code and engine developers are not sure that the code is streamable in the first place.

By the way, besides streamability there is other optimization technique that works probably in all SQL engines. When data does not fit into memory engine may spill it on disk. Thus trading memory pressure for disk access. So, why didn't such techninque find the way into the Xslt or SQL specs?

Tuesday, 02 October 2018 12:50:22 UTC  #    Comments [0] -
Thinking aloud | xslt
# Monday, 06 August 2018

For more than 25 years continues a discussion in C++ community about exceptions. In our opinion this can only be compared with math community and their open problems like Hilbert's 23 problems dated by 1900.

In essence C++ exception discussion is about efficiency of exceptions vs status codes. This problem is not so acute in other languages (like java or C#) because those languages postulate different goals.

C++ designers have introduced a zero-overhead principle for any language feature, which is:

  1. If you don’t use some feature you don’t pay for it.
  2. If you do use it you cannot (reasonably) write it more efficiently by hand.

Exceptions comparing to status codes do not withstand this demand. This led to the fragmentation of C++ comunity where many big projects and code styles ban exceptions partially or completely.

Make no doubt that all this time people were trying to make exceptions better, and have found techniques to make them space and run time efficient to some extent, but still, old plain status codes outperform both in speed (especially in predictability of time of exception handling logic) and in code size.

We guess the solution is finally found after the quarter the century of discussion!

WG paper: Zero-overhead deterministic exceptions: Throwing values by Herb Sutter. This "paper aims to extend C++’s exception model to let functions declare that they throw a statically specified type by value. This lets the exception handling implementation be exactly as efficient and deterministic as a local return by value, with zero dynamic or non-local overheads."

In other words author suggests to:

  • extend exception model (in fact implement additional one);
  • make exceptions as fast and as predicable as status codes (which virtually means designate a status code as a new exception);

 

Here are author's arguments:

  1. Status code is usually just a number, and handling an error is like to perform some if or switch statements.
  2. Handling errors with status codes is predicable in time and visible in the code though it burdens the logic very much.
  3. Exceptions give clear separation between a control flow and error handling.
  4. Exceptions are mostly invisible in the code, so it's not always clear how much they add to code size and how they impact performance.
  5. Statistics show that exceptions add 15 to 30% to size of code, even if no exceptions are used (violation of zero-overhead principle);
  6. Exceptions require Run Time Type Information about types, and have no predictable memory (stack or heap) footprint  (violation of zero-overhead principle).

 

What aurhor suggests might seem too radical but at present it's only viable solution to reestablish zero-verhead principle and to reunite two C++ camps.

Here is the approach.

  1. Clarify what should be considered as an exception.
    1. Contract violation.

      Are contract violation like invalid values of arguments or invalid post conditions (unhold invariants) are exceptions or programmer's bugs?

      If later then it's best to terminate, as you cannot correctly recover from bug.

    2. Virtual Machine fault.

      What user program can do with stack overflow?

      The best according to the author it to terminate.

    3. OOM - Out Of Memory error.

      What is the best way to deal with OOM dyring dynamic allocation.

      At present there are two operators:

      • new - to allocate memory dynamically and throw bad_alloc on failure.
      • new(nothrow) - to allocate memory dynamically and return nullptr on failure.

      Herb Sutter suggests to change new behavior to terminate on failure (it is very hard to properly handle bad_alloc anyway), while new(nothrow) will still allow to build code that can deal with OOM.

    4. Partial success

      This should never be reported as an error, and status codes should be used to report the state.

    5. Error condition distinct from any type of success.

      This is where exceptions should be used.

    Statistics shows that with such separation more than 90% of what curently is an exception will not be exception any more, so no any hidden exception logic is required: program either works or terminates.

  2. Refactor exception

    Redefine what exception object is and how it is propagated.

    It should be thin value type. At minimum it needs to contain an error code. Suggested size is up to a couple of pointers.

    Compiler should be able to cheaply allocate and copy it on the stack or even in the processor's registers.

    Such simple exception type resolves problems with lifetime of exception object, and makes exception handling as light as checking status codes.

    Exception should be propagated through return chanel, so it's like a new calling convention that defines either function result or error outcome.

It's not our intention to quote whole the paper here. If you're interested then please read it. Here we want to outline our concerns.

  1. Exception payload.

    This paper emphasizes that exception type should be small.

    So, what to do with exception payload, if any (at least error message if it's not a static text)?

    If this won't be resolved then developers will start to invent custom mechanisms like GetLastErrorMessage().

    And what to do with aggregate exceptions?

    We think this must be clearly addressed.

  2. Implemntation shift.

    We can accept that most of the current exceptions will terminate.

    But consider now some container that serves requests, like web container or database.

    It may be built from multiple components and serve multiple requests concurently. If one request will terminate we don't want for container to terminate.

    If terminate handler is called then we cannot rely on state of the application. At least we can expect heap leaks and un-released resources.

    So, we either want to be able release heap and other resources per request, or we want to go down with whole process and let OS deal with it.

    In the later case we need to design such containers differently: as a set of cooperative processes; OS should allow to spin processes more easily.

  3. VM with exceptions

    There are Virtual Machines that allow exception to be thrown on each instruction (like JVM, or CLI).

    You cannot tell in this case that code would never throw exception, as it can out of the blue!

    Event in x86 you can have PAGE FAULT on memory access, which can be translated into an exception.

    So, it's still a question whether the terminate() solution is sound in every case, and whether compiler can optimize out exception handling if it proves staticlly that no exception should be thrown.

Monday, 06 August 2018 12:00:10 UTC  #    Comments [0] -
C++ | Thinking aloud
# Saturday, 14 April 2018

We often deal with different SQL DBs, and in particular DB2, Oracle, and SQL Server, and this is what we have found lately.

Our client has reported a problem with SQL insert into the DB2:

  • subject table has a small number of columns, but large number of rows;
  • insert should attempt to insert a row but tolerate the duplicate.

The prototype table is like this:

create table Link(FromID int, ToID int, primary key(FromID, ToID));

DB2 SQL insert is like this:

insert into Link(FromID, ToID)
values(1, 2)
except
select FromID, ToID from Link;

The idea is to have empty row set to insert if there is a duplicate.

SQL Server variant looks like this:

insert into Link(FromID, ToID)
select 1, 2
except
select FromID, ToID from Link;

Client reported ridiculously slow performance of this SQL, due to table scan to calculate results of except operator.

Out of interest we performed the same experiment with SQL Server, and found the execution plan is optimal, and index seek is used to check duplicates. See:

Execution Plan

 

The only reasonable way of dealing with such DB2's peculiarity, except trying to insert and handle duplicate exception, was to qualify select with where clause:

insert into Link(FromID, ToID)
values(1, 2)
except
select FromID, ToID from Link where FromID = 1 and ToID = 2;

We think DB2 could do better.

Saturday, 14 April 2018 19:38:20 UTC  #    Comments [0] -
SQL Server puzzle | Thinking aloud
# Tuesday, 22 August 2017

Today we wanted to write some code in java that performs some or the other action depending on a condition. At the same time if some action fails we wanted to fall back to the other action.

We've written it like this:

switch(boolean_expression)
{
  case true:
  {
    try
    {
      // Some actions.
      break;
    }
    catch(Exception e)
    {
      // Fall back to false route. 
    }
  }
  case false:
  {
    // Other actions.
    break;
  }
}

The fun part is that it's not valid java code.

Why?

The answer can be found in spec: 14.11. The switch Statement

The type of the Expression must be char, byte, short, int, Character, Byte, Short, Integer, String, or an enum type (§8.9), or a compile-time error occurs.

But why?

Who knows...

Sure there are workarounds, even with switch, but it just not justified restriction...

Tuesday, 22 August 2017 15:18:29 UTC  #    Comments [0] -
Java | Thinking aloud
# Tuesday, 16 May 2017

We have found that Saxon HE 9.7.0-18 has finally exposed partial support to map and array item types. So, now you can encapsulate your data in sequence rather than having a single sequence and treating odd and even elements specially.

Basic example is:

<xsl:stylesheet version="3.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:t="t"
  xmlns:map="http://www.w3.org/2005/xpath-functions/map"
  exclude-result-prefixes="xs t map">

  <xsl:template match="/">
    <xsl:variable name="map" as="map(xs:string, xs:string)" select="
      map 
      {
        'Su': 'Sunday',
        'Mo': 'Monday',
        'Tu': 'Tuesday',
        'We': 'Wednesday',
        'Th': 'Thursday',
        'Fr': 'Friday',
        'Sa': 'Saturday'
      }"/>
      
     <xsl:message select="map:keys($map)"/>
  </xsl:template>  

</xsl:stylesheet>

A list of map functions can be found here http://www.w3.org/2005/xpath-functions/map/, though not all are available, as Saxon HE still does not allow inline functions.

P.S. From the development perspective it's a great harm that Saxon HE is so limited. Basically limited to xslt 2.0 + some selected parts of 3.0.

Tuesday, 16 May 2017 06:20:48 UTC  #    Comments [0] -
Thinking aloud | xslt
# Sunday, 26 March 2017

Lately we do not program in XSLT too often but rather in java, C#, SQL and javascript, but from time to time we have tasks in XSLT.

People claim that those languages are too different and use this argument to explain why XSLT is only a niche language. We, on the other hand, often spot similarities between them.

So, what it is in other languages that is implemented as tunnel parameters in XSLT?

To get an answer we reiterated how they work in XSLT, so, you:

  • define a template with parameters marked as tunnel="yes";
  • use these parameters the same way as regular parameters;
  • pass template parameters down to other templates marking them as tunnel="yes";

The important difference of regular template parameters from tunnel parameters is that the tunnel parameters are implicitly passed down the call chain of templates. This means that you:

  • define your API that is expected to receive some parameter;
  • pass these parameters somewhere high in the stack, or override them later in the stack chain;
  • do not bother to propagate them (you might not even know all of the tunnel parameters passed, so encapsulation is in action);

As a result we have a template with some parameters passed explicitly, and some others are receiving values from somewhere, usually not from direct caller. It’s possible to say that these tunnel parameters are injected into a template call. This resembles a lot injection API in other languages where you configure that some parameters are prepared for you by some container rather then by direct caller.

Now, when we have expressed this idea it seems so obvious but before we thought of this we did not realize that tunnel parameters in XSLT and Dependency Injection in other languages are the same thing.

Sunday, 26 March 2017 04:21:36 UTC  #    Comments [0] -
Thinking aloud | xslt
# Thursday, 06 October 2016

Our genuine love is C++. Unfortunately clients don't always share our favors, so we mostly occupied in the C#, java and javascript. Nevertheless, we're closely watching the evolution of the C++. It became more mature in the latest specs.

Recently, we wondered how would we deal with dependency injection in C++. What we found is only strengthened our commitment to C++.

Parameter packs introduced in C++ 11 allow trivial implementation of constructor injection, while std::type_index, std::type_info and std:any give service containers.

In fact there are many DI implementations out there. The one we refer here is Boost.DI. It's not standard nor we can claim it's the best but it's good example of how this concept can be implemented.

So, consider their example seen in Java with CDI, in C# in .NET Core injection, and in C++:

Java:

@Dependent
public class Renderer 
{
  @Inject @Device
  private int device;
};

@Dependent
public class View 
{
  @Inject @Title
  private String title;
  @Inject
  private Renderer renderer;
};

@Dependent
public class Model {};

@Dependent
public class Controller 
{
  @Inject
  private Model model;
  @Inject
  private View view;
};

@Dependent
public class User {};

@Dependent
public class App 
{
  @Inject
  private Controller controller;
  @Inject
  private User user;
};

...
  Privider<App> provider = ...

  App app = provider.get();

C#:

public class RenderedOptions
{
  public int Device { get; set; }
}
    
public class ViewOptions
{
  public int Title { get; set; }
}
    
public class Renderer 
{
  public Renderer(IOptions<RendererOptions> options)
  {
    Device = options.Device;
  }

  public int Device { get; set; }
}

public class View 
{
  public View(IOptions<ViewOptions> options, Renderer renderer)
  {
    Title = options.Title;
    Renderer = renderer;
  }

  public string Title { get; set; }
  public Renderer Renderer { get; set; }
}

public class Model {}

public class Controller 
{
  public Controller(Model model, View view) 
  {
    Model = model;
    View = view;
  }

  public Model Model { get; set; }
  public View View { get; set; }
};

public class User {};

public class App 
{
  public App(Controller controller, User user) 
  {
    Controller = controller;
    User = user;
  }

  public Controller Controller { get; set; }
  public User User { get; set; }
};

...
  IServiceProvider serviceProvider = ...

  serviceProvider.GetService<App>();

C++:

#include <boost/di.hpp>

namespace di = boost::di;

struct renderer 
{
  int device;
};

class view 
{
public:
  view(std::string title, const renderer&) {}
};

class model {};

class controller 
{
public:
  controller(model&, view&) {}
};

class user {};

class app 
{
public:
  app(controller&, user&) {}
};

int main()
{
  /**
   * renderer renderer_;
   * view view_{"", renderer_};
   * model model_;
   * controller controller_{model_, view_};
   * user user_;
   * app app_{controller_, user_};
   */

  auto injector = di::make_injector();
  injector.create<app>();
}

What is different between these DI flavors?

Not too much from the perspective of the final task achieved.

In java we used member injection, with qualifiers to inject scalars.

In C# we used constructor injection with Options pattern to inject scalars.

In C++ we used constructor injection with direct constants injected.

All technologies have their API to initialize DI container, but, again, while API is different, the idea is the same.

So, expressiveness of C++ matches to those of java and C#.

Deeper analysis shows that java's CDI is more feature rich than DI of C# and C++, but, personally, we consider it's advantage of C# and C++ that they have such a light DI.

At the same time there is an important difference between C++ vs java and C#.

While both java and C# are deemed to use reflection (C# in theory could use code generation on the fly to avoid reflection), C++'s DI natively constructs and injects services.

What does it mean for the user?

Well, a lot! Both in java and in C# you would not want to use DI in a performance critical part of code (e.g. in a tight loop), while it's Ok in C++ due to near to zero performance impact from DI. This may result in more modular and performant code in C++.

Thursday, 06 October 2016 11:27:42 UTC  #    Comments [0] -
.NET | C++ | Java | Thinking aloud
# Wednesday, 28 September 2016

While reading on ASP.NET Core Session, and analyzing the difference with previous version of ASP.NET we bumped into a problem...

At Managing Application State they note:

Session is non-locking, so if two requests both attempt to modify the contents of session, the last one will win. Further, Session is implemented as a coherent session, which means that all of the contents are stored together. This means that if two requests are modifying different parts of the session (different keys), they may still impact each other.

This is different from previous versions of ASP.NET where session was blocking, which meant that if you had multiple concurrent requests to the session, then all requests were synchronized. So, you could keep consistent state.

In ASP.NET Core you have no built-in means to keep a consistent state of the session. Even assurances that the session is coherent does not help in any way.

You options are:

  • build your own synchronization to deal with this problem (e.g. around the database);
  • decree that your application cannot handle concurrent requests to the same session, so client should not attempt it, otherwise behaviour is undefined.
Wednesday, 28 September 2016 19:22:15 UTC  #    Comments [0] -
.NET | ASP.NET | Thinking aloud
# Monday, 29 February 2016

8 Ways to Become a Better Coder is a good article. Read and apply to yourself. Never mind what your occupation is. Replace "coder" with your profession. Suits to everybody who wants to be the best.

Monday, 29 February 2016 18:33:04 UTC  #    Comments [0] -
Thinking aloud
# Tuesday, 09 February 2016

Visitor pattern is often used to separate operation from object graph it operates with. Here we assume that the reader is familiar with the subject.

The idea is like this:

  • The operation over object graph is implemented as type called Visitor.
  • Visitor defines methods for each type of object in the graph, which a called during traversing of the graph.
  • Traversing over the graph is implemented by a type called Traverser, or by the Visitor or by each object type in the graph.

Implementation should collect, aggregate or perform other actions during visit of objects in the graph, so that at the end of the visit the purpose of operation will be complete.

Such implementation is push-like: you create operation object and call a method that gets object graph on input and returns operation result on output.

In the past we often dealt with big graphs (usually these are virtual graphs backended at database or at a file system).

Also having a strong experience in the XSLT we see that the visitor pattern in OOP is directly mapped into xsl:template and xsl:apply-templates technique.

Another thought was that in XML processing there are two camps:

  • SAX (push-like) - those who process xml in callbacks, which is very similar to visitor pattern; and
  • XML Reader (pull-like) - those who pull xml components from a source, and then iterate and process them.

As with SAX vs XML Reader or, more generally, push vs pull processing models, there is no the best one. One or the other is preferable in particular circumstances. E.g. Pull like component fits into a transformation pipeline where one pull component has another as its source; another example is when one needs to process two sources at once, which is untrivial with push like model. On the other hand push processing fits better into Reduce part of MapReduce pattern where you need to accumulate results from source.

So, our idea was to complete classic push-like visitor pattern with an example of pull-like implementation.

For the demostration we have selected Java language, and a simplest boolean expression calculator.

Please follow GitHub nesterovsky-bros/VisitorPattern to see the detailed explanation.

Tuesday, 09 February 2016 12:37:10 UTC  #    Comments [0] -
Java | Thinking aloud | xslt
# Monday, 04 January 2016

Essence of the problem (see Error during transformation in Saxon 9.7, thread on forum):

  1. XPath engine may arbitrary reorder predicates whose expressions do not depend on a context position.
  2. While an XPath expression $N[@x castable as xs:date][xs:date(@x) gt xs:date("2000-01-01")] cannot raise an error if it's evaluated from the left to right, an expression with reordered predicates $N[xs:date(@x) gt xs:date("2000-01-01")][@x castable as xs:date] may generate an error when @x is not a xs:date.

To avoid a potential problem one should rewrite the expression like this: $N[if (@x castable as xs:date) then xs:date(@x) gt xs:date("2000-01-01") else false()].

Please note that the following rewrite will not work: $N[(@x castable as xs:date) and (xs:date(@x) gt xs:date("2000-01-01"))], as arguments of and expression can be evaluated in any order, and error that occurs during evaluation of any argument may be propageted.

With these facts we faced a task to check our code base and to fix possible problems.

A search has brought ~450 instances of XPath expessions that use two or more consequtive predicates. Accurate analysis limited this to ~20 instances that should be rewritten. But then, all of sudden, we have decided to commit an experiment. What if we split XPath expression in two sub expressions. Can error still resurface?

Consider:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xsl:variable name="elements" as="element()+"><a/><b value="c"/></xsl:variable>

  <xsl:template match="/">
    <xsl:variable name="a" as="element()*" select="$elements[self::d or self::e]"/>
    <xsl:variable name="b" as="element()*" select="$a[xs:integer(@value) = 1]"/>

    <xsl:sequence select="$b"/>
  </xsl:template>

</xsl:stylesheet>

As we expected Saxon 9.7 internally assembles a final XPath with two predicates and reorders them. As result we get an error:

Error at char 20 in xsl:variable/@select on line 8 column 81 of Saxon9.7-filter_speculation.xslt:
  FORG0001: Cannot convert string "c" to an integer

This turn of events greately complicates the code review we have to commit.

Michiel Kay's answer to this example:

I think your argument that the reordering is inappropriate when the expression is written using variables is very powerful. I shall raise the question with my WG colleagues.

In fact we think that either: reordering of predicates is inappropriate, or (weaker, to allow reordering) to treat an error during evaluation of predicate expression as false(). This is what is done in XSLT patterns. Other solutions make XPath less intuitive.

In other words we should use XPath (language) to express ideas, and engine should correctly and efficiently implement them. So, we should not be forced to rewrite expression to please implementation.

Monday, 04 January 2016 10:07:12 UTC  #    Comments [0] -
Thinking aloud | xslt
# Saturday, 02 January 2016

On December, 30 we have opened a thread in Saxon help forum that shows a stylesheet generating an error. This is the stylesheet:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xsl:variable name="elements" as="element()+"><a/><b value="c"/></xsl:variable>

  <xsl:template match="/">
    <xsl:sequence select="$elements[self::d or self::e][xs:integer(@value) = 1]"/>
  </xsl:template>

</xsl:stylesheet>

We get an error:

Error at char 47 in xsl:sequence/@select on line 7 column 83 of Saxon9.7-filter_speculation.xslt:
  FORG0001: Cannot convert string "c" to an integer
Exception in thread "main" ; SystemID: .../Saxon9.7-filter_speculation.xslt; Line#: 7; Column#: 47
ValidationException: Cannot convert string "c" to an integer
  at ...

It's interesting that error happens in Saxon 9.7 but not in earlier versions.

The answer we got was expected but disheartening:

The XPath specification (section 2.3.4, Errors and Optimization) explicitly allows the predicates of a filter expression to be reordered by an optimizer. See this example, which is very similar to yours:

The expression in the following example cannot raise a casting error if it is evaluated exactly as written (i.e., left to right). Since neither predicate depends on the context position, an implementation might choose to reorder the predicates to achieve better performance (for example, by taking advantage of an index). This reordering could cause the expression to raise an error.

$N[@x castable as xs:date][xs:date(@x) gt xs:date("2000-01-01")]

Following the spec, Michael Kay advices us to rewrite XPath:

$elements[self::d or self::e][xs:integer(@value) = 1]

like this:

$elements[if (self::d or self::e) then xs:integer(@value) = 1 else false()]

Such subtleties make it hard to reason about and to teach XPath. We doubt many people will spot the difference immediately.

We think that if such optimization was so much important to spec writers, then they had to change filter rules to treat failed predicates as false(). This would avoid any obscure differences in these two, otherwise equal, expressions. In fact something similar already exists with templates where failed evaluation of pattern is treated as un-match.

Saturday, 02 January 2016 21:32:16 UTC  #    Comments [0] -
Thinking aloud | xslt
# Wednesday, 16 December 2015

A collegue has approached to us with a question on how Akinator engine may work.

To our shame we have never heard about this amazing game before. To fill the gap we have immediately started to play it, and have identified it as a Troubleshooting solver.

It took us a couple of minutes to come up with a brilliant solution: "We just need to google and find the engine in the internet". :-)

Unfortunately, this led to nowhere, as no Akinator itself is open sourced, and no other good quality open source solutions are available.

After another hour we have got two more ideas:

  1. The task should fit into SQL;
  2. The task is a good candidate for a neural network.

In fact, the first might be required to teach the second, so we have decided to formalize the problem in terms of SQL, while still keeping in mind a neural network.

With this goal we have created a GitHub project. Please see the algorithm and its implementation at github.com/nesterovsky-bros/KB.

Wednesday, 16 December 2015 12:33:41 UTC  #    Comments [0] -
Announce | SQL Server puzzle | Thinking aloud
# Monday, 27 July 2015

Though ADO.NET and other ORM framworks like EntityFramework and Dapper support async pattern, you should remember that database drivers (at least all we know about) do not support concurrent db commands running against a single connection.

To see what we mean consider a bug we have recently identified. Consider a code:

await Task.WhenAll(
  newImages.
    Select(
      async image =>
      {
        // Load data from url.
        image.Content = await HttpUtils.ReadData(image.Url);

        // Insert image into the database.
        image.ImageID = await context.InsertImage(image);
      }));

The code runs multiple tasks to read images, and to write them into a database.

Framework decides to run all these tasks in parallel. HttpUtils.ReadData() has no problem with parallel execution, while context.InsertImage() does not run well in parallel, and is a subject of race conditions.

To workaround the problem we had to use async variant of a critical section. So the fixed code looks like this:

using(var semaphore = new SemaphoreSlim(1))
{
  await Task.WhenAll(
    newImages.
      Select(
        async image =>
        {
          // Load data from url.
          image.Content = await HttpUtils.ReadData(image.Url);

          await semaphore.WaitAsync();

          try
          {
            // Insert image into the database.
            image.ImageID = await context.InsertImage(image);
          }
          finally
          {
            semaphore.Release();
          }
        }));
}

So, in the async world we still should care about race conditions.

Monday, 27 July 2015 06:44:45 UTC  #    Comments [0] -
.NET | Thinking aloud | Tips and tricks
# Monday, 04 May 2015

Having a strong experience in ASP.NET and JSF, we found angular's transclusion concept is obscure and counterintuitive. It took a while for both of us to grasp the transclude's ideas described the Developer Guide. We suspect that this is due to the bad design: a bad design leads to a bad wording.

The other consequence of the bad design is that the transclusion is limited to one template per directive, which limits the use of the feature.

Consider:

  • A directive my-page that encapsulates a page with menu and content.
  • my-page uses templateUrl: my-page.html to render the page.
  • my-page.html defines two sites where menu and page content have to be embedded.
  • Two content fragments are passed to my-page to fill content sites.

Unfortunately, you cannot immediately implement this design in angularjs. On the other hand ASP.NET's Master Pages, and JSF's ui:composition readily solve this task.

Here is one of JSF's approaches:

  1. Define page template my-page.xhtml:
    <html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:ui="http://java.sun.com/jsf/facelets"
      xmlns:h="http://java.sun.com/jsf/html">
      <h:body>
        <table>
          <tr>
            <td><ui:insert name="menu"/></td>
          </tr>
          <tr>
            <td><ui:insert name="content"/></td>
          </tr>
        </table>
      </h:body>
    </html>
  2. Use ui:composition tag to pass parts to the template:
    <html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:ui="http://java.sun.com/jsf/facelets"
      xmlns:h="http://java.sun.com/jsf/html">
      <h:body>
        <ui:composition template="my-page.xhtml">
          <ui:define name="content">
            My Content
          <ui:define>
          <ui:define name="menu">
            <a href="#file">File</a>
            <a href="#edit">Edit</a>
            <a href="#view">View</a>
          <ui:define>
        </ui:composition>
      </h:body>
    </html>

We have decided to model angular directives after JSF, and have defined three simple directives: ui-template, ui-insert, ui-define (see angularjs-api/template/ui-lib.js).

To define a template one writes the following markup (see angularjs-api/template/my-page.html):

<table ui-template>
  <tr>
    <td ui-insert="menu"></td>
  </tr>
  <tr>
    <td ui-insert="content"></td>
  </tr>
</table>

and declares a directive (see angularjs-api/template/my-page.js):

var myPage =
{
  templateUrl: "my-page.html",
  transclude: true
};

angular.module("app").
  directive("myPage", function() { return myPage; });

and finally, to instantiate the directive one needs to write (see angularjs-api/template/sample.html):

<my-page>
  <div ui-define="content">
    My content
  </div>
  <div ui-define="menu">
    <a href="#file">File</a>
    <a href="#edit">Edit</a>
    <a href="#view">View</a>
  </div>
</my-page>

The working sample can be seen through rawgit: sample.html

The other sample that integrates with routing can be found at sample-routing.html

Monday, 04 May 2015 13:07:53 UTC  #    Comments [0] -
AngularJS | javascript | Thinking aloud
# Sunday, 30 November 2014

Farewell Entity Framework and hello Dapper!

For many years we were using Entity Framework. It's still very popular and Microsoft's primary Object-Relational Mapper library.

Clearly, the decision is subjective but here are our arguments.

We know and love SQL, and think that in its domain it occupies strong positions. What SQL leaves out of scope is a bridge between itself and other languages. That's where ORM should help.

We strongly beleive that no ORM library should try to hide SQL behind the Object's language itself. We beleive in a separation of roles in development. Database design and Data Access Layer should be separated from client's logic. Thus, we strive, if possible, to encapulate data access through SQL functions and stored procedures.

Entity Framework, in contrast, tries to factor out SQL, giving a perspective of object graph to a client. Initially, it looks promising but at the end a developer should remember that any object query is mapped back to SQL. Without keeping this in mind either query won't compile, or performance will be poor.

E.g. This query will probably fail to build SQL, as no Regex can be mapped to SQL:

var result = context.Content.
  Where(data => Regex.IsMatch(data.Content, pattern)).
  ToArray();

This query might be slow, if no suitble SQL index is defined:

var result = context.Content.
  Where(data => data.Field == value).
  ToArray();

Thus no EF's goal is achieved completely, SQL power is limitted, and Data Access Layer is often fused into other client's logic.

We think that Entity Framework is over-engineered library, which tries to be more than ORM. Its generality often bumps into limits of SQL support in EF: SQL dialects, types, operators, functions, and so on. One can observe that people for years appeal to introduce support of xml, hierarchyid, geometry/geography types, full text search, and so on. This state cannot be different, as EF will never be able and does not aim to support all SQL features.

EF has both design-time and runtime. Each database vendor should implement their EF adapter for EF to play well with that database. This cooperation is not always smooth. E.g see Database first create entity framework 6.1.1 model using system.data.sqlite 1.0.93.

At some point the cost of dealing with EF has became too high for us, so we started to look into an alternatives: from plain ADO.NET to lighter ORM library.

To our delight we have immediately found: Dapper - a simple object mapper for .NET. It provides a simple extensions to IDBConnection interface to deal with mapping of query parameters to object properties, and of query results to plain types. Here are some examples:

// Get Customer
var customer = connection.
  Query<Customer>("select * from Customers where CustomerId = @id", new { id = customerID }).
  ToSingle();

// Insert a value
connection.Execute("insert into MyTable(A, B) values(@a, @b)", new { a = 2, b = 3 });

So, Dapper leaves you with plain SQL, which we consider as advantage.

Except beeing minimalistic compared to EF, Dapper claims performance close to pure hand written ADO.NET. Indeed, they build dynamic methods to populate parameters and to create rows instances, so reflection is used during warm up period only.

Sunday, 30 November 2014 12:47:46 UTC  #    Comments [0] -
.NET | Thinking aloud
# Sunday, 17 August 2014

Among latest C++ proposals the most ambiguous is N4021.

The goal of that proposal is "to define a 2D drawing API for the C++ programming language".

The motivation is going like this:

Today, computer graphics are pervasive in modern life, and are even replacing console-style I/O for basic user interaction on many platforms. For example, a simple cout << "Hello, world!" statement doesn’t do anything useful on many tablets and smartphones. We feel that C++ programmers should have a simple, standard way of displaying 2D graphics to users.

Authors compare several public and proprietary APIs to select the one named cairo graphics library as a base.

Reflecting on starting point they write:

Taken as a whole, starting from cairo allows for the creation of a 2D C++ drawing library that is already known to be portable, implementable, and useful without the need to spend years drafting, implementing, and testing a library to make sure that it meets those criteria.
...
An alternative design would be to create a new API via a synthesis of existing 2D APIs. This has the benefit of being able to avoid any perceived design flaws that existing APIs suffer from. Unfortunately this would not have implementation and usage experience. Further, doing so would not provide any guarantee that design flaws would not creep in.

What follows is a discussion on best way to transform that C library into std style C++ API.

 

Our thoughts on this proposal are threefold:

  1. This proposal seems a decade or two late.
  2. C++ standard should be modular to support basic and optional features.
  3. We feel that programmers will not be satisfied with bare 2D graphics. It's not enough at nowadays.

 

Indeed, appeals to create standard C++ API for UI are as old as the C++'s standardization process. It's clear why did the committee not produce such API yet: they are bureaucracy that can approve API only. In fact it's a role of community to invent and implement libraries that may make their way into the standard. Without consensus in community no standard will reflect such API.

On the other hand C++ spec at present is too fat. Probably, not many people are satisfied with the pace of its evolution. Any big chunk of a new API makes the progress even slower. C++ spec should go through a refactoring and be split into core(s) and libraries and to allow individual progress of each part. This will simplify both specification and implementation. After that refactoring an API can be added or deprecated much more easily. In fact implementations were always like this. It's the spec that tries to be monolith.

As for a new 2D graphics API. It looks like an idea from late 90-es. We think that today's programmers (at least several samples :-) ) wished to deal with industry standard UI API, and not to start from basic drawing. Looking around we observe that html 5 is such de-facto standard. Take into an account that it supports rich layout, svg, canvas, user input; in addition it's good for GPU optimization. Even if you want to deal with simple graphics then you can build svg markup or draw on the canvas.

So, what we rather prefer to see in the C++ spec is an html binding API (both for DOM and Javascript).

Just think of standard C++ program that uses html engine as its UI!

Sunday, 17 August 2014 08:56:08 UTC  #    Comments [0] -
C++ | Thinking aloud
# Monday, 28 July 2014

Looking at Guava Cache we think its API is more convenient than .NET's Cache API.

Just consider:

  • .NET has getters, and setters of objects by string keys.
    You should provide caching policy with each setter.

  • Guava cache operates with typed storage of Key to Value.
    Provides a value factory and a caching policy in advance at cache construction.

Guava's advantange is based on an idea that homogenous storage assumes a uniform way of creation of values, and uniform caching policy. Thus a great part of logic is factored out into a cache initialization.

We have decided to create a simple adapter of the MemoryCache to achieve the same goal. Here is a result of such an experiment:

public class Cache<K, V>
  where V: class
{
  /// <summary>
  /// A cache builder.
  /// </summary>
  public struct Builder
  {
    /// <summary>
    /// A memory cache. If not specified then MemoryCache.Default is used.
    /// </summary>
    public MemoryCache MemoryCache;

    /// <summary>
    /// An expiration value.
    /// Alternatively CachePolicyFunc can be used.
    /// </summary>
    public TimeSpan Expiration;

    /// <summary>
    /// Indicates whether to use sliding (true), or absolute (false)
    /// expiration.
    /// Alternatively CachePolicyFunc can be used.
    /// </summary>
    public bool Sliding;

    /// <summary>
    /// Optional function to get caching policy.
    /// Alternatively Expiration and Sliding property can be used.
    /// </summary>
    public Func<V, CacheItemPolicy> CachePolicyFunc;

    /// <summary>
    /// Optional value validator.
    /// </summary>
    public Func<V, bool> Validator;

    /// <summary>
    /// A value factory.
    /// Alternatively FactoryAsync can be used.
    /// </summary>
    public Func<K, V> Factory;

    /// <summary>
    /// Async value factory.
    /// Alternatively Factory can be used.
    /// </summary>
    public Func<K, Task<V>> FactoryAsync;

    /// <summary>
    /// A key to string converter.
    /// </summary>
    public Func<K, string> KeyFunc;

    /// <summary>
    /// Converts builder to a Cache<K, V> instance.
    /// </summary>
    /// <param name="builder">A builder to convert.</param>
    /// <returns>A Cache<K, V> instance.</returns>
    public static implicit operator Cache<K, V>(Builder builder)
    {
      return new Cache<K, V>(builder);
    }
  }

  /// <summary>
  /// Creates a cache from a cache builder.
  /// </summary>
  /// <param name="builder">A cache builder instance.</param>
  public Cache(Builder builder)
  {
    if ((builder.Factory == null) && (builder.FactoryAsync == null))
    {
      throw new ArgumentException("builder.Factory");
    }

    if (builder.MemoryCache == null)
    {
      builder.MemoryCache = MemoryCache.Default;
    }

    this.builder = builder;
  }

  /// <summary>
  /// Cached value by key.
  /// </summary>
  /// <param name="key">A key.</param>
  /// <returns>A cached value.</returns>
  public V this[K key]
  {
    get { return Get(key); }
    set { Set(key, value); }
  }

  /// <summary>
  /// Sets a value for a key.
  /// </summary>
  /// <param name="key">A key to set.</param>
  /// <param name="value">A value to set.</param>
  public void Set(K key, V value)
  {
    SetImpl(GetKey(key), IsValid(value) ? value : null);
  }

  /// <summary>
  /// Gets a value for a key.
  /// </summary>
  /// <param name="key">A key to get value for.</param>
  /// <returns>A value instance.</returns>
  public V Get(K key)
  {
    var keyValue = GetKey(key);
    var value = builder.MemoryCache.Get(keyValue) as V;

    if (!IsValid(value))
    {
      value = CreateValue(key);
      SetImpl(keyValue, value);
    }

    return value;
  }

  /// <summary>
  /// Gets a task to return an async value.
  /// </summary>
  /// <param name="key">A key.</param>
  /// <returns>A cached value.</returns>
  public async Task<V> GetAsync(K key)
  {
    var keyValue = GetKey(key);
    var value = builder.MemoryCache.Get(keyValue) as V;

    if (!IsValid(value))
    {
      value = await CreateValueAsync(key);
      SetImpl(keyValue, value);
    }

    return value;
  }

  /// <summary>
  /// Gets string key value for a key.
  /// </summary>
  /// <param name="key">A key.</param>
  /// <returns>A string key value.</returns>
  protected string GetKey(K key)
  {
    return builder.KeyFunc != null ? builder.KeyFunc(key) :
      key == null ? null : key.ToString();
  }

  /// <summary>
  /// Creates a value for a key.
  /// </summary>
  /// <param name="key">A key to create value for.</param>
  /// <returns>A value instance.</returns>
  protected V CreateValue(K key)
  {
    return builder.Factory != null ? builder.Factory(key) :
      builder.FactoryAsync(key).Result;
  }

  /// <summary>
  /// Creates a task for value for a key.
  /// </summary>
  /// <param name="key">A key to create value for.</param>
  /// <returns>A task for a value instance.</returns>
  protected Task<V> CreateValueAsync(K key)
  {
    return builder.FactoryAsync != null ? builder.FactoryAsync(key) :
      Task.FromResult(builder.Factory(key));
  }

  /// <summary>
  /// Validates the value.
  /// </summary>
  /// <param name="value">A value to validate.</param>
  /// <returns>
  /// true if value is valid for a cache, and false otherise.
  /// </returns>
  protected bool IsValid(V value)
  {
    return (value != null) &&
      ((builder.Validator == null) || builder.Validator(value));
  }

  /// <summary>
  /// Set implementation.
  /// </summary>
  /// <param name="key">A key to set value for.</param>
  /// <param name="value">A value to set.</param>
  /// <returns>A set value.</returns>
  private V SetImpl(string key, V value)
  {
    if (value == null)
    {
      builder.MemoryCache.Remove(key);
    }
    else
    {
      builder.MemoryCache.Set(
        key,
        value,
        builder.CachePolicyFunc != null ? builder.CachePolicyFunc(value) :
        builder.Sliding ?
          new CacheItemPolicy { SlidingExpiration = builder.Expiration } :
          new CacheItemPolicy
          {
            AbsoluteExpiration = DateTime.Now + builder.Expiration
          });
    }

    return value;
  }

  /// <summary>
  /// Cache builder.
  /// </summary>
  private Builder builder;
}

The use consists of initialization:

Cache<MyKey, MyValue> MyValues = new Cache<MyKey, MyValue>.Builder
{
  KeyFunc = key => ...key to string value...,
  Factory = key => ...create a value for a key...,
  Expiration = new TimeSpan(0, 3, 0),
  Sliding = true
};

and a trivial cache access:

var value = MyValues[key];

This contrasts with MemoryCache coding pattern:

MemoryCache cache = MemoryCache.Default;
...

var keyAsString = ...key to string value...
var value = cache.Get(keyAsString) as MyValue;

if (value == null)
{
  value = ...create a value for a key...

  cache.Set(keyAsString, value, ...caching policy...);
}

Monday, 28 July 2014 05:36:06 UTC  #    Comments [0] -
.NET | Thinking aloud | Tips and tricks
# Thursday, 10 July 2014

Enumerable class contains many overloads with IEqualityComparable<T> argument. Most notable methods are:

  • Contains;
  • Distinct;
  • Except;
  • GroupBy;
  • Intersect;
  • Join;
  • ToDictionary;
  • ToLookup;
  • Union.

Recently we dealt with simple case:

source.
  Select(
    item =>
      new Word
      {
        Text = ...,
        LangID = ...,
        Properties = ...
        ...
      }).
  Distinct(equality comparer by Text and LangID);

In other words how do you produce a enumeration of distinct words from a enumeration of words, where two words are qualified equal if their Text and LangID are equal?

It turns out it's cumbersome to implement IEqualityComparer<T> interface (and any other interface in C#), at least it's nothing close to a conciseness of lambda functions.

Here we've decided to step in into framework space and to introduce an API to define simple equality comparers for a class.

We start from the use case:

var wordComparer = KeyEqualityComparer.Null<Word>().
  ThenBy(item => item.Text).
  ThenBy(item => item.LangID);

...
source.Select(...).Distinct(wordComparer);

And then proceed to the API:

namespace NesterovskyBros.Linq
{
  using System;
  using System.Collections;
  using System.Collections.Generic;

  /// <summary>
  /// A equality comparer extensions.
  /// </summary>
  public static class KeyEqualityComparer
  {
    /// <summary>
    /// Gets null as equality comparer for a type.
    /// </summary>
    /// <typeparam name="T">A type.</typeparam>
    /// <returns>
    /// null as equality comparer for a type.
    /// </returns>
    public static IEqualityComparer<T> Null<T>()
    {
      return null;
    }

    /// <summary>
    /// Creates an equality comparer for a enumeration item.
    /// </summary>
    /// <typeparam name="T">A type.</typeparam>
    /// <param name="source">A source items.</param>
    /// <param name="keyFunc">A key function.</param>
    /// <returns>
    /// null as equality comparer for a type.
    /// </returns>
    public static IEqualityComparer<T> EqualityComparerBy<T, K>(
      this IEnumerable<T> source,
      Func<T, K> keyFunc)
    {
      return new KeyEqualityComparer<T, K>(keyFunc);
    }

    /// <summary>
    /// Creates an equality comparer that uses this comparer as a base.
    /// </summary>
    /// <typeparam name="T">A type.</typeparam>
    /// <typeparam name="K">A key type.</typeparam>
    /// <param name="equalityComparer">A base equality comparer.</param>
    /// <param name="keyFunc">A key function.</param>
    /// <returns>
    /// An equality comparer that uses this comparer as a base.
    /// </returns>
    public static KeyEqualityComparer<T, K> ThenBy<T, K>(
      this IEqualityComparer<T> equalityComparer,
      Func<T, K> keyFunc)
    {
      return new KeyEqualityComparer<T, K>(keyFunc, equalityComparer);
    }
  }

  /// <summary>
  /// Equality comparer that uses a function to extract a comparision key.
  /// </summary>
  /// <typeparam name="T">A type.</typeparam>
  /// <typeparam name="K">A key type.</typeparam>
  public struct KeyEqualityComparer<T, K>: IEqualityComparer<T>
  {
    /// <summary>
    /// Creates an equality comparer.
    /// </summary>
    /// <param name="keyFunc">A key function.</param>
    /// <param name="equalityComparer">A base equality comparer.</param>
    public KeyEqualityComparer(
      Func<T, K> keyFunc,
      IEqualityComparer<T> equalityComparer = null)
    {
      KeyFunc = keyFunc;
      EqualityComparer = equalityComparer;
    }

    /// </summary>
    /// <param name="x">The first object of type T to compare.</param>
    /// <param name="y">The second object of type T to compare.</param>
    /// <returns>
    /// true if the specified objects are equal; otherwise, false.
    /// </returns>
    public bool Equals(T x, T y)
    {
      return ((EqualityComparer == null) || EqualityComparer.Equals(x, y)) &&
        EqualityComparer<K>.Default.Equals(KeyFunc(x), KeyFunc(y));
    }

    /// <summary>
    /// Returns a hash code for the specified object.
    /// </summary>
    /// <param name="obj">
    /// The value for which a hash code is to be returned.
    /// </param>
    /// <returns>A hash code for the specified object.</returns>
    public int GetHashCode(T obj)
    {
      var hash = EqualityComparer<K>.Default.GetHashCode(KeyFunc(obj));

      if (EqualityComparer != null)
      {
        var hash2 = EqualityComparer.GetHashCode(obj);

        hash ^= (hash2 << 5) + hash2;
      }

      return hash;
    }

    /// <summary>
    /// A key function.
    /// </summary>
    public readonly Func<T, K> KeyFunc;

    /// <summary>
    /// Optional base equality comparer.
    /// </summary>
    public readonly IEqualityComparer<T> EqualityComparer;
  }
}

So, now you can easily build simple equality comparers to cache them or instantiate on the fly. This comparers are usually related to property values or their function of source values.

See also LINQ extensions

Thursday, 10 July 2014 20:31:42 UTC  #    Comments [0] -
.NET | Thinking aloud
# Saturday, 28 June 2014

This is a small post about refactoring lock statements in async methods.

Before refactoring we had a code like this:

lock(sync)
{
  result = methodToRefactorIntoAsync();
}

...

private object sync = new object();

Lock is bound to a thread, thus no way you to use it in async code. As an alternative you may use SemaphoreSlim class:

await sync.WaitAsync(cancellationToken);

try
{
  result = await methodAsync(cancellationToken);
}
finally
{
  sync.Release();
}

...

private SemaphoreSlim sync = new SemaphoreSlim(1, 1);

Saturday, 28 June 2014 11:56:32 UTC  #    Comments [0] -
.NET | Thinking aloud
# Friday, 27 June 2014

What will you do if you have async Web API method that runs on server for a some time but your client is dropped?

There are two solutions:

  1. Run method to the end and allow to a framework to deal with disconnect;
  2. Try to be notified about client's drop and to break early.

The first approach is simplest but might result in some overconsumption of server resources. The other method requires you to check client status from time to time.

Fortunatelly, ASP.NET provides a HttpResponse.ClientDisconnectedToken property, which is limited to IIS 7.5+ in integrated mode, but still fits our needs. So, you should request ClientDisconnectedToken, if any, and implement your async code using that token.

The following extension function gets that token:

using System.Linq;
using System.Net.Http;
using System.Threading.Tasks;
using System.Threading;
using System.Web;

public static class HttpApiExtensions
{
  public static CancellationToken GetCancellationToken(
    this HttpRequestMessage request)
  {
    CancellationToken cancellationToken;
    object value;
    var key = typeof(HttpApiExtensions).Namespace + ":CancellationToken";

    if (request.Properties.TryGetValue(key, out value))
    {
      return (CancellationToken)value;
    }

    var httpContext = HttpContext.Current;

    if (httpContext != null)
    {
      var httpResponse = httpContext.Response;

      if (httpResponse != null)
      {
        try
        {
          cancellationToken = httpResponse.ClientDisconnectedToken;
        }
        catch
        {
          // Do not support cancellation.
        }
      }
    }

    request.Properties[key] = cancellationToken;

    return cancellationToken;
  }
}

And here is a Web API WordCount service described in the previous post:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;

public class ValuesController: ApiController
{
  public async Task<int> GetWordCount([FromUri(Name = "url")] string[] urls)
  {
    var cancellationToken = Request.GetCancellationToken();

    using(var client = new HttpClient())
    {
      return (await Task.WhenAll(
        urls.Select(url => WordCountAsync(client, url, cancellationToken)))).Sum();
    }
  }

  public static async Task<int> WordCountAsync(
    HttpClient client,
    string url,
    CancellationToken cancellationToken)
  {
    string content = await (await client.GetAsync(url, cancellationToken)).
      Content.ReadAsStringAsync();

    return WordCount(content);
  }

  private static int WordCount(string text)
  {
    var count = 0;
    var space = true;

    for (var i = 0; i < text.Length; ++i)
    {
      if (space != char.IsWhiteSpace(text[i]))
      {
        space = !space;

        if (!space)
        {
          ++count;
        }
      }
    }

    return count;
  }
}

Though is simple there is a nuisance. You should pass cancellation token here and there, which adds to a pollution from async.

Friday, 27 June 2014 14:42:44 UTC  #    Comments [0] -
.NET | Thinking aloud
# Wednesday, 25 June 2014

Though parallel and async algorithms solve different tasks, they converge in some cases. And it's not always immediately clear what's the best.

Consider the following task: get a total word count contained in a given a set of urls.

At first we've solved it as a parallel task: indeed this fits to MapReduce pattern when you get urls' contents to count the number of words in  parallel (Map), and then sum word counts per each url to get final result (Reduce). But then we decided that the very same MapReduce algorithm can be implemented with async.

This is a parallel word count:

public static int ParallelWordCount(IEnumerable<string> urls)
{
  var result = 0;

  Parallel.ForEach(
    urls,
    url =>
    {
      string content;

      using(var client = new WebClient())
      {
        content = client.DownloadString(url);
      }

      var count = WordCount(content);

      Interlocked.Add(ref result, count);
    });

  return result;
}

Here is async word count:

public static async Task<int> WordCountAsync(IEnumerable<string> urls)
{
  return (await Task.WhenAll(urls.Select(url => WordCountAsync(url)))).Sum();
}

public static async Task<int> WordCountAsync(string url)
{
  string content;

  using(var client = new WebClient())
  {
    content = await client.DownloadStringTaskAsync(url);
  }

  return WordCount(content);
}

And this is an implementation of word count for a text (it's less important for this discussion):

public static int WordCount(string text)
{
  var count = 0;
  var space = true;

  for(var i = 0; i < text.Length; ++i)
  {
    if (space != char.IsWhiteSpace(text[i]))
    {
      space = !space;

      if (!space)
      {
        ++count;
      }
    }
  }

  return count;
}

Our impressions are:

  1. The parallel version is contained in one method, while the async one is implemeneted with two methods.

    This is due to the fact that C# compiler fails to generate async labmda function. We attribute this to Microsoft who leads and implements C# spec. Features should be composable. If one can implement a method as a lambda function, and one can implement a method as async then one should be able to implement a method as an async lambda function.

  2. Both parallel and async versions are using thread pool to run their logic.

  3. While both implementations follow MapReduce pattern, we can see that async version is much more scaleable. It's because of parallel threads stay blocked while waiting for an http response. On the other hand async tasks are not bound to any thread and are just not running while waiting for I/O.

This sample helped us to answer the question as to when to use parallel and when async. The simple answer goes like this:

  • if your logic is only CPU bound then use parallel API;
  • otherwise use async API (this accounts I/O waits).
Wednesday, 25 June 2014 12:51:28 UTC  #    Comments [5] -
.NET | Thinking aloud
# Monday, 23 June 2014

Not a long ago C# has introduced special language constructs to simplify asynchronous programming. It seems C++1x will follow async trend. But only recently when frameworks like ASP.NET Web API and Entity Framework started to catch up we've felt how it's to program with async and await keywords.

At first glance it seems it's a pure pleasure to write async methods:

private async Task SumPageSizesAsync()
{
  // To use the HttpClient type in desktop apps, you must include a using directive and add a
  // reference for the System.Net.Http namespace.
  HttpClient client = new HttpClient();
  // . . .

  byte[] urlContents = await client.GetByteArrayAsync(url);
  // . . .
}

To dereference a Task<T> into T you just write await task_t_expression, mark your method with async specifier, and adjust output type (if not void) to Task or Task<Result>. Compiler applies its magic to convert your code into an asynchronous state machine.

We liked this feature and immediately have started to use it. But, as we said, async/await has shined in full when frameworks made it a core element, and at that point we have started to see that while  async/await solve the task, they does not abstract the developer from implementation details, as a code gets considerably polluted.

Consider a method with pollution marked:

public static async Task<UserAuthorization> GetAuthorizationAsync(string accessToken)
{
  var key = "oauth2:" + accessToken;
  var authorization = cache.Get<UserAuthorization>(key);

  if (authorization != null)
  {
    return authorization;
  }

  using(var model = new ModelContainer())
  {
    authorization =
      (await model.UserAuthorizations.
        Where(item => item.AccessToken == accessToken).
        ToListAsync()).
      FirstOrDefault();
  }

  if (authorization == null)
  {
    authorization = await ValidateAsync(accessToken);
  }

  cache.Set(key, cache.ShortDelay, authorization);

  return authorization;
}

The more you use async, the more pollution you will see in your code. Ideally we would like to see the method quoted without any marked parts.

Monday, 23 June 2014 06:15:55 UTC  #    Comments [0] -
.NET | Thinking aloud
# Sunday, 25 May 2014

After several years of experience with KendoUI we turned our attention to AngularJS. As many other libraries it has its strong and weak sides. There are many resources describing what AngularJS is, and what it is not. Our approach to study AngularJS was through an attempt to integrate it into an existing KendoUI web application.

It's rather straightforward to convert model from KendoUI into AngularJS, as logically both frameworks are equal in this regard. But tactically KendoUI implements model-view binding very differently than AngularJS does. KendoUI binds model to view immediately per each model field, where AngularJS delays a binding of each model field and performs whole model binding in one go. Angular's approach is more performant, and even more appealing to a developer, though the problem is that the time it takes to make whole model binding is proportional to a size (number of objects and properties) of model. This means that if you have a relatively big model you will experience tangible halts in browser's UI while a javascript updating view/model is running.

AngularJS advices some workaround, which in essence is to avoid big model. The problem is that a couple of thousands or even several hundrends of objects and properties are already considered big model. So, you should immediately plan your model, and view to avoid any potential impact. This seriously distracts from the task your're solving.

The idea that your UI will halt for the time proportional to the size of your whole model looks flawed in our opinion. KendoUI knows no such a problem. That's the reason why our KendoUI to AngularJS transition experience was not smooth.

Our analysis of AngularJS sources shows that the issue could be resolved provided model to view binding (it's called digest in that library) was asynchronous.

To verify our ideas we have created a branch nesterovsky-bros/angular.js where we implemented required refactorings. It includes:

  • API based on existing deferred/promise to write algorithms in async way, and
  • refactored digest logic.

At the end we have proposed to integrate our changes into the main branch: Make $digest async.

We're not sure whether our proposition will be integrated (rather no than yes). Nevertheless what we have come with is an interesting extension of deferred object that we neither have seen in AngularJS nor in JQuery, so later we will quote that API from q.js and scheduler.js.

Sunday, 25 May 2014 08:02:41 UTC  #    Comments [2] -
AngularJS | javascript | kendoui | Thinking aloud
# Tuesday, 11 February 2014

These are initial positions for this writing:

  • SQL Server allows to execute dynamic SQL.
  • Dynamic SQL is useful and often unavoidable, e.g. when you have to filter or order data in a way that you cannot code efficiently in advance.
  • Dynamic SQL has proven to be a dangerous area, as with improper use it can open hole in a security.

In general nothing stops you from building and then excuting of SQL string. Our goal, however, is to define rules that make work with dynamic SQL is more managable and verifiable.

Here we outline these rules, and then give some examples and tips.

Rule #1. Isolate dynamic SQL

Put all logic related to building of dynamic SQL into a separate function.
We usually define a separate scheme Dynamic, and define functions like Dynamic.GetSQL_XXX(params).
This makes it simple to perform code review.

Rule #2. Xml as parameters

Use xml type to pass parameters to a function that builds dynamic SQL.
In many cases dynamic SQL depends on variable number of parameters (like a list of values to check against).
Xml fits here to represent structured information.
On a client (e.g. in C# or java) you can define a class with all parameters, populate an instance and serialize it to an xml.

Rule #3. XQuery as template language

Use XQuery to define SQL template and to generate SQL tree from the input parameters.
Here is an example of such XQuery:

@data.query('
<sql>
select
  T.*
from
  Data.Ticket T
where
{
  for $ticketID in data/ticketID return
    <sql>(T.TicketID = <int>{$ticketID}</int>) and </sql>
}
(1 = 1)
</sql>')

You can see that output is an xml with sql element to represent literal SQL, and int element to represent integer literal.

In fact whole output schema can be defined like this:

<xs:schema elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="sql"/>
  <xs:element name="name"/>
  <xs:element name="string" nillable="true"/>
  <xs:element name="int" nillable="true"/>
  <xs:element name="decimal" nillable="true"/>
  <xs:element name="date" nillable="true"/>
  <xs:element name="time" nillable="true"/>
  <xs:element name="datetime" nillable="true"/>
</xs:schema>

where sql is to represent literal content, name to represent a name, and other elements to represent different literal values.

Rule #4. Escape literals

Use function Dynamic.ToSQL(@template) to build final SQL text.
Here we quote the definition:

-- Builds a text of SQL function for an sql template.
create function Dynamic.ToSQL
(
  -- SQL template.
  @template xml
)
returns nvarchar(max)
with returns null on null input
as
begin
  return
  (
    select
      case
        when N.Node.exist('*[xs:boolean(@xsi:nil)]') = 1 then
          'null'

        when N.Node.exist('self::int') = 1 then
          isnull(N.Node.value('xs:int(.)', 'nvarchar(max)'), '# int #')

        when N.Node.exist('self::string') = 1 then
          'N''' +
          replace
          (
            N.Node.value('.', 'nvarchar(max)'),
            '''',
            ''''''
          ) +
          ''''

        when N.Node.exist('self::name') = 1 then
          isnull
          (
            quotename(N.Node.value('.', 'nvarchar(128)'), '['),
            '# name #'
          )

        when N.Node.exist('self::datetime') = 1 then
          isnull
          (
            'convert(datetime2, ''' +
            N.Node.value('xs:dateTime(.)', 'nvarchar(128)') +
            ''', 126)',
            '# datetime #'
          )

        when N.Node.exist('self::date') = 1 then
          isnull
          (
            'convert(date, ''' +
            N.Node.value('xs:date(.)', 'nvarchar(128)') +
            ''', 126)',
            '# date #'
          )

        when N.Node.exist('self::time') = 1 then
          isnull
          (
            'convert(time, ''' +
            N.Node.value('xs:time(.)', 'nvarchar(128)') +
            ''', 114)',
            '# time #'
          )

        when N.Node.exist('self::decimal') = 1 then
          isnull
          (
            N.Node.value('xs:decimal(.)', 'nvarchar(128)'),
            '# decimal #'
          )

        when N.Node.exist('self::*') = 1 then
          '# invalid template #'

        else
          N.Node.value('.', 'nvarchar(max)')
      end
    from
      @template.nodes('//sql/node()[not(self::sql)]') N(Node)
    for xml path(''), type
  ).value('.', 'nvarchar(max)');
end;

Now, we want to stress that this function plays an important role in prevention of the SQL injection, as it escapes literals from the SQL tree.

Rule #5 (optional). Collect data

Use SQL to collect additional data required to build dynamic SQL. Here is an example of how we get a Ticket by StatusID, while on input we receive a StatusName:

create function Dynamic.GetSQL_GetTicketByStatus(@data xml)
returns nvarchar(max)
as
begin
  set @data =
    (
      select
        @data,
        (
          select
            T.StatusID
          from
            @data.nodes('/data/status') N(Node)
            inner join
            Metadata.Status T
            on
              T.StatusName = Node.value('.', 'nvarchar(128)')
            for xml auto, type, elements
        )
      for xml path('')
    );

  return Dynamic.ToSQL
  (
    @data.query
    ('
<sql>
select
  T.*
from
  Data.Ticket T
where
  T.Status in ({ for $status in /T/StatusID return <sql><int>{$status}</int>,</sql> } null)
</sql>
    ')
  );
end;

Notice code in red that collects some more data before calling XQuery.

Rule #6. Execute

The final step is to call dynamic SQL.
This is done like this:

-- build
declare @sql nvarchar(max) = Dynamic.GetSQL_GetTicket(@data);

-- execute
execute sp_executesql
  @sql
  -- {, N'@parameter_name data_type [ OUT | OUTPUT ][ ,...n ]' }
  -- { , [ @param1 = ] 'value1' [ ,...n ] }
with result sets
(
  (
    TicketID int not null,
    CreatedAt datetime2 not null,
    Summary nvarchar(256) null,
    Status int,
    Severity int,
    DeadLineAt datetime2 null
  )
);

Notice that the use of dynamic SQL does not prevent static parameters.
Notice also that with result sets clause is used to specify output.

Example. Tickets system

Let's assume you're dealing with a tickets system (like Bugzilla), and you have a table Data.Ticket to describe tickets. Assume that DDL for this table is like this:

create table Data.Ticket
(
  TicketID bigint not null primary key,
  CreatedAt datetime2 not null,
  Summary nvarchar(128) null,
  Status int not null,
  UpdatedAt datetime2(7) not null
)

Suppose you have to build C# code to search different tickets, where Entity Framework is used to access the database.
Search should be done by a range of CreatedAt, a range of UpdatedAt, Summary, or by different Status values. It should be possible to order results in different ways.

We start out solution from the C# and define classes for a request:

public enum Direction
{
  Asc,
  Desc
}

public struct Order
{
  public string Field { get; set; }
  public Direction Direction {get; set; }
}

public class DateRange
{
  public DateTime? From { get; set; }

  // This property is to omit From element if value is null.
  // See rules for xml serialization.
  public bool FromSpecified { get { return From != null; } }

  public DateTime? To { get; set; }
  public bool ToSpecified { get { return To != null; } }
}

public class TicketsRequest
{
  public DateRange CreatedAt { get; set; }
  public string Summary { get; set; }
  public DateRange UpdatedAt { get; set; }
  [XmlElement]
  public Order[] Order { get; set; }
  [XmlElement]
  public int[] Status { get; set; }
}

Notice that we're going to use XmlSerializer to convert request to xml and then to pass parameter into EF's model. Here is utility method to perform such conversion:

public static string ToXmlString<T>(T value)
{
  if (value == null)
  {
    return null;
  }

  var serializer = new XmlSerializer(typeof(T));
  var builder = new StringBuilder();

  var writer = XmlWriter.Create(
    builder,
    new XmlWriterSettings
    {
      OmitXmlDeclaration = true,
      Indent = false
    });

  serializer.Serialize(writer, value);
  writer.Flush();

  return builder.ToString();
}

Now we proceed to the database and define a procedure that runs the search:

-- Gets tickets.
create procedure Data.GetTickets
(
  -- A query parameters.
  @params xml
)
as
begin
  set nocount on;

  -- This is for EF to guess type of result.
  if (1 = 0)
  begin
    select
      TicketID,
      CreatedAt,
      Summary,
      Status,
      UpdatedAt
    from
      Data.Ticket;
  end;

  declare @sql nvarchar(max) = Dynamic.GetSQL_GetTickets(@params);

  execute sp_executesql @sql
  with result sets
  (
    (
      TicketID int not null,
      CreatedAt datetime2 not null,
      Summary nvarchar(256) null,
      Status int,
      UpdatedAt datetime2 null
    )
  );
end;

Switch back to C#, import the Data.GetTickets into the EF model, and create a search method:

public IEnumerable<Ticket> GetTickets(TicketsRequest request)
{
  var model = new Model();

  return model.GetTickets(ToXmlString(request));
}

The last ingredient is Dynamic.GetSQL_GetTickets() function.

create function Dynamic.GetSQL_GetTickets(@data xml)
returns nvarchar(max)
as
begin
  return Dynamic.ToSQL
  (
    @data.query('
<sql>
select
  T.TicketID,
  T.CreatedAt,
  T.Summary,
  T.Status,
  T.UpdatedAt
from
  Data.Ticket T
where
{
  for $range in */CreatedAt return
  (
    for $date in $range/From return
    <sql>
      (T.CreatedAt >= <datetime>{$date}</datetime>) and
    </sql>,

    for $date in $range/To return
    <sql>
      (<datetime>{$date}</datetime> > T.CreatedAt) and
    </sql>
  ),

  for $range in */UpdatedAt return
  (
    for $date in $range/From return
    <sql>
      (T.UpdatedAt >= <datetime>{$date}</datetime>) and
    </sql>,

    for $date in $range/To return
    <sql>
      (<datetime>{$date}</datetime> > T.UpdatedAt) and
    </sql>
  ),

  for $summary in */Summary return
  <sql>
    (T.Summary like <string>{$summary}</string>) and
  </sql>,

  if (*/Status) then
  <sql>
    T.Status in
      ({
        for $status in */Status return
          <sql><int>{$status}</int>, </sql>
      } null) and
  </sql>
  else ()
}
(1 = 1)
order by
{
  for $order in
    */Order
    [
      Field = ("TicketID", "CreatedAt", "Summary", "UpdatedAt", "Status")
    ]
  return
  <sql>
    <name>{$order/Field}</name>
    {" desc"[$order[Direction = "Desc"]]},
  </sql>
}
(select null)
</sql>
    ')
  );
end;

SQL text from Dynamic.GetSQL_GetTickets()

Consider now SQL text produced by this function. For an input:

<TicketsRequest>
  <CreatedAt>
    <From>2014-01-01T00:00:00</From>
  </CreatedAt>
  <Summary>hello%</Summary>
  <Order>
    <Field>Status</Field>
    <Direction>Desc</Direction>
  </Order>
  <Status>1</Status>
  <Status>3</Status>
</TicketsRequest>

the output is:

select
  T.TicketID,
  T.CreatedAt,
  T.Summary,
  T.Status,
  T.UpdatedAt
from
  Data.Ticket T
where

      (T.CreatedAt >= convert(datetime2, '2014-01-01T00:00:00', 126)) and

    (T.Summary like N'hello%') and

    T.Status in
      (1, 3, null) and

  (1 = 1)
order by
[Status] desc,

  (select null)

Though the text is not formatted as we would like, it's perfectly valid SQL.

Tips for building XQuery templates

What is called XQuery in SQL Server is in fact a very limited subset of XQuery 1.0. Microsoft clearly states this fact. What is trivial in XQuery is often impossible or ugly in XQuery of SQL Server.

Nevertheless XQuery in SQL Server works rather well as SQL template language. To make it most efficient, however, you should learn several tips.

Tip #1. Where clause

In template you might want to build a where clause:

<sql>
select
...
where
{
  if (...) then
    <sql>...</sql>
  else ()
}
</sql>

and it might happen that for a certain input a condition under where might collapse, and you will be left with where keyword without a real condition, which is wrong. A simple work around is to always add some true condition under ther where like this:

<sql>
select
...
where
{
  if (...) then
    <sql>... and </sql>
  else ()
} (1 = 1)
</sql>

Tip #2. "in" expression

If you want to generate "in" expression like this:

value in (item1, item2,...)

then you might find that it's much easier generate equivalent a code like this:

value in (item1, item2,..., null).

Here is a XQuery to generate such template:

value in
  ({
    for $item in ... return
      <sql><int>{$item}</int>, </sql>
  } null) and

Tip #3. Order by

You can conclude an order by clause built from a data with a dummy expression like this:

order by
{
  for $item in ... return
    <sql>
      <name>{$item/Field}</name>
      {" desc"[$item/Direction = "Desc"]},
    </sql>
} (select null)

Alternatively you can use first column from a clustered index.

Tip #4. Group by

In a group by clause we cannot introduce terminator expression as it was with order by, so a code is a less trivial:

{
  let $items := ... return

  if ($items) then
    <sql>
      group by <name>{$items[1]}</name>
      {
        for $item in $items[position() > 1] return
          <sql>, <name>{$item}</name></sql>
      }
    </sql>
  else ()
}

In fact similar logic may work with order by.

Tip #5. Escape literals

It's crusial not to introduce SQL injection while building SQL. Thus use:

<int>{...}</int> - for literal int;
<decimal>{...}</decimal> - for literal decimal;
<string>{...}</string> - for literal string;
<datetime>{...}</datetime> - for literal datetime2;
<date>{...}</date> - for literal date;
<time>{...}</time> - for literal time;
<name>{...}</name> - for a name to quote.

Note that you can use xsi:nil, so <int xsi:nil="true"/> means null.

If you generate a field name from an input data then it worth to validate it against a list of available names.

Tip #6. Validate input.

It worth to define xml schema for an input xml, and to validate parameters against it.
This makes code more secure, and also adds a documentation.

Tip #7. Don't abuse dynamic SQL

There are not too many cases when you need a dynamic SQL. Usually SQL engine knows how to build a good execution plan. If your query contains optional conditions then you can write it a way that SQL Server can optimize, e.g.:

select
  *
from
  T
where
  ((@name is null) or (Name = @name)) and
  ((@date is null) or (Date = @date))
option(recompile)

Tuesday, 11 February 2014 09:48:07 UTC  #    Comments [0] -
SQL Server puzzle | Thinking aloud | Tips and tricks
# Wednesday, 22 January 2014

Consider how would you implement Style object in the HTML DOM?

These are some characteristics of that object:

  • It has a long list of properties, e.g. in IE 11 there are more than 300 properties over a style object.
  • Any specific instance usually have only several properties assigned.
  • Reads of properties are much more frequent than writes. In fact style often stays unchanged after initialization.
  • DOM contains many style instances (often thousands).
  • The number of distinct instances in terms of values of properties is moderate (usually dozens).

Here is how would we approached to such an object.

1. Styles are sparse objects, thus there is no point to implement plain class with all those properties, as it's wasteful.

We would rather use two techniques to keep style's state:

  • A dictionary of properties with their values;
  • An aggregation of objects, where all properies are grouped into families, each group is defined by a separate type, and a style's state is an aggregation of that groups.

A current style of an element is an aggregation of styles of ancestor element. It can either by dynamic or be fused into a single style instance.

2. Make style's state immutable, and share all these states among all style instances.

In this implementation property write turns into a state transition operation: state = set(state, property, value). Thus no state is modified but replaced with other state that corresponds to a required change.

If state is seen as a dictionary then API may look like this :

public class State<K, V>
{
  // Gets shared dictionary for an input dictionary.
  public IDictionary<K, V> Get(IDictionary<K, V> dictionary);

  // Gets a shared dictionary for an input dictionary with key set to a value.
  public IDictionary<K, V> Set(IDictionary<K, V> dictionary, K key, V value);

  // Gets a shared dictionary for an input dictionary.
  public IDictionary<K, V> Remove(IDictionary<K, V> dictionary, K key);

  // Gets typed value.
  public T Get<T>(IDictionary<K, V> dictionary, K key)
    where T: V
  {
    V value;

    if ((dictionary == null) || !dictionary.TryGetValue(key, out value))
    {
      return default(T);
    }

    return (T)value;
  }

  // Sets or removes a typed value.
  // dictionary can be null.
  // null returned if output dictionary would be empty.
  public IDictionary<K, V> Set<T>(IDictionary<K, V> dictionary, K key, T value)
    where T : V
  {
    return value == null ? Remove(dictionary, key) :
      Set(dictionary, key, (V)value);
  }
}

States can be cached. Provided the cache keeps states in a weak way, no unsued state will be stored for a long time. We may use weak table of dictionary to dictionary WeakTable<Dictionary<K, V>, Dictionary<K, V>> as a storage for such a cache. All required API is described in the WeakTable and Hash Code of Dictionary posts.

3. Style can be implemented as a structure with shared state as a storage. Here is a scetch:

[Serializable]
public struct Style
{
  // All properties.
  public enum Property
  {
    Background,
    BorderColor,
    BorderStyle,
    Color,
    FontFamily,
    FontSize,
    // ...
  }

  public int? Background
  {
    get { return states.Get<int?>(state, Property.Background); }
    set { state = states.Set(state, Property.Background, value); }
  }

  public int? BorderColor
  {
    get { return states.Get<int?>(state, Property.BorderColor); }
    set { state = states.Set(state, Property.BorderColor, value); }
  }

  public string BorderStyle
  {
    get { return states.Get<string>(state, Property.BorderStyle); }
    set { state = states.Set(state, Property.BorderStyle, value); }
  }

  public int? Color
  {
    get { return states.Get<int?>(state, Property.Color); }
    set { state = states.Set(state, Property.Color, value); }
  }

  public string FontFamily
  {
    get { return states.Get<string>(state, Property.FontFamily); }
    set { state = states.Set(state, Property.FontFamily, value); }
  }

  public double? FontSize
  {
    get { return states.Get<double?>(state, Property.FontSize); }
    set { state = states.Set(state, Property.FontSize, value); }
  }

  // ...

  [OnDeserialized]
  private void OnDeserialized(StreamingContext context)
  {
    state = states.Get(state);
  }

  // A state.
  private IDictionary<Property, object> state;

  // A states cache.
  private static readonly State<Property, object> states =
    new State<Property, object>();
}

Note that:

  •  default state is a null dictionary;
  • states are application wide shared.

The following link is our implementation of State<K, V> class: State.cs.

 

Here we have outlined the idea of shared state object, and how it can be applied to sparse mostly immutable objects. We used HTML style as an example of such an object. Shared state object may work in many other areas, but for it to shine its use case should fit to the task.

Wednesday, 22 January 2014 19:43:25 UTC  #    Comments [0] -
.NET | Thinking aloud | Tips and tricks
# Monday, 13 January 2014

Dealing recently with some task (the same that inspired us to implement WeakTable), we were in a position to use a dictionary as a key in another dictionary.

What are the rules for the class to be used as key:

  • key should be immutable;
  • key should implement a GetHashCode() method;
  • key should implement a Equals() method.

The first requirement is usually implemented as a documentation contract like this:

As long as an object is used as a key in the Dictionary<TKey, TValue>, it must not change in any way that affects its hash value.

The third requirement about equals is trivially implemented as a method:

public bool Equals(IDictionary<K, V> x, IDictionary<K, V> y)
{
  if (x == y)
  {
    return true;
  }

  if ((x == null) || (y == null) || (x.Count != y.Count))
  {
    return false;
  }

  foreach(var entry in x)
  {
    V value;

    if (!y.TryGetValue(entry.Key, out value) ||
      !valueComparer.Equals(entry.Value, value))
    {
      return false;
    }
  }

  return true;
}

But how would you implement hash code?

We argued like this.

1. Let's consider the dictionary as a sparse array of values with only populated items that correspond to key hash codes.

2. Hash code is constructed using some fair algorithm. E.g like that used in java to calculate string's hash code:

       n-1
h(s) = SUM (s[i]*p^(n-1-i)) mod m,  where m = 2^31
       i=0

In our case:

  • n can be arbitrary large int value, so in fact it's 2^32;
  • items are enumerated in unknown order;
  • there is only limited set of items, so most s[i] are zeros.

As result we cannot use recurrent function to calculate a power p^k mod m. Fortunately one can build fast exponentiation arguing like this:

        32/s - 1
p^k = p^  SUM  2^((s*i)*k[i]) mod m,  where s some int: 1, 2, 4, 8, 16, or 32.
          i=0

Thus

     32/s - 1
p^k = PRODUCT (p^(2^(s*i)))^k[i] mod m
        i=0

If s = 1 then k[i] is either 1 or 0 (a bit), and there is 32 different p^(2^i) mod m values, which can be precalculated.

On the other hand, if we select s = 8 we can write the formula as:

p^k = p^k[0] * (p^(2^8))^k[1] * (p^(2^16))^k[2] * (p^(2^24))^k[3] mod m

where k[i] is a 8 bit value (byte).

Precalculating all values p^n, (p^(2^8))^n, (p^(2^16))^n, (p^(2^24))^n for n in 0 to 255 we reach the formula with 4 multiplications and with 1024 precalculated values.

Here is the whole utility to calculate hash factors:

/// <summary>
/// Hash utilities.
/// </summary>
public class Hash
{
  /// <summary>
  /// Returns a P^value mod 2^31, where P is hash base.
  /// </summary>
  /// <param name="value">A value to get hash factor for.</param>
  /// <returns>A hash factor value.</returns>
  public static int GetHashFactor(int value)
  {
    return factors[(uint)value & 0xff] *
      factors[(((uint)value >> 8) & 0xff) | 0x100] *
      factors[(((uint)value >> 16) & 0xff) | 0x200] *
      factors[(((uint)value >> 24) & 0xff) | 0x300];
  }

  /// <summary>
  /// Initializes hash factors.
  /// </summary>
  static Hash()
  {
    var values = new int[4 * 256];
    var value = P;
    var current = 1;
    var i = 0;

    do
    {
      values[i++] = current;
      current *= value;
    }
    while(i < 256);

    value = current;
    current = 1;

    do
    {
      values[i++] = current;
      current *= value;
    }
    while(i < 512);

    value = current;
    current = 1;

    do
    {
      values[i++] = current;
      current *= value;
    }
    while(i < 768);

    value = current;
    current = 1;

    do
    {
      values[i++] = current;
      current *= value;
    }
    while(i < 1024);

    factors = values;
  }

  /// <summary>
  /// A base to calculate hash factors.
  /// </summary>
  public const int P = 1103515245;

  /// <summary>
  /// Hash factors.
  /// </summary>
  private static readonly int[] factors;
}

With this API hash code for a dictionary is a trivial operation:

public int GetHashCode(IDictionary<K, V> dictionary)
{
  if (dictionary == null)
  {
    return 0;
  }

  var result = 0;

  foreach(var entry in dictionary)
  {
    if ((entry.Key == null) || (entry.Value == null))
    {
      continue;
    }

    result += Hash.GetHashFactor(keyComparer.GetHashCode(entry.Key)) *
      valueComparer.GetHashCode(entry.Value);
  }

  return result;
}

And finally, here is a reference to a class DictionaryEqualityComparer<K, V>: IEqualityComparer<IDictionary<K, V>> that allows a dictionary to be a key in another dictionary.

Update

We have commited some tests, and have found that with suffiently "good" implementation of GetHashCode() of key or value we achieve results almost of the same quality, as the results of the algorithm we have outlined above with much simpler and straightforward algorithm like this:

public int GetHashCode(IDictionary<K, V> dictionary)
{
  if (dictionary == null)
  {
    return 0;
  }

  var result = 0;

  foreach(var entry in dictionary)
  {
    if ((entry.Key == null) || (entry.Value == null))
    {
      continue;
    }

    var k = entry.Key.GetHashCode();
    var v = entry.Value.GetHashCode();

    k = (k << 5) + k;
    v = (v << (k >> 3)) + v;

    result += k ^ v;

    //result += Hash.GetHashFactor(keyComparer.GetHashCode(entry.Key)) *
    // valueComparer.GetHashCode(entry.Value);
  }

  return result;
}

It was worth to blog about this just to find out that we have outwitted ourselves, and finally to reach to a trivial hash code implementation for the dictionary.

Monday, 13 January 2014 20:33:31 UTC  #    Comments [0] -
.NET | Thinking aloud | Tips and tricks
# Wednesday, 08 January 2014

Dealing recently with some task, we were in a position to use a weak dictionary in the .NET. Instinctively we assumed that it should exist somewhere in the standard library. We definitely knew that there is a WeakReference class to for a single instance. We also knew that there is WeakHashMap in java, and that it's based on java's WeakReference.

So, we were surprised to find that there is no such thing out of the box in .NET.

We have found that java's and .NET's weak references are different. In java weak references whose targets are GCed can be automatically put into a queue, which can be used to build clean up logic to remove dead keys from weak hash map. There is nothing similar in .NET, where weak reference just silently loses it's value.

Internet is full with custom implementations of weak dictionaries in .NET.

.NET 4.5 finally defines a class ConditionalWeakTable<TKey, TValue>, which solves the problem in case when you need to match keys by instance identity.

Unfortunately in our case we needed to match keys using key's GetHashCode() and Equals(). So, ConditionalWeakTable<TKey, TValue> did not directly work, but then we found a way to make it work for us.

Here is a quote from the definition:

A ConditionalWeakTable<TKey, TValue> object is a dictionary that binds a managed object, which is represented by a key, to its attached property, which is represented by a value. The object's keys are the individual instances of the TKey class to which the property is attached, and its values are the property values that are assigned to the corresponding objects.

...in the ConditionalWeakTable<TKey, TValue> class, adding a key/value pair to the table does not ensure that the key will persist, even if it can be reached directly from a value stored in the table... Instead, ConditionalWeakTable<TKey, TValue> automatically removes the key/value entry as soon as no other references to a key exist outside the table.

This property of ConditionalWeakTable<TKey, TValue> has helped us to build a way to get a notification when the key is being finalized, which is the missed ingredient in .NET's weak references.

Assume you have an instance key of type Key. To get a notification you should define a class Finalizer that will call some handler when it's finalized, and you should bind key and a finalizer instance using weak table.

The code looks like this:

public class Finalizer<K>
  where K: class
{
  public static void Bind(K key, Action<K> handler)
  {
    var finalizer = table.GetValue(key, k => new Finalizer<K> { key = k });

    finalizer.Handler += handler;
  }

  public static void Unbind(K key, Action<K> handler)
  {
    Finalizer finalizer;

    if (table.TryGetValue(key, out finalizer))
    {
      finalizer.Handler -= handler;
    }
  }

  ~Finalizer()
  {
    var handler = Handler;

    if (handler != null)
    {
      handler(key);
    }
  }

  private event Action<K> Handler;
  private K key;

  private static readonly ConditionalWeakTable<K, Finalizer> table =
    new ConditionalWeakTable<K, Finalizer>();
}


Key key = ...

Finalizer.Bind(key, k => { /* clean up. */ });

Using this approach we have created a class WeakTable<K, V> modeled after ConditionalWeakTable<TKey, TValue>.

So, this is our take in the problem: WeakTable.cs.

Wednesday, 08 January 2014 21:57:16 UTC  #    Comments [0] -
.NET | Java | Thinking aloud | Tips and tricks
# Monday, 04 November 2013

A day ago we had installed on our laptops Win 8.1. We thought it will be better than Win 8, but now we afraid of next release of Windows... So far, so worse.

ICS became unusable: in Windows 7 we've shared Internet connection from USB 3G dongle between our two computers, in Windows 8 we've succeeded to share only file system access, but not Internet. In Windows 8.1 neither Internet nor file system are accessible...

Monday, 04 November 2013 11:20:28 UTC  #    Comments [0] -
Thinking aloud
# Monday, 14 October 2013

Till recently we were living in simple world of string comparisons in SQL style, and now everything has changed.

From the university years we knew that strings in SQL are compared by first trimming traling spaces, and then comparing in C style.

Well, the picture was a little more complex, as collations were involved (national, case sensivity), and as different SQL vendors implemented it differently.

Next,
we're dealing with programs converted from COBOL, which we originally thought follow SQL rules when strings are compared.

 

Here is where the problem has started.

Once we have found that java program has branched differently than original COBOL, and the reason was that the COBOL and java compared two strings differently:

  • COBOL: "A\n" < "A";
  • Java: "A\n" > "A"

We have looked into COBOL Language Reference and found the rules:

Operands of equal size
Characters in corresponding positions of the two operands are compared, beginning with the leftmost character and continuing through the rightmost character.

If all pairs of characters through the last pair test as equal, the operands are considered as equal.

If a pair of unequal characters is encountered, the characters are tested to determine their relative positions in the collating sequence. The operand that contains the character higher in the sequence is considered the greater operand.

Operands of unequal size
If the operands are of unequal size, the comparison is made as though the shorter operand were extended to the right with enough spaces to make the operands equal in size.

You can see that strings must not be trimmed but padded with spaces to the longer string, and only then they are compared. This subtle difference has significant impact for characters below the space.

So, here we've found that COBOL and SQL comparisons are different.

But then we have questioned how really SQL beheaves?

We've tested comparisons in SQL Server and DB2, and have seen that our understanding of SQL comparison holds. It works as if trimming spaces, and then comparing.

But again we have looked into SQL-92 definition, and that's what we see there:

8.2 <comparison predicate>
3) The comparison of two character strings is determined as follows:

a) If the length in characters of X is not equal to the length in characters of Y, then the shorter string is effectively replaced, for the purposes of comparison, with a copy of itself that has been extended to the length of the longer string by concatenation on the right of one or more pad characters, where the pad character is chosen based on CS. If CS has the NO PAD attribute, then the pad character is an implementation-dependent character different from any character in the character set of X and Y that collates less than any string under CS. Otherwise, the pad character is a <space>.

So, what we see is that SQL-92 rules are very close to COBOL rules, but then we reach the question: how come that at least SQL Server and DB2 implement string comparison differently than SQL-92 dictates?

Update: we have found that both SQL Server and DB2 have their string collation defined in a way that <space> is less than any other character. So the following is always true: '[' + char(13) + ']' > '[ ]'.

Monday, 14 October 2013 20:23:11 UTC  #    Comments [0] -
Java | SQL Server puzzle | Thinking aloud | Tips and tricks
# Sunday, 14 July 2013

Recently we've seen an article Why mobile web apps are slow.

While title mentions web apps, the the criticism is directed purely to javascript language. The answer presented there is twofold:

  • Raw javascript performance is ~5 times slower than performance of native code solving the same task.

    This accounts for the most modern implementations that exploit JIT. Author does not expect that this proportion will be significatly changed in javascript's favor in the near future.

  • Garbage Collection, which is essential part of javascript, does not work well in constrainted environment of mobile devices.

    Here author quotes many references that show that:

    • for GC to work on peer with non-GC application, it needs to have ~6 - 8 times size of memory than an application needs;
    • at the same time for hardware reasons, mobile devices cannot provide such amount of memory;
    • on the other hand with rise of CPU performance, GC pressure rises even faster.

In the end author, while saying about some attempts to change the state, has no final verdict, whether there can be anything done to remedy the problem.

Having roots in C++, we're GC unbelievers. But you know, who will ask your opinion on that, while there are all those modern languages that try to abstract from memory and implicitly or explicitly assume GC: java, C#, javascript, xslt, xquery, and so on.

There always is a solution to avoid GC completely, like C++ and other (e.g. Microsoft's C++/CX, or Apple's ARC) do. But, assuming you're entered GC world, what can be done with it? How do you make it more predictable, less greedy, and probably more fast?

Our arguments are like this.

How does native code manage object graphs?

Today's solution is reference counting along with weak references to break cycles in graph.

Can be GC based on this?

Yes.

In fact upcoming spec of javascript contains weak references. So, provided a developer accurately defines relations in an object graph, one may immediately achieve the same efficiency as native solution.

If one does not use weak references consistently then object cycles can be created, so there can appear a graph that is not referenced anywhere in a program. This graph can be collected with classical GC that scans object roots.

Classical GC part can be used as a debug mode leak detector, and will collect graph cycles at runtime.

Thus, we claim that a hybrid memory management: reference counting with weak references plus classical GC is possible; it will be equal to a native memory management when weak references are properly used, and it will be close to classical GC without use of weak references.

This solution gives a rule to mitigate GC drawbacks: just use weak references in appropriate points, and you can continue to live in GC world, where GC is only a fallback.

Sunday, 14 July 2013 12:20:29 UTC  #    Comments [0] -
javascript | Thinking aloud
# Friday, 12 July 2013

RESTful is well known and established archetecture. It does not need our approvals or recomendations.

It's something different that we want to express. We'd like to thank all those people who formulated those ideas.

While one can argue that REST has "always" existed along with the web, from our experience we can see that most of web applications we have created up to the late 200x were stateful.

We can see that many web applications should support the application (session) state. And here REST has given a rule:

  • client stores a session state;

  • server does not store a session, and is represented as set of services;

  • if there is a state that cannot be stored on a client (e.g. due to security reasons), then server should use caches, and be able to reconstruct that state upon cache miss.

Now is a good time for the proliferation of the REST, as even weakest clients (browsers) can implement its requirements.

REST allowed us to drastically simplify development and support, to unload server, and to build better web applications.

We compare web applications we have written a decade ago using ASP.NET, and in last 3-4 years using RESTful ideas both in java and in .NET. Clearly, the later have better performance, but from the support standpoint the most appealing is that you can instantly upgrade the application without impacting users, as there are no sessions on the server. For the same reason you should not puzzle over whether you should use in-process sessions and session stickness with load balancing server, or out-of-process sessions.

Things became simpler:

  • server now is pure logic through services (WCF, Web API, JAX-RS);

  • client is gui - jquery, kendoui or other;

  • aspx/jsf pages gone completely;

Friday, 12 July 2013 13:25:28 UTC  #    Comments [0] -
Thinking aloud
# Sunday, 07 July 2013

Earlier, in the article How To: Load KendoUI Templates from External Files, we were talking about the way to combine project's templates into a single file using Text Templates. Now, we would like to suggest the next step.

KendoUI defines text templates that it knows to transform into functions, at runtime obviously. Thus a template like this:

<tr>
  <td data-bind=" text: name"></td>
  <td>#: kendo.toString(get("price"), "C") #</td>
  <td data-bind="text: unitsInStock"></td>
  <td><button class="k-button" data-bind="click: deleteProduct"> Delete</button></td>
</tr>

is transformed into a function:

function(data)
{
  var o,e=kendo.htmlEncode;

  with(data)
  {
    o='<tr><td data-bind="text: name"></td><td>'+
      e( kendo.toString(get("price"), "C") )+
      '</td><td data-bind="text: unitsInStock"></td>' +
      '<td><button class="k-button" ' +
      'data-bind="click: deleteProduct">Delete</button></td></tr>';
  }

  return o;
}

The transformation is done through a sequence of of regex replaces.

Now, what's the fastest javascript template engine?

Right! That, which does not work at runtime. :-)

What we thought is that we can generate those functions at compile time rather than defining templates.

We have updated templates.tt to generate template functions, and optionally to generate <script> tags that call those functions. This way, for an input footer.tmpl.html:

<!DOCTYPE html>
<html>
<head>
  <title>Test</title>
  <base href="/" />
  <link href="styles/kendo.common.min.css" rel="stylesheet" />
  <link href="styles/kendo.default.min.css" rel="stylesheet" />
</head>
<body>
  <table data-template-id="view">
    <tr>
      <td>Products count: #: total() #</td>
      <td>Total price: #: totalPrice() #</td>
      <td colspan="2">Units in stock: #: totalUnitsInStock() #</td>
   </tr>
  </table>
</body>
</html>

 templates.js will look like this:

nesterovskyBros.templates=
{
  ...
  "footer-view":function(data)
  {
    var o,e=kendo.htmlEncode;

    with(data)
    {
      ...
    }

    return o;
  },
  ...
};

document.write('<script id="footer-view-template" type="text/x-kendo-template">#=nesterovskyBros.templates["footer-view"](data)#</script>');

To get template function at runtime you simply refer to  nesterovskyBros.templates["footer-view"].

template.tt now allows you to specify:

  • scope - a javascript scope for tempate functions, e.g. "nesterovskyBros.templates";
  • data-script attribute over each template (default is true) to prevent generation of <script> tag;
  • data-with-block attribute (default is true) to prevent with(data) {...} statement in javascript.

See a sample application that shows how nicely KendoUI UserControls work with those compiled templates.

Sunday, 07 July 2013 18:54:57 UTC  #    Comments [2] -
javascript | kendoui | Thinking aloud
# Wednesday, 03 July 2013

Awhile ago we have created a set of xml schemas and xslt to represent different languages as xml, and to generate source from those xmls. This way we know to represent and generate: java, c#, cobol, and several sql dialects (read about languages xom on this site).

Here, we'd like to expose a nuisance we had with sql dialects schema.

Our goal was to define a basic sql schema, and dialect extensions. This way we assumed to express general and dialect specific constructs. So, lets consider an example.

General:

-- Select one row
select * from A

DB2:

select * from A fetch first row only

T-SQL:

select top 1 * from A

Oracle:

select * from A where rownum = 1

All these queries have common core syntax, while at the same time have dialect specific means to express intention to return first row only.

Down to the xml schema basic select statement looks like this:

<xs:complexType name="select-statement">
  <xs:complexContent>
    <xs:extension base="full-select-statement">
      <xs:sequence>
        <xs:element name="columns" type="columns-clause">
        <xs:element name="from" type="from-clause" minOccurs="0">
        <xs:element name="where" type="unary-expression" minOccurs="0"/>
        <xs:element name="group-by" type="expression-list" minOccurs="0"/>
        <xs:element name="having" type="unary-expression" minOccurs="0"/>
        <xs:element name="order-by" type="order-by-clause" minOccurs="0"/>
      </xs:sequence>
      <xs:attribute name="specification" type="query-specification"
        use="optional" default="all"/>
    </xs:extension>
  </xs:complexContent>
</xs:complexType>

Here all is relatively clear. The generic select looks like:

<sql:select>
  <sql:columns>
    <sql:column wildcard="true"/>
  </sql:columns>
  <sql:from>
    <sql:table name="A"/>
  </sql:from>
</sql:select>

But how would you define dialect specifics?

E.g. for T-SQL we would like to see a markup:

<sql:select>
  <tsql:top>
    <sql:number value="1"/>
  </tsql:top>
  <sql:columns>
    <sql:column wildcard="true"/>
  </sql:columns>
  <sql:from>
    <sql:table name="A"/>
  </sql:from>
</sql:select>

While for DB2 there should be:

<sql:select>
  <sql:columns>
    <sql:column wildcard="true"/>
  </sql:columns>
  <sql:from>
    <sql:table name="A"/>
  </sql:from>
  <db2:fetch-first rows="1"/>
</sql:select>

So, again the quesions are:

  • how to define basic sql schema with goal to extend it in direction of DB2 or T-SQL?
  • how to define an xslt sql serializer that will be also extendable?

Though we have tried several solutions to that problem, none is satisfactory enough. To allow extensions we have defined that all elements in sql schema are based on sql-element, which allows extensions:

<xs:complexType name="sql-element" abstract="true">
  <xs:sequence>
    <xs:element ref="extension" minOccurs="0" maxOccurs="unbounded"/>
  </xs:sequence>
</xs:complexType>

<xs:element name="extension" type="extension"/>

<xs:complexType name="extension" abstract="true">
  <xs:complexContent>
    <xs:extension base="sql-element"/>
  </xs:complexContent>
</xs:complexType>

...

<xs:element name="top" type="top-extension" substitutionGroup="sql:extension"/>

<xs:complexType name="top-extension">
  <xs:complexContent>
    <xs:extension base="sql:extension">
      <xs:sequence>
        <xs:element ref="sql:expression"/>
      </xs:sequence>
      <xs:attribute name="percent" type="xs:boolean" use="optional" default="false"/>
    </xs:extension>
  </xs:complexContent>
</xs:complexType>

Unfortunately, this creates too weak typed schema for extensions, thus intellisence suggests too many options.

Wednesday, 03 July 2013 05:50:43 UTC  #    Comments [0] -
Thinking aloud | xslt
# Wednesday, 26 June 2013

On a way to a home we usually listen audio books in our car. In most cases these are science fiction stories. One of our favorite authors is Ray Bradbury. Probably you know his short story collection "The Martian Chronicles". Also, we like to glance over technology and science news to stay tuned in the latest innovations.

Recently, we've read an article that states that NASA is going to send a group of scientists-colonists to Mars in an observable future. It sounds in Ray Bradbury style.

Although, we're not specialists in this field, we discussed what difficulties may face such an expedition.

As far as we understand, there are following issues that must be solved before expedition may start:

  • speed of a spaceship is yet too slow

  • a shield from radiation doesn't exist

  • big cargo section to contain sufficient reserve of water, food and oxygen for the expedition to survive

  • a way to produce enough energy to survive on Mars

Actually, the second issue depends on the first one. At present there is no reliable long term protection from cosmic radiation, which can be installed on a spacecraft. At least we didn't hear about such thing. Thus, long staying in the open space will seriously harm colonists and may bring to naught the whole mission.

Сonsidering these facts, we concluded that a space travel further than the moon is not possible at present.

What could be done to solve these issues?

As a solution of space colonization could be the following. The earthlings won't send people but robots with corresponding equipment and containers for man, animals and plants DNAs or embryos. At the destination place robots will build a colony and will begin to grow people, animals, plants; to train them and to serve them later (at least to man <img alt=" src="http://www.nesterovsky-bros.com/weblog/smilies/wink.gif"/> ).

How all this may solve the mentioned issues?

  • It's much easier to create reliable shield from radiation for small DNA or embrios containers.

  • time and speed of a spacecraft, in such case, will impact less on the mission.

  • weight of spacecraft in this case could be less, or it may get more payload.

Thus, to bring the era of cosmic expansion, human kind must invest in the development of robotics, artificial intelligence, genetic engineering, development of rapid learning, among other scientific fields in space exploration.

Right now, you may see a beginning? of this trend in sciense here DFKI's robot ape to colonize the Moon?

In the far future, after the beginning will be forgotten, all this may lead to question: who was the first a man or a robot? <img alt=" src="http://www.nesterovsky-bros.com/weblog/smilies/wink.gif"/>

Wednesday, 26 June 2013 20:23:35 UTC  #    Comments [0] -
Thinking aloud
# Tuesday, 14 May 2013

We heavily use kendo.ui.Splitter widget. Unfortunately it has several drawbacks:

  • you cannot easily configure panes declaratively;
  • you cannot define a pane that takes space according to its content.

Although we don't like to patch widgets, in this case we found no better way but to patch two functions: kendo.ui.Splitter.fn._initPanes, and  kendo.ui.Splitter.fn._resize.

After the fix, splitter markup may look like the following:

<div style="height: 100%" data-role="splitter" data-orientation="vertical">
  <div data-pane='{ size: "auto", resizable: false, scrollable: false }'>
    Header with size depending on content.
  </div>
  <div data-pane='{ resizable: false, scrollable: true }'>
    Body with size equal to a remaining area.
  </div>
  <div data-pane='{ size: "auto", resizable: false, scrollable: false }'>
    Footer with size depending on content.
  </div>
</div>

Each pane may define a data-pane attribute with pane parameters. A pane may specify size = "auto" to take space according to its content.

The code can be found at splitter.js A test can be seen at splitter.html.

Tuesday, 14 May 2013 07:34:59 UTC  #    Comments [0] -
javascript | kendoui | Thinking aloud | Tips and tricks
# Wednesday, 03 April 2013

To simplify KendoUI development we have defined nesterovskyBros.data.Model, which extends kend.data.Model class.

Extensions in nesterovskyBros.data.Model

  1. As with kendo.data.Model there is fields Object - a set of key/value pairs to configure the model fields, but fields have some more options:
    • fields.fieldName.serializable Boolean - indicates whether the field appears in an object returned in model.toJSON(). Default is true.
    • fields.fieldName.updateDirty Boolean - indicates whether the change of the property should trigger dirty field change. Default is true.
  2. When model defines a field and there is a prototype function with the same name then this function is used to get and set a field value.
  3. When property is changed through the model.set() method then dirty change event is triggered (provided that fields.fieldName.updateDirty !== false). This helps to build a dependcy graph on that property.
  4. When model instance is consturcted, the data passed in are validated, nullable and default values are set.

Model example

Here is an example of a model:

nesterovskyBros.data.ProductModel = nesterovskyBros.data.Model.define(
{

fields:
{
  name: { type: "string", defaultValue: "Product Name" },
  price: { type: "number", defaultValue: 10 },
  unitsInStockValue: { type: "number", defaultValue: 10, serializable: false },
  unitsInStock: { type: "string" }
},

unitsInStock: function(value)
{
  if (value === undefined)
  {
    var count = this.get("unitsInStockValue");

    return ["one", "two", "three", "four"][count] || (count + "");
  }
  else
  {
    this.set("unitsInStockValue", ({one: 1, two: 2, three: 3, four: 4 })[value] || value);
  }
}

});

Notice that:

  • unitsInStock property is implemented as a function - this helps to map model values to presentation values.
  • when you call model.toJSON(), or JSON.stringify() you will see in result name, price, unitsInStock values only - this helps to get model's state and to store it somewhere (e.g. in sessionStorage).
  • in a code:
      var model = new nesterovskyBros.data.ProductModel({ price: "7", unitsInStock: "one" });
    the following is true:
      (typeof(model.price) == "number") && (mode.price == 7) && (model.name == "Product Name") && (model.unitsInStockValue == 1)

As with UserControl the implemntation is defined in the controls.js. The sample page is the same index.html

Wednesday, 03 April 2013 20:37:49 UTC  #    Comments [0] -
javascript | Thinking aloud | Tips and tricks
# Thursday, 28 March 2013

Two weeks ago we've gotten new Lenovo 13" laptops (Yoga-13 with touch screens and Windows 8 Pro on board).

The first expression was WOW! Touch screens! Windows 8! Now we'll try our hand on that new (for us) API. So new, so cool...

A day later. What a shit this new UI. Where are my desktop, "Start" button, all the programs... After googling we've understood - we're not alone.

Few more days later. We've recognized that our SSD hard disk won't live long life with our projects. We generates output several GB a day. Thus we've decided to buy external SD cards - additional 64Gb, class 10. That's enough for us. No sooner said than done. After several attempts to copy our projects from hard drive to SD card (~9Gb of sources) we strongly believe that such a vigorous mix (Lenovo + Win 8 + external SD card) won't survive. Windows 8 hangs up when display off (in middle of data copy, after an hour of work). What a .... of .... this Windows 8, Lenovo and SD cards all together.

Thursday, 28 March 2013 22:39:55 UTC  #    Comments [2] -
Thinking aloud
# Tuesday, 26 March 2013

Developing with KendoUI we try to formalize tasks. With this in mind we would like to have user controls.

We define user control as following:

It is a javascript class that extends Widget.
It offers a way to reuse UI.
It allows to define a model and a template with UI and data binding.

Unfortunately, KendoUI does not have such API, though one can easily define it; so we have defined our version.

Here we review our solution. We have taken a grid KendoUI example and converted it into a user control.

User control on the page

See index.html

<!DOCTYPE html>
<html>
<head>
  <title>Test</title>

  <!-- (1) Include templates for controls. -->
  <script src="scripts/templates.js"></script>

  <script src="scripts/jquery/jquery.js"></script>
  <script src="scripts/kendo/kendo.web.min.js"></script>

  <!-- (2) UserControl definition. -->
  <script src="scripts/controls.js"></script>

  <!-- (3) Confirm dialog user control. -->
  <script src="scripts/controls/confirm.js"></script>

  <!-- (4) Products user control. -->
  <script src="scripts/controls/products.js"></script>

  <link href="styles/kendo.common.min.css" rel="stylesheet" />
  <link href="styles/kendo.default.min.css" rel="stylesheet" />
  <script>
$(function ()
{
  // (5) Bind the page.
  kendo.bind(
    document.body,
    // (6) Model as a datasource.
    { source: [new nesterovskyBros.data.ProductsModel] });
});
  </script>
</head>
<body>
  <!-- (7) User control and its binding. -->
  <div data-role="products" data-bind="source: source"></div>
</body>
</html>

That's what we see here:

  1. Templates that define layouts. See "How To: Load KendoUI Templates from External Files", and templates.tt.
  2. Definition of the UserControl widget.
  3. Confirm dialog user control (we shall mention it later).
  4. Products user control.
  5. Data binding that instantiates page controls.
  6. Model is passed to a user control through the dataSource.
  7. Use of Products user control. Notice that "data-role" defines control type, "source" refers to the model.

User Control declaration

Declaration consists of a view and a model.

View is html with data binding. See products.tmpl.html

We build our project using Visual Studio, so templates packaging is done with templates.tt. This transformation converts products template into a tag:

<script id="products-template" type="text/x-kendo-template">

thus template can be referred by a utility function: nesterovskyBros.template("products-template").

Model inherits kedo.data.Model. Here how it looks:

// (1) Define a ProducsModel class.
nesterovskyBros.data.ProductsModel = kendo.data.Model.define(
{

// (2) Model properties.
fields:
{
  productName: { type: "string", defaultValue: "Product Name" },
  productPrice: { type: "number", defaultValue: 10 },
  productUnitsInStock: { type: "number", defaultValue: 10 },
  products: { type: "default", defaultValue: [] }
},

// (3) Model methods.
addProduct: function () { ... },
deleteProduct: function (e) { ... },
...

});

// (4) Register user control.
nesterovskyBros.ui.Products = nesterovskyBros.defineControl(
{
  name: "Products",
  model: nesterovskyBros.data.ProductsModel
});

That's what we have here:

  1. We define a model that inherits KendoUI Model.
  2. We define model fields.
  3. We define model methods.
  4. Register user control with  nesterovskyBros.defineControl(proto) call, where:
    • proto.name - defines user control name;
    • proto.model - defines model type;
    • proto.template - defines optional template. If not specified, a template is retrieved from $("#" + proto.name.toLowerCase() + "-template").html().

UserControl API

Now, what's remained is API for the UserControl. See controls.js.

  1. UserControl defines following events:
    • change - triggered when data source is changed;
    • dataBound - triggered when widget is data bound;
    • dataBinding - triggered befor widget data binding;
    • save - used to notify user to save model state.
  2. UserControl defines following options:
    • autoBind (default false) - autoBind data source;
    • template (default $.noop) - user control template.
  3. UserControl defines dataSource field and setDataSource() method.
  4. UserControl defines rebind() method to manually rebuild widget's view from the template and model.
  5. UserControl sets/deletes model.owner, which is a function returning a user control widget when model is bound/unbound to the widget.
  6. When UserControl binds/unbinds model a model.refresh method is called, if any.
  7. You usually define you control with a call nesterovskyBros.defineControl(proto). See above.
  8. There is also a convenience method to build a dialog based on a user control: nesterovskyBros.defineDialog(options), where
    • options.name - a user control name (used in the data-role);
    • options.model - a model type;
    • options.windowOptions - a window options.
    This method returns a function that recieves a user control model, and returns a dialog (kendo.ui.Window) based on the user control.
    Dialog has model() function that returns an instance of model.
    Model has dialog() function that returns an instance of the dialog.
    Dialog and model have result() function that returns an instance of deferred object used to track dialog completion.
    The example of user control dialog is confirm.js and confirm.tmpl.html. The use is in the products.js deleteProduct():

    deleteProduct: function(e)
    {
      var that = this;

      return nesterovskyBros.dialog.confirm(
      {
        title: "Please confirm",
        message: "Do you want to delete the record?",
        confirm: "Yes",
        cancel: "No"
      }).
      open().
      center().
      result().
      then(
        function(confirmed)
        {
          if (!confirmed)
          {
            return;
          }
          ...
       });
    }

Last

User controls along with technique to manage and cache templates allow us to build robust web applications. As the added value it's became a trivial task to build SPA.

See also: Compile KendoUI templates.

Tuesday, 26 March 2013 21:40:05 UTC  #    Comments [0] -
javascript | Thinking aloud | Tips and tricks
# Tuesday, 20 November 2012

Our goal is to generate reports in streaming mode.

At some point we need to deal with data streams (e.g. xml streams for xslt transformations). Often a nature of report demands several passes through the data. To increase performance we have defined a class named StreamResource. This class encapsulates input data, reads it once and caches it into a temp file; thus data can be traversed many times. StreamResource can read data lazily or in a eager way thus releasing resources early. This class can be used as a variation of PipeStream, which never blocks, as if a size of a buffer is not limited, and which can be read many times.

The API looks like this:

public class StreamResource: IDisposable
{
  /// <summary>
  /// Creates a StreamSource instance.
  /// </summary>
  /// <param name="source">
  /// A function that returns source as an input stream.
  /// </param>
  /// <param name="settings">Optional settings.</param>
  public StreamResource(Func<Stream> source, Settings settings = null);

  /// <summary>
  /// Creates a StreamSource instance.
  /// </summary>
  /// <param name="source">
  /// A function that writes source data into an output stream.
  /// </param>
  /// <param name="settings">Optional settings.</param>
  public StreamResource(Action<Stream> source, Settings settings = null);

  /// <summary>
  /// Gets an input stream.
  /// </summary>
  /// <param name="shared">
  /// Indicates that this StreamResouce should be disposed when returned
  /// stream is closed and there are no more currently opened cache streams.
  /// </param>
  /// <returns>A input stream.</returns>
  public Stream GetStream(bool shared = false);
}

The use pattern is following:

// Acquire resource.
using(var resource = new StreamResource(() => CallService(params...)))
{
  // Read stream.
  using(var stream = resource.GetStream())
  {
    ...
  }

  ...

  // Read stream again.
  using(var stream = resource.GetStream())
  {
    ...
  }
}

StreamResource is efficient even if you need to process content only once, as it monitors timings of reading of source data and compares it with timings of data consumption. If the difference exceeds some threshold then StreamResource caches source greedily, otherwise source is pooled lazily. Thus, input resources can be released promptly. This is important, for example, when the source depends on a database connection.

The use pattern is following:

// Acquire resource and get shared stream.
using(var stream = new StreamResource(() => CallService(params...)).GetStream(true))
{
  ...
}

Finally, StreamResource allows to process data in a pipe stream mode. This is when you have a generator function Action<Stream> that can write to a stream, and you want to read that data. The advantage of StreamResource over real pipe stream is that it can work without blocking of generator, thus releasing resources early.

The use pattern is similar to the previous one:

using(var stream = new StreamResource(output => Generate(output, params...)).GetStream(true))
{
  ...
}

The source of the class can be found at Streaming.zip.

Tuesday, 20 November 2012 07:01:57 UTC  #    Comments [0] -
.NET | Thinking aloud
# Monday, 29 October 2012

If you deal with web applications you probably have already dealt with export data to Excel. There are several options to prepare data for Excel:

  • generate CSV;
  • generate HTML that excel understands;
  • generate XML in Spreadsheet 2003 format;
  • generate data using Open XML SDK or some other 3rd party libraries;
  • generate data in XLSX format, according to Open XML specification.

You may find a good article with pros and cons of each solution here. We, in our turn, would like to share our experience in this field. Let's start from requirements:

  • Often we have to export huge data-sets.
  • We should be able to format, parametrize and to apply different styles to the exported data.
  • There are cases when exported data may contain more than one table per sheet or even more than one sheet.
  • Some exported data have to be illustrated with charts.

All these requirements led us to a solution based on XSLT processing of streamed data. The advantage of this solution is that the result is immediately forwarded to a client as fast as XSLT starts to generate output. Such approach is much productive than generating of XLSX using of Open XML SDK or any other third party library, since it avoids keeping a huge data-sets in memory on the server side.

Another advantage - is simple maintenance, as we achieve clear separation of data and presentation layers. On each request to change formatting or apply another style to a cell you just have to modify xslt file(s) that generate variable parts of XLSX.

As result, our clients get XLSX files according with Open XML specifications. The details of implementations of our solution see in our next posts.

Monday, 29 October 2012 15:34:38 UTC  #    Comments [0] -
.NET | ASP.NET | Thinking aloud | xslt
# Friday, 07 September 2012

We're implementing UDT changes in the big database. Earlier, that User Defined Type was based on smallint, and now we have to use int as the base.

The impact here is manyfold:

  1. Clients of the database should be prepared to use wider types.
  2. All stored procedures, functions, triggers, and views should be updated accordingly.
  3. Impact on the database size should be analyzed.
  4. Types of columns in tables should be changed.
  5. Performance impact should be minimal.

Now, we're trying to address (3), (5) and to implement (4), while trying to keep interface with clients using old types.

As for database size impact, we have found that an index fragmentation is a  primary disk space waster (see Reorganize index in SQL Server). We have performed some partial index reorganization and can see now that we can gain back hundreds of GB of a disk space. On the other hand we use page compression, so we expect that change of types will not increase sizes of tables considerably. Indeed, our measurments show that tables will only be ~1-3% bigger.

The change of types of columns is untrivial task. The problem is that if you try to change column's type (which is part of clustered index) directly then you should temporary remove foreign keys, and to rebuild all indices. This won't work neither due to disk space required for the operation (a huge transaction log is required), nor due to availability of tables (we're talking about days or even weeks to rebuild indices).

To work-around the problem we have selected another way. For each target table T we performed the following:

  • Renamed table T to T_old;
  • Created a table T_new with required type changes;
  • Created a view named T, which is union of T_old for the dates before a split date and T_new for the dates after the split date;
  • Created instead of insert/update/delete triggers for the view T.
  • Created a procedures that move data in bulks from T_old to the T_new, update split date in view definitions, and delete data from T_old.

Note that:

  • the new view uses wider column types, so we had to change stored procedures that clients use to cast those columns back to shorter types to prevent side effects (fortunately all access to this database is through stored procedures and functions);
  • the procedures that transfer data between new and old tables may work online;
  • the quality of execution plans did not degrade due to switch from table to a view;
  • all data related to the date after the split date are inserted into T_new table.

After transfer will be complete we shall drop T_old tables, and T views, and will rename T_new tables into T.

This will complete part 4 of the whole task. Our estimations are that it will take a month or even more to complete the transfer. However solution is rather slow, the database will stay online whole this period, which is required condition.

The next task is to deal with type changes in parameters of stored procedures and column types of output result sets. We're not sure yet what's the best way to deal with it, and probably shall complain about in in next posts.

Friday, 07 September 2012 20:57:36 UTC  #    Comments [2] -
SQL Server puzzle | Thinking aloud | Tips and tricks
# Sunday, 19 August 2012

We have a large table in the form:

create table dbo.Data
(
  Date date not null,
  Type int not null,
  Value nvarchar(50) null,
  primary key clustered(Date, Type)
);

create unique nonclustered index IX_Data on dbo.Data(Type, Date);

Among other queries we often need a snapshot of data per each Type for a latest Date available:

select
  max(Date) Date,
  Type
from
  dbo.Data
group by
  Type

We have found that the above select does not run well on our data set. In fact dbo.Data grows with time, while snapshot we need stays more or less of the same size. The best solution to such query is to precalculate it. One way would be to create an indexed view, but SQL Server does not support max() aggregate in indexed views.

So, we have decided to add additional bit field dbo.Data.Last indicating that a row belongs to a last date snapshot, and to create filtered index to access that snapshot:

create table dbo.Data
(
  Date date not null,
  Type int not null,
  Value nvarchar(50) null,
  Last bit not null default 0,
  primary key clustered(Date, Type)
);

create unique nonclustered index IX_Data on dbo.Data(Type, Date);

create unique nonclustered index IX_Data_Last on dbo.Data(Type)
include(Date)
where Last = 1;

One way to support Last indicator is to create a trigger that will adjust Last value:

create trigger dbo.Data_Update on dbo.Data
after insert,delete,update
as
begin
  if (trigger_nestlevel(@@procid) < 2)
  begin
    set nocount on;

    with D as
    (
      select Date, Type from deleted
      union
      select Date, Type from inserted
    ),
    U as
    (
      select
        V.Date, V.Type
      from
        D
        inner join
        dbo.Data V
        on
          (V.Last = 1) and
          (V.Type = D.Type)
      union
      select
        max(V.Date) Date,
        V.Type
      from
        D
        inner join
        dbo.Data V
        on
          V.Type = D.Type
      group by
        V.Type
    ),
    V as
    (
      select
        rank() over(partition by V.Type order by V.Date desc) Row,
        V.*
      from
        dbo.Data V
        inner join
        U
        on
          (V.Date = U.Date) and
          (V.Type = U.Type)
    )
    update V set Last = 1 - cast(Row - 1 as bit);
  end;
end;

With Last indicator in action, our original query has been transformed to:

select Date, Type from dbo.Data where Last = 1

Execution plan shows that a new filtered index IX_Data_Last is used. Execution speed has increased considerably. As our actual table contains other bit fields, so Last indicator did not increase the table size, as SQL Server packs each 8 bit fields in one byte.

Sunday, 19 August 2012 05:57:55 UTC  #    Comments [0] -
SQL Server puzzle | Thinking aloud | Tips and tricks
# Friday, 03 August 2012

Earlier we have shown how to build streaming xml reader from business data and have reminded about ForwardXPathNavigator which helps to create a streaming xslt transformation. Now we want to show how to stream content produced with xslt out of WCF service.

To achieve streaming in WCF one needs:

1. To configure service to use streaming. Description on how to do this can be found in the internet. See web.config of the sample Streaming.zip for the details.

2. Create a service with a method returning Stream:

[ServiceContract(Namespace = "http://www.nesterovsky-bros.com")]
[AspNetCompatibilityRequirements(RequirementsMode = AspNetCompatibilityRequirementsMode.Allowed)]
public class Service
{
  [OperationContract]
  [WebGet(RequestFormat = WebMessageFormat.Json)]
  public Stream GetPeopleHtml(int count, int seed)
  {
    ...
  }
}

2. Return a Stream from xsl transformation.

Unfortunately (we mentioned it already), XslCompiledTransform generates its output into XmlWriter (or into output Stream) rather than exposes result as XmlReader, while WCF gets input stream and passes it to a client.

We could generate xslt output into a file or a memory Stream and then return that content as input Stream, but this will defeat a goal of streaming, as client would have started to get data no earlier that the xslt completed its work. What we need instead is a pipe that form xslt output Stream to an input Stream returned from WCF.

.NET implements pipe streams, so our task is trivial. We have defined a utility method that creates an input Stream from a generator populating an output Stream:

public static Stream GetPipedStream(Action<Stream> generator)
{
  var output = new AnonymousPipeServerStream();
  var input = new AnonymousPipeClientStream(
    output.GetClientHandleAsString());

  Task.Factory.StartNew(
    () =>
    {
      using(output)
      {
        generator(output);
        output.WaitForPipeDrain();
      }
    },
    TaskCreationOptions.LongRunning);

  return input;
}

We wrapped xsl transformation as such a generator:

[OperationContract]
[WebGet(RequestFormat = WebMessageFormat.Json)]
public Stream GetPeopleHtml(int count, int seed)
{
  var context = WebOperationContext.Current;

  context.OutgoingResponse.ContentType = "text/html";
  context.OutgoingResponse.Headers["Content-Disposition"] =
    "attachment;filename=reports.html";

  var cache = HttpRuntime.Cache;
  var path = HttpContext.Current.Server.MapPath("~/People.xslt");
  var transform = cache[path] as XslCompiledTransform;

  if (transform == null)
  {
    transform = new XslCompiledTransform();
    transform.Load(path);
    cache.Insert(path, transform, new CacheDependency(path));
  }

  return Extensions.GetPipedStream(
    output =>
    {
      // We have a streamed business data.
      var people = Data.CreateRandomData(count, seed, 0, count);

      // We want to see it as streamed xml data.
      using(var stream =
        people.ToXmlStream("people", "http://www.nesterovsky-bros.com"))
      using(var reader = XmlReader.Create(stream))
      {
        // XPath forward navigator is used as an input source.
        transform.Transform(
          new ForwardXPathNavigator(reader),
          new XsltArgumentList(),
          output);
      }
    });
}

This way we have build a code that streams data directly from business data to a client in a form of report. A set of utility functions and classes helped us to overcome .NET's limitations and to build simple code that one can easily support.

The sources can be found at Streaming.zip.

Friday, 03 August 2012 22:32:49 UTC  #    Comments [0] -
.NET | ASP.NET | Thinking aloud | Tips and tricks | xslt
# Thursday, 26 July 2012

In the previous post about streaming we have dropped at the point where we have XmlReader in hands, which continously gets data from IEnumerable<Person> source. Now we shall remind about ForwardXPathNavigator - a class we have built back in 2002, which adds streaming transformations to .NET's xslt processor.

While XslCompiledTransform is desperately obsolete, and no upgrade will possibly follow; still it's among the fastest xslt 1.0 processors. With ForwardXPathNavigator we add ability to transform input data of arbitrary size to this processor.

We find it interesting that xslt 3.0 Working Draft defines streaming processing in a way that closely matches rules for ForwardXPathNavigator:

Streaming achieves two important objectives: it allows large documents to be transformed without requiring correspondingly large amounts of memory; and it allows the processor to start producing output before it has finished receiving its input, thus reducing latency.

The rules for streamability, which are defined in detail in 19.3 Streamability Analysis, impose two main constraints:

  • The only nodes reachable from the node that is currently being processed are its attributes and namespaces, its ancestors and their attributes and namespaces, and its descendants and their attributes and namespaces. The siblings of the node, and the siblings of its ancestors, are not reachable in the tree, and any attempt to use their values is a static error. However, constructs (for example, simple forms of xsl:number, and simple positional patterns) that require knowledge of the number of preceding elements by name are permitted.

  • When processing a given node in the tree, each descendant node can only be visited once. Essentially this allows two styles of processing: either visit each of the children once, and then process that child with the same restrictions applied; or process all the descendants in a single pass, in which case it is not possible while processing a descendant to make any further downward selection.

The only significant difference between ForwardXPathNavigator and xlst 3.0 streaming is in that we reported violations of rules for streamability at runtime, while xslt 3.0 attempts to perform this analysis at compile time.

Here the C# code for the xslt streamed transformation:

var transform = new XslCompiledTransform();

transform.Load("People.xslt");

// We have a streamed business data.
var people = Data.CreateRandomData(10000, 0, 0, 10000);

// We want to see it as streamed xml data.
using(var stream =
  people.ToXmlStream("people", "http://www.nesterovsky-bros.com"))
using(var reader = XmlReader.Create(stream))
using(var output = File.Create("people.html"))
{
  // XPath forward navigator is used as an input source.
  transform.Transform(
    new ForwardXPathNavigator(reader),
    new XsltArgumentList(),
    output);
}

Notice how XmlReader is wrapped into ForwardXPathNavigator.

To complete the picture we need xslt that follows the streaming rules:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:msxsl="urn:schemas-microsoft-com:xslt"
  xmlns:d="http://www.nesterovsky-bros.com"
  exclude-result-prefixes="msxsl d">

  <xsl:output method="html" indent="yes"/>

  <!-- Root template processed in the streaming mode. -->
  <xsl:template match="/d:people">
    <html>
      <head>
        <title>List of persons</title>
        <style type="text/css">
          .even
          {
          }

          .odd
          {
            background: #d0d0d0;
          }
        </style>
      </head>
      <body>
        <table border="1">
          <tr>
            <th>ID</th>
            <th>First name</th>
            <th>Last name</th>
            <th>City</th>
            <th>Title</th>
            <th>Age</th>
          </tr>

          <xsl:for-each select="d:person">
            <!--
              Get element snapshot.
              A snapshot allows arbitrary access to the element's content.
            -->
            <xsl:variable name="person">
              <xsl:copy-of select="."/>
            </xsl:variable>

            <xsl:variable name="position" select="position()"/>

            <xsl:apply-templates mode="snapshot" select="msxsl:node-set($person)/d:person">
              <xsl:with-param name="position" select="$position"/>
            </xsl:apply-templates>
          </xsl:for-each>
        </table>
      </body>
    </html>
  </xsl:template>

  <xsl:template mode="snapshot" match="d:person">
    <xsl:param name="position"/>

    <tr>
      <xsl:attribute name="class">
        <xsl:choose>
          <xsl:when test="$position mod 2 = 1">
            <xsl:text>odd</xsl:text>
          </xsl:when>
          <xsl:otherwise>
            <xsl:text>even</xsl:text>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:attribute>

      <td>
        <xsl:value-of select="d:Id"/>
      </td>
      <td>
        <xsl:value-of select="d:FirstName"/>
      </td>
      <td>
        <xsl:value-of select="d:LastName"/>
      </td>
      <td>
        <xsl:value-of select="d:City"/>
      </td>
      <td>
        <xsl:value-of select="d:Title"/>
      </td>
      <td>
        <xsl:value-of select="d:Age"/>
      </td>
    </tr>
  </xsl:template>

</xsl:stylesheet>

So, we have started with a streamed entity data, proceeded to the streamed XmlReader and reached to the streamed xslt transformation.

But at the final post about streaming we shall remind a simple way of building WCF service returning html stream from our xslt transformation.

The sources can be found at Streaming.zip.

Thursday, 26 July 2012 18:49:51 UTC  #    Comments [0] -
.NET | Thinking aloud | Tips and tricks | xslt
# Sunday, 22 July 2012

For some reason neither .NET's XmlSerializer nor DataContractSerializer allow reading data through an XmlReader. These APIs work other way round writing data into an XmlWriter. To get data through XmlReader one needs to write it to some destination like a file or memory stream, and then to read it using XmlReader. This complicates streaming design considerably.

In fact the very same happens with other .NET APIs.

We think the reason of why .NET designers preferred XmlWriter to XmlReader in those APIs is that XmlReader's implementation is a state machine like, while XmlWriter's implementation looks like a regular procedure. It's much harder to manually write and to support a correct state machine logic than a procedure.

If history would have gone slightly different way, and if yield return, lambda, and Enumerator API appeared before XmlReader, and XmlWriter then, we think, both these classes looked differently. Xml source would have been described with a IEnumerable<XmlEvent> instead of XmlReader, and XmlWriter must be looked like a function receiving IEnumerable<XmlEvent>. Implementing XmlReader would have meant a creating a enumerator. Yield return and Enumerable API would have helped to implement it in a procedural way.

But in our present we have to deal with the fact that DataContractSerializer should write the data into XmlWriter, so let's assume we have a project that uses Entity Framework to access the database, and that you have a data class Person, and data access method GetPeople():

[DataContract(Name = "person", Namespace = "http://www.nesterovsky-bros.com")]
public class Person
{
  [DataMember] public int Id { get; set; }
  [DataMember] public string FirstName { get; set; }
  [DataMember] public string LastName { get; set; }
  [DataMember] public string City { get; set; }
  [DataMember] public string Title { get; set; }
  [DataMember] public DateTime BirthDate { get; set; }
  [DataMember] public int Age { get; set; }
}

public static IEnumerable<Person> GetPeople() { ... }

And your goal is to expose result of GetPeople() as XmlReader. We achieve result with three simple steps:

  1. Define JoinedStream - an input Stream implementation that reads data from a enumeration of streams (IEnumerable<Stream>).
  2. Build xml parts in the form of IEnumerable<Stream>.
  3. Combine parts into final xml stream.

The code is rather simple, so here we qoute its essential part:

public static class Extensions
{
  public static Stream JoinStreams(this IEnumerable<Stream> streams, bool closeStreams = true)
  {
    return new JoinedStream(streams, closeStreams);
  }

  public static Stream ToXmlStream<T>(
    this IEnumerable<T> items,
    string rootName = null,
    string rootNamespace = null)
  {
    return items.ToXmlStreamParts<T>(rootName, rootNamespace).
      JoinStreams(false);
  }

  private static IEnumerable<Stream> ToXmlStreamParts<T>(
    this IEnumerable<T> items,
    string rootName = null,
    string rootNamespace = null)
  {
    if (rootName == null)
    {
      rootName = "ArrayOfItems";
    }

    if (rootNamespace == null)
    {
      rootNamespace = "";
    }

    var serializer = new DataContractSerializer(typeof(T));
    var stream = new MemoryStream();
    var writer = XmlDictionaryWriter.CreateTextWriter(stream);

    writer.WriteStartDocument();
    writer.WriteStartElement(rootName, rootNamespace);
    writer.WriteXmlnsAttribute("s", XmlSchema.Namespace);
    writer.WriteXmlnsAttribute("i", XmlSchema.InstanceNamespace);

    foreach(var item in items)
    {
      serializer.WriteObject(writer, item);
      writer.WriteString(" ");

      writer.Flush();
      stream.Position = 0;

      yield return stream;

      stream.Position = 0;
      stream.SetLength(0);
    }

    writer.WriteEndElement();
    writer.WriteEndDocument();

    writer.Flush();
    stream.Position = 0;

    yield return stream;
  }

  private class JoinedStream: Stream
  {
    public JoinedStream(IEnumerable<Stream> streams, bool closeStreams = true)
    ...
  }
}

The use is even more simple:

// We have a streamed business data.
var people = GetPeople();

// We want to see it as streamed xml data.
using(var stream = people.ToXmlStream("persons", "http://www.nesterovsky-bros.com"))
using(var reader = XmlReader.Create(stream))
{
  ...
}

We have packed the sample into the project Streaming.zip.

In the next post we're going to remind about streaming processing in xslt.

Sunday, 22 July 2012 20:38:29 UTC  #    Comments [2] -
.NET | Thinking aloud | Tips and tricks | xslt
# Sunday, 17 June 2012

We're pleased to work with Kendo UI. Its design is good, however we find here and there things we would wish be done better. Here is a list of problems in a no particular order we would like to be addressed in the next release:

  • RTL is not supported (including correct scroll bar position see Tunning KendoUI).
  • Templates and binding should support a context information along with the data source. (Why do they use with statement?)
  • attr binding should use jquery.attr() method; there should be prop binding which is analogous to attr binding.
  • There should be custom binding that allows any json object to bind to different aspects of a widget or an element.
  • One should be able to use format/parse functions during binding. (Allow binding to express as a triple json object?)
  • parseExact(value, format, culture) method should be rewritten, as it has nothing in common with parsing data string according to exact format.
  • Type inference during binding is poor (parseOption() method). It works neither for string "1,2", nor json " { x: 0 } ", nor for date.
  • Binding is not implemented for many components: splitter, grid.
  • Splitter's pane should support size="auto".
  • Drid does not support totals in group headers, nor it supports header selection.
  • DataSource does not works after remote error, neither it allows to cancel request.
  • innerHtml is used all over the code, thus one cannot rely on jquery.data().
  • Grid does not support customization (localization) of a column filter.
  • Grid should support data binding of its content.
  • One should be able to destroy any widget.
Sunday, 17 June 2012 20:03:37 UTC  #    Comments [1] -
javascript | Thinking aloud
# Tuesday, 08 May 2012

Some time ago we were taking a part in a project where 95% of all sources are xslt 2.0. It was a great experience for us.

The interesting part is that we used xslt in areas we would never expect it in early 2000s. It crunched gigabytes of data in offline, while earlier we generally sought xslt application in a browser or on a server as an engine to render the data.

Web applications (both .NET and java) are in our focus today, and it became hard to find application for xslt or xquery.

Indeed, client side now have a very strong APIs: jquery, jqueryui, jsview, jqgrid, kendoui, and so on. These libraries, and today's browsers cover developer's needs in building managable applications. In contrast, a native support of xslt (at least v2) does not exist in browsers.

Server side at present is seen as a set of web services. These services support both xml and json formats, and implement a business logic only. It would be a torture to try to write such a frontend in xslt/xquery. A server logic itself is often dealing with a diversity of data sources like databases, files (including xml files) and other.

As for a database (we primarily work with SQL Server 2008 R2), we think that all communication should go through stored procedures, which implement all data logic. Clearly, this place is not for xslt. However, those who know sql beyond its basics can confirm that sql is very similar to xquery. More than that SQL Server (and other databases) integrate xquery to work with xml data, and we do use it extensively.

Server logic itself uses API like LINQ to manipulate with different data sources. In fact, we think that one can build a compiler from xquery 3.0 to C# with LINQ. Other way round compiler would be a whole different story.

The net result is that we see little place for xslt and xquery. Well, after all it's only a personal perspective on the subject. The similar type of thing has happened to us with C++. As with xslt/xquery we love C++ very much, and we fond of C++11, but at present we have no place in our current projects for C++. That's pitty.

P.S. Among other things that play against xslt/xquery is that there is a shortage of people who know these languages, thus who can support such projects.

Tuesday, 08 May 2012 20:28:51 UTC  #    Comments [0] -
Thinking aloud | xslt
# Saturday, 24 March 2012

Let's start from a distance.

We support a busy database for a customer. Customer's requirement (in fact, state's requirement)  is that the database should have audit logs. This means that all important requests should be logged. These logs help both for the offline security analysis, and for the database health monitoring.

Before the end of the last year we used SQL Server 2005, and then customer has upgraded to SQL Server 2008 R2.

As by design the database is accessed through Stored Procedures only, so the logging was done using a small SP that traced input parameters and execution time. The call to that SP was inserted throughout the code of other SPs.

We expected SQL Server 2008 R2 to simplify the task, and to allow us to switch the audit on and off on a fine grained level without the need to change a SP in the production (see Understanding SQL Server Audit for details).

Unfortunatelly, we have almost immediately found that the current audit implementation traces SP calls but does not store parameter values. This way, you can see that there was a call "execute X @param1, @param2", but you have no idea what values were passed. Internet search shows that this a known problem (see SQL Server 2008 Database Audit on INSERT UPDATE and DELETE actual SQL and not parameter values), which renders SQL Server Audit useless.

But nevertheless, looking at how can we simplify our hand-made audit we have found a brilliant solution: "Light weight SQL Server procedure auditing without using SQL Server auditing". It's so simple, that it's a shame that we did not invent it ourselves! The approach is to insert or remove tracing code automatically. Indeed, there is nothing but data in the database, even the text of SP is only a data.

To automate it even more, we have defined a small table with names of procedures and their log levels, and have defined a procedure "Log.SetLevel @level" to configure all logging in one go. In addition we have simplified logging procedures and tables, and started to store parameters in xml columns rather than in a pipe-concatenated strings.

Now, to the negative SP execution times.

The logging code among other things measures current_timestamp at the begin and at the end of the execution of SP. This helps us (as developers) to monitor how database performs on a day to day basis, and to build many useful statistics.

For example we can see that the duration of about 10% of untrivial selects is 0ms (execution time is under 1ms). This means that SQL Server is good at data caching. But what is most interesting is that about 0.1% of requests have negative duration!

You could speculate on parallel or on out of order execution, but the paradox is resolved when you look closely on a value of duration. It's always around of -7,200,000ms. No one will assume that execution has ended two hours before it has started. So, what does it mean -2 hours? Well, we live in (UTC+02:00) Jerusalem time zone. We think that UTC offset crawls somehow into the result. To prove our hypothesis we would like to change time zone on sql servers, but customer won't agree on such an experiment. :-)

This effect probably means that there is some hidden bug in SQL Server 2008 R2 that we cannot reliably reproduce, but we can see that the datediff(ms, start_timestamp, end_timestamp) may return negative value when it's known that start_timestamp is acquired before end_timestamp.

Update: What a shame. During tunning of the original logging procedures we have changed type from datetime to datetime2, and calls from GETUTCDATE() to current_timestamp, except one place (default value in the table definition) where it remained with GETUTCDATE().

So, negative durations meant operation timeout (in our case duration is greater than 30 secs).

Saturday, 24 March 2012 14:44:04 UTC  #    Comments [0] -
SQL Server puzzle | Thinking aloud
# Friday, 16 March 2012

After C++11 revision has been approved a new cycle of C++ design has begun:

N3370: The C++ standards committee is soliciting proposals for additional library components. Such proposals can range from small (addition of a single signature to an existing library) to large (something bigger than any current standard library component).

At this stage it's interesting to read papers, as authors try to express ideas rather than to formulate sentences that should go into spec as it lately was.

These are just several papers that we've found interesting:

N3322 12-0012 A Preliminary Proposal for a Static if Walter E. Brown
N3329 12-0019 Proposal: static if declaration H. Sutter, W. Bright, A. Alexandrescu

Those proposals argue about compile time "if statement". The feature can replace #if preprocessor directive, a SFINAE or in some cases template specializations.

A static if declaration can appear wherever a declaration or a statement is legal. Authors also propose to add static if clause to a class and a function declarations to conditionally exclude them from the scope.

Examples:

// Compile time factorial.
template <unsigned n>
struct factorial
{
  static if (n <= 1)
  {
    enum : unsigned { value = 1 };
  }
  else
  {
    enum : unsigned { value = factorial<n - 1>::value * n };
  }
};

// Declare class provided a condition is true.
class Internals if (sizeof(void*) == sizeof(int));

Paper presents strong rationale why this addition helps to build better programs, however the questions arise about relations between static if and concepts, static if clause and an error diagnostics.

 

N3327 12-0017 A Standard Programmatic Interface for Asynchronous Operations N. Gustafsson, A. Laksberg
N3328 12-0018 Resumable Functions Niklas Gustafsson

That's our favorite.

Authors propose an API and a language extensions to make asynchronous programs simpler.

In fact, asynchronous function will look very mush as a regular one but with small additions. It's similar to yield return in C# (a construct that has been available in C# for many years and is well vetted), and to async expression in C# 4.5. Compiler will rewrite such a function into a state machine, thus function can suspend its execution, wait for the data and to resume when data is available.

Example:

// read data asynchronously from an input and write it into an output.
int cnt = 0;

do
{
  cnt = await streamR.read(512, buf);

  if (cnt == 0)
  {
    break;
  }

  cnt = await streamW.write(cnt, buf);
}
while(cnt > 0);

It's iteresting to see how authors will address yield return: either with aditional keyword, or in terms of resumable functions.

 

N3340 12-0030 Rich Pointers D. M. Berris, M. Austern, L. Crowl

Here authors try to justify rich type-info but mask it under the name "rich pointers". To make things even more obscure they argue about dynamic code generation.

If you want a rich type-info then you should talk about it and not about thousand of other things.

We would better appealed to create a standard API to access post-compile object model, which could be used to produce different type-infos or other source derivatives.

This paper is our outsider. :-)

 

N3341 12-0031 Transactional Language Constructs for C++ M. Wong, H. Boehm, J. Gottschlich, T. Shpeisman, et al.

Here people try to generalize (put you away from) locking, and replace it with other word "transaction".

Seems it's not viable proposition. It's better to teach on functional style of programming with its immutable objects.

 

N3347 12-0037 Modules in C++ (Revision 6) Daveed Vandevoorde

Author argues against C style source composition with #include directive, and propose alternative called "modules".

We think that many C++ developers would agree that C pre-processor is a legacy that would never have existed, but for the same reason (for the legacy, and compatibility) it should stay.

In out opinion the current proposition is just immature, at least it's not intuitive. Or in other words there should be something to replace the C pre-processor (and #include as its part), but we don't like this paper from aestetic perspective.

 

N3365 12-0055 Filesystem Library Proposal (Revision 2)

This proposal says no a word about asynchronous nature of file access, while it should be designed around it.

Friday, 16 March 2012 19:21:58 UTC  #    Comments [0] -
C++ | Thinking aloud
# Thursday, 08 March 2012

For a long time we were developing web applications with ASP.NET and JSF. At present we prefer rich clients and a server with page templates and RESTful web services.

This transition brings technical questions. Consider this one.

Browsers allow to store session state entirely on the client, so should we maintain a session on the server?

Since the server is just a set of web services, so we may supply all required arguments on each call.

At first glance we can assume that no session is required on the server. However, looking further we see that we should deal with data validation (security) on the server.

Think about a classic ASP.NET application, where a user can select a value from a dropdown. Either ASP.NET itself or your program (against a list from a session) verifies that the value received is valid for the user. That list of values and might be other parameters constitute a user profile, which we stored in session. The user profile played important role (often indirectly) in the validation of input data.

When the server is just a set of web services then we have to validate all parameters manually. There are two sources that we can rely to: (a) a session, (b) a user principal.

The case (a) is very similar to classic ASP.NET application except that with EnableEventValidation="true" runtime did it for us most of the time.
The case (b) requires reconstruction of the user profile for a user principal and then we proceed with validation of parameters.

We may cache user profile in session, in which case we reduce (b) to (a); on the other hand we may cache user profile in Cache, which is also similar to (a) but which might be lighter than (at least not heavier than) the solution with the session.

What we see is that the client session does not free us from server session (or its alternative).

Thursday, 08 March 2012 21:56:19 UTC  #    Comments [0] -
.NET | ASP.NET | Java | JSF and Facelets | Thinking aloud
# Wednesday, 29 February 2012

We were dealing with a datasource of (int? id, string value) pairs in LINQ. The data has originated from a database where id is unique field. In the program this datasource had to be seen as a dictionary, so we have written a code like this:

var dictionary = CreateIDValuePairs().ToDictionary(item => item.ID, item => item.Value);

That was too simple-minded. This code compiles but crashes at runtime when there is an id == null.

Well, help warns about this behaviour, but anyway this does not make pain easier.

In our opinion this restriction is not justified and just complicates the use of Dictionaty.

Wednesday, 29 February 2012 20:42:46 UTC  #    Comments [0] -
.NET | Thinking aloud
# Friday, 13 January 2012

A database we support for a client contains multi-billion row tables. Many users query the data from that database, and it's permanently populated with a new data.

Every day we load several millions rows of a new data. Such loads can lock tables for a considerable time, so our loading procedures collect new data into intermediate tables and insert it into a final destination by chunks, and usually after work hours.

SQL Server 2008 R2 introduced READ_COMMITTED_SNAPSHOT database option. This feature trades locks for an increased tempdb size (to store row versions) and possible performance degradation during a transaction.

When we have switched the database to that option we did not notice any considerable performance change. Encouraged, we've decided to increase size of chunks of data we insert at once.

Earlier we have found that when we insert no more than 1000 rows at once, users don't notice impact, but for a bigger chunk sizes users start to complain on performance degradation. This has probably happened due to locks escalations.

Now, with chunks of 10000 or even 100000 rows we have found that no queries became slower. But load process became several times faster.

We were ready to pay for increased tempdb and transaction log size to increase performance, but in our case we didn't approach limits assigned by the DBA. Another gain is that we can easily load data at any time. This makes data we store more up to date.

Friday, 13 January 2012 13:43:56 UTC  #    Comments [0] -
SQL Server puzzle | Thinking aloud | Tips and tricks
# Saturday, 17 December 2011

Yesterday, by accident, we've seen an article about some design principles of V8 JavaScript Engine. It made clearer what techniques are used in today's script implementations.

In particular V8 engine optimizes property access using "dynamically created hidden classes". These are structures to store object's layout, they are derived when ש new property is created (deleted) on the object. When code accesses a property, and if a cached object's dynamic hidden class is available at the code point then access time is comparable to one of native fields.

In our opinion this tactics might lead to a proliferation of such dynamic hidden classes, which requires a considerable housekeeping, which also slows property write access, especially when it's written for the first time.

We would like to suggest a slightly different strategy, which exploits the cache matches, and does not require a dynamic hidden classes.

Consider an implementation data type with following characteristics:

  • object is implemented as a hash map of property id to property value: Map<ID, Value>;
  • it stores data as an array of pairs and can be accessed directly: Pair<ID, Value> values[];
  • property index can be acquired with a method: int index(ID);

A pseudo code for the property access looks like this:

pair = object.values[cachedIndex];

if (pair.ID == propertyID)
{
   value = pair.Value;
}
else
{
  // Cache miss.
  cachedIndex = object.index(propertyID);
  value = objec.values[cachedIndex].Value;
}

This approach brings us back to dictionary like implementation but with important optimization of array speed access when property index is cached, and with no dynamic hidden classes.

Saturday, 17 December 2011 10:18:14 UTC  #    Comments [0] -
Thinking aloud
# Saturday, 10 December 2011

@michaelhkay Saxon 9.4 is out.

But why author does not state that HE version is still xslt/xpath 2.0, as neither xslt maps, nor function items are supported.

Saturday, 10 December 2011 12:16:28 UTC  #    Comments [0] -
Thinking aloud | xslt
# Saturday, 03 December 2011

Recently, we have found and reported the bug in the SQL Server 2008 (see SQL Server 2008 with(recompile), and also Microsoft Connect).

Persons, who's responsible for the bug evaluation has closed it, as if "By Design". This strange resolution, in our opinion, says about those persons only.

Well, we shall try once more (see Microsoft Connect). We have posted another trivial demonstartion of the bug, where we show that option(recompile) is not used, which leads to table scan (nothing worse can happen for a huge table).

Saturday, 03 December 2011 15:06:44 UTC  #    Comments [0] -
SQL Server puzzle | Thinking aloud
# Thursday, 10 November 2011

A bit history: the first release of this solution was about 9.5 years ago...

Today we've run into a strange situation. One of our clients ask us about automatic conversion of data from mainframe (that were defined as COBOL copybooks) into XML or Java/.NET objects. On our suggestion to use eXperanto, which is well known to him, he stated that he wouldn't like to use a tool of a company that is no more exists...

The situation, in our opinion, become more strange when you consider the following:

  • eXperanto (the design-time tool and run-time libraries for Java and .NET) were developed, well tested, and delivered by us to production already several years ago.
  • the client bought this set (the tool and libraries).
  • the set is in production yet already in another big company, and is used time to time by our company in different migration projects.
  • the client talks with developers of this tool and run-time libraries, and he knows about this fact.
  • the client uses widely open source solutions even without dedicated vendors or support warranties.
Thursday, 10 November 2011 22:14:09 UTC  #    Comments [2] -
.NET | Java | Thinking aloud
# Friday, 28 October 2011

It has happened so, that we have never worked with jQuery, however were aware of it.

In early 2000 we have developed a web application that contained rich javascript APIs, including UI components. Later, we were actively practicing in ASP.NET, and later in JSF.

At present, looking at jQuery more closely we regret that we have failed to start using it earlier.

Separation of business logic and presentation is remarkable when one uses JSON web services. In fact server part can be seen as a set of web services representing a business logic and a set of resources: html, styles, scripts, others. Nor ASP.NET or JSF approach such a consistent separation.

The only trouble, in our opinion, is that jQuery has no standard data binding: a way to bind JSON data to (and from) html controls. The technique that will probably be standardized is called jQuery Templates or JsViews .

Unfortunatelly after reading about this binding API, and being in love with Xslt and XQuery we just want to cry. We don't know what would be the best solution for the task, but what we see looks uncomfortable to us.

Friday, 28 October 2011 22:59:23 UTC  #    Comments [0] -
ASP.NET | JSF and Facelets | Thinking aloud | Tips and tricks | xslt
# Sunday, 16 October 2011

Incidentally, we have found one new implementation of yield return in java that is in the development stage. Sources can be found at https://github.com/peichhorn/lombok-pg/zipball/master. Just to be sure we have copied those sources at other place peichhorn-lombok-pg-0.10.0-39-g384fb7b.zip (you may search "yield" in the archive).

It's broken according to source tracker, but the funny thing is that sources, however different, still resemble our yield return implementation (Yield.jar, Yield.3.7.jar - Indigo, Yield.zip - sources) very much: variable names, error messages, algorithmic structure.

Those programmers probably have forgotten good manners: to reference a base work, at least.

Well, we generously forgive them this blunder.

P.S. our implementation, in contrast, works without bugs.

P.P.S. misunderstanding is resolved. See comments.

Sunday, 16 October 2011 13:49:33 UTC  #    Comments [3] -
Java | Thinking aloud
# Thursday, 29 September 2011

A couple of weeks ago, we have suggested to introduce a enumerator function into the XPath (see [F+O30] A enumerator function):

I would like the WG to consider an addition of a function that turns a sequence into a enumeration of values.

Consider a function like this:  fn:enumerator($items as item()*) as function() as item()?;

alternatively, signature could be:

 fn:enumerator($items as function() as item()*) as function() as item()?;

This function receives a sequence, and returns a function item, which upon N's call shall return N's element of the original sequence. This way, a sequence of items is turned into a function providing a enumeration of items of the sequence.

As an example consider two functions:

a) t:rand($seed as xs:double) as xs:double* - a function producing a random number sequence;
b) t:work($input as element()) as element() - a function that generates output from it's input, and that needs random numbers in the course of the execution.

t:work() may contain a code like this:
  let $rand := fn:enumerator(t:rand($seed)),

and later it can call $rand() to get a random numbers.

Enumerators will help to compose algorithms where one algorithm communicate with other independant algorithms, thus making code simpler. The most obvious class of enumerators are generators: ordered numbers, unique identifiers, random numbers.

Technically, function returned from fn:enumerator() is nondetermenistic, but its "side effect" is similar to a "side effect" of a function generate-id() from a newly created node (see bug #13747, and bug #13494).

The idea is inspired by a generator function, which returns a new value upon each call.

Such function can be seen as a stateful object. But our goal is to look at it in a more functional way. So, we look at the algorithm as a function that produces a sequence of output, which is pure functional; and an enumerator that allows to iterate over algorithm's output.

This way, we see the function that implements an algorithm and the function that uses it can be seen as two thread of functional programs that use messaging to communicate to each other.

Honestly, we doubt that WG will accept it, but it's interesting to watch the discussion.

Thursday, 29 September 2011 11:56:05 UTC  #    Comments [0] -
Thinking aloud | xslt
# Wednesday, 14 September 2011

More than month has passed since we have reported a problem to the saxon forum (see Saxon optimizer bug and Saxon 9.2 generate-id() bug).

The essence of the problem is that we have constructed argumentless function to return a unique identifiers each time function is called. To achieve the effect we have created a temporary node and returned its generate-id() value.

Such a function is nondetermenistic, as we cannot state that its result depends on arguments only. This means that engine's optimizer is not free to reorder calls to such a function. That's what happens in Saxon 9.2, and Saxon 9.3 where engine elevates function call out of cycle thus producing invalid results.

Michael Kay, the author of the Saxon engine, argued that this is "a gray area of the xslt spec":

If the spec were stricter about defining exactly when you can rely on identity-dependent operations then I would be obliged to follow it, but I think it's probably deliberate that it currently allows implementations some latitude, effectively signalling to users that they should avoid depending on this aspect of the behaviour.

He adviced to raise a bug in the w3c bugzilla to resolve the issue. In the end two related bugs have been raised:

  • Bug 13494 - Node uniqueness returned from XSLT function;
  • Bug 13747 - [XPath 3.0] Determinism of expressions returning constructed nodes.

Yesterday, the WG has resolved the issue:

The Working Group agreed that default behavior should continue to require these nodes to be constructed with unique IDs. We believe that this is the kind of thing implementations can do with annotations or declaration options, and it would be best to get implementation experience with this before standardizing.

This means that the technique we used to generate unique identifiers is correct and the behaviour is well defined.

The only problem is to wait when Saxon will fix its behaviour accordingly.

Wednesday, 14 September 2011 05:54:56 UTC  #    Comments [0] -
Thinking aloud | xslt
# Tuesday, 06 September 2011

We're not big fans of Entity Framework, as we don't directly expose the database structure to the client program but rather through stored procedures and functions. So, EF for us is a tool to expose those stored procedures as .NET wrappers. This limited use of EF still greatly automates the data access code.

But what we have lately found is that the EF has a problem with char parameters. Namely, if you import a procedure say MyProc that accepts char(1), and then will call it through the generated wrapper, the you will see in sql profiler that char(1) parameter is passed with many trailing spaces as if it were char(8000). There isn't necessity to prove that this is highly ineffective.

We can see that the problem happens in VS 2010 designer rather than in the EF runtime, as SP's parameters are not attributed with length, see model xml (*.edmx):

<Function Name="MyProc" Schema="Data">
  ...
  <Parameter Name="recipientType" Type="char" Mode="In" />
  ...
</Function>

while if we set:

  <Parameter Name="recipientType" Type="char" MaxLength="1" Mode="In" />

the runtime starts working as expected. So the workaround is to fix model file manually.

See also: Stored Proc and Char parm

Tuesday, 06 September 2011 21:11:38 UTC  #    Comments [0] -
.NET | Thinking aloud | Tips and tricks
# Friday, 12 August 2011

1. query.dll vs tquery.dll

We have installed Windows Search 4 on a Windows 2003 server. The goal was to index huge compressed xml files (see Windows Search Notifications). But for some reason it did not want to index content.

No "select System.ItemUrl from SystemIndex where contains('...')" has ever returned a row.

We thought that the problem was in our protocol handler, and tried to localize it, but finally have discovered that Windows Search is not able to find anything within text files.

Registry comparision has shown that *.txt extension was indexed by the IFilter defined in the query.dll, while on the other computers, where everything worked, the implementation was in the tquery.dll.

Both libraries were present on the Windows 2003 server, so we have corrected the registry and everything has started to work.

As far as we understand query.dll is part of legacy Indexing Service, and tquery.dll is up to date implementation.

2. Search index size

We have to index a considerable amout of data. But before we can do it we have to estimate the size of index.

In the past it seems we saw somewhere a statement that search index needs a storage that's about 10% of original data for its purposes. Unfortunatelly we cannot find this estimation at present, neither we cannot find any other estimation. This complicates our planning.

To get empirical estimate we've indexed several thousands *.xml-gz files, which are gz'ed big xmls. The total size of this files is about 4.5GB. Total uncompressed size of xmls ~50GB. Xml contained about 10 millions pages of data.

According to 10% criteria we had to arrive to ~5GB search index.

But what we have discovered is that the index has grown to more than 50GB. That's very disappointing. We cannot afford such expense, as we've commited test on a tiny part of data, which increases over time.

So, the solution is to find out what's wrong, and how can it be cured, or to fulltext index only most recent subset of data.

P.S. We have tried to mark folder with search index as compressed, but it did not work.

P.P.S. We have found the reference to Windows Search 4 index size estimation. It is in Windows Search Frequently Asked Questions, see answer on "What is average size of a user's index?" question.

Friday, 12 August 2011 09:20:18 UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Friday, 22 July 2011

We needed to track a stream position during creation of xml file. This is to allow random access to a huge xml file (the task is related to WindowsSearch).

This is a simplified form of the xml:

<data>
  <item>...</item>
   ...
  <item>...</item> 
</data>

The goal was to have stream position of each item element. With this in mind, we've decided to:

  • open a stream, and then xml writer over it;
  • write data into xml writer;
  • call Flush() method of the xml writer before measuring stream offset;

That's a code sample:

var stream = new MemoryStream();
var writer = XmlWriter.Create(stream);

writer.WriteStartDocument();
writer.WriteStartElement("data");

for(var i = 0; i < 10; ++i)
{
  writer.Flush();

  Console.WriteLine("Flush offset: {0}, char: {1}",
    stream.Position,
    (char)stream.GetBuffer()[stream.Position - 1]);
 
  writer.WriteStartElement("item");
  writer.WriteValue("item " + i);
  writer.WriteEndElement();
}

writer.WriteEndElement();
writer.WriteEndDocument();

That's the output:

Flush offset: 46, char: a
Flush offset: 66, char: >
Flush offset: 85, char: >
Flush offset: 104, char: >
Flush offset: 123, char: >
Flush offset: 142, char: >
Flush offset: 161, char: >
Flush offset: 180, char: >
Flush offset: 199, char: >
Flush offset: 218, char: >

Funny, isn't it?

After feeding the start tag <data>, and flushing xml writer we observe that only "<data" has been written down to the stream. Well, Flush() have never promissed anything particular about the content of the stream, so we cannot claim any violation, however we expected to see whole start tag.

Inspection of the implementation of xml writer reveals laziness during writting data down the stream. In particular start tag is closed when one starts the content. This is probably to implement empty tags: <data/>.

To do the trick we had to issue empty content, moreover, to call a particular method with particular parameters of the xml writer. So the code after the fix looks like this:

var stream = new MemoryStream();
var writer = XmlWriter.Create(stream);

writer.WriteStartDocument();
writer.WriteStartElement("data");

char[] empty = { ' ' };

for(var i = 0; i < 10; ++i)
{
  writer.WriteChars(empty, 0, 0);
  writer.Flush();

  Console.WriteLine("Flush offset: {0}, char: {1}",
    stream.Position,
    (char)stream.GetBuffer()[stream.Position - 1]);

  writer.WriteStartElement("item");
  writer.WriteValue("item " + i);
  writer.WriteEndElement();
}

writer.WriteEndElement();
writer.WriteEndDocument();

And output is:

Flush offset: 47, char: >
Flush offset: 66, char: >
Flush offset: 85, char: >
Flush offset: 104, char: >
Flush offset: 123, char: >
Flush offset: 142, char: >
Flush offset: 161, char: >
Flush offset: 180, char: >
Flush offset: 199, char: >
Flush offset: 218, char: >

While this code works, we feel uneasy with it.

What's the better way to solve the task?

Update: further analysis shows that it's only possible behaviour, as after the call to write srart element, you either can write attributes, content or end of element, so writer may write either space, '>' or '/>'. The only question is why it takes WriteChars(empty, 0, 0) into account and WriteValue("") it doesn't.

Friday, 22 July 2011 21:08:36 UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Wednesday, 13 July 2011

As you probably know we have implemented our custom Protocol Handler for the Windows Search.

It's called .xml-gz, and has a goal to index compressed xml files and to have search results with a subtree precision. So, for xml:

<data>
  <item>...</item>
  <item>...</item>
  ...
</data>

search finds results within item and returns xml's url and stream offset of the item. Using ZLIB API we can compress data with stream bookmarks, so fast random access to the data is possible.

The only problem we have is about notification of changes (create, delete, update) of such files.

Spec describes several techniques (nothing has worked for us):

1. Call catalogManager.ReindexMatchingURLs() - it just returns without any impact.

2.Call changeSink.OnItemsChanged() - returns error.

3. Implement .xml-gz IFilter and call IGatherNotifyInline (see " have your .zip urls indexed when they are created or modified") - that's a mistery, as:

4. Implement root url in form .xml-gz:/// and perform Windows Search:

SELECT
  System.ItemUrl, System.DateModified
FROM
  SystemIndex WHERE System.FileExtension='.xml-gz'

to find all .xml-gz sources. This is not reliable, as your protocol handler can be (and is) called before file is indexed.

So, the only reliable way to index your data is to (re-)add indexing rule for the protocol handler, which in most cases reindexes everything.

The only bearable solution we found is to define indexing rule in the form: .xml-gz://file:d:/data/... and to use IShellFolder(2) interfaces to discover sub items and their modification times. This technique allows minimal data scan when you're (re-)add indexing rule.

Wednesday, 13 July 2011 20:21:00 UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Saturday, 09 July 2011

Being unexperienced with Windows Search we tried to build queries to find data in the huge storage. We needed to find a document that matches some name pattern and contains some text.

Our naive query was like this:

select top 1000
  System.ItemUrl
from
  SystemIndex
where
  scope = '...' and
  System.ItemName like '...%' and
  contains('...')

In most cases this query returns nothing and runs very long. It's interesting to note that it may start returning data if "top" clause is missing or uses a bigger number, but in this cases query is slower even more.

Next try was like this:

select top 1000
  System.ItemUrl
from
  SystemIndex
where
  scope = '...' and
  System.ItemName >= '...' and System.ItemName < '...' and
  contains('...')

This query is also slow, but at least it returns some results.

At some point we have started to question the  utility of Windows Search if it's so slow, but then we have found that there is a property System.ItemNameDisplay, which in our case coincides with the value of property System.ItemName, so we have tried the query:

select top 1000
  System.ItemUrl
from
  SystemIndex
where
  scope = '...' and
  System.ItemNameDisplay like '...%' and
  contains('...')

This query worked fast, and produced good results. This hints that search engine has index on System.ItemNameDisplay in contrast to System.ItemName property.

We've looked at property definitions:

System.ItemNameDisplay

The display name in "most complete" form. It is the unique representation of the item name most appropriate for end users.

propertyDescription
    name = System.ItemNameDisplay
    shellPKey = PKEY_ItemNameDisplay
    formatID = B725F130-47EF-101A-A5F1-02608C9EEBAC
    propID = 10
    searchInfo
       inInvertedIndex = true
       isColumn = true
       isColumnSparse = false
       columnIndexType = OnDisk
       maxSize = 128

System.ItemName

The base name of the System.ItemNameDisplay property.

propertyDescription
    name = System.ItemName
    shellPKey = PKEY_ItemName
    formatID = 6B8DA074-3B5C-43BC-886F-0A2CDCE00B6F
    propID = 100
    searchInfo
       inInvertedIndex = false
       isColumn = true
       isColumnSparse = false
       columnIndexType = OnDisk
       maxSize = 128

Indeed, one property is indexed, while the other is not.

As with other databases, query is powerful when engine uses indices rather than performs data scan. This is also correct for Windows Search.

The differences in results that variations of query produce also manifests that Windows Search nevertheless is very different from relational database.

Saturday, 09 July 2011 10:01:36 UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Tuesday, 05 July 2011

We have developed our custom Windows Search Protocol Handler. The role of this component is to expose items of complex content (or unusual storage) to Windows Search.

You can think of some virtual folder, so a Protocol Handler allows to enumerate it's files, file properties, and contents.

The goal of our Protocol Handler is to represent some data structure as a set of xml files. We expected that if we found a data within a folder with these files, then a search within Protocol Handler's scope would bring the same (or almost the same) results.

Reality is different.

For some reason .xml IFilter (a component to extract text data to index) works differently with file system and with our storage. We cannot state that it does not work, but for some reason many words that Windows Search finds within a file are never found within Protocol Handler scope.

We have observed that if, for purpose of indexing, we represent content xml items as .txt files, then search works as expected. So, our workaround was to present only xml's text data for the indexing, and to use .txt IFilter (this in fact roughly what .xml IFilter does by itself).

Is there a conclusion?

Well, Windows Search is a black box probably containing bugs. Its behaviour is not always obvious.

Tuesday, 05 July 2011 20:31:47 UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Friday, 24 June 2011

Let's put it blatantly: Windows Search 4 has design and implementation problems.

You discover this immediatelly when you start implementing indexing of custom file format.

If you want to index simple file format then you need to imlement you IFilter interface. But if it has happened so that you want to index compound data then you should invent you own protocols.

If you will fugure out how to implement your protocol to index that compound data, then you will most probably stuck on the problem on how to notify indexer about the changes.

The problem is that Windows Search 4 has API to reindex urls, which simply does not work, or to notify indexer about changes, which throws an error (returns error HRESULT) for custom protocols. At least, we were not able to make it run.

Friday, 24 June 2011 19:39:36 UTC  #    Comments [0] -
Thinking aloud | Window Search
# Thursday, 02 June 2011

Do you know that the best JSF/Facelets visual editor, in our opinion, is ... Microsoft Visual Studio 2008? Another rather good JSF editor is presented in IBM RAD 7.xx. The most popular, open source Java IDE Eclipse contains an ugly implementation of such useful thing.

Thursday, 02 June 2011 08:37:09 UTC  #    Comments [0] -
JSF and Facelets | Thinking aloud
# Tuesday, 03 May 2011

Already for a couple of days we're trying to create a UserControl containing a TabContainer. To achieve the goal we have created a page with a ToolkitScriptManager and a user control itself.

Page:

<form runat="server">
  <ajax:ToolkitScriptManager ID="ToolkitScriptManager1" runat="server"/>
  <uc1:WebUserControl1 ID="WebUserControl11" runat="server" />
</form>

User control:

<%@ Control Language="C#" %>
<%@ Register
  Assembly="AjaxControlToolkit"
  Namespace="AjaxControlToolkit"
  TagPrefix="ajax" %>

<ajax:TabContainer ID="Tab" runat="server" Width="100%">
  <ajax:TabPanel runat="server" HeaderText="Tab1" ID="Tab1">
    <ContentTemplate>Panel 1</ContentTemplate>
  </ajax:TabPanel>
</ajax:TabContainer>

What could be more simple?

But no, there is a problem. At run-time control works perfectly, but at the designer it shows an error instead of a normal design view:

Error Rendering Control - TabContainer1
An unhandled exception has occurred.
Could not find any resources appropriate for the specified culture or the neutral culture. Make sure "AjaxControlToolkit.Properties.Resources.NET4.resources" was correctly embedded or linked into assembly "AjaxControlToolkit" at compile time, or that all the satellite assemblies required are loadable and fully signed.

That's a stupid error, which says nothing about the real problem reason. We had to attach a debugger to a Visual Studio just to realize what the problem is.

So, the error occurs at the following code of AjaxControlToolkit.ScriptControlBase:

private void EnsureScriptManager()
{
  if (this._scriptManager == null)
  {
    this._scriptManager = ScriptManager.GetCurrent(this.Page);

    if (this._scriptManager == null)
    {
      throw new HttpException(Resources.E_NoScriptManager);
    }
  }
}

Originally, the problem is due to the fact that ScriptManager is not found, and code wants to report an HttpException, but fun is that we recieve a different exception, which is releted to a missing resouce text for a message Resources.E_NoScriptManager. It turns out that E_NoScriptManager text is found neither in primary no in resource assemblies.

As for original problem, it's hard to say about reason of why ScriptManager is not available at design time. We, however, observed that a ScriptManager registers itself for a ScriptManager.GetCurrent() method at run-time only:

protected internal override void OnInit(EventArgs e)
{
...
  if (!base.DesignMode)
  {
...
    iPage.Items[typeof(ScriptManager)] = this;
...
  }
}

So, it's not clear what they (toolkit's developers) expected to get at design time.

These observations leave uneasiness regarding the quality of the library.

Tuesday, 03 May 2011 19:45:51 UTC  #    Comments [8] -
ASP.NET | Thinking aloud
# Tuesday, 26 April 2011

Earlier, we have described an approach to call Windows Search from SQL Server 2008. But it has turned out that our problem is more complicated...

All has started from the initial task:

  • to allow free text search in a store of huge xml files;
  • files should be compressed, so these are *.xml.gz;
  • search results should be addressable to a fragment within xml.

Later we shall describe how we have solved this task, and now it's enough to say that we have implemented a Protocol Handler for Windows Search named '.xml-gz:'. This way original file stored say at 'file:///c:/store/data.xml-gz' is seen as a container by the Windows Search:

  • .xml-gz:///file:c:/store/data.xml-gz/id1.xml
  • .xml-gz:///file:c:/store/data.xml-gz/id2.xml
  • ...

This way search xml should be like this:

select System.ItemUrl from SystemIndex where scope='.xml-gz:' and contains(...)

Everything has worked during test: we have succeeded to issue Windows Search selects from SQL Server and join results with other sql queries.

But later on when we considered a runtime environment we have seen that our design won't work. The reason is simple. Windows Search will work on a computer different from those where SQL Servers run. So, the search query should look like this:

select System.ItemUrl from Computer.SystemIndex where scope='.xml-gz:' and contains(...)

Here we have realized the limitation of current (Windows Search 4) implementation: remote search works for shared folders only, thus query may only look like:

select System.ItemUrl from Computer.SystemIndex where scope='file://Computer/share/' and contains(...)

Notice that search restricts the scope to a file protocol, this way remoter search will never return our results. The only way to search in our scope is to perform a local search.

We have considered following approaches to resolve the issue.

The simplest one would be to access Search protocol on remote computer using a connection string: "Provider=Search.CollatorDSO;Data Source=Computer" and use local queries. This does not work, as provider simply disregards Data Source parameter.

The other try was to use MS Remote OLEDB provider. We tried hard to configure it but it always returns obscure error, and more than that it's deprecated (Microsoft claims to remove it in future).

So, we decided to forward request manually:

  • SQL Server calls a web service (through a CLR function);
  • Web service queries Windows Search locally.

Here we considered WCF Data Services and a custom web service.

The advantage of WCF Data Services is that it's a technology that has ambitions of a standard but it's rather complex task to create implementation that will talk with Windows Search SQL dialect, so we have decided to build a primitive http handler to get query parameter. That's trivial and also has a virtue of simple implementation and high streamability.

So, that's our http handler (WindowsSearch.ashx):

<%@ WebHandler Language="C#" Class="WindowsSearch" %>

using System;
using System.Web;
using System.Xml;
using System.Text;
using System.Data.OleDb;

/// <summary>
/// A Windows Search request handler.
/// </summary>
public class WindowsSearch: IHttpHandler
{
  /// <summary>
  /// Handles the request.
  /// </summary>
  /// <param name="context">A request context.</param>
  public void ProcessRequest(HttpContext context)
  {
    var request = context.Request;
    var query = request.Params["query"];
    var response = context.Response;

    response.ContentType = "text/xml";
    response.ContentEncoding = Encoding.UTF8;

    var writer = XmlWriter.Create(response.Output);

    writer.WriteStartDocument();
    writer.WriteStartElement("resultset");

    if (!string.IsNullOrEmpty(query))
    {
      using(var connection = new OleDbConnection(provider))
      using(var command = new OleDbCommand(query, connection))
      {
        connection.Open();

        using(var reader = command.ExecuteReader())
        {
          string[] names = null;

          while(reader.Read())
          {
            if (names == null)
            {
              names = new string[reader.FieldCount];

              for (int i = 0; i < names.Length; ++i)
              {
                names[i] = XmlConvert.EncodeLocalName(reader.GetName(i));
              }
            }

            writer.WriteStartElement("row");

            for(int i = 0; i < names.Length; ++i)
            {
              writer.WriteElementString(
                names[i],
                Convert.ToString(reader[i]));
            }

            writer.WriteEndElement();
          }
        }
      }
    }

    writer.WriteEndElement();
    writer.WriteEndDocument();

    writer.Flush();
  }

  /// <summary>
  /// Indicates that a handler is reusable.
  /// </summary>
  public bool IsReusable { get { return true; } }

  /// <summary>
  /// A connection string.
  /// </summary>
  private const string provider =
    "Provider=Search.CollatorDSO;" +
    "Extended Properties='Application=Windows';" +
    "OLE DB Services=-4";
}

And a SQL CLR function looks like this:

using System;
using System.Collections;
using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Net;
using System.IO;
using System.Xml;

/// <summary>
/// A user defined function.
/// </summary>
public class UserDefinedFunctions
{
  /// <summary>
  /// A Windows Search returning result as xml strings.
  /// </summary>
  /// <param name="url">A search url.</param>
  /// <param name="userName">A user name for a web request.</param>
  /// <param name="password">A password for a web request.</param>
  /// <param name="query">A Windows Search SQL.</param>
  /// <returns>A result rows.</returns>
  [SqlFunction(
    IsDeterministic = false,
    Name = "WindowsSearch",
    FillRowMethodName = "FillWindowsSearch",
    TableDefinition = "value nvarchar(max)")]
  public static IEnumerable Search(
    string url,
    string userName,
    string password,
    string query)
  {
    return SearchEnumerator(url, userName, password, query);
  }

  /// <summary>
  /// A filler of WindowsSearch function.
  /// </summary>
  /// <param name="value">A value returned from the enumerator.</param>
  /// <param name="row">An output value.</param>
  public static void FillWindowsSearch(object value, out string row)
  {
    row = (string)value;
  }

  /// <summary>
  /// Gets a search row enumerator.
  /// </summary>
  /// <param name="url">A search url.</param>
  /// <param name="userName">A user name for a web request.</param>
  /// <param name="password">A password for a web request.</param>
  /// <param name="query">A Windows Search SQL.</param>
  /// <returns>A result rows.</returns>
  private static IEnumerable<string> SearchEnumerator(
    string url,
    string userName,
    string password,
    string query)
  {
    if (string.IsNullOrEmpty(url))
    {
      throw new ArgumentException("url");
    }

    if (string.IsNullOrEmpty(query))
    {
      throw new ArgumentException("query");
    }

    var requestUrl = url + "?query=" + Uri.EscapeDataString(query);

    var request = WebRequest.Create(requestUrl);

    request.Credentials = string.IsNullOrEmpty(userName) ?
      CredentialCache.DefaultCredentials :
      new NetworkCredential(userName, password);

    using(var response = request.GetResponse())
    using(var stream = response.GetResponseStream())
    using(var reader = XmlReader.Create(stream))
    {
      bool read = true;

      while(!read || reader.Read())
      {
        if ((reader.Depth == 1) && reader.IsStartElement())
        {
          // Note that ReadInnerXml() advances the reader similar to Read().
          yield return reader.ReadInnerXml();

          read = false;
        }
        else
        {
          read = true;
        }
      }
    }
  }
}

And, finally, when you call this service from SQL Server you write query like this:

with search as
(
  select
    cast(value as xml) value
  from
    dbo.WindowsSearch
    (
      N'http://machine/WindowsSearchService/WindowsSearch.ashx',
      null,
      null,
      N'
        select
          "System.ItemUrl"
        from
          SystemIndex
        where
          scope=''.xml-gz:'' and contains(''...'')'
    )
)
select
  value.value('/System.ItemUrl[1]', 'nvarchar(max)')
from
  search

Design is not trivial but it works somehow.

After dealing with all these problems some questions remain unanswered:

  • Why SQL Server does not allow to query Windows Search directly?
  • Why Windows Search OLEDB provider does not support "Data Source" parameter?
  • Why Windows Search does not support custom protocols during remote search?
  • Why SQL Server does not support web request/web services natively?
Tuesday, 26 April 2011 08:26:10 UTC  #    Comments [0] -
SQL Server puzzle | Thinking aloud | Tips and tricks | Window Search
# Friday, 04 March 2011

We were trying to query Windows Search from an SQL Server 2008.

Documentation states that Windows Search is exposed as OLE DB datasource. This meant that we could just query result like this:

SELECT
  *
FROM
  OPENROWSET(
    'Search.CollatorDSO.1',
    'Application=Windows',
    'SELECT "System.ItemName", "System.FileName" FROM SystemIndex');

But no, such select never works. Instead it returns obscure error messages:

OLE DB provider "Search.CollatorDSO.1" for linked server "(null)" returned message "Command was not prepared.".
Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "Search.CollatorDSO.1" for linked server "(null)" reported an error. Command was not prepared.
Msg 7350, Level 16, State 2, Line 1
Cannot get the column information from OLE DB provider "Search.CollatorDSO.1" for linked server "(null)".

Microsoft is silent about reasons of such behaviour. People came to a conclusion that the problem is in the SQL Server, as one can query search results through OleDbConnection without problems.

This is very unfortunate, as it bans many use cases.

As a workaround we have defined a CLR function wrapping Windows Search call and returning rows as xml fragments. So now the query looks like this:

select
  value.value('System.ItemName[1]', 'nvarchar(max)') ItemName,
  value.value('System.FileName[1]', 'nvarchar(max)') FileName
from
  dbo.WindowsSearch('SELECT "System.ItemName", "System.FileName" FROM SystemIndex')

Notice how we decompose xml fragment back to fields with the value() function.

The C# function looks like this:

using System;
using System.Collections;
using System.IO;
using System.Xml;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using System.Data.OleDb;

using Microsoft.SqlServer.Server;

public class UserDefinedFunctions
{
  [SqlFunction(
    FillRowMethodName = "FillSearch",
    TableDefinition="value xml")]
  public static IEnumerator WindowsSearch(SqlString query)
  {
    const string provider =
      "Provider=Search.CollatorDSO;" +
      "Extended Properties='Application=Windows';" +
      "OLE DB Services=-4";

    var settings = new XmlWriterSettings
    {
      Indent = false,
      CloseOutput = false,
      ConformanceLevel = ConformanceLevel.Fragment,
      OmitXmlDeclaration = true
    };

    string[] names = null;

    using(var connection = new OleDbConnection(provider))
    using(var command = new OleDbCommand(query.Value, connection))
    {
      connection.Open();

      using(var reader = command.ExecuteReader())
      {
        while(reader.Read())
        {
          if (names == null)
          {
            names = new string[reader.FieldCount];

            for (int i = 0; i < names.Length; ++i)
            {
              names[i] = XmlConvert.EncodeLocalName(reader.GetName(i));
            }
          }

          var stream = new MemoryStream();
          var writer = XmlWriter.Create(stream, settings);

          for(int i = 0; i < names.Length; ++i)
          {
            writer.WriteElementString(names[i], Convert.ToString(reader[i]));
          }

          writer.Close();

          yield return new SqlXml(stream);
        }
      }
    }
  }

  public static void FillSearch(object value, out SqlXml row)
  {
    row = (SqlXml)value;
  }
}

Notes:

  •  Notice the use of "OLE DB Services=-4" in provider string to avoid transaction enlistment (required in SQL Server 2008).
  • Permission level of the project that defines this extension function should be set to unsafe (see Project Properties/Database in Visual Studio) otherwise it does not allow the use OLE DB.
  • SQL Server should be configured to allow CLR functions, see Server/Facets/Surface Area Configuration/ClrIntegrationEnabled in Microsoft SQL Server Management Studio
  • Assembly should either be signed or a database should be marked as trustworthy, see Database/Facets/Trustworthy in Microsoft SQL Server Management Studio.
Friday, 04 March 2011 09:22:49 UTC  #    Comments [0] -
SQL Server puzzle | Thinking aloud | Tips and tricks | Window Search
# Sunday, 20 February 2011

Last few days we were testing Java web-applications that expose web-services. During these tests we've found few interesting features.

The first feature allows to retrieve info about all endpoints supported by the web-application on GET request. The feature works at least for Metro that implements JAX-WS API v2.x. In order to get such info, a client sends any endpoint's URL to the server. The result is an HTML page with a table. Each row of such table contains an endpoint's data for each supported web-service method. This feature may be used as a web-services discovery mechanism.

The second feature is bad rather than good. JAX-WS API supposes that a developer annotates classes and methods that he/she wants to expose as web-services. Then, an implementation generates additional layer-bridge between developer's code and API that does all routine work behind the scene. May be that was a good idea, but Metro's implementation is imperfect. Metro dynamically generates such classes at run-time when a web-application starts. Moreover, Metro does such generation for all classes at once. So, in our case, when the generated web-based application contains dozens or even hundreds of web-services, the application's startup takes a lot of time.

Probably, Metro developers didn't want to deal with implementation of lazy algorithms, when a web-service is generated and cached on demand. We hope this issue will be solved in next releases.

Sunday, 20 February 2011 10:20:12 UTC  #    Comments [0] -
Java | Thinking aloud | Tips and tricks
# Tuesday, 08 February 2011

A while ago we have created a simple cache for Java application. It was modelled like a Map<K, V>: it cached values for keys.

Use cases were:

Cache<String, Object> cache = new Cache<String, Object>();
...
instance = cache.get("key");
cache.put("key", instance);

But now we thought of different implementation like a WeakReference<V> and with map access as additional utility methods.

 Consider an examples:

1. Free standing CachedReference<V> instance.

CachedReference<Data> ref = new CachedReference<Data>(1000, true);
...
ref.set(data);
...
data = ref.get();

2. Map of CachedReference<V> instances.

ConcurrentHashMap<String, CachedReference<Data>> cache =
  new ConcurrentHashMap<String, CachedReference<Data>>();

CachedReference.put(cache, "key", data, 1000, true);
...
data = CachedReference.get(cache, "key");

The first case is faster than original Cache<K, V> as it does not use any hash map at all. The later case provides the same performance as Cache<K, V> but gives a better control over the storage. Incidentally, CachedReference<V> is more compact than Cache<K, V>.

The new implementation is CachedReference.java, the old one Cache.java.

Tuesday, 08 February 2011 15:20:29 UTC  #    Comments [0] -
Java | Thinking aloud
# Thursday, 03 February 2011

michaelhkay: Just posted a new internal draft of XSLT 3.0. Moving forward on maps, nested sequences, and JSON support.

Hope they will finally appear there!

See also: Tuples and maps - next try, Tuples and maps - Status: CLOSED, WONTFIX, Tuples and maps in Saxon and other blog posts on our site about immutable maps.

Thursday, 03 February 2011 11:07:49 UTC  #    Comments [0] -
Thinking aloud | xslt
# Thursday, 27 January 2011

A method pattern we have suggested to use along with @Yield annotation brought funny questions like: "why should I mark my method with @Yield annotation at all?"

Well, in many cases you may live with ArrayList populated with data, and then to perform iteration. But in some cases this approach is not practical either due to amount of data or due to the time required to get first item.

In later case you usually want to build an iterator that calculates items on demand. The @Yield annotation is designed as a marker of such methods. They are refactored into state machines at compilation time, where each addition to a result list is transformed into a new item yielded by the iterator.

So, if you have decided to use @Yield annotation then at some point you will ask yourself what happens with resources acquired during iteration. Will resources be released if iteration is interrupted in the middle due to exception or a break statement?

To address the problem yield iterator implements Closeable interface.

This way when you call close() before iteration reached the end, the state machine works as if break statement of the method body is injected after the yield point. Thus all finally blocks of the original method are executed and resources are released.

Consider an example of data iterator:

@Yield
public Iterable<Data> getData(final Connection connection)
  throws Exception
{
  ArrayList<Data> result = new ArrayList<Data>();

  PreparedStatement statement =
    connection.prepareStatement("select key, value from table");

  try
  {
    ResultSet resultSet = statement.executeQuery();

    try
    {
      while(resultSet.next())
      {
        Data data = new Data();

        data.key = resultSet.getInt(1);
        data.value = resultSet.getString(2);

        result.add(data); // yield point
      }
    }
    finally
    {
      resultSet.close();
    }
  }
  finally
  {
    statement.close();
  }

  return result;
}

private static void close(Object value)
  throws IOException
{
  if (value instanceof Closeable)
  {
    Closeable closeable = (Closeable)value;

    closeable.close();
  }
}

public void daoAction(Connection connection)
  throws Exception
{
  Iterable<Data> items = getData(connection);

  try
  {
    for(Data data: items)
    {
      // do something that potentially throws exception.
    }
  }
  finally
  {
    close(items);
  }
}

getData() iterates over sql data. During the lifecycle it creates and releases PreparedStatement and ResultSet.

daoAction() iterates over results provided by getData() and performs some actions that potentially throw an exception. The goal of close() is to release opened sql resources in case of such an exception.

Here you can inspect how state machine is implemented for such a method:

@Yield()
public static Iterable<Data> getData(final Connection connection)
  throws Exception
{
  assert (java.util.ArrayList<Data>)(ArrayList<Data>)null == null;

  class $state implements java.lang.Iterable<Data>, java.util.Iterator<Data>, java.io.Closeable
  {
    public java.util.Iterator<Data> iterator() {
      if ($state$id == 0) {
        $state$id = 1;

        return this;
      } else return new $state();
    }

    public boolean hasNext() {
      if (!$state$nextDefined) {
        $state$hasNext = $state$next();
        $state$nextDefined = true;
      }

      return $state$hasNext;
    }

    public Data next() {
      if (!hasNext()) throw new java.util.NoSuchElementException();

      $state$nextDefined = false;

      return $state$next;
    }

    public void remove() {
      throw new java.lang.UnsupportedOperationException();
    }

    public void close() {
      do switch ($state$id) {
      case 3:
        $state$id2 = 8;
        $state$id = 5;

        continue;
      default:
        $state$id = 8;

        continue;
      } while ($state$next());
    }

    private boolean $state$next() {
      java.lang.Throwable $state$exception;

      while (true) {
        try {
          switch ($state$id) {
          case 0:
            $state$id = 1;
          case 1:
            statement = connection.prepareStatement("select key, value from table");
            $state$exception1 = null;
            $state$id1 = 8;
            $state$id = 2;
          case 2:
            resultSet = statement.executeQuery();
            $state$exception2 = null;
            $state$id2 = 6;
            $state$id = 3;
          case 3:
            if (!resultSet.next()) {
              $state$id = 4;

              continue;
            }

            data = new Data();
            data.key = resultSet.getInt(1);
            data.value = resultSet.getString(2);
            $state$next = data;
            $state$id = 3;

            return true;
          case 4:
            $state$id = 5;
          case 5:
            {
              resultSet.close();
            }

            if ($state$exception2 != null) {
              $state$exception = $state$exception2;

              break;
            }

            if ($state$id2 > 7) {
              $state$id1 = $state$id2;
              $state$id = 7;
            } else $state$id = $state$id2;

            continue;
          case 6:
            $state$id = 7;
          case 7:
            {
              statement.close();
            }

            if ($state$exception1 != null) {
              $state$exception = $state$exception1;

              break;
            }

            $state$id = $state$id1;

            continue;
          case 8:
          default:
            return false;
          }
        } catch (java.lang.Throwable e) {
          $state$exception = e;
        }

        switch ($state$id) {
        case 3:
        case 4:
          $state$exception2 = $state$exception;
          $state$id = 5;

          continue;
        case 2:
        case 5:
        case 6:
          $state$exception1 = $state$exception;
          $state$id = 7;

          continue;
        default:
          $state$id = 8;

          java.util.ConcurrentModificationException ce = new java.util.ConcurrentModificationException();

          ce.initCause($state$exception);

          throw ce;
        }
      }
    }

    private PreparedStatement statement;
    private ResultSet resultSet;
    private Data data;
    private int $state$id;
    private boolean $state$hasNext;
    private boolean $state$nextDefined;
    private Data $state$next;
    private java.lang.Throwable $state$exception1;
    private int $state$id1;
    private java.lang.Throwable $state$exception2;
    private int $state$id2;
  }

  return new $state();
}

Now, you can estimate for what it worth to write an algorithm as a sound state machine comparing to the conventional implementation.

Yield annotation processor can be downloaded from Yield.zip or Yield.jar

See also Yield return feature in java.

Thursday, 27 January 2011 10:33:54 UTC  #    Comments [0] -
Java | Thinking aloud | Tips and tricks
# Monday, 24 January 2011

We're happy to announce that we have implemented @Yield annotation both in javac and in eclipse compilers.

This way you get built-in IDE support for the feature!

To download yield annotation processor please use the following link: Yield.zip

It contains both yield annotation processor, and a test project.

If you do not want to compile the sources, you can download Yield.jar

We would like to reiterate on how @Yield annotation works:

  1. A developer defines a method that returns either Iterator<T> or Iterable<T> instance and marks it with @Yield annotation.
  2. A developer implements iteration logic following the pattern:
    • declare a variable to accumulate results:
        ArrayList<T> items = new ArrayList<T>();
    • use the following statement to add item to result:
        items.add(...);
    • use
        return items;
      or
        return items.iterator();
      to return result;
    • mark method's params, if any, as final.
  3. A devoloper ensures that yield annotation processor is available during compilation (see details below).
  4. YieldProcessor rewrites method into a state machine at compilation time.

The following is an example of such a method:

@Yield
public static Iterable<Integer> generate(final int from, final int to)
{
  ArrayList<Integer> items = new ArrayList<Integer>();

  for(int i = from; i < to; ++i)
  {
    items.add(i);
  }

  return items;
}

The use is like this:

for(int value: generate(7, 20))
{
  System.out.println("generator: " + value);
}

Notice that method's implementation still will be correct in absence of YieldProcessor.

Other important feature is that the state machine returned after the yield processor is closeable.

This means that if you're breaking the iteration before the end is reached you can release resources acquired during the iteration.

Consider the example where break exits iteration:

@Yield
public static Iterable<String> resourceIteration()
{
  ArrayList<String> items = new ArrayList<String>();

  acquire();

  try
  {
    for(int i = 0; i < 100; ++i)
    {
      items.add(String.valueOf(i));
    }
  }
  finally
  {
    release();
  }

  return items;
}

and the use

int i = 0;
Iterable<String> iterator = resourceIteration();

try
{
  for(String item: iterator)
  {
    System.out.println("item " + i + ":" + item);

    if (i++ > 30)
    {
      break;
    }
  }
}
finally
{
  close(iterator);
}

...

private static <T> void close(T value)
  throws IOException
{
  if (value instanceof Closeable)
  {
    Closeable closeable = (Closeable)value;

    closeable.close();
  }
}

Close will execute all required finally blocks. This way resources will be released.

To configure yield processor a developer needs to refer Yield.jar in build path, as it contains @Yield annotation. For javac it's enough, as compiler will find annotation processor automatically.

Eclipse users need to open project properties and:

  • go to the "Java Compiler"/"Annotation Processing"
  • mark "Enable project specific settings"
  • select "Java Compiler"/"Annotation Processing"/"Factory Path"
  • mark "Enable project specific settings"
  • add Yield.jar to the list of "plug-ins and JARs that contain annotation processors".

At the end we want to point that @Yield annotation is a syntactic suggar, but it's important the way the foreach statement is important, as it helps to write concise and an error free code.

See also
  Yield feature in java implemented!
  Yield feature in java

Monday, 24 January 2011 10:23:53 UTC  #    Comments [2] -
Announce | Java | Thinking aloud | Tips and tricks
# Tuesday, 11 January 2011

We could not stand the temptation to implement the @Yield annotation that we described earlier.

Idea is rather clear but people are saying that it's not an easy task to update the sources.

They were right!

Implementation has its price, as we were forced to access JDK's classes of javac compiler. As result, at present, we don't support other compilers such as EclipseCompiler. We shall look later what can be done in this area.

At present, annotation processor works perfectly when you run javac either from the command line, from ant, or from other build tool.

Here is an example of how method is refactored:

@Yield
public static Iterable<Long> fibonachi()
{
  ArrayList<Long> items = new ArrayList<Long>();

  long Ti = 0;
  long Ti1 = 1;

  while(true)
  {
    items.add(Ti);

    long value = Ti + Ti1;

    Ti = Ti1;
    Ti1 = value;
  }
}

And that's how we transform it:

@Yield()
public static Iterable<Long> fibonachi() {
  assert (java.util.ArrayList<Long>)(ArrayList<Long>)null == null : null;

  class $state$ implements java.lang.Iterable<Long>, java.util.Iterator<Long>, java.io.Closeable {

    public java.util.Iterator<Long> iterator() {
      if ($state$id == 0) {
        $state$id = 1;
        return this;
      } else return new $state$();
    }

    public boolean hasNext() {
      if (!$state$nextDefined) {
        $state$hasNext = $state$next();
        $state$nextDefined = true;
      }

      return $state$hasNext;
    }

    public Long next() {
      if (!hasNext()) throw new java.util.NoSuchElementException();

      $state$nextDefined = false;

      return $state$next;
    }

    public void remove() {
      throw new java.lang.UnsupportedOperationException();
    }

    public void close() {
      $state$id = 5;
    }

    private boolean $state$next() {
      while (true) switch ($state$id) {
      case 0:
        $state$id = 1;
      case 1:
        Ti = 0;
        Ti1 = 1;
      case 2:
        if (!true) {
          $state$id = 4;
          break;
        }

        $state$next = Ti;
        $state$id = 3;

        return true;
      case 3:
        value = Ti + Ti1;
        Ti = Ti1;
        Ti1 = value;
        $state$id = 2;

        break;
      case 4:
      case 5:
      default:
        $state$id = 5;

        return false;
      }
    }

    private long Ti;
    private long Ti1;
    private long value;
    private int $state$id;
    private boolean $state$hasNext;
    private boolean $state$nextDefined;
    private Long $state$next;
  }

  return new $state$();
}

Formatting is automatic, sorry, but anyway it's for diagnostics only. You will never see this code.

It's iteresting to say that this implementation is very precisely mimics xslt state machine implementation we have done back in 2008.

You can download YieldProcessor here. We hope that someone will find our solution very interesting.

Tuesday, 11 January 2011 16:08:41 UTC  #    Comments [0] -
Announce | Thinking aloud | Tips and tricks | xslt | Java
# Monday, 20 December 2010

Several times we have already wished to see yield feature in java and all the time came to the same implementation: infomancers-collections. And every time with dissatisfaction turned away, and continued with regular iterators.

Why? Well, in spite of the fact it's the best implementation of the feature we have seen, it's still too heavy, as it's playing with java byte code at run-time.

We never grasped the idea why it's done this way, while there is post-compile time annotation processing in java.

If we would implemented the yeild feature in java we would created a @Yield annotation and would demanded to implement some well defined code pattern like this:

@Yield
Iteratable<String> iterator()
{
  // This is part of pattern.
  ArrayList<String> list = new ArrayList<String>();

  for(int i = 0; i < 10; ++i)
  {
    // list.add() plays the role of yield return.
    list.add(String.valueOf(i));
  }

  // This is part of pattern.
  return list;
}

or

@Yield
Iterator<String> iterator()
{
  // This is part of pattern.
  ArrayList<String> list = new ArrayList<String>();

  for(int i = 0; i < 10; ++i)
  {
    // list.add() plays the role of yield return.
    list.add(String.valueOf(i));
  }

  // This is part of pattern.
  return list.iterator();
}

Note that the code will work correctly even, if by mischance, post-compile-time processing will not take place.

At post comile time we would do all required refactoring to turn these implementations into a state machines thus runtime would not contain any third party components.

It's iteresting to recall that we have also implemented similar refactoring in pure xslt.

See What you can do with jxom.

Update: implementation can be found at Yield.zip

Monday, 20 December 2010 16:28:35 UTC  #    Comments [0] -
Java | Thinking aloud | Tips and tricks | xslt
# Saturday, 11 December 2010

We have a class Beans used to serialize a list of generic objects into an xml. This is done like this:

public class Call
{
  public Beans input;
  public Beans output;
  ...
}

@XmlJavaTypeAdapter(value = BeanAdapter.class)
public class Beans
{
  public List<Object> bean;
}

Thanks to @XmlJavaTypeAdapter, we're able to write xml in whatever form we want.

When we're serializing a Call instance:

Call call = ...
Beans beans = ...;

call.setInput(beans);

JAXBContext context = ...;
Marshaller marshaler = context.createMarshaller();
ObjectFactory factory = ...;

marshaler.marshal(factory.createCall(call), result);

things work as expected, meaning that BeanAdapter is used during xml serialization. But if it's happened that you want to serialize a Beans instance itself, you start getting problems with the serialization of unknown objects. That's because JAXB does not use BeanAdapter.

We have found a similar case "How to assign an adapter to the root element?", unfortunately with no satisfactory explanation.

That is strange.

Saturday, 11 December 2010 08:48:00 UTC  #    Comments [0] -
Java | Thinking aloud
# Sunday, 07 November 2010

michaelhkay: Saxon 9.3 has been out for 8 days: only two bugs so far, one found by me. I think that's a record.

Not necessary. We, for example, who use Saxon HE, have found nothing new in Saxon 9.3, while expected to see xslt 3.0. Disappointed. No actual reason to migrate.

P.S. We were among the first who were finding early bugs in previous releases.

Sunday, 07 November 2010 09:07:11 UTC  #    Comments [0] -
Thinking aloud | xslt
# Tuesday, 02 November 2010

Reading individual papers of C++ WG, you can find the following one:

N3174 10-0164 To move or not to move Bjarne Stroustrup 2010-10-17 2010-10 Core

There, Bjarne Stroustrup thinks about issues with implicitly generated copy and move operations in C++.

It's always a pleasure to see how one can deal with a problem burdened with antagonisms. To conduct his position Bjarne skilfully uses not only rational but also emotional argumentation:

...We may deem this “bad code that deserves to be broken” or “unrealistic”, but this example demonstrates that the problem with a generated move has an exact counterpart for copy (which we have lived with for 27 years)...

...In 1984, I missed the chance to protect us against copy and we have lived with the problems ever since. I should have instituted some rule along the lines “if a class has a destructor, no copy operations are generated” or “if a class has a pointer member, no copy operations are generated.”...

It's impossible to recall this numbers without shivering. :-)

Tuesday, 02 November 2010 10:16:08 UTC  #    Comments [0] -
Thinking aloud

We're following w3's "Bug 9069 - Function to invoke an XSLT transformation".

There, people argue about xpath API to invoke xslt transformations. Function should look roughly like this:

transform
(
  $node-tree as node()?,
  $stylesheet as item(),
  $parameters as XXX
) as node()

The discussion is spinning around the last argument: $parameters as XXX. Should it be an xml element describing parameters, a function returning values for parameter names, or some new type modelling immutable map?

What is most interesting in this discussion is the leak about plans to introduce a map type:

Comment 7 Michael Kay, 2010-09-14 22:46:58 UTC

We're currently talking about adding an immutable map to XSLT as a new data type (the put operation would return a new map). There appear to be a number of possible efficient implementations. It would be ideally suited for this purpose, because unlike the mechanism used for serialization parameters, the values can be any data type (including nodes), not only strings.

There is a hope that map will finally appear in xslt!

See also:
Bug 5630 - [DM] Tuples and maps,
Tuples and maps - Status: CLOSED, WONTFIX,
Map, based on immutable trees,
Maps in exslt2?

Tuesday, 02 November 2010 08:34:52 UTC  #    Comments [0] -
Thinking aloud | xslt
# Friday, 22 October 2010

In the previous post we have announced an API to parse a COBOL source into the cobolxom.

We exploited the incremental parser to build a grammar xml tree and then were planning to create an xslt transformation to generate cobolxom.

Now, we would like to declare that such xslt is ready.

At present all standard COBOL constructs are supported, but more tests are required. Preprocessor support is still in the todo list.

You may peek into an examples of COBOL:

Cobol grammar:

And cobolxom:

While we were building a grammar to cobolxom stylesheet we asked ourselves whether the COBOL parsing could be done entirely in xslt. The answer is yes, so who knows it might be that we shall turn this task into pure xslt one. :-)

Friday, 22 October 2010 13:24:31 UTC  #    Comments [0] -
Announce | Incremental Parser | Thinking aloud | xslt
# Wednesday, 25 August 2010

Accidentally we have found that implementation of String and StringBuilder have been considerably revised, while public interface has remained the same.

public sealed class String
{
  private int m_arrayLength;
  private int m_stringLength;
  private char m_firstChar;
}

This layout is dated to .NET 1.0.

VM, in fact, allocates more memory than that defined in C# class, as &m_firstChar refers to an inline char buffer.

This way string's buffer length and string's length were two different values, thus StringBuilder used this fact and stored its content in a private string which it modified in place.

In .NET 4, string is different:

public sealed class String
{
  private int m_stringLength;
  private char m_firstChar;
}

Memory footprint of such structure is smaller, but string's length should always be the same as its buffer. In fact layout of string is now the same as layout of char[].

This modification leads to implementation redesign of the StringBuilder.

Earlier, StringBuilder looked like the following:

public sealed class StringBuilder
{
  internal IntPtr m_currentThread;
  internal int m_MaxCapacity;
  internal volatile string m_StringValue;
}

Notice that m_StringValue is used as a storage, and m_currentThread is used to preserve thread affinity of the internal string value.

Now, guys at Microsoft have decided to implement StringBuilder very differently:

public sealed class StringBuilder
{
  internal int m_MaxCapacity;
  internal int m_ChunkLength;
  internal int m_ChunkOffset;
  internal char[] m_ChunkChars;
  internal StringBuilder m_ChunkPrevious;
}

Inspection of this layout immediately reveals implementation technique. It's a list of chunks. Instance itself references the last chunk (most recently appended), and the previous chunks.

Characteristics of this design are:

  • while Length is small, performance almost the same as it was earlier;
  • there are no more thread affinity checks;
  • Append(), and ToString() works as fast a in the old version.
  • Insert() in the middle works faster, as only a chuck should be splitted and probably reallocated (copied), instead of the whole string;
  • Random access is fast at the end O(1) and slows when you approaching the start O(chunk-count).

Personally, we would select a slightly different design:

public sealed class StringBuilder
{
  private struct Chunk
  {
    public int length; // Chunk length.
    public int offset; // Chunk offset.
    public char[] buffer; 
  }

  private int m_MaxCapacity;

  // Alternatively, one can use
  // private List<Chunk> chunks;
  private int chunkCount; // Number of used chunks.
  private Chunk[] chunks; // Array of chunks except last.

  private Chunk last; // Last chunk.
  private bool nonHomogenous; // false if all chunks are of the same size.
}

This design has better memory footprint, and random access time is O(1) when there were no inserts in the middle (nonHomogenous=false), and O(log(chunkCount)) after such inserts. All other characteristics are the same.

Wednesday, 25 August 2010 09:36:55 UTC  #    Comments [0] -
Thinking aloud | Tips and tricks
# Thursday, 15 July 2010

We have run into another xslt bug, which depends on several independent circumstances and often behaves differently being observed. That's clearly a Heisenbug.

Xslt designers failed to realize that a syntactic suggar they introduce into xpath can turn into obscure bugs. Well, it's easy to be wise afterwards...

To the point.

Consider you have a sequence consisting of text nodes and elements, and now you want to "normalize" this sequence wrapping adjacent text nodes into separate elements. The following stylesheet is supposed to do the work:

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:t="http://www.nesterovsky-bros.com/xslt/this"
  exclude-result-prefixes="xs t">

  <xsl:template match="/">
    <xsl:variable name="nodes" as="node()*">
      <xsl:text>Hello, </xsl:text>
      <string value="World"/>
      <xsl:text>! </xsl:text>
      <xsl:text>Well, </xsl:text>
      <string value="hello"/>
      <xsl:text>, if not joking!</xsl:text>
    </xsl:variable>
 
    <result>
      <xsl:sequence select="t:normalize($nodes)"/>
    </result>
  </xsl:template>

  <xsl:function name="t:normalize" as="node()*">
    <xsl:param name="nodes" as="node()*"/>

    <xsl:for-each-group select="$nodes" group-starting-with="*">
      <xsl:variable name="string" as="element()?" select="self::string"/>
      <xsl:variable name="texts" as="node()*"
        select="current-group() except $string"/>

      <xsl:sequence select="$string"/>

      <xsl:if test="exists($texts)">
        <string value="{string-join($texts, '')}"/>
      </xsl:if>
    </xsl:for-each-group>
  </xsl:function>

</xsl:stylesheet>

We're expecting the following output:

<result>
  <string value="Hello, "/>
  <string value="World"/>
  <string value="! Well, "/>
  <string value="hello"/>
  <string value=", if not joking!"/>
</result>

But often we're getting other results, like:

<result>
  <string value="Hello, "/>
  <string value="World"/>
  <string value="Well, ! "/>
  <string value="hello"/>
  <string value=", if not joking!"/>
</result>

Such output may seriously confuse, unless you will recall the rule for the xpath except operator:

The except operator takes two node sequences as operands and returns a sequence containing all the nodes that occur in the first operand but not in the second operand.

... these operators eliminate duplicate nodes from their result sequences based on node identity. The resulting sequence is returned in document order..

...
The relative order of nodes in distinct trees is stable but implementation-dependent

These words mean that result sequence may be very different from original sequence.

In contrast, if we change $text definition to:

<xsl:variable name="texts" as="node()*"
  select="current-group()[not(. is $string)]"/>

then the result becomes stable, but less clear.

See also Xslt Heisenbug

Thursday, 15 July 2010 08:22:13 UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | xslt
# Sunday, 11 July 2010

It does not matter that DataBindExtender looks not usual in the ASP.NET. It turns to be so handy that built-in data binding is not considered to be an option.

After a short try, you uderstand that people tried very hard and have invented many controls and methods like ObjectDataSource, FormView, Eval(), and Bind() with outcome, which is very specific and limited.

In contrast DataBindExtender performs:

  • Two or one way data binding of any business data property to any control property;
  • Converts value before it's passed to the control, or into the business data;
  • Validates the value.

See an example:

<asp:TextBox id=Field8 EnableViewState="false" runat="server"></asp:TextBox>
<bphx:DataBindExtender runat='server'
  EnableViewState='false'
  TargetControlID='Field8'
  ControlProperty='Text'
  DataSource='<%# Import.ClearingMemberFirm %>'
  DataMember='Id'
  Converter='<%# Converters.AsString("XXXXX", false) %>'
  Validator='<%# (extender, value) => Functions.CheckID(value as string) %>'/>

Here, we beside a regualar two way data binding of a property Import.ClearingMemberFirm.Id to a property Field8.Text, format (parse) Converters.AsString("XXXXX", false), and finally validate an input value with a lambda function (extender, value) => Functions.CheckID(value as string).

DataBindExtender works also well in template controls like asp:Repeater, asp:GridView, and so on. Having your business data available, you may reduce a size of the ViewState with EnableViewState='false'. This way DataBindExtender approaches page development to a pattern called MVC.

Recently, we have found that it's also useful to have a way to run a javascript during the page load (e.g. you want to attach some client side event, or register a component). DataBindExtender provides this with OnClientInit property, which is a javascript to run on a client, where this refers to a DOM element:

... OnClientInit='$addHandler(this, "change", function() { handleEvent(event, "Field8"); } );'/>

allows us to attach onchange javascript event to the asp:TextBox.

So, meantime we're very satisfied with what we can achieve with DataBindExtender. It's more than JSF allows, and much more stronger and neater to what ASP.NET has provided.

The sources can be found at DataBindExtender.cs

Sunday, 11 July 2010 07:07:03 UTC  #    Comments [4] -
ASP.NET | Thinking aloud | Tips and tricks
# Monday, 05 July 2010

Lately, we have found that we've accustomed to declare C#'s local variables using var:

var exitStateName = exitState == null ? "" : exitState.Name;
var rules = Environment.NavigationRules;
var rule = rules[caller.Name];
var flow = rule.NavigationCases[procedure.OriginExitState];

This makes code cleaner, and in presense of good IDE still allows to figure out types very easely.

We, howerer, found that var tends to have exceptions in its uses. E.g. for some reason most of boolean locals in our code tend to remain explicit (matter of taste?):

bool succeed = false;

try
{
  ...

  succeed = true;
}
finally
{
  if (!succeed)
  {
    ...
  }
}

Also, type often survives in for, but not in foreach:

for(int i = 0; i < sourceDataMapping.Length; ++i)
{
  ...
}

foreach(var property in properties)
{
  ...
}

In addition var has some limitations, as one cannot easily initialize such local with null. From the following we prefer the first approach:

  • IWindowContext context = null;
  • var context = (IWindowContext)null;
  • var context = null as IWindowContext;
  • var context = default(IWindowContext);

We might need to figure out a consistent code style as for var. It might be like that:

  • Numeric, booleans and string locals should use explicit type;
  • Try to avoid locals initialized with null, or without initializer, or use type if such variable cannot be avoided;
  • Use var in all other cases.

Another code style could be like that:

  • For the consistency, completely avoid the use of keyword var.
Monday, 05 July 2010 09:09:26 UTC  #    Comments [0] -
Thinking aloud | Tips and tricks
# Tuesday, 22 June 2010

Recently we were raising a question about serialization of ASPX output in xslt.

The question went like this:

What's the recommended way of ASPX page generation?
E.g.:

------------------------
 <%@ Page AutoEventWireup="true"
   CodeBehind="CurMainMenuP.aspx.cs"
   EnableSessionState="True"
   Inherits="Currency.CurMainMenuP"
   Language="C#"
   MaintainScrollPositionOnPostback="True"
   MasterPageFile="Screen.Master" %>

<asp:Content ID="Content1" runat="server" ContentPlaceHolderID="Title">CUR_MAIN_MENU_P</asp:Content>

<asp:Content ID="Content2" runat="server" ContentPlaceHolderID="Content">
  <span id="id1222146581" runat="server"
    class="inputField system UpperCase" enableviewstate="false">
    <%# Dialog.Global.TranCode %>
  </span>
  ...
------------------------

Notice aspx page directives, data binding expessions, and prefixed tag names without namespace declarations.

There was a whole range of expected answers. We, however, looked whether somebody have already dealed with the task and has a ready solution at hands.

In general it seems that xslt community is very angry about ASPX: both format and technology. Well, put this aside.

The task of producing ASPX, which is almost xml, is not solvable when you're staying with pure xml serializer. Xslt's xsl:character-map does not work at all. In fact it looks as a childish attempt to address the problem, as it does not support character escapes but only grabs characters and substitutes them with strings.

We have decided to create ASPX serializer API producing required output text. This way you use <xsl:output method="text"/> to generate ASPX pages.

With this goal in mind we have defined a little xml schema to describe ASPX irregularities in xml form. These are:

  • <xs:element name="declared-prefix"> - to describe known prefixes, which should not be declared;
  • <xs:element name="directive"> - to describe directives like <%@ Page %>;
  • <xs:element name="content"> - a transparent content wrapper;
  • <xs:element name="entity"> - to issue xml entity;
  • <xs:element name="expression"> - to describe aspx expression like <%# Eval("A") %>;
  • <xs:element name="attribute"> - to describe an attribute of the parent element.

This approach greately simplified for us an ASPX generation process.

The API includes:

Tuesday, 22 June 2010 10:25:41 UTC  #    Comments [0] -
Announce | ASP.NET | Thinking aloud | Tips and tricks | xslt
# Tuesday, 15 June 2010

In previous posts we were crying about problems with JSF to ASP.NET migration. Let's point to another one.

Consider that you have an input field, whose value should be validated:

<input type="text" runat="server" ID="id1222146409" maxlength="4"/>
<bphx:DataBindExtender runat="server"
  TargetControlID="id1222146409" ControlProperty="Value"
  DataSource="<%# Import.AaControlAttributes %>"
  DataMember="UserEnteredTrancode"/>

Here we have an input control, whose value is bound to Import.AaControlAttributes.UserEnteredTrancode property. But what is missed is a value validation. Somewhere we have a function that could answer the question whether the value is valid. It should be called like this: Functions.IsTransactionCodeValid(value).

Staying within standard components we can use a custom validator on the page:

<asp:CustomValidator runat="server"
  ControlToValidate="id1222146409"
  OnServerValidate="ValidateTransaction"
  ErrorMessage="Invalid transaction code."/>

and add the following code-behind:

protected void ValidateTransaction(object source, ServerValidateEventArgs args)
{
  args.IsValid = Functions.IsTransactionCodeValid(args.Value);
}

This approach works, however it pollutes the code-behind with many very similar methods. The problem is that the validation rules in most cases are not property of page but one of data model. That's why page validation methods just forward check to somewhere.

While thinking on how to simplify the code we have came up with more conscious and short way to express validators, namely using lambda functions. To that end we have introduced a Validator property of type ValueValidator over DataBindExtender. Where

/// <summary>A delegate to validate values.</summary>
/// <param name="extender">An extender instance.</param>
/// <param name="value">A value to validate.</param>
/// <returns>true for valid value, and false otherwise.</returns>
public delegate bool ValueValidator(DataBindExtender extender, object value);

/// <summary>An optional data member validator.</summary>
public virtual ValueValidator Validator { get; set; }

With this new property the page markup looks like this:

<input type="text" runat="server" ID="id1222146409" maxlength="4"/>
<bphx:DataBindExtender runat="server"
  TargetControlID="id1222146409" ControlProperty="Value"
  DataSource="<%# Import.AaControlAttributes %>"
  DataMember="UserEnteredTrancode"
  Validator='<%# (extender, value) => Functions.IsTransactionCodeValid(value as string) %>'
  ErrorMessage="Invalid transaction code."/>

This is almost like an event handler, however it allowed us to call data model validation logic without unnecessary code-behind.

The updated DataBindExtender can be found at DataBindExtender.cs.

Tuesday, 15 June 2010 06:36:44 UTC  #    Comments [0] -
ASP.NET | Thinking aloud | Tips and tricks
# Thursday, 10 June 2010

Being well behind of the latest news and traps of the ASP.NET, we're readily falling on each problem. :-)

This time it's a script injection during data binding.

In JSF there is a component to output data called h:outputText. Its use is like this:

<span jsfc="h:outputText" value="#{myBean.myProperty}"/>

The output is a span element with data bound value embeded into content. The natural alternative in ASP.NET seems to be an asp:Label control:

<asp:Label runat="server" Text="<%# Eval("MyProperty") %>"/>

This almost works except that the h:outputText escapes data (you may override this and specify attribute escape="false"), and asp:Label never escapes the data.

This looks as a very serious omission in ASP.NET (in fact very close to a security hole). What are chances that when you're creating a new page, which uses data binding, you will not forget to fix code that wizard created for you and to change it to:

<asp:Label runat="server" Text="<%# Server.HtmlEncode(Eval("MyProperty")) %>"/>

Eh? Think what will happen if MyProperty will return a text that looks like a script (e.g.: <script>alert(1)</script>), while you just wanted to output a label?

To address the issue we've also introduced a property Escape into DataBindExtender. So at present we have a code like this:

<asp:Label runat="server" ID="MyLabel"/>
<bphx:DataBindExtender runat="server" TargetControlID="MyLabel"
  ControlProperty="Text" ReadOnly="true" Escape="true"
  DataSource="<%# MyBean %>" DataMember="MyProperty"/>

See also: A DataBindExtender, Experience of JSF to ASP.NET migration

Thursday, 10 June 2010 13:06:19 UTC  #    Comments [0] -
ASP.NET | Thinking aloud | Tips and tricks
# Saturday, 05 June 2010

After struggling with ASP.NET data binding we found no other way but to introduce our little extender control to address the issue.

We were trying to be minimalistic and to introduce two way data binding and to support data conversion. This way extender control (called DataBindExtender) have following page syntax:

<asp:TextBox id=TextBox1 runat="server"></asp:TextBox>
<cc1:DataBindExtender runat="server"
  DataSource="<%# Data %>"
  DataMember="ID"
  TargetControlID="TextBox1"
  ControlProperty="Text" />

Two way data binding is provided with DataSource object (notice data binding over this property) and a DataMember property from the one side, and TargetControlID and ControlProperty from the other side. DataBindExtender supports Converter property of type TypeConverter to support custom converters.

DataBindExtender is based on AjaxControlToolkit.ExtenderControlBase class and implements System.Web.UI.IValidator. ExtenderControlBase makes implementation of extenders extremely easy, while IValidator plugs natuarally into page validation (Validate method, Validators collections, ValidationSummary control).

The good point about extenders is that they are not visible in designer, while it exposes properties in extended control itself. The disadvantage is that it requires Ajax Control Toolkit, and also ScriptManager component of the page.

To simplify the use DataBindExtender gets data from control and puts the value into data source in Validate method, and puts data into control in OnPreRender method; thus no specific action is required to perform data binding.

Source for the DataBindExtender is DataBindExtender.cs.

Saturday, 05 June 2010 11:22:03 UTC  #    Comments [0] -
ASP.NET | Thinking aloud | Tips and tricks
# Saturday, 29 May 2010

We used to think that ASP.NET is a way too powerful than JSF. It might be still true, but not when you are accustomed to JSF and spoiled with its code practice...

Looking at both technologies from a greater distance, we now realize that they give almost the same level of comfort during development, but they are different. You can feel this after you were working for some time with one technology and now are to implement similar solution in opposite one. That is where we have found ourselves at present.

The funny thing is that we did expect some problems but in a different place. Indeed, both ASP.NET and JSF are means to define a page layout and to map input and output of business data. While with the presentation (controls, their compositions, masters, styles and so on) you can find more or less equal analogies, the differences of implementation of data binding is a kind of a pain.

We have found that data binding in ASP.NET is somewhat awkward. Its Eval and Bind is bearable in simple cases but almost unusable when you business data is less trivial, or if you have to apply custom data formatting.

In JSF, with its Expression Language, we can perform two way data binding for rather complex properties like ${data.items[index + 5].property}, or to create property adapters ${my:asSomething(data.bean, "property").Value}, or add standard or custom property converters. In contrast data binding in ASP.NET is limited to simple property path (no expressions are supported), neither custom formatters are supported (try to format number as a telephone number).

Things work well when you're designing ASP.NET application from scratch, as you naturally avoid pitfalls, however when you got existing business logic and need to expose it to the web, you have no other way but to write a lot of code behind just to smooth out the problems that ASP.NET exhibits.

Another solution would be to design something like extender control that would attach more proper data binding and formatting facilities to control properties. That would allow to make page definitions in more declarative way, like what we have now in JSF.

Saturday, 29 May 2010 14:16:05 UTC  #    Comments [0] -
ASP.NET | JSF and Facelets | Thinking aloud
# Sunday, 23 May 2010

While porting a solution from JSF to ASP.NET we have seen an issue with synchronization of access to a data stored in a session from multiple requests.

Consider a case when you store a business object in a session.

Going through the request lifecycle we observe that this business object may be accessed at different stages: data binding, postback event handler, security filters, other.

Usually this business object is mutable and does not assume concurent access. Browsers, however, may easily issue multiple requests to the same session at the same time. In fact, such behaviour, is not even an exception, as browsers nowadays are often sending concurrent requests.

In the JSF we're using a sync object, which is part of business object itself; lock it and unlock at the begin and at the end of a request correspondingly. This works perfectly as JSF guarantees that:

  • lock is released after it's acquired (we use request scope bean with @PostConstruct and @PreDestroy annotations to lock and unlock);
  • both lock and unlock take place in the same thread.

ASP.NET, in contrast, tries to be more asynchronous, and allows for different stages of request to take place in different threads. This could be seen indirectly in the documentation, which does not give any commitments in this regards, and with code inspection where you can see that request can begin in one thread, and a next stage can be queued for the execution into the other thread.

In addition, ASP.NET does not guarantee that if BeginRequest has been executed then EndRequest will also run.

The conclusion is that we should not use locks to synchronize access to the same session object, but rather try to invent other means to avoid data races.

Update msdn states:

Concurrent Requests and Session State

Access to ASP.NET session state is exclusive per session, which means that if two different users make concurrent requests, access to each separate session is granted concurrently. However, if two concurrent requests are made for the same session (by using the same SessionID value), the first request gets exclusive access to the session information. The second request executes only after the first request is finished. (The second session can also get access if the exclusive lock on the information is freed because the first request exceeds the lock time-out.)

This means that the required synchronization is already built into ASP.NET. That's good.

Sunday, 23 May 2010 12:22:35 UTC  #    Comments [0] -
ASP.NET | Thinking aloud
# Friday, 14 May 2010

We have implemented report parser in C#. Bacause things are spinned around C#, a schema definition is changed.

We have started from classes defining a report definition tree, annotated these classes for xml serialization, and, finally, produced xml schema for such tree. So, at present, it is not an xml schema with annotations but a separate xml schema.

In addition we have defined APIs:

  • to enumerate report data (having report definition and report data one can get IEnumerable<ViewValue> to iterate report data in structured form);
  • to read report through XmlReader, which allows, for example, to have report as input for an xslt tranformation.
  • to write report directly into XmlWriter.

An example of report definition as C# code is: MyReport.cs. The very same report definition but serialized into xml is my-report.xml. A generated xml schema for a report definition is: schema0.xsd.

The good point about this solution is that it's already flexible enough to describe every report layout we have at hands, and it's extendable. Our measurments show that report parsing is extremely fast and have very small memory footprint due to forward only nature of report definitions.

From the design point of view report definition is a view of original text data with view info attached.

At present we have defined following views:

  • Element - a named view to generate output from a content view;
  •  Content - a view to aggregate other views together;
  • Choice - a view to produce output from one of content views;
  • Sequence - a view to sequence input view by key expressions, and to attach an index to each sequence item;
  • Iterator - a view to generate output from input view while some condition is true, and to attach an iteration index to each part of output view;
  • Page - a view to remove page headers and footers in the input view, and to attach an index to each page;
  • Compute - a named view to produce result of evaluation of expression as output view;
  •  Data - a named view to produce output value from some bounds of input view, and optionally to convert, validate and format the value.

To specify details of definitions there are:

  • expressions to deal with integers: Add, Div, Integer, MatchProperty, Max, Min, Mod, Mul, Neg, Null, Sub, VariableRef, ViewProperty, Case;
  • conditions to deal with booleans: And, EQ, GE, GT, IsMatch, LE, LT, NE, Not, Or.

At present there is no a specification of a report definitions. Probably, it's the most complex part to create such a spec for a user without deep knowledge. At present, our idea is that one should use xml schema (we should polish generated schema) for the report definition and schema aware editor to build report definitions. That's very robust approach working perfectly with languages xom.

C# sources can be found at: ReportLayout.zip including report definition classes and a sample report.

Friday, 14 May 2010 12:45:42 UTC  #    Comments [0] -
Announce | Thinking aloud | xslt
# Sunday, 09 May 2010
Ribbon of Saint George

We're facing a task of parsing reports produced from legacy applications and converting them into a structured form, e.g. into xml. These xml files can be processed further with up to date tools to produce good looking reports.

Reports at hands are of very different structure and of size: from a couple of KB to a several GB. The good part is that they mostly have a tabular form, so it's easy to think of specific parsers in case of each report type.

Our goal is to create an environment where a less qualified person(s) could create and manage such parsers, and only rarely to engage someone who will handle less untrivial cases.

Our analysis has shown that it's possible to write such parser in almost any language: xslt, C#, java.

Our approach was to create an xml schema annotations that from one side define a data structure, and from the other map report layout. Then we're able to create an xslt that will generate either xslt, C#, or java parser according to the schema definitions. Because of languages xom, providing XML Object Model and serialization stylesheets for C# and java, it does not really matter what we shall generate xslt or C#/java, as code will look the same.

The approach we're going to use to describe reports is not as powerfull as conventional parsers. Its virtue, however, is simplicity of specification.

Consider a report sample (a data to extract is in bold):

1 TITLE ...                    PAGE:            1
 BUSINESS DATE: 09/30/09   ... RUN DATE: 02/23/10
 CYCLE : ITD      RUN: 001 ... RUN TIME: 09:22:39

        CM         BUS   ...
  CO    NBR  FRM   FUNC  ...
 ----- ----- ----- -----  
 XXX   065   065   CLR   ...
 YYY   ...
...
1 TITLE ...                    PAGE:            2
 BUSINESS DATE: 09/30/09   ... RUN DATE: 02/23/10
 CYCLE : ITD      RUN: 001 ... RUN TIME: 09:22:39

        CM         BUS   ...
  CO    NBR  FRM   FUNC  ...
 ----- ----- ----- -----  
 AAA   NNN   MMM   PPP   ...
 BBB   ...
...

* * * * *  E N D   O F   R E P O R T  * * * * *

We're approaching to the report through a sequence of views (filters) of this report. Each veiw localizes some report data either for the subsequent filterring or for the extraction of final data.

Looking into the example one can build following views of the report:

  1. View of data before the "E N D   O F   R E P O R T" line.
  2. View of remaining data without page headers and footers.
  3. Views of table rows.
  4. Views of cells.

A sequence of filters allows us to build a pipeline of transformations of original text. This also allows us to generate a clean xslt, C# or java code to parse the data.

At first, our favorite language for such parser was xslt. Unfortunatelly, we're dealing with Saxon xslt implementation, which is not very strong in streaming processing. Without a couple of extension functions to prevent caching, it tends to cache whole input in the memory, which is not acceptable.

At present we have decided to start from C# code, which is pure C# naturally. :-)

Code still is in the development but at present we would like to share the xml schema annotations describing report layout: report-mapping.xsd, and a sample of report description: test.xsd.

Sunday, 09 May 2010 05:18:57 UTC  #    Comments [0] -
Announce | Thinking aloud | xslt
# Saturday, 01 May 2010

To see that the problem with Generator functions in xslt is a bit more complicated compare two functions.

The first one is quoted from the earlier post:

  <xsl:function name="t:generate" as="xs:integer*">
    <xsl:param name="value" as="xs:integer"/>

    <xsl:sequence select="$value"/>
    <xsl:sequence select="t:generate($value * 2)"/>
  </xsl:function>

It does not work in Saxon: crashes with out of memory.

The second one is slightly modified version of the same function:

  <xsl:function name="t:generate" as="xs:integer*">
    <xsl:param name="value" as="xs:integer"/>

    <xsl:sequence select="$value + 0"/>
    <xsl:sequence select="t:generate($value * 2)"/>
  </xsl:function>

It's working without problems. In first case Saxon decides to cache all function's output, in the second case it decides to evaluate data lazily on demand.

It seems that optimization algorithms implemented in Saxon are so plentiful and complex that at times they fool one another. :-)

See also: Generator functions

Saturday, 01 May 2010 07:18:24 UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | xslt
# Friday, 23 April 2010

There are some complications with streamed tree that we have implemented in saxon. They are due to the fact that only a view of input data is available at any time. Whenever you access some element that's is not available you're getting an exception.

Consider an example. We have a log created with java logging. It looks like this:

<log>
  <record>
    <date>...</date>
    <millis>...</millis>
    <sequence>...</sequence>
    <logger>...</logger>
    <level>INFO</level>
    <class>...</class>
    <method>...</method>
    <thread>...</thread>
    <message>...</message>
  </record>
  <record>
    ...
  </record>
  ...

We would like to write an xslt that returns a page of log as html:

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:t="http://www.nesterovsky-bros.com/xslt/this"
  xmlns="http://www.w3.org/1999/xhtml"
  exclude-result-prefixes="xs t">

  <xsl:param name="start-page" as="xs:integer" select="1"/>
  <xsl:param name="page-size" as="xs:integer" select="50"/>

  <xsl:output method="xhtml" byte-order-mark="yes" indent="yes"/>

  <!-- Entry point. -->
  <xsl:template match="/log">
    <xsl:variable name="start" as="xs:integer"
      select="($start-page - 1) * $page-size + 1"/>
    <xsl:variable name="records" as="element()*"
      select="subsequence(record, $start, $page-size)"/>

    <html>
      <head>
        <title>
          <xsl:text>A log file. Page: </xsl:text>
          <xsl:value-of select="$start-page"/>
        </title>
      </head>
      <body>
        <table border="1">
          <thead>
            <tr>
              <th>Level</th>
              <th>Message</th>
            </tr>
          </thead>
          <tbody>
            <xsl:apply-templates mode="t:record" select="$records"/>
          </tbody>
        </table>
      </body>
    </html>
  </xsl:template>

  <xsl:template mode="t:record" match="record">
    <!-- Make a copy of record to avoid streaming access problems. -->
    <xsl:variable name="log">
      <xsl:copy-of select="."/>
    </xsl:variable>

    <xsl:variable name="level" as="xs:string"
      select="$log/record/level"/>
    <xsl:variable name="message" as="xs:string"
      select="$log/record/message"/>

    <tr>
      <td>
        <xsl:value-of select="$level"/>
      </td>
      <td>
        <xsl:value-of select="$message"/>
      </td>
    </tr>
  </xsl:template>

</xsl:stylesheet>

This code does not work. Guess why? Yes, it's subsequence(), which is too greedy. It always wants to know what's the next node, so it naturally skips a content of the current node. Algorithmically, such saxon code could be rewritten, and could possibly work better also in modes other than streaming.

A viable workaround, which does not use subsequence, looks rather untrivial:

<!-- Entry point. -->
<xsl:template match="/log">
  <xsl:variable name="start" as="xs:integer"
    select="($start-page - 1) * $page-size + 1"/>
  <xsl:variable name="end" as="xs:integer"
    select="$start + $page-size"/>

  <html>
    <head>
      <title>
        <xsl:text>A log file. Page: </xsl:text>
        <xsl:value-of select="$start-page"/>
      </title>
    </head>
    <body>
      <table border="1">
        <thead>
          <tr>
            <th>Level</th>
            <th>Message</th>
          </tr>
        </thead>
        <tbody>
          <xsl:sequence select="
            t:generate-records(record, $start, $end, ())"/>
        </tbody>
      </table>
    </body>
  </html>
</xsl:template>

<xsl:function name="t:generate-records" as="element()*">
  <xsl:param name="records" as="element()*"/>
  <xsl:param name="start" as="xs:integer"/>
  <xsl:param name="end" as="xs:integer?"/>
  <xsl:param name="result" as="element()*"/>

  <xsl:variable name="record" as="element()?" select="$records[$start]"/>

  <xsl:choose>
    <xsl:when test="(exists($end) and ($start > $end)) or empty($record)">
      <xsl:sequence select="$result"/>
    </xsl:when>
    <xsl:otherwise>
      <!-- Make a copy of record to avoid streaming access problems. -->
      <xsl:variable name="log">
        <xsl:copy-of select="$record"/>
      </xsl:variable>

      <xsl:variable name="level" as="xs:string"
        select="$log/record/level"/>
      <xsl:variable name="message" as="xs:string"
        select="$log/record/message"/>

      <xsl:variable name="next-result" as="element()*">
        <tr>
          <td>
            <xsl:value-of select="$level"/>
          </td>
          <td>
            <xsl:value-of select="$message"/>
          </td>
        </tr>
      </xsl:variable>

      <xsl:sequence select="
        t:generate-records
        (
          $records,
          $start + 1,
          $end,
          ($result, $next-result)
        )"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:function>

Here we observed the greediness of saxon, which too early tried to consume more input than it's required. In the other cases we have seen that it may defer actual data access to the point when there is no data anymore.

So, without tuning internal saxon logic it's possible but not easy to write stylesheets that exploit streaming features.

P.S. Updated sources are at streamedtree.zip

Friday, 23 April 2010 10:12:38 UTC  #    Comments [0] -
Thinking aloud | xslt
# Thursday, 22 April 2010

At some point we needed to have an array with volatile elements in java.

We knew that such beast is not found in the java world. So we searched the Internet and found the answers that are so wrong, and introduce so obscure threading bugs that the guys who provided them would better hide them and run immediately to fix their buggy programs...

The first one is Volatile arrays in Java. They suggest such solution:

volatile int[] arr = new int[...];

...
arr[4] = 100;
arr = arr;

The number two: What Volatile Means in Java

A guy assures that this code works:

Fields:

int answer = 0;
volatile boolean ready = false;

Thread1:

answer = 42;
ready = true;

Thread2:

if (ready)
{
  print(answer);
}

They are very wrong! Non volatile access can be reordered by the implementation. See Java's Threads and Locks:

The rules for volatile variables effectively require that main memory be touched exactly once for each use or assign of a volatile variable by a thread, and that main memory be touched in exactly the order dictated by the thread execution semantics. However, such memory actions are not ordered with respect to read and write actions on nonvolatile variables.

They probably thought of locks when they argued about volatiles:

a lock action acts as if it flushes all variables from the thread's working memory; before use they must be assigned or loaded from main memory.

P.S. They would better recommend AtomicReferenceArray.

Thursday, 22 April 2010 13:05:48 UTC  #    Comments [0] -
Thinking aloud | Tips and tricks
# Wednesday, 21 April 2010

When time has come to process big xml log files we've decided to implement streamable tree in saxon the very same way it was implemented in .net eight years ago (see How would we approach to streaming facility in xslt).

It's interesting enough that the implementation is similar to one of composable tree. There a node never stores a reference to a parent, while in the streamed tree no references to children are stored. This way only a limited subview of tree is available at any time. Implementation does not support preceding and preceding-sibling axes. Also, one cannot navigate to a node that is out of scope.

Implementation is external (there are no changes to saxon itself). To use it one needs to create an instance of DocumentInfo, which pulls data from XMLStreamReader, and to pass it as an input to a transformation:

Controller controller = (Controller)transformer;
XMLInputFactory factory = XMLInputFactory.newInstance();
StreamSource inputSource = new StreamSource(new File(input));
XMLStreamReader reader = factory.createXMLStreamReader(inputSource);
StaxBridge bridge = new StaxBridge();

bridge.setPipelineConfiguration(
  controller.makePipelineConfiguration());
bridge.setXMLStreamReader(reader);
inputSource = new DocumentImpl(bridge);

transformer.transform(inputSource, new StreamResult(output));

This helped us to format an xml log file of arbitrary size. An xslt like this can do the work:

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns="http://www.w3.org/1999/xhtml"
  exclude-result-prefixes="xs">

  <xsl:template match="/log">
    <html>
      <head>
        <title>Log</title>
      </head>
      <body>
        <xsl:apply-templates/>
      </body>
    </html>
  </xsl:template>

  <xsl:template match="message">
   ...
  </xsl:template>

  <xsl:template match="message[@error]">
    ...
  </xsl:template>

  ...

</xsl:stylesheet>

Implementation can be found at: streamedtree.zip

Wednesday, 21 April 2010 07:10:34 UTC  #    Comments [0] -
Announce | Thinking aloud | xslt
# Friday, 09 April 2010

By the generator we assume a function that produces an infinitive output sequence for a particular input.

That's a rather theoretical question, as xslt does not allow infinitive sequence, but look at the example:

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:t="http://www.nesterovsky-bros.com/xslt"
  exclude-result-prefixes="xs t">

  <xsl:template match="/">
    <xsl:variable name="value" as="xs:string" select="'10100101'"/>

    <xsl:variable name="values" as="xs:integer+" select="t:generate(1)"/>

    <!--<xsl:variable name="values" as="xs:integer+">
      <xsl:call-template name="t:generate">
        <xsl:with-param name="value" select="1"/>
      </xsl:call-template>
    </xsl:variable>-->

    <xsl:variable name="integer" as="xs:integer" select="
      sum
      (
        for $index in 1 to string-length($value) return
          $values[$index][substring($value, $index, 1) = '1']
      )"/>

    <xsl:message select="$integer"/>
  </xsl:template>

  <xsl:function name="t:generate" as="xs:integer*">
    <xsl:param name="value" as="xs:integer"/>

    <xsl:sequence select="$value"/>
    <xsl:sequence select="t:generate($value * 2)"/>
  </xsl:function>

  <!--<xsl:template name="t:generate" as="xs:integer*">
    <xsl:param name="value" as="xs:integer"/>

    <xsl:sequence select="$value"/>

    <xsl:call-template name="t:generate">
      <xsl:with-param name="value" select="$value * 2"/>
    </xsl:call-template>
  </xsl:template>-->

</xsl:stylesheet>

Here the logic uses such a generator and decides by itself where to break.

Should such code be valid?

From the algorithmic perspective example would better to work, as separation of generator logic and its use are two different things.

Friday, 09 April 2010 14:38:34 UTC  #    Comments [0] -
Thinking aloud | xslt

Lately, after playing a little with saxon tree models, we thought that design would be more cleaner and implementation faster if NamePool were implemented differently.

Now, saxon is very pessimistic about java objects thus it prefers to encode qualified names with integers. The encoding and decoding is done in the NamePool. Other parts of code use these integer values.

Operations done over these integers are:

  • equality comparision of two such integers in order to check whether to qualified or extended names are equal;
  • get different parts of qualified name from NamePool.

We would design this differently. We would:

  1. create a QualifiedName class to store all name parts.
  2. declare NamePool to create and cache QualifiedName instances.

This way:

  • equality comparision would be a reference comparision of two instances;
  • different parts of qualified name would become a trivial getter;
  • contention of such name pool would be lower.

That's the implementation we would propose: QualifiedName.java, NameCache.java

Friday, 09 April 2010 13:05:30 UTC  #    Comments [0] -
Thinking aloud | xslt
# Thursday, 08 April 2010

Earlier, in the entry "Inline functions in xslt 2.1" we've described an implementation of xml tree model that may share subtrees among different trees.

This way, in a code:

<xsl:variable name="elements" as="element()*" select="..."/>

<xsl:variable name="result" as="element()">
  <result>
    <xsl:sequence select="$elements"/>
  </result>
</xsl:variable>

the implementation shares internal representation among $elements and subtree of $result. From the perspective of xslt it looks as completely different subtrees with different node identities, which is in the accordance with its view of the world.

After a short study we've decided to create a research implementation of this tree model in saxon. It's took only a couple of days to introduce a minimal changes to engine, to refactor linked tree into a new composable tree, and to perform some tests.

In many cases saxon has benefited immediately from this new tree model, in some other cases more tunings are required.

Our tests've showed that this new tree performed better than linked tree, but a little bit worser than tiny tree. On the other hand, it's obvious that conventional code patterns avoid subtree copying, assuming it's expensive operation, thus one should rethink some code practices to benefit from composable tree.

Implementation can be downloaded at: saxon.composabletree.zip

Thursday, 08 April 2010 06:26:02 UTC  #    Comments [0] -
Announce | Thinking aloud | xslt
# Sunday, 04 April 2010

From the web we know that xslt WG is thinking now on how to make xslt more friendly to a huge documents. They will probably introduce some xslt syntax to allow implementation to be ready for a such processing.

They will probably introduce an indicator marking a specific mode for streaming. XPath in this mode will probably be restricted to a some its subset.

The funny part is that we have implemented similar feature back in 2002 in .net. It was called XPathForwardOnlyNavigator.

Implementation stored only several nodes at a time (context node and its ancestors), and read data from XmlReader perforce. Thus one could navigate to ancestor elements, to children, and to the following siblings, but never to any previous node. When one tried to reach a node that was already not available we threw an exception.

It was simple, not perfect (too restrictive) but it was pluggable in .net's xslt, and allowed to process files of any size.

That old implementation looks very attractive even now in 2010. We expect that WG with their decisions will not rule out such or similar solutions, and will not force implementers to write alternative engine for xslt streaming.

Sunday, 04 April 2010 20:53:27 UTC  #    Comments [0] -
Thinking aloud | xslt
# Friday, 02 April 2010

Xslt 1.0 has been designed based on the best intentions. Xslt 2.0 got a legacy baggage.

If you're not entirely concentrated during translation of your algorithms into xslt 2.0 you can get into trap, as we did.

Consider a code snapshot:

<xsl:variable name="elements" as="element()+">
  <xsl:apply-templates/>
</xsl:variable>

<xsl:variable name="converted-elements" as="element()+"
  select="$elements/t:convert(.)"/>

Looks simple, isn't it?

Our intention was to get converted elements, which result from some xsl:apply-templates logic.

Well, this code works... but rather sporadically, as results are often in wrong order! This bug is very close to what is called a Heisenbug.

So, where is the problem?

Elementary, my dear Watson:

  1. xsl:apply-templates constructs a sequence of rootless elements.
  2. $elements/t:convert(.) converts elements and orders them in document order.

Here is a tricky part:

The relative order of nodes in distinct trees is stable but implementation-dependent...

Clearly each rootless element belongs to a unique tree.

After that we have realized what the problem is, code has been immediately rewritten as:

<xsl:variable name="elements" as="element()+">
  <xsl:apply-templates/>
</xsl:variable>

<xsl:variable name="converted-elements" as="element()+" select="
  for $element in $elements return
    t:convert($element)"/>

P.S. Taking into an accout a size of our xslt code base, it took a half an hour to localize the problem. Now, we're at position to review all uses of slashes in xslt. As you like it?

Friday, 02 April 2010 17:53:18 UTC  #    Comments [0] -
Thinking aloud | xslt
# Monday, 22 March 2010

Inline functions in xslt 2.1 look often as a some strange aberration. Sure, there are very usefull cases when they are delegates of program logic (e.g. comparators, and filters), but often (probably more often) we can see that it's use is to model data structures.

As an example, suppose you want to model a structure with three properties say a, b, and c. You implement this creating functions that wrap and unwrap the data:

function make-data($a as item(), $b as item(), $c as item()) as function() as item()+
{
  function() { $a, $b, $c }
}

function a($data as function() as item()+) as item()
{
  $data()[1]
}

function b($data as function() as item()+) as item()
{
  $data()[2]
}

function c($data as function() as item()+) as item()
{
  $data()[3]
}

Clever?

Sure, it is! Here, we have modeled structrue with the help of sequence, which we have wrapped into a function item.

Alas, clever is not always good (often it's a sign of a bad). We just wanted to define a simple structure. What it has in common with function?

There is a distance between what we want to express, designing an algorithm, and what we see looking at the code. The greater the distance, the more efforts are required to write, and to read the code.

It would be so good to have simpler way to express such concept as a structure. Let's dream a little. Suppose you already have a structure, and just want to access its members. An idea we can think about is an xpath like access method:

$data/a, $data/b, $data/c

But wait a second, doesn't $data looks very like an xml element, and its accessors are just node tests? That's correct, so data constructor may coincide with element constructor.

Then what pros and cons of using of xml elements to model structures?

Pros are: existing xml type system, and sensibly looking code (you just understand that here we're constructing a structure).

Cons are: xml trees are implemented the way that does not assume fast (from the perfromace perspective) composition, as when you construct a structure a copy of data is made.

But here we observe that "implemented" is very important word in this context. If xml tree implementation would not store reference to the parent node then subtrees could be composed very efficiently (note that tree is assumed to be immutable). Parent node could be available through a tree navigator, which would contain reference to a node itself and to a parent tree navigator (or to store child parent map somewhere near the root).

Such tree structure would probably help not only in this particular case but also with other conventional xslt code patterns.

P.S. Saxon probably could implement its NodeInfo this way.

Update: see also Custom tree model.

Monday, 22 March 2010 11:02:07 UTC  #    Comments [0] -
Thinking aloud | xslt
# Monday, 15 March 2010

A while ago we have proposed to introduce maps as built-in types in xpath/xquery type system: Tuples and maps.

The suggestion has been declined (probably our arguments were not convincing). We, however, still think that maps along with sequences are primitives, required to build sensible (not obscure one) algorithms. To see that map is at times is the best way to resolve the problems we shall refer to an utility function to allocate names in scope. Its signature looks like this:

<!--
Allocates unique names in the form $prefix{number}?.
Note that prefixes may coincide.
$prefixes - a name prefixes.
$names - allocated names pool.
$name-max-length - a longest allowable name length.
Returns unique names.
-->
<xsl:function name="t:allocate-names" as="xs:string*">
  <xsl:param name="prefixes" as="xs:string*"/>
  <xsl:param name="names" as="xs:string*"/>
  <xsl:param name="name-max-length" as="xs:integer?"/>

Just try to program it and you will find yourselves coding something like one defined at cobolxom.

To be efficient such maps should provide, at worst, a logarithmic operation complexity:

  • Access to the map through a key (and/or by index) - complexity is LnN;
  • Creation of a new map with added or removed item - complexity is LnN;
  • Construction of the map from ordered items - complexity is O(N);
  • Construction of the map from unordered items - complexity is O(N*LnN);
  • Total enumeration of all items - complexity is O(N*LnN);

These performance metrics are found in many functional and procedural implementations of the maps. Typical RB and AVL tree based maps satisfy these restrictions.

What we suggest is to introduce map implementation into the exslt2 (assuming inline functions are present). As a sketch we have implemented pure AVL Tree in Xslt 2.0:

We do not assume that implementation like this should be used, but rather think of some extension function(s) that provides a better performance.

What do you think?

Monday, 15 March 2010 13:59:19 UTC  #    Comments [1] -
Thinking aloud | xslt
# Sunday, 28 February 2010

The story about immutable tree would not be complete without xslt implementation. To make it possible one needs something to approxomate tree nodes. You cannot implement such consruct efficiently in pure xslt 2.0 (it would be either unefficient or not pure).

To isolate the problem we have used tuple interface:

  • tuple:ref($items as item()*) as item() - to wrap items into a tuple;
  • tuple:deref($tuple as item()?) as item()* - to unwrap items from a tuple;
  • tuple:is-same($first as item(), $second as item()) as xs:boolean - to test whether two tuples are the same.

and defined inefficient implementation based on xml elements. Every other part of code is a regular AVL algorithm implementation.

We want to stress again that even assuming that there is a good tuple implementation we would prefer built-in associative container implementation. Why the heck you need to include about 1000 lines of code just to use a map?

Source code is:

Sunday, 28 February 2010 19:28:07 UTC  #    Comments [0] -
Thinking aloud | xslt

We like Visual Studio very much, and try to adopt new version earlier.

For the last time our VS's use pattern is centered around xml and xslt. In our opinion VS 2008 is the best xslt 2 editor we have ever seen even with lack of support of xslt 2.0 debugging.

Unfortunatelly, that is still a true when VS 2010 is almost out. VS 2008 is just 2 - 3 times faster. You can observe this working with xslt files like those in languages-xom.zip (1000 - 2000 rows). Things just become slow.

We still hope that VS 2010 will make a final effort to outdo what VS 2008 has already delivered.

Sunday, 28 February 2010 18:37:58 UTC  #    Comments [0] -
Thinking aloud | xslt
# Wednesday, 24 February 2010

Why do we return to this theme again?

Well, it's itching!

In cobolxom there is an utility function to allocate names in scope. Its signature looks like this:

<!--
  Allocates unique names in the form $prefix{number}?.
  Note that prefixes may coincide.
    $prefixes - a name prefixes.
    $names - allocated names pool.
    $name-max-length - a longest allowable name length.
    Returns unique names.
-->
<xsl:function name="t:allocate-names" as="xs:string*">
  <xsl:param name="prefixes" as="xs:string*"/>
  <xsl:param name="names" as="xs:string*"/>
  <xsl:param name="name-max-length" as="xs:integer?"/>

We have created several different implementations (all use recursion). Every implementation works fair for relatively small input sequences, say N < 100, but we have cases when there are about 10000 items on input. Algorithm's worst case complexity, in absence of associative containers, is O(N*N), and be sure it's an O-o-o-oh... due to xslt engine implementation.

If there were associative containers with efficient access (complexity is O(LogN)), and construction of updated container (complexity is also O(LogN)) then implementation would be straightforward and had complexity O(N*LogN).

Wednesday, 24 February 2010 07:34:07 UTC  #    Comments [0] -
Thinking aloud | xslt
# Sunday, 07 February 2010

Given:

public class N
{
  public readonly N next;
}

What needs to be done to construct a ring of N: n1 refers to n2, n2 to n3, ... nk to n1? Is it possible?

Sunday, 07 February 2010 07:57:08 UTC  #    Comments [2] -
Thinking aloud | Tips and tricks
# Saturday, 06 February 2010

To end with immutable trees, at least for now, we've implemented IDictionary<K, V>. It's named Map<K, V>. Functionally it looks very like SortedDictionary<K, V>. there are some differences, however:

  • Map in contrast to SortedDictionary is very cheap on copy.
  • Bacause Map is based on AVL tree, which is more rigorly balanced than RB tree, so it's a little bit faster asymptotically for lookup than SortedDictionary, and a little bit slower on modification.
  • Due to the storage structure: node + navigator, Map consumes less memory than SortedDictionary, and is probably cheaper for GC (simple garbage graphs).
  • As AVL tree stores left and right subtree sizes, in contrast to a "color" in RB tree, we able to index data in two ways: with integer index, and with key value.

Sources are:

Update:

It was impossible to withstand temptation to commit some primitive performance comparision. Map outperforms SortedDictionary both in population and in access. this does not aggree with pure algorithm's theory, but there might be other unaccounted factors: memory consumption, quality of implementation, and so on.

Program.cs is updated with measurements.

Update 2:

More occurate tests show that for some key types Map's faster, for others SortedDictionary's faster. Usually Map's slower during population (mutable AVL tree navigator may fix this). the odd thing is that Map<string, int> is faster than SortedDictionary<string, int> both for allocaction and for access. See excel report.

Update 3:

Interesing observation. The following table shows maximal and average tree heights for different node sizes in AVL and RB trees after a random population:

AVL RB
Size Max Avg Max Avg
10 4 2.90 5 3.00
50 7 4.94 8 4.94
100 8 5.84 9 5.86
500 11 8.14 14 8.39
1000 12 9.14 16 9.38
5000 15 11.51 18 11.47
10000 16 12.53 20 12.47
50000 19 14.89 23 14.72
100000 20 15.90 25 15.72
500000 25 18.26 28 18.27
1000000 25 19.28 30 19.27

Here, according with theory, the height of AVL tree is shorter than the height of RB tree. But what is most interesting is that the depth of an "average node". This value describes a number of steps required to find a random key. RB tree is very close and often is better than AVL in this regard.

Saturday, 06 February 2010 18:31:13 UTC  #    Comments [0] -
Thinking aloud | Tips and tricks
# Wednesday, 03 February 2010

It was obvious as hell from day one of generics that there will appear obscure long names when you will start to parametrize your types. It was the easiest thing in the world to take care of this in advanvce. Alas, C# inherits C++'s bad practices.

Read Associative containers in a functional languages and Program.cs to see what we're talking about.

Briefly, there is a pair (string, int), which in C# should be declared as:

System.Collections.Generic.KeyValuePair<string, int>

Obviously we would like to write it in a short way. These are our attempts, which fail:

1. Introduce generic alias Pair<K, V>:

using System.Collections.Generic;
using Pair<K, V> = KeyValuePair<K, V>;

2. Introduce type alias for a generic type with specific types.

using System.Collections.Generic;
using Pair = KeyValuePair<string, int>;

And this is only one that works:

using Pair = System.Collections.Generic.KeyValuePair<string, int>;

Do you think is it bearable? Well, consider the following:

  • There is a generic type ValueNode<T>, where T should be Pair.
  • There is a generic type TreeNavigator<N>, where N is should be ValueNode<Pair>.

The declaration looks like this:

using Pair = System.Collections.Generic.KeyValuePair<string, int>;
using Node = NesterovskyBros.Collections.AVL.ValueNode<
  System.Collections.Generic.KeyValuePair<string, int>>;
using Navigator = NesterovskyBros.Collections.AVL.TreeNavigator<
  NesterovskyBros.Collections.AVL.ValueNode<
    System.Collections.Generic.KeyValuePair<string, int>>>;

Do you still think is it acceptable?

P.S. Legacy thinking led C#'s and java's designers to the use of word "new" for the object construction. It is not required at all. Consider new Pair("A", 1) vs Pair("A", 1). C++ prefers second form. C# and java always use the first one.

Wednesday, 03 February 2010 11:59:19 UTC  #    Comments [1] -
Thinking aloud | Tips and tricks
# Wednesday, 27 January 2010

Continuing with the post "Ongoing xslt/xquery spec update" we would like to articulate what options regarding associative containers do we have in a functional languages (e.g. xslt, xquery), assuming that variables are immutable and implementation is efficient (in some sense).

There are three common implementation techniques:

  • store data (keys, value pairs) in sorted array, and use binary search to access values by a key;
  • store data in a hash map;
  • store data in a binary tree (usually RB or AVL trees).

Implementation choice considerably depends on operations, which are taken over the container. Usually these are:

  1. construction;
  2. value lookup by key;
  3. key enumeration (ordered or not);
  4. container modification (add and remove data into the container);
  5. access elements by index;

Note that modification in a functional programming means a creation of a new container, so here is a division:

  1. If container's use pattern does not include modification, then probably the simplest solution is to build it as an ordered sequence of pairs, and use binary search to access the data. Alternatively, one could implement associative container as a hash map.
  2. If modification is essential then neither ordered sequence of pairs, hash map nor classical tree implementation can be used, as they are either too slow or too greedy for a memory, either during modification or during access.

On the other hand to deal with container's modifications one can build an implementation, which uses "top-down" RB or AVL trees. To see the difference consider a classical tree structure and its functional variant:

Classical Functional
Node structure: node
  parent
  left
  right
  other data
node
 
  left
  right
  other data
Node reference: node itself node path from a root of a tree
Modification: either mutable or requires a completely new tree O(LnN) nodes are created

Here we observe that:

  1. one can implement efficient map (lookup time no worse than O(LnN)) with no modification support, using ordered array;
  2. one can implement efficient map with support of modification, using immutable binary tree;
  3. one can implement all these algorithms purely in xslt and xquery (provided that inline functions are supported);
  4. any such imlementation will lose against the same implementation written in C++, C#, java;
  5. the best implementation would probably start from sorted array and will switch to binary tree after some size threshold.

Here we provide a C# implementation of a functional AVL tree, which also supports element indexing:

Our intention was to show that the usual algorithms for associative containers apply in functional programming; thus a feature complete functional language must support associative containers to make development more conscious, and to free a developer from inventing basic things existing already for almost a half of century.

Wednesday, 27 January 2010 07:00:55 UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | xslt
# Tuesday, 19 January 2010

Several years ago we have started a new project. We do not like neither hate any particular language, thus the decision what language to use was pragmatical: xslt 2.0 fitted perfectly.

At present it's a solid volume of xslt code. It exhibits all the virtues of any other good project in other language: clean design, modularity, documentation, sophisticationless (good code should not be clever).

Runtime profile of the project is that it deals with xml documents with sizes from a few dozens of bytes to several megabytes, and with xml schemas from simple ones like a tabular data, and to rich like xhtml and untyped. Pipeline of stylesheets processes gigabytes of data stored in the database and in files.

All the bragging above is needed here to introduce the context for the following couple of lines.

The diversity of load conditions and a big code base, exposed xslt engine of choice to a good proof test. The victim is Saxon. In the course of project we have found and reported many bugs. Some of them are nasty and important, and others are bearable. To Michael Kay's credit (he's owner of Saxon) all bugs are being fixed promtly (see the last one).

Such project helps us to understand a weak sides of xslt (it seems sometimes they, in WG, lack such experience, which should lead them through).

Well, it has happened so that we're helping to Saxon project. Unintentionally, however! :-)

P.S. About language preferences.

Nowdays we're polishing a COBOL generation. To this end we have familiarized ourselves with this language. That's the beatiful language. Its straightforwardness helps to see the evolution of computer languages and to understand what and why today's languages try to timidly hide.

Tuesday, 19 January 2010 19:56:04 UTC  #    Comments [0] -
Thinking aloud | xslt
# Saturday, 19 December 2009

In spite of the fact that our last projects are being developed in Java, the .NET is definitly our favorite platform.

In a twitter I saw the phrase: "Java the language is a stagnant mess". It's said in favour of C#. It's true that C# significantly affects now even on Java (let's remember generics, jaxb, web services, etc.), but in my opinion, the C# won't be the leading language for worldwide enterprise applications in the nearest future.

One of causes is that the main platform for .NET still is Windows. The situation could be changed by Mono project, but I think there are yet not enough projects on platforms other than Windows.

My guess is confirmed by some of observations that I did as a software engineer of an IT company. Our company performs different software porting projects from legacy programming languages like COBOL, ADSO, Natural etc. into up to date languages like Java, C# etc. It worth to say that clients rarely select to migrate to .NET despite to our advices.

The main reason of such choice, according to most of our clients, is that they want to be platform independent and only Java gives them this choice.

It worth for Microsoft to think about cooperation with Mono in order to make .NET really platform indpendent, otherwise C# will always be a step behind Java despite apparent advantages of C# as a programming language.

Saturday, 19 December 2009 17:08:23 UTC  #    Comments [1] -
Thinking aloud
# Tuesday, 22 September 2009

Suppose you have a library, which is out in the production and is used by many clients. At the same time the library evolves: API is extended, bugs are being fixed, code becomes faster and cleaner, bla bla bla...

At some point you're fixing some important bug that's been hiding for a long time in the bowels of your library. You're happy that you've spotted it before clients got into troubles. You're notifying all the clients that there is an important fix, and that they need to update the library.

What do you think you hear in return?

Well, we're not perfect, there are bugs in our software. We and our clients realize this. Nothing will eliminate bugs to creep into a code from time to time.

That's a train of thought of a particular client:

We agree that there is a bug and that it has to be fixed. We, however, want to touch the library in a minimal way, as who knows what other new bugs had they introduced, so let's ask them to fix this particular bug in our version of the library.

That's fair from the client's perspective. They don't want better code, they just want that particular bug fixed!

For us, however, this means branching some old version of the library, fixing bug and supporting this branch for the particular client. It's fair to expect similar position from each client, thus should we create and support library branches per client, and branch a main branch for a new client only?

For us (Arthur and Vladimir) it looks as enormous waste of resources. We (our company) should either hire more and more scaled people or experience gradual slowdown of support and development.

Our answer could be obvious if not position of top managers who value client relations so much that they easily promise whatever client wishes. No arguments that latest version is better tested, more conforming to specifications, more reliable, faster and so on are accepted. The main argument against our position is that the client's applications run in the production, and no new potential bugs are acceptable.

Here is our dilemma: we neither can convince the client (more precisely our managers) that we're right, nor are convinced with their arguments...

Tuesday, 22 September 2009 08:16:11 UTC  #    Comments [3] -
Thinking aloud
# Sunday, 18 January 2004

From time to time I like to tell, to express (or even to brag about) what keeps our minds busy. Sometimes it's even useful, and you unexpectedly discover, you are on the edge of community interests, other time you are outsider.

So, what are we doing now? Well, already several years we are working in Multiconn. One of goals of our company is to help to expose mainframe functionality. Many different solutions of this task are available, and even Multiconn provides several different approaches. Now go right to our business - web service as delegate of mainframe.

To be honest, I think mainframes are legacy monsters, dinosaurs. I realize, it's only a bad perspective, but the first experience is often very strong. For us as developers mainframe is a source of data in different formats (sure formats are legacy also): COBOL records, terminal messages, and so on. Our idea is to allow to mainframe to rest in peace and to gracefully consume its data. Combining consumed data in xml form and operation bindings that describe mainframe's application flow, we arrive to a view of mainframe application as a web service.

To achieve this goal we have worked up extension to xml schema that allows mapping schema elements to a data. It should be pointed, it's perfectly legal to extend xml schema. One way to extend it, is to use elements in custom namespace in appInfo element.

The following was to create importer of schema with our extensions. It's somehow similar to xsd.exe tool. Our tool generates classes with annotations XmlXxxAttribute for xml serialization and LayoutXxxAttribute (this is our custom attributes) for instance serializing and deserializing to and from a mainframe data.

The next stage was to create data serializer. This is a counterpart to XmlSerializer. It inspects class meta-data and creates plan for serialization and deserialization.

On the next stage we have worked up wsdl schema bindings for our technology.

After that we have created tool to import wsdl (similar to wsdl.exe) that generates web service which passes input messages to communication layer that serializes and forwards data to mainframe, accepts and deserializes response and returns it to the web service, which in its turn returns result to a client.

The next was communication layer that interacts with mainframe. This layer consists of abstract (general) sublayer and specializations which support different patterns of communications.

As result we have web service implemented in .NET. This web service represents some mainframe's application.

There are plans to generate similar web services in the Java.

Sunday, 18 January 2004 09:40:59 UTC  #    Comments [0] -
Thinking aloud
Archive
<2024 April>
SunMonTueWedThuFriSat
31123456
78910111213
14151617181920
21222324252627
2829301234
567891011
Statistics
Total Posts: 387
This Year: 3
This Month: 1
This Week: 0
Comments: 971
Locations of visitors to this page
Disclaimer
The opinions expressed herein are our own personal opinions and do not represent our employer's view in anyway.

© 2024, Nesterovsky bros
All Content © 2024, Nesterovsky bros
DasBlog theme 'Business' created by Christoph De Baene (delarou)